Key Responsibilities:
- Strategic Platform Leadership
- Translate enterprise AI strategy defined by the AI Architect Principal, Enterprise Architecture, Enterprise AI Tech Leaders (AI Engineering, DS/ML Engineering, AI Emerging Tech), AI Product Teams into actionable platform roadmaps and technical priorities.
- Partner with data, cloud, security, and infrastructure teams (both onshore and offshore) to define the end-to-end AI architecture framework, including compute, model lifecycle, and deployment strategies.
- Evaluate emerging technologies and recommend platform enhancements to improve model performance, scalability, and sustainability.
- Define and deliver AI/ML/Agentic operations strategy – including tool suite, standards, and technology/capability roadmap.
- Establish POV on key evaluations – including, but not limited to “buy v. build”, platform assessments, tool comparisons, cost/performance optimizations
- Drive the technical strategy for the platform, balancing short-term needs with long-term scalability and reliability.
- Design, implement, and scale cloud-based infrastructures (GCP and Databricks) to support internal and external applications.
- Oversee platform architecture decisions, ensuring that the platform is robust, efficient, and capable of supporting growing business needs.
Architecture Design and Governance:
- Lead the design of core AIML platform components—data pipelines, model training and inference engines, orchestration workflows, and monitoring frameworks.
- Design for continuous training pipelines and promote automation capabilities where possible
- Establish architectural best practices, patterns, and standards for AI/ML development and deployment.
- Oversee design reviews and ensure compliance with enterprise architecture and regulatory requirements.
- Design and build platform to meet AI Governance requirements (including application of controls and measurement of controls). Ensure observability requirements can be met.
Cross-Functional Collaboration:
- Work closely with product, data science, engineering, and security teams to operationalize AI/ML capabilities across the enterprise.
- Partner with the AI Architect Principal and AI Solution Architects to align agentic platform evolution with organizational priorities and technology roadmaps.
- Collaborate with security and operations teams to ensure the platform is secure, compliant, and maintains high uptime and reliability.
- Engage with leadership to align technical initiatives with organizational objectives.
Team Leadership and Talent Development:
- Manage and mentor a team of AI engineers, AI architects, and AI Ops engineers.
- Foster a high-performance culture emphasizing innovation, collaboration, and agile delivery.
- Support career development, technical upskilling, and diversity in AI technology roles, specifically those aligned into the Architecture space. Mentor and guide junior team members.
- Conduct regular performance reviews, provide career development guidance, and support team members’ growth and skill development.
- Own hiring, onboarding, and team-building activities to ensure the team has the right talent and skills.
- Provide hands-on technical leadership, leading by example in designing and building robust, scalable systems.
- Drive high standards of code quality, testing, and engineering practices.
- Advocate for platform engineering best practices, ensuring systems are maintainable, extensible, and documented.
Operational Excellence:
- Oversee platform scalability, reliability, and cost optimization across cloud and on-prem environments.
- Implement observability and monitoring tools to proactively identify performance or security issues.
- Ensure platform compliance with responsible AI, data privacy, and ethical ML principles.
- Build and develop architecture audit process, and execute
- Identify opportunities for automation, process improvements, and tooling that enhance platform reliability and efficiency.
- Stay current with emerging technologies and industry best practices and incorporate relevant trends into the platform strategy.
- Lead incident response and post-mortem reviews to continuously improve platform resilience.
Platform Performance & Reliability:
- Define, implement, and monitor key platform performance metrics, including system uptime, latency, and resource utilization.
- Ensure the platform is scalable and cost-efficient, optimizing for performance and operational cost management.
- Lead efforts to identify, troubleshoot, and resolve platform performance issues or outages.
Qualifications:
- Bachelor’s degree in computer science, a related field, or applicable work experience.
- 10+ years of experience in software development or architecture; minimum of 3+ years in a leadership role
- Deep knowledge of AI/ML ecosystems
- Experience designing MLOps pipelines and AI Ops frameworks at enterprise scale
- Strong understanding of cloud-native architecture (AWS, Azure, Databricks or GCP), MLOps frameworks, and CI/CD principles
- Experience with multi-agent architecture concepts: orchestration patterns, tool/skill registry, memory and state management, and agent observability
- Experience setting technical standards and coordinating across distributed engineering teams
- Proven ability to lead cross-functional engineering teams and deliver enterprise-scale AI solutions.
- Comfortable navigating new, novel technology solutions and managing ambiguity to deliver in evolving domains
- Familiarity with security and compliance standards for platform operations.
- Strong analytical and problem-solving skills with a data-driven mindset.
- Excellent communication, project management, and stakeholder engagement skills.
- Comfortable with presenting up to senior leadership (VP+ level); able to present technical concepts to non-technical and executive


