Senior Technical Program Manager - AI Research Systems
US, CA, Santa Clara
NVIDIA
NVIDIA erfindet den Grafikprozessor und fördert Fortschritte in den Bereichen KI, HPC, Gaming, kreatives Design, autonome Fahrzeuge und Robotik.Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services. Our objective is to deliver a resilient and scalable environment for NVIDIA's AI researchers, providing them with the necessary resources and scale to foster innovation.
As a Technical Program Manager (TPM) on this team, you'll confront and oversee the unique challenges of building and maintaining the AI and data infrastructure necessary for training flagship models at an unprecedented scale. Your focus will be on increasing researcher productivity, as well improving system stability, availability and performance.
What you'll be doing:
Understand the challenges of training foundational models: Delve into the specific workflows and resource requirements of training innovative LLMs and Generative AI models. Identify the problems, scaling limitations, and failure points that researchers encounter.
Engineer Scalable and Resilient Solutions: Design, implement, and continuously refine highly scalable AI/ML infrastructure. Prioritize fault tolerance, automated recovery mechanisms, and proactive monitoring to minimize disruptions to critical research projects.
Increase Researcher Velocity: Develop streamlined processes for resource allocation, model deployment, and experimentation tracking. Collaborate closely with researchers to ensure the infrastructure seamlessly supports their rapidly evolving needs.
Lead Complex Technical Projects: Own the planning, execution, and delivery of complex infrastructure projects in a dynamic, fast-paced research environment. Balance agility with meticulous attention to detail, risk assessment, and long-term maintainability.
Collaborate for Success: Partner with diverse teams across engineering, research, and operations to drive solutions that address the complexities of large-scale AI development.
Resource Matching and Optimization: Collaborate with researchers to understand their computational needs (compute, memory, networking bandwidth, storage performance) and ensure optimal resource utilization.
Data Access and Pipelines: Design high-throughput data pipelines and storage solutions that seamlessly integrate with researcher workflows, enabling efficient access to massive datasets.
What We Need To See:
BS or MS Degree or equivalent experience
8+ years of Program management experience within same or similar industries
Technical Expertise: Deep understanding of cloud infrastructure, distributed systems, large-scale ML/HPC workloads, Kubernetes, Slurm, and AWS services. Experience in handling petabyte-scale data and extreme-scale systems with 10s of thousands of compute nodes is a plus.
Experience managing projects with one of the following workloads: ML (e.g., training and deploying large machine learning models) or HPC (e.g., deploying Hardware and Software needed to run large batch compute jobs).
Experience within Cloud Infrastructure, particularly with compute, networking, and storage.
Software Development Background: Familiarity with AI/ML frameworks
Project Management Mastery: Demonstrated Ability to manage numerous projects, prioritize in a high-pressure setting, identify and mitigate risks, and ensure on-time delivery.
Researcher-Centric Focus: Demonstrated ability to empathize with researchers, understand their problems, and translate their needs into technical requirements and map them to the abilities of AI infrastructure engineers and teams. Excellent communication and collaboration skills are essential.
Understanding of data pipeline requirements for large-scale ML training. This includes awareness of data throughput needs, data preprocessing steps, and potential bottlenecks. Familiarity with tools like Apache Spark or Ray is a plus.
Background in evaluating and selecting data storage solutions (HDFS, object storage, distributed file systems, such as Lustre) based on cost, performance, and research requirements.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.Tags: AWS Data pipelines Distributed Systems Engineering Generative AI HDFS HPC Kubernetes LLMs Machine Learning ML infrastructure ML models Model deployment Pipelines Research Spark
Perks/benefits: Career development Equity
More jobs like this
Explore more AI, ML, Data Science career opportunities
Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.
- Open Data Manager jobs
- Open Data Science Manager jobs
- Open Lead Data Analyst jobs
- Open MLOps Engineer jobs
- Open Senior Business Intelligence Analyst jobs
- Open Principal Data Engineer jobs
- Open Data Engineer II jobs
- Open Power BI Developer jobs
- Open Sr Data Engineer jobs
- Open Data Analytics Engineer jobs
- Open Product Data Analyst jobs
- Open Data Scientist II jobs
- Open Business Intelligence Developer jobs
- Open Junior Data Scientist jobs
- Open Business Data Analyst jobs
- Open Sr. Data Scientist jobs
- Open Senior Data Architect jobs
- Open Data Analyst Intern jobs
- Open Big Data Engineer jobs
- Open Principal Data Scientist jobs
- Open Junior Data Engineer jobs
- Open Manager, Data Engineering jobs
- Open Data Quality Analyst jobs
- Open Azure Data Engineer jobs
- Open Data Product Manager jobs
- Open Data quality-related jobs
- Open GCP-related jobs
- Open Business Intelligence-related jobs
- Open Java-related jobs
- Open ML models-related jobs
- Open Data management-related jobs
- Open Privacy-related jobs
- Open Data visualization-related jobs
- Open Finance-related jobs
- Open Deep Learning-related jobs
- Open PhD-related jobs
- Open APIs-related jobs
- Open PyTorch-related jobs
- Open TensorFlow-related jobs
- Open NLP-related jobs
- Open Consulting-related jobs
- Open Snowflake-related jobs
- Open CI/CD-related jobs
- Open LLMs-related jobs
- Open Generative AI-related jobs
- Open Kubernetes-related jobs
- Open Hadoop-related jobs
- Open Data governance-related jobs
- Open Airflow-related jobs
- Open DevOps-related jobs