Research Engineer - Codec Avatar ML Compute Team
Pittsburgh, PA
Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage a strong sense of ownership and embrace the ambiguity that comes with working on the frontiers of research. In this research engineer role on the Codec Avatar ML Compute team, you will serve as the point of contact for Meta's research GPU super clusters, parallelizing massive ML models and data, and optimizing other compute resources to enable groundbreaking research in relightable avatars, full-body avatars, and generative AI for codec avatars.Research Engineer - Codec Avatar ML Compute Team Responsibilities
- Build efficient and scalable machine learning tooling for the GPU clusters within Meta research labs, a heterogeneous environment containing diverse system architectures and research workload
- Build efficient and scalable data tooling for massive ML training data preprocessing and postprocessing using thousands of CPU / GPUs nodes
- Provide on-call support and lead incident root cause analysis through multiple infrastructure layers (compute, storage, network) for GPU clusters and act as a final escalation point
- Work side by side with research scientists and engineers to take full advantage of modern GPUs for large scale multi-GPU training jobs impact
- Collaborate in a diverse team environment across multiple scientific and engineering disciplines, making the architectural tradeoffs required to rapidly train large scale ML models
- Provide guidance to other engineers on best practices to build mature tools which are highly reliable, secure, and scalable
- Influence outcomes within your immediate team, peer engineering teams, and with cross-functional stakeholders
- Ability to work independently, handle large projects simultaneously, and prioritize team roadmap and deliverables by balancing required effort with resulting
- Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
- 3+ years experience coding in C++ and Python
- Experience in large scale ML system performance measurement, logging, and optimization
- Experience in writing system level infrastructure, libraries, and applications
- Prior experience in ML libraries such as PyTorch, TensorFlow, or cuDNN
- Experience with software development practices such as source control, unit testing, debugging and profiling
- Experience in developing performant software and systems
- Prior experience in large scale machine learning model training, including model parallelization strategies
- Prior experience in machine learning model compiler
- Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of researchers, developing tools to improve research experience, providing guidance on best practices, coordinating distribution of compute/storage resources, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies
- Prior experience building tooling for monitoring and telemetry for large scale supercomputers
- Prior experience in debugging performance issues for large scale ML training tasks
- Prior experience in GPGPU development with CUDA, OpenCL or DirectCompute
- Familiar with Linux observability tools, such as eBPF
Individual pay is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base salary, Meta offers benefits. Learn more about benefits at Meta.
Tags: Architecture Computer Science CUDA cuDNN Engineering Generative AI GPU Linux Machine Learning ML models Model training Physics Python PyTorch R Research TensorFlow Testing VR
Perks/benefits: Career development Equity Health care Salary bonus
More jobs like this
Explore more AI, ML, Data Science career opportunities
Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.
- Open Data Manager jobs
- Open Power BI Developer jobs
- Open Principal Data Engineer jobs
- Open Marketing Data Analyst jobs
- Open Data Science Manager jobs
- Open Lead Data Analyst jobs
- Open MLOps Engineer jobs
- Open Senior Business Intelligence Analyst jobs
- Open Business Data Analyst jobs
- Open Data Analytics Engineer jobs
- Open Data Scientist II jobs
- Open Business Intelligence Developer jobs
- Open Product Data Analyst jobs
- Open Sr Data Engineer jobs
- Open Junior Data Scientist jobs
- Open Data Analyst Intern jobs
- Open Senior Data Architect jobs
- Open Sr. Data Scientist jobs
- Open Principal Data Scientist jobs
- Open Research Scientist jobs
- Open Big Data Engineer jobs
- Open Data Quality Analyst jobs
- Open Azure Data Engineer jobs
- Open Manager, Data Engineering jobs
- Open ML Engineer jobs
- Open GCP-related jobs
- Open Data quality-related jobs
- Open Java-related jobs
- Open ML models-related jobs
- Open Business Intelligence-related jobs
- Open Data management-related jobs
- Open Privacy-related jobs
- Open Data visualization-related jobs
- Open PhD-related jobs
- Open Deep Learning-related jobs
- Open NLP-related jobs
- Open Finance-related jobs
- Open PyTorch-related jobs
- Open TensorFlow-related jobs
- Open APIs-related jobs
- Open LLMs-related jobs
- Open Consulting-related jobs
- Open Generative AI-related jobs
- Open CI/CD-related jobs
- Open Snowflake-related jobs
- Open Kubernetes-related jobs
- Open Hadoop-related jobs
- Open Data governance-related jobs
- Open Databricks-related jobs
- Open Data warehouse-related jobs