HPC Engineer, Machine Learning Infrastructure - US Remote
United States - Remote
Applications have closed
Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better.
We have built the fastest-growing, open-source, library of pre-trained models in the world. With more than 1 Million+ models and 320K+ stars on GitHub, over 15.000 companies are using HF technology in production, including leading AI organizations such as Google, Elastic, Salesforce, Grammarly and NASA.
About the role:
We are looking for a HPC Engineer responsible for developing and scaling our distributed large cluster. The ideal candidate will have experience provisioning large compute clusters for AI workflows and strong experience supporting teams to create best practices for reliability and scalability.
Responsibilities
- Design, develop, deploy, and maintain reliable and scalable infrastructure that enables efficient training workloads.
- Manage large compute clusters for AI Training and development.
- Create tooling and infrastructure that abstract compute and storage in ML workflows
- Measure and optimize system performance.
- Monitor and troubleshoot infrastructure issues, ensuring high availability and performance of AI workloads.
- Stay up to date with the latest advancements in AI infrastructure technologies and recommend improvements to enhance system efficiency and performance.
- Work closely with AI software engineering teams to ensure infrastructure can handle all system requirements.
- Provide primary operational support and engineering for multiple teams.
Qualifications:
- 7+ years of experience in a DevOps or infrastructure Engineer role building machine learning infrastructure and working with large GPU clusters.
- Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks and observability tools.
- Familiarity with Python Scientific stack, Pytorch.
- Experience with data structures, data modeling, and database management as well as object and file storage systems.
- Strong communication, collaboration, and documentation skills.
- Experience with Linux, Git, containers, networking and command line tools.
- Strong programming skills in Python, Golang, and/or Rust.
About you:
If you are a passionate HPC Engineer with a keen interest in AI and thrive in a challenging and innovative setting, we would love to hear from you. Join our team and contribute to the advancement of AI technologies while working alongside talented professionals in a collaborative and stimulating environment.
More about Hugging Face
We are actively working to build a culture that values diversity, equity, and inclusivity. We are intentionally building a workplace where people feel respected and supported—regardless of who you are or where you come from. We believe this is foundational to building a great company and community. Hugging Face is an equal opportunity employer and we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
We value development. You will work with some of the smartest people in our industry. We are an organization that has a bias for impact and is always challenging ourselves to continuously grow. We provide all employees with reimbursement for relevant conferences, training, and education.
We care about your well-being. We offer flexible working hours and remote options. We offer health, dental, and vision benefits for employees and their dependents. We also offer 12 weeks of parental leave (20 for birthing mothers) and unlimited paid time off.
We support our employees wherever they are. While we have office spaces in NYC and Paris, we're very distributed and all remote employees have the opportunity to visit our offices. If needed, we'll also outfit your workstation to ensure you succeed.
We want our teammates to be shareholders. All employees have company equity as part of their compensation package. If we succeed in becoming a category-defining platform in machine learning and artificial intelligence, everyone enjoys the upside.
We support the community. We believe major scientific advancements are the result of collaboration across the field. Join a community supporting the ML/AI community.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: AWS DevOps Engineering GCP Git GitHub Golang GPU HPC Linux Machine Learning ML infrastructure Open Source Python PyTorch Rust Salesforce
Perks/benefits: Career development Conferences Equity Flex hours Flex vacation Health care Parental leave Unlimited paid time off
More jobs like this
Explore more AI, ML, Data Science career opportunities
Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.
- Open MLOps Engineer jobs
- Open Data Science Manager jobs
- Open Lead Data Analyst jobs
- Open Senior Business Intelligence Analyst jobs
- Open Data Manager jobs
- Open Principal Data Engineer jobs
- Open Data Engineer II jobs
- Open Sr Data Engineer jobs
- Open Power BI Developer jobs
- Open Business Intelligence Developer jobs
- Open Junior Data Scientist jobs
- Open Data Analytics Engineer jobs
- Open Product Data Analyst jobs
- Open Data Scientist II jobs
- Open Sr. Data Scientist jobs
- Open Business Data Analyst jobs
- Open Senior Data Architect jobs
- Open Data Analyst Intern jobs
- Open Big Data Engineer jobs
- Open Manager, Data Engineering jobs
- Open Data Quality Analyst jobs
- Open Data Product Manager jobs
- Open Junior Data Engineer jobs
- Open Principal Data Scientist jobs
- Open Azure Data Engineer jobs
- Open GCP-related jobs
- Open Data quality-related jobs
- Open Business Intelligence-related jobs
- Open Java-related jobs
- Open ML models-related jobs
- Open Data management-related jobs
- Open Privacy-related jobs
- Open Data visualization-related jobs
- Open Finance-related jobs
- Open Deep Learning-related jobs
- Open PhD-related jobs
- Open APIs-related jobs
- Open TensorFlow-related jobs
- Open PyTorch-related jobs
- Open NLP-related jobs
- Open Consulting-related jobs
- Open Snowflake-related jobs
- Open CI/CD-related jobs
- Open LLMs-related jobs
- Open Kubernetes-related jobs
- Open Generative AI-related jobs
- Open Data governance-related jobs
- Open Hadoop-related jobs
- Open Airflow-related jobs
- Open Docker-related jobs