Lead Systems Engineer - GPU Management (AI/HPC)

Palo Alto

Applications have closed

Hippocratic AI

The First Safety Focused LLM for Healthcare

View company page

At Hippocratic AI, we are at the forefront of technological innovation, leveraging advanced computing resources to solve complex problems. Our dedicated GPU clusters, including high-end NVIDIA A100 and H100 models, are crucial for our data processing, machine learning, and computational tasks, including the development and optimization of Large Language Models (LLMs).

Position Overview:

As Lead System Administrator specializing in Slurm, HPC, and GPUs, you will play a crucial role in designing, implementing, and maintaining our advanced computing infrastructure. Your in-depth knowledge of Slurm, HPC principles, and GPU utilization will enable you to optimize our system performance, ensure reliable operation, and support our growing computational needs.

Responsibilities:

  • GPU Cluster Management:

    • Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker.

    • Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.

    • Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.

  • Orchestration and Automation:

    • Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.

    • Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.

  • Troubleshooting and Debugging:

    • Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.

    • Familiarity with debugging tools and techniques specific to GPU hardware and software.

  • Performance Optimization:

    • Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.

    • Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.

  • Security and Compliance:

    • Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.

    • Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.

  • Collaboration and Support:

    • Work closely with other engineering, research and applied science teams to understand and support their computational needs.

    • Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.

    • Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.

Requirements:

  • Education:

    • Bachelor’s degree in Computer Science, Electrical Engineering, or a related field. Master’s degree preferred.

  • Experience:

    • At least 3 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.

  • Technical Skills:

    • Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.

    • Experience with Slurm for workload management in GPU cluster environments.

    • Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.

    • Strong background in Linux/Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.

    • Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.

  • Problem-Solving:

    • Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.

  • Communication:

    • Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.

About Hippocratic AI

Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector. We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly. Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of "Do no Harm."


* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Tags: Ansible AWS Azure Computer Science Engineering GCP Generative AI GPU HPC Kubernetes Linux LLMs Machine Learning Python Research SageMaker Security

Perks/benefits: Career development

Region: North America
Country: United States
Job stats:  12  1  0

More jobs like this

Explore more AI, ML, Data Science career opportunities

Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.