Staff Deep Learning Training Optimization Engineer

San Francisco, CA

Applications have closed

Cruise LLC

Cruise is the leading self-driving car company driven to improve life in our cities by safely connecting people with places, things & experiences they love.

View company page

We're Cruise, a self-driving service designed for the cities we love.

We’re building the world’s most advanced, self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.

Cruisers have the opportunity to grow and develop while learning from leaders at the forefront of their fields. With a culture of internal mobility, there's an opportunity to thrive in a variety of disciplines. This is a place for dreamers and doers to succeed.

If you are looking to play a part in making a positive impact in the world by advancing the revolutionary work of self-driving cars, join us.

About the role

The Autonomous Vehicle (AV) software stack heavily relies on machine learning techniques to perform a variety of tasks, each with different requirements of hardware/compute resources. Throughout the life-cycle of each machine learning model, skilled ML engineers (on both training and inference sides) work closely to prepare it for a robust, scalable, and compute/power efficient inferencing on a resource-constrained hardware accelerator. Such a close working relationship is key to fast and successful deployment of intelligent systems on the car.

Cruise is looking for deep learning performance experts who can understand the big picture of training performance on GPUs, prioritizing and then solving problems across many dozens of state-of-the-art neural networks.In this position, you will be responsible to understand, analyze, profile, and optimize deep learning training workloads on state-of-the-art hardware and software platforms. You will be expected to evaluate and improve performance optimization of every stage of computation, from vertical scaling (data pipeline optimization, faster layer execution/scheduling, mixed precision, model/subgraph parallelism) to horizontal scaling (strong vs weak scaling, communication collective tuning for latency/bandwidth), to convergence tuning for large batches (LARS, LAMB, etc.). 

This is a tech-leadership role. You will be charged with defining and leading a strategic roadmap to scale the velocity and throughput of training large ML models that are deployed to our AVs. In collaboration with your team, you will identify the current gaps and explore the right training optimizations strategies to invest in for the AI department. You will work closely with both ML engineers and ML platform engineers to scale these solutions to fit the expected demand and scaling.

If you're interested in optimizing machine learning training and inference on different hardware accelerators, and want to test your skills with real-world (and practical) applications in the autonomous vehicle domain, let's chat!

Day-to-day responsibilities include: 

  • Technical leader in driving strategy for optimizing training workloads at Cruise by defining strategic investments in collaboration with partner teams in AI.
  • Understand the big picture of training performance at Cruise, and define the technical roadmap wrt to optimizing performance across many dozens of state-of-the-art neural networks
  • Build performance analysis tools (profilers, hotspot analysis, etc) to diagnose the bottlenecks in end-to-end training workflow here at Cruise
  • Define technical strategies for scaling up utilization of training workloads on cloud resources
  • Lead execution of strategy in partnership with customer/partner teams within AI department (e.g., perception, prediction, robotics and infra teams), product managers and TPMs
  • Bring and extend SOTA in training efficiency to scale up the velocity of AI at Cruise

You should apply for this role if you have the following qualifications:

  • Expertise in optimizing training workloads (in pytorch or Tensorflow) for scaling out and scaling up training performance in the cloud / datacenter
  • Experience with defining technical strategy, vision and direction and bringing alignment across cross-functional teams.
  • Experience working as TL (tech lead) and delivered impact in improved training efficiency
  • Experience in Pytorch performance tuning (such as enabling async data loading, avoiding unnecessary CPU-GPU sync,  disabling redundant gradient calculations, enabling more effective op-fusions, eliminating redundant ops), familiarity with Pytorch general optimizations APIs (such as buffer checkpointing) 
  • Good understanding of deep learning framework building  blocks, e.g. operator registry, CPU & GPU ops, tensor memory management system (e.g. caching allocator), performance analysis, diagnosis and optimization for GPU workloads in DL framework runtimes. 
  • General C++ experience
  • MS, or higher degree, in CS/CE/EE, or equivalent, in industry experience

Bonus points!

  • Experience with deep learning optimization libraries such as DeepSpeed. MLPerf training optimization
  • Familiarity with distributed training packages in frameworks (such as torch.distribute, Horovod), libraries (such as Nvidia’s NCCL) and other scaling technologies (such as Reduction Server) for scaling up performance on multiple-GPU systems.
  • Familiarity with exploiting model parallelism and data parallelism to improve performance in multi-node data centers
  • Experience with open-source deep learning stacks (TVM, XLA, etc)
  • Familiarity/experience using auto-grad capable compilers (such as Jax, JuliaGPU) 
  • GPU programming (CUDA) and familiarity with deep learning stack (e.g., cuDNN, cuBLAS)
  • SIMD programming model (avx2, neon)
Why Cruise?
  • Our benefits are here to support the whole you:
    • Competitive salary and benefits 
    • 401(k) Cruise matching program 
    • Medical / dental / vision, AD+D and Life
    • One Medical membership
    • Flexible vacation and company paid holidays
    • Healthy meals and snacks provided for non-remote employees
    • Paid parental leave
    • Fertility Benefits 
    • Dependent Care Flexible Spending Account, subsidized by Cruise
    • Flexible Spending Account 
    • Monthly wellness stipend
    • Pre-tax Commuter Benefit Plan for non-remote employees
  • We’re Integrated
    • Through our partnerships with General Motors and Honda, we are the only self-driving company with fully integrated manufacturing at scale.
  • We’re Funded
    • GM, Honda, Microsoft, SoftBank, & T. Rowe Price, have invested billions in Cruise. Their backing for our technology demonstrates their confidence in our progress, team, and vision and makes us one of the leading autonomous vehicle organizations in the industry. Our deep resources greatly accelerate our operating speed.
  • We’re Independent
    • We have our own governance, board of directors, equity, and investors. Our independence allows us to not just work on the edge of technology, but also define it.
  • We’re Vested
    • You won’t just own your work here, you’ll have the potential to own equity in Cruise, too. We are competing in a market that is projected to grow exponentially, which gives our company valuation room to grow.

Cruise LLC is an equal opportunity employer. We strive to create a supportive and inclusive workplace where contributions are valued and celebrated, and our employees thrive by being themselves and are inspired to do the best work of their lives. 

We seek applicants of all backgrounds and identities, across race, color, ethnicity, national origin or ancestry, citizenship, religion, sex, sexual orientation, gender identity or expression, veteran status, marital status, pregnancy or parental status, or disability. Applicants will not be discriminated against based on these or other protected categories or social identities. Cruise will consider for employment qualified applicants with arrest and conviction records, in accordance with applicable laws.

Cruise is committed to the full inclusion of all applicants. If reasonable accommodation is needed to participate in the job application or interview process please let our recruiting team know or email HR@getcruise.com.

We proactively work to design hiring processes that promote equity and inclusion while mitigating bias. To help us track the effectiveness and inclusivity of our recruiting efforts, please consider answering the following demographic questions. Answering these questions is entirely voluntary. Your answers to these questions will not be shared with the hiring decision makers and will not impact the hiring decision in any way. Instead, Cruise will use this information not only to comply with any government reporting obligations but also to track our progress toward meeting our diversity, equity, inclusion, and belonging objectives.

Vaccine Mandate. 

At Cruise, we’re tasked with leading in the communities we serve — and doing our part to help keep our communities and our teams safe. Our #StaySafe culture transcends and informs all we do, and because of this, as of October 31, 2021 Cruise will be mandating COVID-19 vaccinations for all US-based Cruisers who need or want to access any of our US Cruise facilities and engage in any business travel — including attending any in-person Company-sponsored event. 

If you are unable to get a vaccine due to a medical condition, disability, or a strongly-held religious belief, Cruise will consider requests for an accommodation.

Note to Recruitment Agencies: Cruise does not accept unsolicited agency resumes. Furthermore, Cruise does not pay placement fees for candidates submitted by any agency other than its approved partners.

Tags: APIs CUDA Deep Learning GPU Horovod Machine Learning ML models PyTorch Robotics TensorFlow

Perks/benefits: Career development Competitive pay Equity Fertility benefits Flex hours Flexible spending account Flex vacation Gear Health care Home office stipend Medical leave Parental leave Salary bonus Snacks / Drinks Wellness

Region: North America
Country: United States

More jobs like this

Explore more AI, ML, Data Science career opportunities

Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.