Machine Learning Infrastructure Engineer, ML Platform

Remote, San Francisco

Full Time
Stripe logo
Stripe
The new standard in online payments
Apply now Apply later

Posted 1 month ago

Infrastructure Engineer, Machine Learning Platform

Stripe's mission is to increase the GDP of the internet. To do this, we need to fight fraud at scale and build great software products, which means assembling strong machine learning teams and equipping them with the technologies they need to be effective. Our mission on Machine Learning Platform is to make these teams more impactful by providing reliable and flexible infrastructure to enable Machine Learning at scale.

 

The Machine Learning Platform team does this by designing and engineering the underlying infrastructure that powers experimentation, training and serving for Stripe’s key machine learning systems. Our flagship products include Railyard and Diorama. Railyard provides an expressive and powerful interface for model training at scale. Diorama enables model serving in real-time with strong reliability and latency guarantees. We work closely with ML engineers, data scientists, and platform infrastructure teams to build the powerful, flexible, and user-friendly systems that substantially increase ML velocity across the company.

You will work on:

  • Building powerful, flexible, and user-friendly infrastructure that powers all of ML at Stripe
  • Designing and building fast, reliable services for ML model training and serving, and distributing that infrastructure across multiple regions
  • Creating services and libraries that enable ML engineers at Stripe to seamlessly transition from experimentation to production across Stripe’s systems
  • Pairing with product teams and ML modeling engineers to develop easy to use  infrastructure for production ML models

 

We are looking for:

  • A strong engineering background and experience with data infrastructure and/or distributed systems.
  • Experience optimizing the end-to-end performance of distributed systems.
  • Experience developing and maintaining distributed systems built with open source tools.
  • Experience with or strong interest in developing ML models.

 

Nice to haves:

  • Experience with Scala and Python
  • Experience with Kubernetes
  • Experience with creating developer tools
  • Experience with model training and serving in production and at scale.
  • Experience in writing and debugging ETL jobs using a distributed data framework (such as Spark, Kafka, or Flink)

It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar.

Job tags: Distributed Systems Engineering ETL Kafka Kubernetes Machine Learning ML Open Source Python Scala Spark
Job region(s): North America Remote/Anywhere
Share this job: