Machine Learning Infrastructure Engineer, ML Platform

Seattle, San Francisco, Remote (North America)

Applications have closed

Stripe

Stripe powers online and in-person payment processing and financial solutions for businesses of all sizes. Accept payments, send payouts, and automate financial processes with a suite of APIs and no-code tools.

View company page

Stripe's mission is to increase the GDP of the internet. To do this, we need to fight fraud at scale and build great software products, which means assembling strong machine learning teams and equipping them with the technologies they need to be effective. Our mission on Machine Learning Platform is to make these teams more impactful by providing reliable and flexible infrastructure to enable Machine Learning at scale.

 

The Machine Learning Platform team does this by designing and engineering the underlying infrastructure that powers experimentation, training and serving for Stripe’s key machine learning systems. Our flagship products include Railyard and Diorama. Railyard provides an expressive and powerful interface for model training at scale. Diorama enables model serving in real-time with strong reliability and latency guarantees. We work closely with ML engineers, data scientists, and platform infrastructure teams to build the powerful, flexible, and user-friendly systems that substantially increase ML velocity across the company.

 

You will work on:

  • Building powerful, flexible, and user-friendly infrastructure that powers all of ML at Stripe
  • Designing and building fast, reliable services for ML model training and serving, and distributing that infrastructure across multiple regions
  • Creating services and libraries that enable ML engineers at Stripe to seamlessly transition from experimentation to production across Stripe’s systems
  • Pairing with product teams and ML modeling engineers to develop easy to use  infrastructure for production ML models

We are looking for:

  • A strong engineering background and experience with data infrastructure and/or distributed systems
  • Experience optimizing the end-to-end performance of distributed systems
  • Experience developing and maintaining distributed systems built with open source tools
  • Experience with or strong interest in developing ML models

 

Nice to haves:

  • Experience with Scala and Python
  • Experience with Kubernetes
  • Experience with creating developer tools
  • Experience with model training and serving in production and at scale
  • Experience in writing and debugging ETL jobs using a distributed data framework (such as Spark, Kafka, or Flink)

It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar.

Tags: Distributed Systems Engineering ETL Flink Kafka Kubernetes Machine Learning ML models Model training Open Source Python Scala Spark

Regions: Remote/Anywhere North America
Country: United States
Job stats:  11  3  0

More jobs like this

Explore more AI, ML, Data Science career opportunities

Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.