Staff Software Engineer, ML Infrastructure

San Francisco, CA

Full Time Senior-level / Expert USD 180K - 270K

Attentive

Attentive is the most comprehensive personalized text messaging solution. 99% open rates, 30%+ click-through rates, and 25x+ ROI.

View company page

Apply now Apply later

Posted 4 weeks ago

About Attentive: Attentive® is the AI marketing platform for leading brands, designed to optimize message performance through 1:1 SMS and email interactions. Infusing intelligence at every stage of the consumer's purchasing journey, Attentive empowers businesses to achieve hyper-personalized communication with their customers on a large scale. Leveraging AI-powered tools, a mobile-first approach, two-way conversations, and enterprise-grade technology, Attentive drives billions in online revenue for brands around the globe. Trusted by over 8,000 leading brands such as CB2, Urban Outfitters, GUESS, Dickey’s Barbecue Pit, and Wyndham Resort, Attentive is the go-to solution for delivering powerful commerce experiences for consumers with the brands they love.
Attentive’s growth has been recognized by Deloitte’s Fast 500, Linkedin’s Top Startups and Forbes Cloud 100 all thanks to the hard work from our global employees!
Who we areWe’re looking for a self-motivated, highly driven Staff Software Engineer to join our Machine Learning Operations (MLOps) team. As a team, we enable Attentive’s Machine Learning (ML) practice to directly impact Attentive’s AI product suite through the tools to train, inference, and deploy ML models with higher velocity and performance, while maintaining reliability. We build and maintain a foundational ML platform spanning the full ML lifecycle for consumption by ML engineers and data scientists. This is an exciting opportunity to join a rapidly growing MLOps team at the ground floor with the ability to drive and influence the architectural roadmap enabling the entire ML organization at Attentive.
This team and role is responsible for building and operating the ML compute and orchestration architecture here at Attentive, which currently consists of a hosted notebook solution with Spark on AWS EMR, a multi-cluster CPU and GPU-enabled training and inference orchestrator leveraging Metaflow on Argo Workflows, and an ML feature store. We are excited to bring on more engineers to continue expanding this stack.

Why Attentive needs you

Define and lead cross-functional ML infrastructure and ML platform projects
Demonstrate the ability to analyze, troubleshoot, coordinate, and resolve complex ML infrastructure issues
Orchestrate Kubernetes and ML training / inference infrastructure exposed as an ML platform
Expose and manage environments, interfaces, and workflows to enable ML engineers to develop, build, and test ML models and services
Manage and expand our feature store implementation that allows ML teams to self-service data labeling, feature engineering, and batch inferencing
Close the latency gap on model inference to online, real-time model serving
Develop automation workflows to improve team efficiency and ML stability
Analyze and improve efficiency, scalability, and stability of various system resources
Partner with other teams and business stakeholders to deliver business initiatives
Help onboard new team members, provide mentorship and enable successful ramp up on your team's code bases

About you

You have been working in the areas of MLOps / ML Platform / Data Platform / Site Reliability Engineering / DevOps / Infrastructure for 8+ years, and have an understanding of best practices for DevOps applied to ML
You have successfully led major cross-functional, cross-team ML infrastructure or ML platform projects
Your passion is infrastructure and exposing platform capabilities through interfaces that enable high performance ML practices, rather than designing ML experiments (this team does not directly develop ML models)
You have deep experience in Kubernetes applied to ML use cases such as CPU & GPU training, hosting and exposing ML tools, and managing ML endpoints as web services
You understand the key differences between online and offline ML inferences and can voice the critical elements to be successful with each
You have a background in software development and are passionate about bringing that experience to bear on the world of ML infrastructure
You have experience with Infrastructure as Code using Terraform and can’t imagine a world without it
You understand the importance of CI/CD in building high-performing teams and have worked with tools like Jenkins, CircleCI, Argo Workflows, and ArgoCD
You are passionate about observability and worked with tools such as Splunk, Nagios, Sensu, Datadog, New Relic
You are very familiar with containers and container orchestration and have direct experience with vanilla Docker as well as Kubernetes as both a user and as an administrator.

Some sample projects

Design and lead implementation of an online inference pipeline with champion/challenger model testing
Unite existing pipelines across data, ML, and platform teams to handle low-latency, high volume real-time streaming use cases in production inference workflows
Define golden path build and release pipelines for better reliability and Python package management
Identify opportunities to improve scalability, resiliency, and cost efficacy of GPU training and inference workflows
Design and lead implementation of a low-touch, automated model CI/CD pipeline

Our scale

8,000 brands powered by Attentive sent over 2.2 billion text messages over Cyber Week 2023 (Black Friday/Cyber Monday) representing a growth of 31% from 2022
We sent 32 billion SMS messages in 2023, up 32% YoY. That’s an average of 87 million per day
Our production cluster contains over 18,000 containers which serve 200+ services
Our streaming services process over 80 billion events per month

What we use

Our infrastructure runs primarily in Kubernetes hosted in AWS’s EKS
Infrastructure tooling includes Istio, Datadog, Terraform, CloudFlare, and Helm
Our backend is Java / Spring Boot microservices, built with Gradle, coupled with things like DynamoDB, Pulsar, AirFlow, Postgres, Planetscale, and Redis, hosted via AWS
Our frontend is built with React and TypeScript, and uses best practices like GraphQL, Storybook, Radix UI, Vite, esbuild, and Playwright
Our automation is driven by custom and open source machine learning models, lots of data and built with Python, Metaflow, HuggingFace 🤗, PyTorch, TensorFlow, and Pandas

You'll get competitive perks and benefits, from health & wellness to equity, to help you bring your best self to work.
For US based applicants:- The US base salary range for this full-time position is $180,000 - $270,000 annually + equity + benefits- Our salary ranges are determined by role, level and location
#LI-MDK1
Attentive Company ValuesDefault to Action - Move swiftly and with purposeBe One Unstoppable Team - Rally as each other’s championsChampion the Customer - Our success is defined by our customers' successAct Like an Owner - Take responsibility for Attentive’s success
Learn more about AWAKE, Attentive’s collective of employee resource groups.
If you do not meet all the requirements listed here, we still encourage you to apply! No job description is perfect, and we may also have another opportunity that closely matches your skills and experience.
At Attentive, we know that our Company's strength lies in the diversity of our employees. Attentive is an Equal Opportunity Employer and we welcome applicants from all backgrounds. Our policy is to provide equal employment opportunities for all employees, applicants and covered individuals regardless of protected characteristics. We prioritize and maintain a fair, inclusive and equitable workplace free from discrimination, harassment, and retaliation.

Apply now Apply later

Share this job via
or

Tags: Airflow Architecture AWS CI/CD DevOps Docker DynamoDB Engineering Feature engineering GPU GraphQL Helm HuggingFace Java Kubernetes Machine Learning Microservices ML infrastructure ML models MLOps Model inference Open Source Pandas Pipelines PostgreSQL Pulsar Python PyTorch React Spark Splunk Streaming TensorFlow Terraform Testing TypeScript

Perks/benefits: Career development Competitive pay Equity Health care Startup environment Team events Wellness

Region: North America

Country: United States

Job stats: 7 1 0

Categories: Engineering Jobs Leadership Jobs Machine Learning Jobs

More jobs like this

« Back to job search To the top ↑

Explore more AI, ML, Data Science career opportunities

Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.

Staff Software Engineer, ML Infrastructure

San Francisco, CA

Full Time Senior-level / Expert USD 180K - 270K

Attentive

Why Attentive needs you

About you

Some sample projects

Our scale

What we use

More jobs like this

Generative AI Field Solutions Architect Manager

Analytics Engineer

Mech Product Engineer, Data Center Eng, Mech Solutions

Senior Machine Learning Engineer - Hybrid

Senior Data Engineer

Senior Staff Software Engineer, ML Infrastructure

Senior/Staff Flight Test Data Engineer, Group 5

Master Data Management - Staff Consultant - Consulting - Location OPEN

Senior Deep Learning Architect, Gen AI Innovation Center

Robotics Software Engineer, Intelligence Systems

Explore more AI, ML, Data Science career opportunities