Data Engineer
Bangalore, India
Airbnb
Get an Airbnb for every kind of trip → 7 million vacation rentals → 2 million Guest Favorites → 220+ countries and regions worldwideAirbnb is a mission-driven company dedicated to helping create a world where anyone can belong anywhere. It takes a unified team committed to our core values to achieve this goal. Airbnb's various functions embody the company's innovative spirit and our fast-moving team is committed to leading as a 21st century company.
A Data Engineer is responsible for designing, developing, producing and owning a specific business domain’s core models. These data models are often intended to be used not only by members of that business domain, but also data consumers from across the company. Common uses of this data include metrics generation, analytics, experimentation, reporting, and ML feature generation.
Data Engineers do not build datasets for individual applications (e.g. a report, a metric, an analysis) or for experimental nascent metrics. Application-specific datasets are typically built by the application owner, often using one or more core data models as inputs.
The responsibilities of a data engineer by development phase include:
- Define
- Identify and gather the most frequent data consumption use cases for the datasets they are designing. A critical question we expect DEs to weigh in is whether existing data models can satisfy (or at least be used as foundation) prior to building new data.
- Understand the impact of each requirement and use impact to inform prioritization decisions
- Define data governance requirements (data access & privacy PII, retention etc.)
- Define requirements for upstream data producers to satisfy intended data access patterns (including latency and completeness)
- Design
- Guide product design decisions to ensure the need for data timeliness, accuracy, and completeness are addressed
- Design the data set required to support a specific business domain (i.e. the data model for that business domain)
- Identify required data to support data model requirements
- Work closely with online system engineers to influence the design of online data models (events and production tables) such that they meet the requirements of their offline data
- Partner closely with data source owners (i.e. engineers and APIs and their owners) to specify and document data required to be ingested for the successful delivery of a requirement
- Define and own schema of events for ingested data
- Define and document data transformations that will need to be made to transform ingested data into data warehouse tables
- Validate the data model meets Data Warehouse standards
- Validate the data model integrates with adjacent data models
- Document tables & columns for data consumers
- Optimize data pipeline design to ensure compute and storage costs are efficient over time
- Build
- Implement data pipelines (streaming & batch) to execute required data transformations to meet design requirements
- Validate incoming data to identify syntactic & semantic data quality issues
- Validate data through analysis and testing to ensure the data produced meets the requirements specifications
- Implement sufficient data quality checks (pre-checks, post-checks, anomaly detection) to preserve ongoing data quality
- Partner with data consumers to validate resulting tables address the intended business need
- Provide and solicit constructive peer review on data artifacts like pipeline design, data models, data quality checks, unit tests, code etc.
- Maintain
- Continually improve, optimize and tune their data pipelines for performance, storage & cost efficiency. Simplify existing systems and minimize complexity. Reduce data technical debt.
- Actively deprecate low usage, low quality and legacy data and pipelines.
- Triage and promptly correct data quality bugs. Implement additional data quality checks to ensure issues are detected earlier
- Invest in automation where possible to reduce operational burden
- Foundations, Citizenship & Stewardship
- When building a tool to improve general DE workflow, contribute to development of tools to support company-wide Data Engineering Paved Path instead of local one-off solutions
- Contribute to education of other DEs and data consumers on the data they curate for the business
- Be data driven and influence data driven decisions. Be factual in communication, use data to effectively to tell stories, be critical of decisions not founded on data
- Actively participate in recruiting, interviewing, mentoring and coaching. Champion the mission of Data Engineering by representing Airbnb at tech talks, blog posts, conferences, data meetups and communities
What Is A Data Engineer Not Responsible For?
- Implementing and ensuring the syntactic and semantic quality of events sent to the offline data ecosystem. This is owned by the online system’s engineering team or the Data Platform team (depending on the ingestion mechanism).
- Developing artifact-specific datasets for reports or models - generally owned by Data Science or Analysts through company
- Developing or maintaining Minerva assets - generally owned by Data Science
- Developing or maintaining reports & dashboards - generally owned by Data Science or Analysts throughout company
- Developing & maintaining specific ML features - generally owned by Data Science or ML Engineers throughout company
Skills
Not every Data Engineer will require all of these skills, but we expect most Data Engineers to be strong in a significant number of these skills to be successful at Airbnb.
- Data Product Management
- Effective at building partnerships with business stakeholders, engineers and product to understand use cases from intended data consumers
- Able to create & maintain documentation to support users in understanding how to use tables/columns
- Data Architecture & Data Pipeline Implementation
- Experience creating and evolving dimensional data models & schema designs to structure data for business-relevant analytics. (Ex: familiarity with Kimball's data warehouse lifecycle)
- Strong experience using ETL framework (ex: Airflow, Flume, Oozie etc.) to build and deploy production-quality ETL pipelines.
- Experience ingesting and transforming structured and unstructured data from internal and third-party sources into dimensional models.
- Experience with dispersal of data to OLTP (ex: MySQL, Cassandra, HBase, etc) and fast analytics solutions (ex: Druid, ElasticSearch etc.).
- Data Systems Design
- Strong understanding of distributed storage and compute (S3, Hive, Spark)
- Knowledge in distributed system design, such as how map-reduce and distributed data processing work at scale
- Basic understanding of OLTP systems like Cassandra, HBase, Mussel, Vitess etc.
- Coding
- Experience building batch data pipelines in Spark
- Expertise in SQL
- General Software Engineering (e.g. proficiency coding in Python, Java, Scala)
- Experience writing data quality unit and functional tests.
- (Optional) Aptitude to learn and utilize data analytics tools to accelerate business needs
- (Optional) Stream Processing:
- Experience building Stream Processing jobs on Apache Flink, Apache Spark Streaming, Apache Samza, Apache Storm or similar streaming analytics technology.
- Experience with messaging systems (ex: Apache Kafka or RabbitMQ etc.)
- Experience designing and implementing distributed and real-time algorithms for stream data processing.
- Understand concepts of schema evolution, sharding, latency etc.
- Good understanding of Lambda Architecture, along with its advantages and drawbacks
Tags: Airflow APIs Cassandra Data Analytics Data pipelines Elasticsearch Engineering ETL Flink HBase Kafka Lambda Machine Learning MySQL Oozie Pipelines Python Scala Spark SQL Streaming Testing Unstructured data
Perks/benefits: Conferences Team events
More jobs like this
Explore more AI, ML, Data Science career opportunities
Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.
- Open Lead Data Analyst jobs
- Open MLOps Engineer jobs
- Open Data Science Manager jobs
- Open Senior Business Intelligence Analyst jobs
- Open Data Engineer II jobs
- Open Sr Data Engineer jobs
- Open Data Manager jobs
- Open Principal Data Engineer jobs
- Open Data Analytics Engineer jobs
- Open Power BI Developer jobs
- Open Junior Data Scientist jobs
- Open Business Intelligence Developer jobs
- Open Senior Data Architect jobs
- Open Data Scientist II jobs
- Open Product Data Analyst jobs
- Open Sr. Data Scientist jobs
- Open Business Data Analyst jobs
- Open Manager, Data Engineering jobs
- Open Big Data Engineer jobs
- Open Data Quality Analyst jobs
- Open Data Analyst Intern jobs
- Open Data Product Manager jobs
- Open Azure Data Engineer jobs
- Open Junior Data Engineer jobs
- Open ETL Developer jobs
- Open Data quality-related jobs
- Open Business Intelligence-related jobs
- Open ML models-related jobs
- Open GCP-related jobs
- Open Data management-related jobs
- Open Privacy-related jobs
- Open Java-related jobs
- Open Finance-related jobs
- Open Data visualization-related jobs
- Open APIs-related jobs
- Open Deep Learning-related jobs
- Open PyTorch-related jobs
- Open Consulting-related jobs
- Open Snowflake-related jobs
- Open TensorFlow-related jobs
- Open PhD-related jobs
- Open CI/CD-related jobs
- Open NLP-related jobs
- Open Kubernetes-related jobs
- Open Data governance-related jobs
- Open Airflow-related jobs
- Open Hadoop-related jobs
- Open LLMs-related jobs
- Open Databricks-related jobs
- Open DevOps-related jobs