Getting into
#dataengineering
is actually pretty easy
- learn SQL
- learn Python
- learn Snowflake/BigQuery/DataBricks
- learn data modeling
- learn data pipelines with Airflow
If you learn these 5 things, you’ll be interview-ready for a junior position for sure
I created a public Github repo with all the resources, books, companies, and social media accounts you should be following to stay current on data engineering topics.
I'm accepting PRs so we can crowdsource this effort!
#dataengineering
If I had to start learning
#dataengineering
all over again, I’d follow this plan, mostly in order:
- Learn SQL
— Aggregations with GROUP BY
— Joins (INNER, LEFT, FULL OUTER)
— Window functions
— Common table expressions
- Learn about data modeling
— read about data
I worked 2 years each at Meta, Airbnb and Netflix. Their engineering stacks are different and cultures have pros and cons.
- Meta
Stack I used: Hive, Spark, HDFS, Dataswarm, Unidash, Deltoid
Pros:
Tons of motivated people willing to help you
Great social events to make
The data engineer interview has 4-5 pieces:
- the SQL interview
Make sure you know:
Window functions, self-joins, common table expressions and SQL fundamentals
- the data modeling interview
Make sure you know:
Fact data modeling, dimensional data modeling, aggregate tables
I know data engineers who know just Python and SQL who make $500k at Netflix.
You don’t need to know the high performance languages to make a killing as a data engineer!
The best tech for each task:
- batch pipeline: Apache Spark
- data visualization: Apache Superset
- web api: NextJS (spring boot close second)
- SQL database: Postgres
- NoSQL database: DynamoDB
- Graph database: Neo4j
- front end web: React
- front end mobile: React
When I was at Airbnb, I reduced the pricing and availability data sets to 3% their original size!
This removed a few petabytes from the cloud and made Jeff Bezos cry.
How did I do this?
1. I recognized that listing and listing night information should be in one table not
Seven months ago, I decided to leave my big tech job to build something on my own. I was inspired by
@thejustinwelsh
's solopreneur content and believed I could attain a similar life!
I was making $600k/year at my data engineering job at Airbnb. I made $600k in seven months as an
Data engineering is like you take all the frustrating parts of being a data analyst and combined them with all the frustrating parts of being a software engineer
Every SQL concept you should know to ace data engineering interviews:
- Basics
SELECT, FROM, WHERE, GROUP BY, ORDER BY and HAVING
- Window functions
Know the difference between RANK vs DENSE_RANK vs ROW_NUMBER
Know how PARTITION BY and ORDER BY work in the OVER clause
AI isn't the cause of the tech hiring slow down! There was a law that went into effect in 2022 that updated Section 174 of the tax code.
Here are two scenarios to illustrate this:
- In 2021, you could found a startup and hire an engineer and pay them $100,000. Say your company
The data engineer interview has 4-5 pieces:
- the SQL interview
Make sure you know:
Window functions, self-joins, common table expressions and SQL fundamentals
- the data modeling interview
Make sure you know:
Fact data modeling, dimensional data modeling, aggregate tables
I migrated my data engineer handbook to:
This repo has over 7300 stars and all the resources you'd ever need to become an amazing data engineer!
#dataengineering
SQL interviews are common in data engineering. They’re even more common in big tech.
I wrote an article today revealing everything I know about them in my nine years of data engineering experience!
Link in my bio since Elon would bury it otherwise!
#dataengineering
Breaking in to data engineering can be 100% free and 100% project-based!
Here are the steps:
- find a REST API you like as a data source. Maybe stocks, sports games, Pokémon, etc.
- learn Python to build a short script that reads that REST API and initially dumps to a CSV
Please never use COUNT(*) in your SQL. It’s bad and unnecessarily selects all the columns. Use COUNT(1) for a basic row count. Or COUNT(column) for the count of a specific column.
#dataengineering
Breaking into data engineering can be very confusing!
Should I learn Spark or Snowflake? Python or Scala? Airflow or Argo? Flink or Spark Streaming? AWS or GCP? Superset or Tableau?
Fundamentals are more important than technologies:
- understanding distributed
For the next week only, I’m removing the paywall on my data engineering interview articles.
I wrote four in depth articles on passing the following four big tech interviews:
- data structures and algos
- data modeling
- data architecture
- SQL
Link in bio since Elon would
Breaking in to data engineering can be 100% free and 100% project-based!
Here are the steps:
- find a REST API you like as a data source. Maybe stocks, sports games, Pokémon, etc.
- learn Python to build a short script that reads that REST API and initially dumps to a CSV
SQL is deceptively complex. The order which things apply isn’t that intuitive and can be frustrating when debugging queries.
Let’s talk about the ordering of a query and when each step is executed.
Here’s the query well deconstruct.
SELECT
city,
SUM(weed_smoked) as
I wrote a new article on passing data engineering data structures and algorithms interviews!
I cover:
- how to prepare in the interviews
- what to do on the day of the interview
- the exact leetcode questions I’ve seen in my career and more!
Check out the link in my bio
When I worked at Netflix, I built pipelines that processed over 2000 terabytes per day, data pipelines play by different rules when you get to this scale.
I go into more detail here in this 2 min YouTube video you should check out!
#dataengineering
Don’t stop at SQL and Python when learning
#dataengineering
Add:
- distributed computation
- data modeling
- bash/docker/dev ops
- a statically typed language like Java
you’ll make a lot more money if you do this
Breaking in to data engineering can be 100% free and 100% project-based!
Here are the steps:
- find a REST API you like as a data source. Maybe stocks, sports games, Pokémon, etc.
- learn Python to build a short script that reads that REST API and initially dumps to a CSV
Data engineers come in a few levels:
- level 1
Knows Python and SQL. Can move data from point A to point B so long as it’s not too big
- level 2
Knows distributed compute basics like BigQuery and Spark. Can move data around on the order of single terabytes
- level 3
Data engineering is like you take all the frustrating parts of being a data analyst and combined them with all the frustrating parts of being a software engineer
How I went from junior data engineer (L3) at Facebook to staff data engineer (L6) at Airbnb in 4 years.
- I got hired at Facebook in 2016 as a junior data engineer. I had 2 years of experience and I realized that I probably got hired at the wrong level. (1/13)
Python, SQL and Airflow will get you to $125k as a data engineer.
If you want more, you’ll need to adopt a software engineering mindset.
- how do you make these pipelines scalable to arbitrary sizes of data?
- how do you make data sets that are adaptable to inevitable
The S tier data engineering stack is:
- S3 and Apache iceberg for storage
- Spark and Flink for compute
- Airflow or Mage or Prefect for orchestration
- Great Expectations for data quality
- Druid for fast columnar storage for dashboards
- AWS as the cloud platform
What’s
Quick guide to go from 0 to
#dataengineering
hero:
- learn SQL
Data Lemur is a great resource here
- learn Python
Do like… 30-40 leetcode easy and medium questions
- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it
1/3
Starting out in the data field can be overwhelming. Should you be a data scientist? A data engineer? A data analyst? An ML engineer? The number of role options is overwhelming!
Here's some high-level guidance on how to pick between some of these roles.
1/5
You should pick SQL over Python for all pipelines that can use it!
Here’s why:
- SQL pipelines are going to be closer to the database and more likely to be optimized by default
- SQL is the common denominator language of data professionals allowing analysts to more easily
Fundamental concepts every data engineer should know because they don’t really change
- ANSI SQL
- distributed compute
- OLTP vs OLAP
- CAP theorem
- slowly-changing dimensional modeling
- fact data modeling
- logging best practices
- AVRO / Thrift schemas
- idempotent
By changing the sort order of one of my parquet tables at Airbnb, I was able to reduce its size from 35 GBs to 1 GB! Since there's 365 partitions of this data. It goes from being 12.2 TBs of data to 0.3 TBs.
Remember when sorting your Parquet data that you should start with
Here is a picture of how my resume transformed between 2014 and 2023.
You'll see I didn't even list SQL or Python on my 2014 resume!
You're allowed to change your mind on the trajectory and direction of your technical career! I realized I didn't like mobile app development
3 months ago, I created a public Github repo with all the resources, books, companies, and social media accounts you should be following to stay current on data engineering topics.
This repo has ~6k stars now!
I'm still accepting PRs so we can crowdsource this effort and make
Data engineers often become bored of data engineering!
After a while of SQL + Python + airflow, you start thinking all pipelines are the same and it’s copy and paste work.
Some strategies to help with this:
- become more end-to-end
Maybe that means building a dashboard. Maybe
Data products are what is going to elevate data engineering into the stratosphere!
They power everything you could imagine in the big tech companies!
- At Airbnb, I worked on a data product that helped detect "bad hosts" to increase guest satisfaction
- At Netflix, I worked
Most companies need the following data roles:
- Data engineer for master data management
- Data scientist for model development and experimentation
- Analytics engineer for KPI development and visualization
- Machine learning engineer for model development, deployment, monitor
Top 4 reasons why data engineering is the best data profession:
1. highest pay for the least education
Machine learning engineers and data scientists make 10-15% more but spent 30% more time in college. Data analysts make less than data engineers but require less schooling.
Picking the right storage technology depends on a lot of factors!
Picking the wrong one will always result in pain and migrations down the line!
These constraints are around:
- latency
Low latency is dominated by queues and caches. Data access in those data structures is
Window functions are critical in SQL interviews. Here's every piece dissected.
An example query for the question "Give me the rolling 30-day sum of revenue by department"
SELECT SUM(revenue) OVER (PARTITION BY department ORDER BY date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW)
Data engineering has many "this or that" questions
- Python or Scala?
If you don't know either, start with Python. If you want to transition to the software/data engineer archetype, pick up Scala later.
- Streaming or Batch?
A vast majority of data engineering jobs are batch
The data engineer journey has a few levels:
- level 1
Am I an analyst or a data engineer?
At this level you’re probably doing a mixture of pipeline work and reporting. You like pipeline work more.
- level 2
Why are pipelines so complicated?
Here you learn about
When I worked at Netflix, I built pipelines that processed over 2000 terabytes per day, data pipelines play by different rules when you get to this scale.
I go into detail here in this 2 min YouTube video you should check out!
Do you want to get better at data engineering?
Here's some free YouTube videos you should watch:
Data Modeling 100TBs to 5 TBs:
Data Lake fundamentals (Iceberg and Parquet):
Dimensional Data Modeling:
Level 1 data engineers: I use SQL
Level 2 data engineers: SQL is hard to test, you need TDD in your pipelines, data frames only!
Level 3 data engineers: I use SQL and dbt
#dataengineering
Data analysts don’t need to learn that much more SQL to become data engineers!
Data analysts have a mastery of the SELECT query! This is 80% of data engineering SQL tools!
Adding in a few other SQL commands will make it much easier to go from data analyst to data engineer!
-
I nearly tripled my salary in a year by transitioning from data analyst to data engineer!
I started my career as a data analyst in 2014 making $30k.
I decided I needed to upskill more.
I learned Linux, Hadoop fundamentals, Java MapReduce, and got more depth in my software
Python, SQL and Airflow will get you to $125k as a data engineer.
If you want more, you’ll need to adopt a software engineering mindset.
- how do you make these pipelines scalable to arbitrary sizes of data?
- how do you make data sets that are adaptable to inevitable
When I was in my early 20s, I believed that making $250k was going to be my "late career" earnings.
This belief changed in 2017 after working at Facebook for a year.
After working for a year with people whose parents' paid $250k+ for their college, made me realize that either:
Data engineering != data science != software engineering
So many companies have data engineers writing REST APIs, data scientists building pipelines and software engineers building models.
Hire your specialists for their special skills.
Don’t push them into inefficient
Some people have been asking for sample lectures from the boot camp content. Here's the very first data modeling lecture at full length to give you an idea if the boot camp is for you or not!
I hope you enjoy the 48 minutes of data engineering bliss!
Job requirements are mostly wishlists.
I applied to a staff data engineer role at Airbnb that required 10+ years of experience when I had 6 years of experience.
I got the job though!
Apply to jobs you don’t think you’re ready for! You might surprise yourself!
The data architecture interview is often the thing that stands between you and a fancy senior+ data engineering role in big tech!
I wrote a newsletter article covering the pieces that you need to remember to excel in these interviews!
Link in the bio since Elon would downrank
Slow ETL slaps data engineers on a daily basis!
If you want to speed up your ETL 10x, try these things out:
1. Cumulatively build your dimensions
Facebook keeps track of 30 days of user activity in an array. This makes calculating monthly active users much easier! You no
What people think breaking into data engineering looks like:
- processing hundreds of terabytes at scale
- mastering Spark, Iceberg, Airflow
- knowing everything about data lakes and data architecture
- burning thousands of dollars on AWS compute just to get a job
What breaking
Here's what the average data engineering interview looks like in 2024:
- 1 hour algorithms in Python
Here you will be asked irrelevant questions about dynamic programming, linked lists, and inverting trees
- 1 hour SQL
Here you will be asked niche questions about recursive CTEs
Data engineers with strong software engineering skills will be in very high demand for the next 5 years!
Building end-to-end data products and not just data pipelines will unlock outsized value for companies!
Data products are full stack so DEs should upskill here: 1/2
After you’ve been in data for a while you realize tooling doesn’t matter that much!
- whether it’s Snowflake vs BigQuery vs Spark
It’s all distributed compute underneath the hood.
- whether it’s Airflow vs Prefect vs Mage vs Dagster
It’s all CRON underneath the hood
-
My bold 5 year predictions about
#dataengineering
- Streaming data eng jobs account for 15-20% of all data eng jobs, but pay the most
- Rust becomes a mainstream data engineering infrastructure language like Scala
- Spark starts looking like Hive does now
- Data engineers
When I was 17, I ran away from home and ultimately got tackled by my 300 pound step dad.
He screamed at me, “Zach you’re a drug addict!”
My journey since then has been kind of crazy.
I spent 17-22 lost. Going in and out of rehabs and feeling dejected and anxious. My one
When I worked at Netflix, I built a graph database that had over 40 different vertex types and 50 different edge types!
This extreme variety of data needs to be handled with care! I wrote a detailed blog post about everything you should consider here:
Distributed SQL is not the same as regular SQL!
These keywords cause shuffling in distributed environments:
- GROUP BY
- JOIN
- ORDER BY
- PARTITION BY
These keywords behave mostly the same everywhere:
- WHERE
- HAVING
- FROM
- SELECT
You’ll notice the word “BY”
Refusing to grow beyond SQL and Python will limit your career growth as a data engineer!
Growing in the following areas will get you more money:
- data modeling
Knowing when to use cumulative table design to model your dimensions is critical.
Knowing how to efficiently model
Mid-level engineers often fall into the trap that doing more gets you promoted faster!
This bias sounds correct though. Senior engineers write more code that’s why they’re senior right?
I remember at Facebook I fell into this trap.
I became the main DE owning notifications,
Getting a big tech data engineer job in 2016:
- do you know SQL?
- yes
- here’s $500k
Getting a big tech data engineer job in 2024:
- do you know Spark, Kafka, Iceberg?
- yes
- did you shake hands with Bill Inmon when he invented the data warehouse?
- no
- rejected and
Data modeling has evolved beyond Kimball’s book
Here’s why:
- Kimball modeling didn’t think about distributed compute environments or large scale data
- Splitting everything up into tables that can’t be broadcast JOIN’d in Spark is expensive.
- Doing JOINs with extremely
Top five skills to break into data engineering:
- data modeling
Dimensional data modeling - what analysts use
Relational data modeling - what software engineers use
One Big Table data modeling - a new cutting edge way that is appropriate sometime
- distributed compute
The
Data engineering SQL interviews always have a silly RANK question. Should you use RANK, DENSE RANK, or ROW NUMBER? Here’s a refresher!
For more free data engineering interview, subscribe to my blog:
#dataengineering
Every engineer has one of two tech stacks:
- stack one:
MacBook, Discord, AWS, JavaScript, React, Jenkins, GitHub, FaceTime
- stack two:
Windows, Slack, Azure, Python, Vue, GitHub actions, GitHub, Zoom
Which stack are you?
Data analytics is going to become more "Kafka-first" for a variety of reasons
- Relying on a data engineer to ETL the data is a bottleneck that a lot of companies don't want to worry about
- Technologies like Apache Pinot sit on top of Kafka and enable real-time analytics
Breaking into data engineering can feel complicated and overwhelming!
You need to learn the languages of the trade SQL and Python.
You need to learn the tools of the trade Spark,BigQuery, Airflow, Databricks, etc.
Then you need to show that you actually know this stuff!
I go
Once you’ve been in analytics long enough you realize there’s only like… 6 patterns
- Aggregatation
Count things by other things
- Experimentation / Segmentation
Split people into groups and test product changes
- Accumulation vs Derivative
Think rolling sum or YoY
If you use Excel for data analytics, you’re a data analyst. You don’t have to know SQL and Python.
Don’t belittle others for using tools that are different from yours! It’s very impressive how far business can go with just Excel.
Data engineering interviews are frustrating because:
- some treat DE like software eng and give you ridiculous data structures and algorithms questions
- some treat DE like analytics eng and expect extremely in depth knowledge of dbt and metrics
- some treat DE like being a
The perfect data engineering portfolio project has the following things:
- a data modeling diagram
This shows you know how to build usable data tables.
- a live visualization people can view from the web
This is probably the thing people will look at and share. Without this
I intentionally don’t monetize my long form YouTube videos so y’all can have the best learning experience even if you can’t afford YouTube Premium!
Here are my best ad-free hits:
Data Lakes, Apache Iceberg and parquet compression in 60 minutes:
I turn 29 + 1 today at 9:02 PM Pacific.
As I desperately cling to my 20s, here are 29 + 1 things I’ve learned during my time on this planet that have lead to success
1. Always ask questions! The stupider the question the better!
2. Don’t ask what’s the least I can do. Ask
For the holidays, I'm offering ten full-ride scholarships to V4 boot camp. If you get selected, you'll get immediate access to V3 material and get a free seat in the V4 boot camp in the spring!
Here's the link to apply for the scholarship:
Do you want to get better at data engineering?
Here's some free YouTube videos you should watch:
Data Modeling 100TBs to 5 TBs:
Data Lake fundamentals (Iceberg and Parquet):
Dimensional Data Modeling: ()
The data modeling round in big tech interviews weeds out the DEs who can't solve vague business problems!
I wrote a free article about everything you need to know to pass these interviews!
Link in my bio since Elon would downrank otherwise!
#dataengineering
Data engineering compensation can get kind of crazy as you climb the ladder in big tech!
- junior DEs usually make $180-200k
- mid-level makes $250-275k
- senior makes $300-350k
- staff makes $500-600k
Climbing the ladder is definitely worth it!
#dataengineering