Let me paint you a picture.
It’s 3am. Your on-call alert fires. A critical dashboard that the executive team reviews every morning is showing stale data. You log in, trace the issue, and find that your Spark job silently died four hours ago because a single executor ran out of memory on a skewed join.
The fix takes ten minutes. The damage — missed SLAs, a frantic Slack thread, a post-mortem doc — takes much longer to clean up.
This is not a story about bad engineers. It’s a story about what happens when good engineers build for the happy path and forget to design for the real world.
Advanced data engineering with Spark, dbt, and Airflow isn’t about knowing more syntax. It’s about knowing where each tool breaks — and building systems that survive anyway.
Apache Spark: The Hidden Cost of Data Skew
Spark is incredibly powerful, but it has one Achilles heel that catches even experienced engineers off guard: data skew.
When you’re joining two large datasets and one key appears disproportionately often (say, a NULL value, or a single mega-customer ID), Spark will assign all that work to one partition — and one executor. While other executors finish quickly and sit idle, the overloaded one struggles. Your job hangs. Eventually it times out or runs out of memory.
How to diagnose it
Open the Spark UI and look at the task duration for your shuffle stages. If one task is taking 10x longer than the median, you have skew.
How to fix it
Option 1 — Repartition before the join:
df = df.repartition(200, "join_key")
Option 2 — Salting for severely skewed keys:
# Add a random salt to the skewed dataset
df_skewed = df_skewed.withColumn("salt", (rand() * 10).cast("int"))
df_skewed = df_skewed.withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))
# Explode the salt on the other side
df_other = df_other.withColumn("salt", explode(array([lit(i) for i in range(10)])))
df_other = df_other.withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))
# Join on salted key
result = df_skewed.join(df_other, "salted_key")
Salting artificially distributes the hot key across multiple partitions. It’s not elegant, but it works.
dbt: Test First, Transform Second
Most engineers treat dbt tests as an afterthought — something you bolt on after the models are working. Senior engineers do the opposite: they write the tests first.
Before you write a single line of SQL in a dbt model, ask yourself: What should always be true about the output of this model? Which columns should never be null? Which columns should always be unique? What referential integrity should exist between tables?
Then codify those assumptions in your schema.yml before you write the model.
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
When your dbt tests run in CI before every merge, bad data assumptions get caught in development — not at 3am in production.
Advanced: Custom generic tests
{% test assert_positive_value(model, column_name) %}
select {{ column_name }}
from {{ model }}
where {{ column_name }} <= 0
{% endtest %}
Now you can apply assert_positive_value to any revenue or quantity column across all your models.
Airflow: Build for Late Data, Not Just Scheduled Data
Airflow’s default mental model is simple: a DAG runs at a scheduled time, does its work, and completes. But production data is rarely that cooperative.
APIs go down. Upstream jobs run late. Files arrive in S3 hours after they should. If your DAG assumes the data will be there at 6am because it’s usually there at 6am, you’re one bad day away from a failure.
Use sensors instead of assuming
from airflow.sensors.s3_key_sensor import S3KeySensor
wait_for_file = S3KeySensor(
task_id='wait_for_daily_export',
bucket_name='my-data-bucket',
bucket_key='exports/{{ ds }}/orders.parquet',
poke_interval=300, # check every 5 minutes
timeout=7200, # fail after 2 hours
mode='reschedule', # don't hold a worker slot while waiting
dag=dag
)
The mode='reschedule' is critical — it releases the worker slot between pokes so you don’t exhaust your Airflow worker pool waiting for files.
Set SLAs on your critical DAGs
from datetime import timedelta
dag = DAG(
'daily_revenue_pipeline',
schedule_interval='0 6 * * *',
sla_miss_callback=notify_on_slack,
default_args={
'sla': timedelta(hours=2),
'retries': 3,
'retry_delay': timedelta(minutes=5)
}
)
SLA misses trigger a callback — so you find out about slow jobs before your stakeholders do.
Make every DAG idempotent
This is the rule that saves you during re-runs. Every DAG run should produce the same result whether it runs once or ten times. No duplicates, no partial loads.
Use INSERT OVERWRITE instead of INSERT INTO. Use date-partitioned tables and overwrite the partition. Use MERGE statements for slowly changing dimensions. Build idempotency in from day one, not as a fix after your first duplicate incident.
Bringing It Together: The Senior Data Engineer Mindset
The tools — Spark, dbt, Airflow — are just tools. What makes a senior data engineer is the mindset they bring to using them.
Design for failure. Ask “what happens when this breaks?” before “will this work?”
Validate at every layer. dbt tests on the transformation layer. Spark data quality checks on the ingestion layer. Airflow sensors on the orchestration layer.
Instrument everything. Logs, metrics, SLA alerts. If you can’t observe it, you can’t fix it at 3am.
Build for re-runs. Idempotent pipelines mean that when something goes wrong (and it will), the recovery is a single button press.
The pipelines that survive in production aren’t the cleverest ones. They’re the ones built by engineers who assumed things would go wrong — and planned accordingly.
What’s one pattern from this list that you’re going to implement this week? Let me know in the comments below.
— Pushpjeet Cholkar, Data Engineer
Leave a Reply