Every data engineer eventually lands on the same question: “When do I use Spark vs dbt vs Airflow?”
If you’ve asked yourself this, you’re not alone. These three tools form the backbone of a modern data stack — but the confusion about when to use which one leads to some of the messiest pipeline architectures I’ve ever seen.
In this post, I’m going to break down each tool’s role, show you where they overlap (and where they absolutely don’t), and walk you through a practical architecture pattern that actually scales.
The Short Answer (Before We Dive In)
- Spark = distributed compute for large-scale data processing
- dbt = SQL-based transformation layer inside your data warehouse
- Airflow = orchestrator that schedules and monitors jobs
They’re not competitors. They’re teammates. The trick is giving each one the right job.
Apache Spark: Your Heavy-Lifting Engine
Apache Spark is a distributed computing framework designed to process massive amounts of data fast. We’re talking terabytes or petabytes, spread across a cluster of machines working in parallel.
When should you reach for Spark? Use it when you have raw, unstructured data coming from Kafka, S3, or HDFS. Use it when your data volume makes single-machine processing impractical. Use it when you need complex transformations before data hits your warehouse, or when you’re doing streaming ingestion alongside batch processing.
Spark is excellent at the ingestion and raw processing phase. It can read from almost any source, apply heavy transformations in PySpark or Scala, and write results to your data lake or warehouse.
What Spark is NOT: a scheduler, an orchestrator, or a transformation layer inside your warehouse. Using Spark to run light transformations on structured warehouse data is overkill — that’s dbt’s territory.
dbt: The Transformation Layer Your SQL Deserves
dbt (data build tool) changed how data engineers think about transformations. Instead of scattered SQL scripts with names like final_v3_FINAL.sql, dbt gives you a structured, version-controlled, testable transformation framework.
Here’s what makes dbt powerful: Modularity lets you write reusable SQL models that reference each other. Testing lets you define schema tests (not null, unique, accepted values) that run automatically. Documentation auto-generates a data catalog from your models. Lineage lets you visualize how data flows from source to final table.
dbt runs inside your warehouse — Snowflake, BigQuery, Redshift, Databricks. It doesn’t move data; it transforms data that’s already there.
A Quick dbt Example
-- models/marts/fact_orders.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
)
SELECT
o.order_id,
o.order_date,
c.customer_name,
o.total_amount
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
That ref() function is dbt magic — it builds the dependency graph automatically, so dbt knows to run stg_orders and stg_customers before fact_orders.
What dbt is NOT: a job scheduler, a data ingestion tool, or a substitute for Spark on large raw datasets.
Apache Airflow: The Conductor of Your Pipeline
Airflow is a workflow orchestration platform. Its job is simple but critical: run the right jobs, in the right order, at the right time — and tell you when something goes wrong.
You define workflows as DAGs (Directed Acyclic Graphs) in Python. A typical daily DAG looks like this: Spark ingests raw data → dbt transforms it → dbt tests validate it. Clean, readable, and version-controlled.
The #1 Airflow mistake I see: Running heavy data processing logic inside Airflow operators. PythonOperators with 10,000-row Pandas loops, inline SQL queries that run for hours — this kills your Airflow workers. Airflow schedules work. It doesn’t do the heavy work itself.
The Architecture Pattern That Works
Here’s the pattern I’ve used on production pipelines handling hundreds of millions of rows daily: Airflow triggers a daily DAG → Spark ingests raw data to the data lake → Airflow triggers dbt → dbt transforms inside the warehouse → dbt tests validate data quality → BI tools and downstream consumers read clean data.
Each layer has one responsibility. Airflow handles scheduling and monitoring, Spark handles scale, dbt handles structured transformations. When something breaks, you know exactly where to look.
Common Mistakes to Avoid
1. Running Pandas in Airflow operators. Heavy compute belongs in Spark, not inside Airflow. If your DAG tasks take more than a few minutes, move the logic to a Spark job and trigger it from Airflow.
2. Using dbt for raw data ingestion. dbt reads from what’s already in your warehouse. It doesn’t pull from APIs, Kafka, or flat files. Use Spark, Fivetran, or a custom ingestion job for that.
3. Treating Spark as a scheduler. Spark has no built-in job scheduling or dependency management. Airflow is always needed to coordinate when Spark jobs run.
4. No dbt tests. If you’re not running dbt test, you’re flying blind. Schema tests catch broken pipelines before your stakeholders do.
Wrapping Up
Spark, dbt, and Airflow are genuinely complementary. Once you understand each tool’s lane, using them together feels natural — and your pipelines become dramatically more maintainable.
The key mental model: Airflow is the conductor. Spark is the muscle. dbt is the translator.
Give each tool its role and stay disciplined about not crossing the lanes. Have questions about your specific setup? Drop a comment below — I read every one of them.
— Pushpjeet Cholkar, Data Engineer
Leave a Reply