5 Silent Killers of Production Data Pipelines (And How to Fix Them)

I’ve seen pipelines fail in the most dramatic ways.

Not during development. Not during testing. In production. At 2 AM. Right before a stakeholder demo.

And almost every time, the root cause wasn’t bad code. It was a bad assumption — one that was quietly baked into the design and never questioned.

If you build data pipelines professionally, this post is for you. Let’s walk through five assumptions that silently kill pipelines, and more importantly, how to fix them before they become your emergency.


1. “The Source Schema Will Never Change”

This is the most common assumption I see — and the most dangerous.

You build an ingestion layer that reads from a MySQL table or a REST API. It works perfectly in dev and staging. You deploy to prod and call it a day.

Then three weeks later, the backend team renames a column. Or adds a NOT NULL constraint. Or changes a field from varchar to int. Nobody sends a Slack message. Nobody files a ticket. And your pipeline silently breaks — or worse, starts producing wrong results.

The fix: Validate schema at ingestion time, not just at build time. Tools like Great Expectations, Soda Core, or even a simple Python schema check can catch drift early. Make schema validation a first-class citizen of your pipeline, not an afterthought.

# Simple example: validate expected columns exist before processing
expected_columns = {"user_id", "event_type", "timestamp", "session_id"}
actual_columns = set(df.columns)

if not expected_columns.issubset(actual_columns):
    missing = expected_columns - actual_columns
    raise ValueError(f"Schema drift detected. Missing columns: {missing}")

2. “NULL Means Missing Data”

This one is subtle but important. NULL in a database can mean many different things: the data was never collected, the user explicitly opted out, the value is unknown, or it’s a default placeholder.

If you treat all NULLs the same way — replacing them with zeros, dropping them, or ignoring them — you might be making incorrect business decisions downstream.

The fix: Treat NULL handling as a business logic decision, not a technical default. Sit down with your data analyst or product team and ask: What does NULL mean in this context? Then encode that decision explicitly in your transformation layer, and document it.


3. “The Pipeline Will Always Run on Schedule”

In an ideal world, your Airflow DAG fires at 6 AM every day, runs cleanly, and finishes in 20 minutes. In reality, you have late-arriving data, infrastructure hiccups, manual backfills, and retry logic that re-runs tasks multiple times.

If your pipeline isn’t idempotent — meaning, running it twice produces the same result as running it once — you’re one retry away from duplicate data or corrupted aggregates.

The fix: Design for idempotency from the start. Use INSERT OVERWRITE or MERGE instead of INSERT INTO. Add partition filters so reruns only affect the target date range. Test your pipeline by intentionally running it twice in a row and verifying the output is identical.

-- Non-idempotent (dangerous):
INSERT INTO orders_summary SELECT * FROM raw_orders WHERE date = '2026-04-13';

-- Idempotent (safe):
INSERT OVERWRITE orders_summary PARTITION (date = '2026-04-13')
SELECT * FROM raw_orders WHERE date = '2026-04-13';

4. “Row Count Equals Data Quality”

I’ve seen dashboards that proudly show ‘1,000,000 rows processed’ as a success metric. But here’s the truth: you can have a million rows and still have complete garbage. Row count tells you the pipeline ran. It tells you nothing about whether the data is correct.

The fix: Add meaningful data quality checks — completeness (are critical fields populated?), distributions (has the average order value suddenly dropped 80%?), referential integrity, and freshness. Libraries like Great Expectations, dbt tests, or custom Python scripts can automate these checks.


5. “Logs Are Enough for Debugging”

Logs are great. They tell you that something went wrong, and sometimes they even tell you what. But when a data engineer gets paged at 2 AM because a dashboard is wrong, the question isn’t just ‘what happened?’ — it’s ‘why did it happen, and what upstream process caused it?’

That’s where data lineage comes in. Lineage gives you a graph of how data flows from source to destination — which table feeds which model, which model feeds which report.

The fix: Invest in lineage from day one. If you’re using dbt, lineage is built in. Tools like OpenLineage, Marquez, or DataHub can add lineage tracking without a major rewrite. The setup cost is small compared to the debugging cost it saves.


Putting It All Together

The best data engineers don’t just build pipelines that work in dev. They build pipelines that survive reality — schema changes, retries, bad data, and all.

    Which of these has burned you in a real production incident? I’d love to hear your war stories in the comments.

    — Pushpjeet Cholkar, Data Engineer

    Comments

    Leave a Reply

    Your email address will not be published. Required fields are marked *