5 Hard Lessons from a Week in the Data Engineering Trenches

Every week in data engineering teaches you something new — often the hard way. This week was no different. Between debugging a flaky Airflow DAG, refactoring a dbt model that had grown way too clever for its own good, and explaining a data discrepancy to a frustrated stakeholder, I picked up five lessons I want to carry into next week.

If you’re in the data space — whether you’re just starting out or have been building pipelines for years — I think you’ll recognise at least one of these.

1. Simplicity Is a Feature, Not a Shortcut

There’s a temptation in data engineering to show off. To write the cleverly nested SQL that handles 12 edge cases in one subquery. To build the Spark job that processes everything in a single stage. But cleverness has a cost: maintainability.

This week I revisited a pipeline I wrote six months ago and spent 45 minutes figuring out what I was trying to do. The fix? Break it into smaller dbt models, add a comment explaining why, and drop the clever tricks. A junior engineer can now understand it in under five minutes.

Rule of thumb: If your pipeline would confuse a smart colleague who hasn’t seen it before, it’s too complex.

2. Always Ask “Why Do We Need This?”

Before writing a single line of code, ask your stakeholder: What decision will this data enable? This week a request came in for a new aggregation table. After two minutes of questions, it turned out the stakeholder just wanted a number already available in an existing dashboard. Two hours of engineering work avoided with a two-minute conversation.

3. Document at the Point of Understanding

We all know documentation matters. We all leave it for later. Later never comes. This week I tried documenting at the moment I understood something. It added 10 minutes to my day but saved a 30-minute “how does this work again?” on Friday afternoon.

Practical tip: In dbt, treat the description: field in schema.yml as mandatory. A single sentence explaining why a model exists is more valuable than a paragraph describing what SQL it runs.

4. Data Quality Issues Are Communication Issues

When a number is wrong, the instinct is to dive into SQL. But more often, the root cause isn’t technical — it’s a misunderstanding between engineering and the business about what a metric actually means. This week’s “data quality issue” turned out to be a disagreement about whether returns should be excluded from revenue before or after a certain date. Not a bug. A definition problem.

The fix: Before you debug, align on the definition. A shared data dictionary isn’t a luxury — it’s infrastructure.

5. Knowing When to Stop Is a Skill

I spent three hours optimising a Spark job and got it 15% faster. Was that worth three hours? Probably not — the job ran once a day and the business didn’t care. Premature optimisation in data engineering is just as dangerous as in software engineering. Know your SLAs. Save your energy for the jobs that actually matter.

Wrapping Up

The fundamentals don’t change: build things simply, understand the problem before you build, communicate clearly, and know when to stop. If this week was tough, you’re in good company. Keep going.

What was your biggest lesson this week? I’d love to hear it in the comments.

— Pushpjeet Cholkar, Data Engineer

5 Hard Lessons from a Week in the Data Engineering Trenches

1. Simplicity Is a Feature, Not a Shortcut

2. Always Ask “Why Do We Need This?”

3. Document at the Point of Understanding

4. Data Quality Issues Are Communication Issues

5. Knowing When to Stop Is a Skill

Wrapping Up

Comments

Leave a Reply Cancel reply

More posts

5 Hard Lessons from a Week in the Data Engineering Trenches

Spark, dbt, and Airflow: The Advanced Patterns That Keep Data Pipelines Alive

Understanding Linear Regression: A Comprehensive Guide for Business Applications

Spark, dbt, and Airflow: The Advanced Patterns That Keep Data Pipelines Alive