Author: pushpjeet

  • 5 Silent Killers of Production Data Pipelines (And How to Fix Them)

    I’ve seen pipelines fail in the most dramatic ways.

    Not during development. Not during testing. In production. At 2 AM. Right before a stakeholder demo.

    And almost every time, the root cause wasn’t bad code. It was a bad assumption — one that was quietly baked into the design and never questioned.

    If you build data pipelines professionally, this post is for you. Let’s walk through five assumptions that silently kill pipelines, and more importantly, how to fix them before they become your emergency.


    1. “The Source Schema Will Never Change”

    This is the most common assumption I see — and the most dangerous.

    You build an ingestion layer that reads from a MySQL table or a REST API. It works perfectly in dev and staging. You deploy to prod and call it a day.

    Then three weeks later, the backend team renames a column. Or adds a NOT NULL constraint. Or changes a field from varchar to int. Nobody sends a Slack message. Nobody files a ticket. And your pipeline silently breaks — or worse, starts producing wrong results.

    The fix: Validate schema at ingestion time, not just at build time. Tools like Great Expectations, Soda Core, or even a simple Python schema check can catch drift early. Make schema validation a first-class citizen of your pipeline, not an afterthought.

    # Simple example: validate expected columns exist before processing
    expected_columns = {"user_id", "event_type", "timestamp", "session_id"}
    actual_columns = set(df.columns)
    
    if not expected_columns.issubset(actual_columns):
        missing = expected_columns - actual_columns
        raise ValueError(f"Schema drift detected. Missing columns: {missing}")

    2. “NULL Means Missing Data”

    This one is subtle but important. NULL in a database can mean many different things: the data was never collected, the user explicitly opted out, the value is unknown, or it’s a default placeholder.

    If you treat all NULLs the same way — replacing them with zeros, dropping them, or ignoring them — you might be making incorrect business decisions downstream.

    The fix: Treat NULL handling as a business logic decision, not a technical default. Sit down with your data analyst or product team and ask: What does NULL mean in this context? Then encode that decision explicitly in your transformation layer, and document it.


    3. “The Pipeline Will Always Run on Schedule”

    In an ideal world, your Airflow DAG fires at 6 AM every day, runs cleanly, and finishes in 20 minutes. In reality, you have late-arriving data, infrastructure hiccups, manual backfills, and retry logic that re-runs tasks multiple times.

    If your pipeline isn’t idempotent — meaning, running it twice produces the same result as running it once — you’re one retry away from duplicate data or corrupted aggregates.

    The fix: Design for idempotency from the start. Use INSERT OVERWRITE or MERGE instead of INSERT INTO. Add partition filters so reruns only affect the target date range. Test your pipeline by intentionally running it twice in a row and verifying the output is identical.

    -- Non-idempotent (dangerous):
    INSERT INTO orders_summary SELECT * FROM raw_orders WHERE date = '2026-04-13';
    
    -- Idempotent (safe):
    INSERT OVERWRITE orders_summary PARTITION (date = '2026-04-13')
    SELECT * FROM raw_orders WHERE date = '2026-04-13';

    4. “Row Count Equals Data Quality”

    I’ve seen dashboards that proudly show ‘1,000,000 rows processed’ as a success metric. But here’s the truth: you can have a million rows and still have complete garbage. Row count tells you the pipeline ran. It tells you nothing about whether the data is correct.

    The fix: Add meaningful data quality checks — completeness (are critical fields populated?), distributions (has the average order value suddenly dropped 80%?), referential integrity, and freshness. Libraries like Great Expectations, dbt tests, or custom Python scripts can automate these checks.


    5. “Logs Are Enough for Debugging”

    Logs are great. They tell you that something went wrong, and sometimes they even tell you what. But when a data engineer gets paged at 2 AM because a dashboard is wrong, the question isn’t just ‘what happened?’ — it’s ‘why did it happen, and what upstream process caused it?’

    That’s where data lineage comes in. Lineage gives you a graph of how data flows from source to destination — which table feeds which model, which model feeds which report.

    The fix: Invest in lineage from day one. If you’re using dbt, lineage is built in. Tools like OpenLineage, Marquez, or DataHub can add lineage tracking without a major rewrite. The setup cost is small compared to the debugging cost it saves.


    Putting It All Together

    The best data engineers don’t just build pipelines that work in dev. They build pipelines that survive reality — schema changes, retries, bad data, and all.

      Which of these has burned you in a real production incident? I’d love to hear your war stories in the comments.

      — Pushpjeet Cholkar, Data Engineer

    • Weekly Reflection: 5 Hard Lessons I Learned as a Data Engineer This Week

      Every Sunday, I take 15 minutes to look back at the week — not just what I built, but how I thought. This habit has quietly become one of the most valuable things I do for my career.

      This week was one of those weeks where the biggest wins came from doing less, not more.

      1. Simpler Pipelines Beat Clever Ones (Almost Always)

      I inherited an Airflow DAG this week that had 14 tasks, custom sensors, dynamic task mapping, and enough conditional logic to make your head spin. It was impressive — but it was also breaking constantly and nobody could debug it in under an hour.

      We replaced it with a dbt model + a single cron job. Result: 80% less code, same output, and any junior engineer on the team can now understand and maintain it.

      The lesson? Complexity is not sophistication. If a pipeline needs a presentation to explain it, it’s already too complicated.

      2. Query Execution Plans Are Underrated

      I started spending 30 minutes each morning reviewing EXPLAIN ANALYZE output on our slowest queries. Within three days, I found two silent killers: a full table scan on a 200M-row table and a nested loop join picking the wrong strategy due to stale statistics.

      EXPLAIN ANALYZE
      SELECT *
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
      WHERE o.created_at > NOW() - INTERVAL '7 days';

      Takeaway: Reading execution plans feels slow. Not reading them is slower.

      3. The Power of Saying No to Data Sources

      A stakeholder came to me with a “quick” request: connect 3 new data sources. Old me would’ve said yes. This week’s me asked: What decision will this data enable? Who will use it? How often? The answers were vague. The request got deprioritized.

      Every new data source is a long-term maintenance commitment. Be selective. A lean data platform that reliably serves 10 use cases is worth more than a sprawling one that partially serves 50.

      4. Documentation Debt Is Real (And Painful)

      I came back to a Python utility script I wrote 6 weeks ago. No comments. No README. No docstrings. I spent 45 minutes reverse-engineering what I had written.

      def normalize_event_timestamps(df: pd.DataFrame, tz: str = "UTC") -> pd.DataFrame:
          """
          Convert all timestamp columns to a unified timezone.
      
          Args:
              df: Input DataFrame with raw event data
              tz: Target timezone string (default: 'UTC')
      
          Returns:
              DataFrame with normalized timestamp columns
          """
          # implementation here

      A docstring + type hints. Takes 2 minutes. Saves 45 minutes later.

      5. The Mindset Shift That Changed My Week

      Stop asking “how do I build this?” Start asking “should I build this at all?”

      Most data problems are not engineering problems. They’re clarity problems. The best data engineers push back — not to be difficult, but to make sure the work they do actually matters.

      Wrapping Up

      If you’re a data engineer, spend 15 minutes every Sunday asking: What worked and why? What didn’t work and what would I do differently? What’s one thing I’ll carry into next week?

      Small habit. Big compounding returns. See you next Sunday 👋

      — Pushpjeet Cholkar, Data Engineer

    • Real-World AI Applications in 2026: What Data Engineers Need to Know

      Everyone’s talking about AI. But most of that conversation lives in the world of demos, benchmarks, and announcements.

      Let’s talk about where AI is actually running in production — quietly, reliably, at scale — and what that means for you as a data engineer.

      Fraud Detection: Real-Time ML at Scale

      Banks and payment processors were among the first industries to go all-in on production ML. Today, when you swipe your card, a model scores that transaction in under 100 milliseconds.

      These systems ingest streaming data (think Kafka), run it through feature stores, and call inference endpoints on models trained on billions of labeled transactions. The old rule-based systems have been replaced by gradient boosting models and neural nets that detect subtle behavioral patterns.

      What this requires from data engineering:

      • Real-time streaming pipelines (Kafka, Flink, Spark Streaming)
      • Feature stores with low-latency reads (Feast, Tecton, Redis)
      • Data quality monitoring — a bad feature can tank model performance overnight

      Demand Forecasting: Knowing What You’ll Buy Before You Do

      Retailers like Walmart, Zara, and Amazon have turned demand forecasting into a serious competitive advantage. Instead of static seasonal models, they now run AI systems that incorporate weather data, local events, social media trends, historical sales, and supply chain status — all in real time.

      Tech stack typically involved:

      • Time-series models (Prophet, NeuralProphet, DeepAR on AWS SageMaker)
      • Feature pipelines ingesting 50+ data sources
      • Orchestration via Airflow or Prefect
      • Results served into planning dashboards via dbt + Looker or Tableau

      This is a data engineering problem at its core. The model is only as good as the pipeline feeding it.

      Predictive Maintenance: Preventing Failures Before They Happen

      Manufacturing and energy companies are using IoT sensor data + ML to predict equipment failure before it happens. A turbine with 200 sensors generates millions of data points per day. ML models trained on historical failure patterns can now flag anomalies weeks in advance.

      The data pipeline challenge here is massive:

      • Ingesting high-frequency sensor streams
      • Handling missing data and sensor drift
      • Storing time-series data efficiently (InfluxDB, TimescaleDB, or Delta Lake with time partitioning)
      • Triggering alerts when anomaly scores cross thresholds

      AI-Assisted Code Reviews and Developer Tools

      Tools like GitHub Copilot, CodeRabbit, and Cursor are now embedded in daily development workflows. From a data perspective, these tools are powered by large language models fine-tuned on code, served via inference APIs with strict latency requirements.

      The impact on software teams is real: 30-40% reduction in PR review turnaround time, faster onboarding of new engineers, and fewer syntax-level bugs making it to production.

      Your Social Feed: The Most Visible AI in the World

      Every time you open Instagram, TikTok, LinkedIn, or YouTube, you’re triggering dozens of ML inference calls. Content ranking, ad targeting, notification timing, A/B test assignment — it’s all ML, running in real time, personalized to you specifically.

      The Common Thread: Data Engineering Is the Foundation

      Look at every example above. Every single one depends on:

      1. Clean, reliable data ingestion — if the pipeline breaks, the model breaks
      2. Feature engineering — raw data rarely goes straight into models
      3. Monitoring and data quality — models degrade silently when data shifts
      4. Scalable infrastructure — AI at scale requires petabyte-level thinking

      This is why data engineers are still the most underrated role in AI projects. The ML engineer gets the credit. The data engineer keeps the lights on.

      What You Should Take Away From This

      AI applications in 2026 are real, widespread, and deeply dependent on data infrastructure. As a data engineer, the smartest move is to understand what the models need — not just how to build pipelines, but how to build pipelines that serve real ML use cases.

      The gap between “data engineer” and “ML platform engineer” is closing. And the ones closing it fastest are the ones who understand both sides.

      What real-world AI application has impressed you the most? Leave a comment below — I read every one.

      — Pushpjeet Cholkar, Data Engineer

    • Spark, dbt, and Airflow: How to Use All Three Without Losing Your Mind

      Every data engineer eventually lands on the same question: “When do I use Spark vs dbt vs Airflow?”

      If you’ve asked yourself this, you’re not alone. These three tools form the backbone of a modern data stack — but the confusion about when to use which one leads to some of the messiest pipeline architectures I’ve ever seen.

      In this post, I’m going to break down each tool’s role, show you where they overlap (and where they absolutely don’t), and walk you through a practical architecture pattern that actually scales.

      The Short Answer (Before We Dive In)

      • Spark = distributed compute for large-scale data processing
      • dbt = SQL-based transformation layer inside your data warehouse
      • Airflow = orchestrator that schedules and monitors jobs

      They’re not competitors. They’re teammates. The trick is giving each one the right job.

      Apache Spark: Your Heavy-Lifting Engine

      Apache Spark is a distributed computing framework designed to process massive amounts of data fast. We’re talking terabytes or petabytes, spread across a cluster of machines working in parallel.

      When should you reach for Spark? Use it when you have raw, unstructured data coming from Kafka, S3, or HDFS. Use it when your data volume makes single-machine processing impractical. Use it when you need complex transformations before data hits your warehouse, or when you’re doing streaming ingestion alongside batch processing.

      Spark is excellent at the ingestion and raw processing phase. It can read from almost any source, apply heavy transformations in PySpark or Scala, and write results to your data lake or warehouse.

      What Spark is NOT: a scheduler, an orchestrator, or a transformation layer inside your warehouse. Using Spark to run light transformations on structured warehouse data is overkill — that’s dbt’s territory.

      dbt: The Transformation Layer Your SQL Deserves

      dbt (data build tool) changed how data engineers think about transformations. Instead of scattered SQL scripts with names like final_v3_FINAL.sql, dbt gives you a structured, version-controlled, testable transformation framework.

      Here’s what makes dbt powerful: Modularity lets you write reusable SQL models that reference each other. Testing lets you define schema tests (not null, unique, accepted values) that run automatically. Documentation auto-generates a data catalog from your models. Lineage lets you visualize how data flows from source to final table.

      dbt runs inside your warehouse — Snowflake, BigQuery, Redshift, Databricks. It doesn’t move data; it transforms data that’s already there.

      A Quick dbt Example

      -- models/marts/fact_orders.sql
      WITH orders AS (
          SELECT * FROM {{ ref('stg_orders') }}
      ),
      customers AS (
          SELECT * FROM {{ ref('stg_customers') }}
      )
      SELECT
          o.order_id,
          o.order_date,
          c.customer_name,
          o.total_amount
      FROM orders o
      LEFT JOIN customers c ON o.customer_id = c.customer_id

      That ref() function is dbt magic — it builds the dependency graph automatically, so dbt knows to run stg_orders and stg_customers before fact_orders.

      What dbt is NOT: a job scheduler, a data ingestion tool, or a substitute for Spark on large raw datasets.

      Apache Airflow: The Conductor of Your Pipeline

      Airflow is a workflow orchestration platform. Its job is simple but critical: run the right jobs, in the right order, at the right time — and tell you when something goes wrong.

      You define workflows as DAGs (Directed Acyclic Graphs) in Python. A typical daily DAG looks like this: Spark ingests raw data → dbt transforms it → dbt tests validate it. Clean, readable, and version-controlled.

      The #1 Airflow mistake I see: Running heavy data processing logic inside Airflow operators. PythonOperators with 10,000-row Pandas loops, inline SQL queries that run for hours — this kills your Airflow workers. Airflow schedules work. It doesn’t do the heavy work itself.

      The Architecture Pattern That Works

      Here’s the pattern I’ve used on production pipelines handling hundreds of millions of rows daily: Airflow triggers a daily DAG → Spark ingests raw data to the data lake → Airflow triggers dbt → dbt transforms inside the warehouse → dbt tests validate data quality → BI tools and downstream consumers read clean data.

      Each layer has one responsibility. Airflow handles scheduling and monitoring, Spark handles scale, dbt handles structured transformations. When something breaks, you know exactly where to look.

      Common Mistakes to Avoid

      1. Running Pandas in Airflow operators. Heavy compute belongs in Spark, not inside Airflow. If your DAG tasks take more than a few minutes, move the logic to a Spark job and trigger it from Airflow.

      2. Using dbt for raw data ingestion. dbt reads from what’s already in your warehouse. It doesn’t pull from APIs, Kafka, or flat files. Use Spark, Fivetran, or a custom ingestion job for that.

      3. Treating Spark as a scheduler. Spark has no built-in job scheduling or dependency management. Airflow is always needed to coordinate when Spark jobs run.

      4. No dbt tests. If you’re not running dbt test, you’re flying blind. Schema tests catch broken pipelines before your stakeholders do.

      Wrapping Up

      Spark, dbt, and Airflow are genuinely complementary. Once you understand each tool’s lane, using them together feels natural — and your pipelines become dramatically more maintainable.

      The key mental model: Airflow is the conductor. Spark is the muscle. dbt is the translator.

      Give each tool its role and stay disciplined about not crossing the lanes. Have questions about your specific setup? Drop a comment below — I read every one of them.

      — Pushpjeet Cholkar, Data Engineer

    • How Data Engineers Can Build a Personal Brand That Actually Opens Doors

      How Data Engineers Can Build a Personal Brand That Actually Opens Doors

      When I first heard the phrase “personal brand,” I pictured influencers with ring lights and perfectly curated feeds.

      I didn’t think it applied to me — a data engineer whose day job involves wrangling pipelines, debugging Spark jobs, and staring at YAML configs.

      But then something shifted. I started sharing what I was learning online. A concept I figured out. A mistake I made. A tool I was trying out. And slowly, people started noticing.

      That’s when I realized: personal branding for engineers isn’t about looking polished. It’s about building trust in public.

      Here’s what I’ve learned about doing it well.


      Why Personal Branding Matters More Than Ever for Data Engineers

      The data engineering field is growing fast. Companies are hiring. But so are thousands of other candidates with similar resumes.

      Your resume tells people what you’ve done. Your personal brand tells them how you think.

      That distinction is huge. Hiring managers, recruiters, and future collaborators often check LinkedIn, GitHub, or a blog before they ever reach out. What they find there either builds confidence in you — or doesn’t.

      A strong personal brand can mean:

      • Inbound job opportunities (instead of cold applications)
      • Speaking invitations at meetups and conferences
      • Collaboration requests from peers in the industry
      • A growing audience that values your perspective

      And the best part? You don’t need to be a senior engineer or a thought leader to start. You just need to be willing to share the journey.


      The #1 Mistake Engineers Make With Personal Branding

      Most engineers wait until they “know enough” to start sharing.

      They think: “I’ll post when I have something really valuable to say.”

      The result? They never post.

      Here’s the reframe: you don’t need to be the expert. You need to be one step ahead of someone else.

      If you just figured out how dbt incremental models work, write about it. There are hundreds of people right behind you who are confused by the exact same thing. Your explanation — written in your own words, from your own experience — is more valuable to them than any documentation.

      Teach what you know. Document what you’re learning. That’s the content formula.


      What to Post About as a Data Engineer

      Not sure what to share? Here are content ideas that consistently perform well:

      Share Your Learning

      • “I spent 3 hours debugging this Airflow DAG. Here’s what I found.”
      • “Finally understood window functions in SQL — here’s the simple way to think about it.”
      • “Tried Apache Iceberg for the first time. My honest take.”

      Share Your Process

      • Walk through how you approach a data modeling problem
      • Show a before/after of a messy query you cleaned up
      • Explain how you set up your local dev environment

      Share Your Opinion

      • “Hot take: most data pipelines are over-engineered”
      • “Why I think every data engineer should learn a little dbt”
      • “The most underrated skill in data engineering? Communication.”

      Share Career Lessons

      • Mistakes you made early in your career
      • What you wish you knew before your first data engineering job
      • How you prepared for a technical interview

      Mix these formats. The variety keeps things interesting and reaches different audiences.


      The Consistency Formula That Actually Works

      Going viral once won’t build a brand. Showing up consistently will.

      But consistency doesn’t mean daily posts forever. It means finding a sustainable rhythm and sticking to it.

      My suggestion: start with 3 posts per week on LinkedIn.

      Why LinkedIn? Because that’s where the professional data community lives. Your content reaches hiring managers, peers, and potential collaborators directly. Instagram and a blog are great supplements, but LinkedIn is where professional reputations are built in this space.

      Here’s a simple weekly template:

      • Monday: Teach something technical (a concept, tool, or pattern)
      • Wednesday: Share a career lesson or personal story
      • Friday: Ask the community a question or share your opinion

      That’s it. 3 posts a week, 3 different angles. You’ll cover technical depth, human connection, and community engagement all in one rhythm.


      How to Make Your Content Stand Out

      The data engineering space can feel crowded. Here’s how to differentiate:

      1. Write like you talk Skip the jargon when plain language works. If you’d explain it to a colleague over coffee in simple terms, write it that way.

      2. Lead with the problem Start posts with a pain point, not a solution. “Ever spent 2 hours debugging a pipeline only to find a typo?” — now you have my attention.

      3. Use your real experience Generic advice is forgettable. “Here’s what happened to me when I tried X” is not.

      4. Be honest about what you don’t know Counterintuitively, admitting you’re still figuring something out builds more trust than pretending you have all the answers.

      5. Engage in the comments Reply to every comment, especially early on. Algorithms reward engagement, but more importantly, it turns followers into a real community.


      Building Beyond LinkedIn

      Once you have a posting rhythm on LinkedIn, here’s how to expand:

      • A blog (like this one) helps with SEO and gives you space for long-form thinking
      • Instagram lets you reach a different, often younger, audience with visual content
      • GitHub is your portfolio — keep it active and organized
      • Newsletters are powerful once you have a few hundred subscribers

      You don’t need all of these on day one. Pick one platform, go deep, then expand.


      Final Thoughts

      The engineers who stand out aren’t always the most senior or the most skilled.

      They’re the ones who are willing to show their work — to write about what they’re learning, share what they’re building, and help others along the way.

      Start small. Post something this week. It doesn’t have to be perfect.

      Your future self — the one with the inbound DMs, the speaking invitations, and the career options — will thank you.


      — Pushpjeet Cholkar, Data Engineer

      Follow me on LinkedIn and Instagram @me_the_data_engineer for daily content on data engineering, AI/ML, and career growth.

    • What Is a Data Engineer? A Plain-English Guide for 2026

      If you’ve ever wondered what a data engineer actually does — you’re not alone.

      Most people can picture a software engineer (they build apps) or a data scientist (they build models). But the data engineer? That role lives in a grey zone that even people inside tech companies struggle to explain clearly. Let me fix that.

      The Simple Version

      A data engineer builds and maintains the systems that move, store, and transform data — so that everyone else (data scientists, analysts, business teams, AI systems) can actually use it.

      Think of it this way: data scientists are chefs who cook amazing meals. Data engineers are the ones who built the kitchen, stocked the fridge, installed the plumbing, and made sure the electricity works. No kitchen → no meal. No data engineer → no working AI.

      What Does a Data Engineer Actually Do Day-to-Day?

      1. Building Data Pipelines

      A data pipeline is a system that automatically collects data from one place, transforms it, and delivers it somewhere useful. Data engineers design, build, and maintain these pipelines using tools like Apache Airflow, Python, and cloud platforms.

      2. Transforming Raw Data

      Raw data is messy — duplicate records, inconsistent formats, missing values. Data engineers clean and transform this data using tools like dbt (data build tool), SQL, and Spark.

      3. Managing Data Storage

      Where does all the data live? In data warehouses like Snowflake, BigQuery, or Redshift. Data engineers design the structure of these warehouses and make sure data is stored efficiently and queryable.

      4. Enabling AI and Analytics

      Every machine learning model needs training data. Every business dashboard needs reliable data. Data engineers are the ones making sure the right data gets to the right place in the right format.

      5. Ensuring Data Quality

      Bad data in → bad decisions out. Data engineers build monitoring systems to catch problems before they cause damage downstream.

      Key Tools in a Data Engineer’s Toolkit (2026)

      Here are the core tools every data engineer works with: Python and SQL for programming, Apache Airflow, Prefect, or Dagster for pipeline orchestration, dbt and Spark for data transformation, Snowflake, BigQuery, or Redshift as data warehouses, Kafka or Kinesis for streaming, and AWS, GCP, or Azure as the cloud foundation.

      Why Data Engineering Matters More Than Ever in 2026

      The rise of AI has made data engineering more important, not less. LLMs need massive clean datasets to train on — someone has to build the pipelines that collect, clean, and version that data. Real-time AI applications like fraud detection, personalisation, and recommendations need streaming data infrastructure. And AI governance and data quality are now regulatory requirements in many industries.

      How to Get Started in Data Engineering

      1. Learn Python — the primary language for data engineering
      2. Master SQL — you’ll use it every single day
      3. Understand databases — both relational (Postgres) and warehouses (BigQuery/Snowflake)
      4. Learn Apache Airflow — the most widely used pipeline orchestration tool
      5. Get hands-on with cloud — pick one: AWS, GCP, or Azure
      6. Build real projects — a portfolio of actual pipelines matters more than certifications

      Final Thoughts

      Data engineering is one of the most foundational, in-demand, and underappreciated roles in technology. As AI continues to reshape industries, the importance of clean, well-governed, well-engineered data will only grow. If you found this useful, I’ll be publishing regularly on data engineering, AI tools, and building a tech career. Follow along — there’s a lot more to come.

      — Pushpjeet Cholkar, Data Engineer

    • Hello world!

      Welcome to WordPress. This is your first post. Edit or delete it, then start writing!