Category: Uncategorized

  • How Data Engineers Can Build a Personal Brand That Opens Doors

    There’s a quiet frustration that many data engineers share but rarely talk about.

    You’ve spent years mastering Spark, dbt, Airflow, and Python. You’ve built pipelines that process millions of rows. You’ve solved complex data quality issues that nobody else could crack. And yet — the career opportunities you want aren’t coming fast enough.

    Here’s the uncomfortable truth: technical skills get you in the door, but your personal brand is what keeps it open.

    In 2026, the data engineering job market is competitive. Companies aren’t just looking for engineers who can write great code — they’re looking for engineers who can communicate, influence, and lead. And your personal brand is the signal they use to evaluate that.

    What Is a Personal Brand (And Why Should a Data Engineer Care)?

    A personal brand is simply the impression people have of you professionally. It’s what comes up when someone Googles your name. It’s the posts you share on LinkedIn. It’s the articles you write, the problems you solve publicly, and the way you explain complex concepts to others.

    • Get you noticed by recruiters who aren’t posting jobs publicly
    • Position you as a thought leader in your niche (Kafka? dbt? streaming pipelines?)
    • Help you command higher salaries because you’re seen as an expert, not just a candidate
    • Build a network that sends opportunities your way

    The best part? You don’t need to be an influencer. You just need to be consistent.

    Step 1: Pick Your 3 Core Topics

    Pick 3 topics that sit at the intersection of what you know deeply, what you enjoy talking about, and what your target audience cares about. For me, those are: Data Engineering (pipelines, architecture, tools), Python for data workflows, and AI tools for engineers.

    Step 2: Show Your Work, Not Just Your Results

    Instead of waiting until you’ve built the perfect data pipeline to talk about it, share your journey. Instead of posting “Just shipped a new ETL pipeline!”, try: “We had 3-hour data latency. Here’s how I rebuilt it using Kafka + Spark Streaming to get it under 5 minutes — and the mistake I almost made.” That version teaches something. It builds trust.

    Step 3: Choose Your Platform and Post Consistently

    LinkedIn is the most powerful platform for data engineers right now. A personal blog gives you a home base you own — and helps with SEO. One post per week, every week, for six months will do more for your career than ten posts in a burst followed by three months of silence.

    Step 4: Engage, Don’t Just Broadcast

    Leave thoughtful comments on posts from engineers you admire. Reply to every comment on your own posts. Share other people’s insights with your own take added. The engineers who grow fastest are the ones genuinely engaging — not just broadcasting.

    Step 5: Be Patient and Track What Works

    Building a personal brand is a long game. Expect the first 90 days to feel slow. Track what resonates — which posts get comments, which topics drive profile views. As a data engineer, you’re uniquely positioned to be analytical about your content strategy. Use that superpower.

    Your Action Plan for This Week

    1. Update your LinkedIn headline to reflect your specialty
    2. Write one post about a problem you solved recently
    3. Leave five genuine comments on posts by engineers you respect
    4. Start a notes document to capture ideas for future posts

    The engineers who will thrive in the next five years aren’t just the ones with the most technical skills. They’re the ones who can communicate their value, build trust at scale, and show up consistently. Your code already speaks for itself. Now it’s time to let your voice do the same.

    — Pushpjeet Cholkar, Data Engineer

  • 5 AI & ML Tools Every Data Engineer Should Know in 2026

    1. dbt + LLM Macros: AI-Powered Transformations

    What it is: dbt (data build tool) integrates with large language models to auto-generate column descriptions, test cases, and documentation from your SQL models.

    Why it matters: Writing documentation and tests is the least glamorous part of data engineering — and the most skipped. LLM-powered macros change that equation. You can prompt dbt to generate YAML documentation for an entire model in seconds.

    -- Run dbt's AI doc generation on your model
    dbt docs generate --select my_model

    With tools like dbt-osmosis and emerging LLM integrations, you can now propagate column descriptions automatically across your DAG.

    2. MLflow: The Experiment Tracker You Already Need

    What it is: MLflow is an open-source platform for managing the ML lifecycle — experiment tracking, model registry, and model serving.

    Why data engineers need it: Even if you never train a model yourself, your pipelines feed them. When a model degrades, the first question is: “Did the training data change?” MLflow gives you the audit trail to answer that.

    import mlflow
    
    with mlflow.start_run():
        mlflow.log_param("learning_rate", 0.01)
        mlflow.log_metric("rmse", 0.84)
        mlflow.log_artifact("model.pkl")

    3. Feast: The Feature Store That Stops Pipeline Duplication

    What it is: Feast is an open-source feature store that lets you define, store, and serve ML features consistently across training and production environments.

    The problem it solves: Without a feature store, teams rebuild the same feature logic in multiple places. Models trained on one version of a feature get served predictions from a slightly different version — this is called training-serving skew, and it silently kills model accuracy.

    from feast import FeatureStore
    
    store = FeatureStore(repo_path=".")
    features = store.get_online_features(
        features=["user_stats:purchase_7d_avg"],
        entity_rows=[{"user_id": 1001}]
    ).to_dict()

    4. Great Expectations + Anomaly Detection: Quality Beyond Rules

    Why the combination matters: Rule-based expectations catch what you already know to look for. Anomaly detection catches distribution shifts, sudden value spikes, or gradual drift that no rule anticipated.

    import great_expectations as gx
    
    context = gx.get_context()
    validator = context.sources.pandas_default.read_csv("orders.csv")
    validator.expect_column_values_to_not_be_null("order_id")
    validator.expect_column_values_to_be_between("order_amount", min_value=0, max_value=50000)
    results = validator.validate()

    5. Vertex AI / SageMaker Pipelines: Orchestrating ML Like a Data Engineer

    If you’ve used Apache Airflow, these tools map cleanly to concepts you already know: DAGs, steps, inputs/outputs, artifacts — just applied to ML workflows.

    from kfp import dsl
    
    @dsl.pipeline(name="training-pipeline")
    def training_pipeline(data_path: str):
        preprocess_task = preprocess_op(data_path=data_path)
        train_task = train_op(data=preprocess_task.outputs["processed_data"])
        evaluate_task = evaluate_op(model=train_task.outputs["model"])

    The Big Picture

    In 2026, a pipeline doesn’t just move data from A to B. It moves data from raw sources to clean features to trained models to reliable predictions to business outcomes. Every link in that chain needs engineering.

    Build both sides of the stack. That’s where the real leverage is.

    — Pushpjeet Cholkar, Data Engineer

  • Stop Treating Your Data Pipelines Like Scripts — Build Them Like Products

    Every data engineer has a story.

    It usually starts the same way: someone needed a quick data pull, so you wrote a Python script. It worked. Then it got scheduled. Then it fed a dashboard. Then the VP of Sales started refreshing that dashboard every morning before their 9am standup.

    Your “quick script” just became critical infrastructure — and nobody updated the README.

    This is one of the most common patterns in data engineering, and it’s also one of the most dangerous. When pipelines are built like throwaway scripts, they become time bombs. They break at the worst moments, they’re impossible to debug, and they’re terrifying to hand off to someone else.

    The fix? Start treating your data pipelines like products.

    1. Version Control Everything — Not Just the Code

    Most engineers version-control their Python files. But your pipeline is more than just code. Version control your SQL transformations, dbt models, schema definitions, DAG definitions, and infrastructure configs. When a schema changes without a Git commit, you lose traceability.

    Practical tip: Use a monorepo structure for your data platform. Tools like dbt make this natural — every model, test, and doc block lives in version control.

    2. Write Data Tests, Not Just Code Tests

    Unit tests catch bugs in your logic. Data tests catch bugs in your data — and in data engineering, the data is usually where the real surprises hide. Most production data issues aren’t caused by broken code — they’re caused by an upstream source sending nulls, a date field switching formats, or a join key returning duplicate rows.

    Test for not-null checks on critical columns, uniqueness constraints on primary keys, accepted values for categorical columns, referential integrity between tables, and row count anomalies. dbt has built-in generic tests for all of the above.

    3. Build Observability From Day One

    If your pipeline fails silently, does it even matter that it failed? The answer is yes — and your stakeholders will make it very clear when they figure out their dashboard is two days stale.

    Observability means alerting on failures, data freshness monitoring, row-level audit logs, and lineage tracking. The rule of thumb: you should know your pipeline is broken before your stakeholders do. Always.

    4. Document the WHY, Not Just the WHAT

    Code explains what it does. Documentation should explain why it does it. Six months from now, when someone needs to modify a complex transformation, they don’t need to know that this column is a LEFT JOIN — they need to know why it’s a LEFT JOIN and what business logic it encodes.

    Write dbt model descriptions that explain business context, keep an Architecture Decision Record file in your repo for major design choices, and update docs as part of your PR review process — not as an afterthought.

    5. Treat Pipeline Failures as Incidents

    When your production pipeline breaks, it’s not just a bug — it’s a business incident. Log it with full error context. Alert the right people — not just the on-call engineer, but the data consumers who are affected. Fix it with a proper root cause analysis, not just a git revert. Then post-mortem it.

    Teams that run post-mortems on data incidents ship more reliable pipelines over time, because they learn from failures instead of repeating them.

    The Product Mindset Shift

    All of these practices come down to one mental shift: think about the people downstream from your pipeline. Ask yourself before every pipeline you build: Who depends on this data? What breaks if this pipeline fails at 2am? How will I know it’s working correctly tomorrow? Would a new engineer understand this in 6 months?

    If you can answer those questions confidently, you’re not just writing scripts anymore. You’re building data infrastructure that lasts.

    Have a “temporary” pipeline that’s been running for years? Share your story in the comments 👇

    — Pushpjeet Cholkar, Data Engineer

  • 5 Apache Spark Optimization Tricks Every Data Engineer Should Know

    Most data engineers can write a Spark job. But writing one that’s actually fast? That’s where things get interesting.

    I’ve spent years working on large-scale data pipelines, and time and again I see the same performance mistakes show up in Spark jobs — even from experienced engineers. The good news: most of them are easy to fix once you know what to look for.

    Here are 5 Spark optimization tricks that have saved me hours of compute time and a lot of frustrated debugging.

    1. Stop Triggering Unnecessary Shuffles

    Shuffles are Spark’s most expensive operation. When a shuffle happens, data is redistributed across the network — taking time, memory, and I/O. Operations that trigger shuffles: groupBy(), wide join(), distinct(), repartition().

    The fix: use broadcast joins for small tables.

    from pyspark.sql.functions import broadcast
    result = large_df.join(broadcast(small_lookup_df), on="id", how="left")

    2. Get Your Partition Count Right

    Target ~128MB per partition. Too few = idle cores. Too many = scheduling overhead.

    df.rdd.getNumPartitions()
    df = df.repartition(200)
    df = df.coalesce(50)

    Use coalesce() when reducing, repartition() when increasing or needing even distribution.

    3. Cache DataFrames You Use More Than Once

    Spark’s lazy evaluation means it recomputes DataFrames from scratch each time an action is triggered — unless you cache them.

    df = spark.read.parquet("s3://my-bucket/data/")
    df.cache()
    count = df.count()
    filtered = df.filter(...)  # uses cache

    4. Switch to Kryo Serialization

    Kryo can be 2–10x faster than Java’s default serializer. One config line:

    spark = SparkSession.builder     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")     .getOrCreate()

    5. Actually Read the Spark UI

    The Spark UI (at http://localhost:4040) shows exactly where time is wasted — skewed partitions, slow stages, missed caches. Check the Stages tab, DAG visualization, and Executors tab. A task taking 5 minutes while 199 others take 10 seconds? That’s partition skew — fix it by salting your join key.

    Putting It All Together

    Checklist for every new Spark job: avoid shuffle-heavy operations where possible, check partition count, cache reused DataFrames, enable Kryo, and inspect the Spark UI after a test run. Optimizing Spark is systematic — find the bottleneck, understand it, fix it.

    — Pushpjeet Cholkar, Data Engineer

  • Why Data Engineers Need a Personal Brand (And How to Build One Without the Cringe)

    Let me paint you a picture.

    Two data engineers join the same company on the same day. Same skills. Same stack. Same team.

    Eighteen months later, one is a senior engineer with inbound recruiter messages and a growing online following. The other is still waiting for their “turn” at the next performance cycle.

    What separated them? It wasn’t the code. It was visibility.

    The Visibility Problem in Data Engineering

    Data engineering is one of the most impactful roles in a modern tech company. You build the infrastructure that powers product decisions, revenue models, and machine learning systems. Without you, data scientists are staring at empty Jupyter notebooks.

    And yet — data engineers are often the least visible people on the technical team. Our output lives in Airflow DAGs and dbt models that only your team appreciates. This invisibility has a career cost. A personal brand solves this.

    What Personal Branding Actually Means for Engineers

    First, let’s kill the cringe. Personal branding doesn’t mean becoming a LinkedIn influencer. For a data engineer, personal branding means one thing: making your expertise legible to the right people.

    The 5 High-ROI Moves for Building Your Brand as a Data Engineer

    1. Write About What You Just Solved

    Every week, you solve at least one problem that took you longer than it should have. A tricky dbt macro. A Spark memory tuning issue. A confusing Airflow dependency. Write 300 words about how you solved it and post it on LinkedIn or your blog. You will help dozens of engineers who are Googling the exact same problem.

    2. Narrate Your Architecture Decisions

    Most engineers document the what. Almost nobody documents the why. Why did you choose Kafka over Kinesis? Why did you pick Iceberg over Delta Lake? These decisions are gold. Write them up — share the best ones externally. This positions you as someone who thinks about engineering, not just implements it.

    3. Teach One Thing Every Week

    You know something that would be useful to someone at an earlier stage of their career. Teaching doesn’t require a YouTube channel. It can be a 5-minute Loom walkthrough, a reply to someone’s LinkedIn question, or a short “Today I Learned” post. Every time you teach, you reinforce your own learning and your reputation simultaneously.

    4. Be Consistent Over Being Viral

    The first 10 posts feel pointless. By post 50, you start getting DMs. By post 100, recruiters are finding you. Set a sustainable cadence — one LinkedIn post per week, one blog post per month. Two years of that consistency will transform your career.

    5. Engage, Don’t Just Broadcast

    The fastest way to grow your network is to add value in other people’s conversations first. Find the data engineers you respect. Comment thoughtfully on their posts. Share their work with your own take. This is how you get on people’s radars before you have a big following.

    A Week-1 Challenge

    This week, do one of these: write a LinkedIn post about a technical problem you solved recently, reply thoughtfully to three posts from data engineers you admire, or write up a short internal doc explaining an architecture decision you made. That’s it. No newsletter or podcast required yet.

    Final Thought

    You spent years learning SQL, Python, Spark, dbt, Airflow, and a dozen other tools. You’ve built systems that process millions of rows of data. Don’t let that expertise stay invisible. Your career is also a product. Build it with the same intention you bring to your pipelines.

    Start this week. One post. One doc. One comment. You’ve got this. 🚀

    — Pushpjeet Cholkar, Data Engineer

  • Two years ago, choosing an ML tool meant picking one of three options. Today, I track over 50 tools in the MLOps space—and new ones ship every week.

    But here’s the thing: more options don’t mean easier decisions. They mean paralysis.

    As a data engineer, you’re not here to evaluate every tool. You’re here to ship models, monitor them, and keep pipelines running at scale. This guide cuts through the noise and covers what actually matters in production.

    ## The Three Layers of Your ML Stack

    Modern ML infrastructure has three distinct challenges, and each needs a different tool.

    ### Layer 1: Feature Engineering & Storage

    This is where your ML maturity actually lives. Raw data → Features → Training → Inference. If features don’t flow smoothly between training and serving, you’re in trouble.

    **The Problem:** Most teams train with one data pipeline and serve from another. A feature computed in Spark during training might be computed in Pandas on the inference server. Slight differences in logic. Slight differences in timing. Your model silently degrads, and you don’t know why.

    **The Solution:** Feature stores.

    Three mature options exist today:

    – **Tecton** – The enterprise choice. SOC 2 compliant, strong operational support, battle-tested at scale. Cost is high; complexity is justified.
    – **Feast** – The open-source backbone. Free, flexible, runs on Kubernetes, smaller community. Great if you want control and don’t need support.
    – **Databricks Feature Store** – If you’re already in the Databricks ecosystem, it’s deeply integrated and surprisingly good.

    **My take:** Most teams start with Feast. It teaches you what a feature store should be. Move to Tecton when your features become mission-critical.

    ### Layer 2: Model Serving & Inference

    You’ve trained a model. Now what? It lives in a notebook? Nope. It needs to serve requests at scale, in real-time, with sub-100ms latencies.

    **The Problem:** Data scientists export models from Scikit-learn, XGBoost, or PyTorch. Engineers containerize them. But the serving layer often becomes a bottleneck—custom Python Flask servers, inconsistent dependencies, no monitoring.

    **The Solution:** Specialized inference frameworks.

    Two leaders emerged:

    – **BentoML** – Designed for data engineers. One Python decorator turns your model into a production service. Handles batching, scaling, dependency management. Fast to deploy, mature community.
    – **Seldon Core** – Kubernetes-native. Runs on your cluster, scales with your workload, integrates with monitoring stacks. Steeper learning curve, but worth it at scale.

    **My take:** BentoML gets you 80% of the way there with 20% of the complexity. Use Seldon when you need predictable, declarative scaling.

    ### Layer 3: Model Monitoring & Observability

    This is where most teams fail silently.

    You ship a model. It works great in testing. But three months later, data drift happens. Your model is making predictions 40% less accurate than when you trained it. You have no idea. Customers do.

    **The Problem:** ML is invisible. You can’t just use application monitoring. You need to watch for data drift, prediction drift, feature distribution changes, label shifts.

    **The Solution:** Dedicated ML monitoring tools.

    The ecosystem split into two camps:

    – **Arize & Whylabs** – Purpose-built for production ML. Dashboard views into model health, drift detection that works, integrations with all the tools you use. Not cheap, but focused.
    – **Open-source alternatives** – Alibi Detect, Great Expectations for data quality, Prometheus for basic metrics. Requires assembly but free.

    **My take:** If your model touches customers, Arize or Whylabs pays for itself in one prevented incident. If it’s internal, Great Expectations + Prometheus works.

    ## The Real-Time ML Shift

    One more trend worth discussing: batch processing is giving way to streaming.

    Yesterday’s architecture: Daily batch pipeline. Train models on yesterday’s data. Serve predictions from this morning’s batch.

    Tomorrow’s architecture: Real-time feature pipelines. Models trained on streaming data. Sub-second predictions.

    Tools enabling this shift:

    – **Kafka** – The backbone. If you’re building streaming features, you’re using Kafka.
    – **Flink** – Distributed stream processing at scale. Complex, but handles what Spark can’t.
    – **Bytewax** – Lightweight Python framework for stream processing. Newer, but impressive for ML workloads.

    ## Practical Decision Framework

    Here’s how I choose tools:

    **1. What’s your bottleneck?**
    – Can’t train consistently? Fix your features first. You need a feature store.
    – Model works in test but fails in production? You need monitoring.
    – Can’t serve fast enough? You need BentoML or Seldon.

    **2. What’s your scale?**
    – Under 10k requests/day? Start simple. BentoML + Great Expectations might be enough.
    – Over 100k requests/day? You need Seldon + proper monitoring.
    – Over 1M requests/day? You probably need specialized infrastructure (Tecton + Arize or custom).

    **3. What’s your team’s expertise?**
    – If you have Kubernetes experts, use Kubernetes-native tools (Seldon, KServe).
    – If you have Python experts, lean on Python-first tools (BentoML, Feast).
    – If you have data engineers (likely), build around data-centric tools (Feature stores, streaming).

    ## The Biggest Mistake

    Evaluation paralysis.

    I’ve seen teams spend six months comparing tools and ship nothing. The difference between Feast and Tecton matters less than actually having a feature store. The difference between BentoML and Seldon matters less than actually monitoring your model.

    Pick a tool. Use it for three months. Then evaluate. Tools improve monthly—your production insights are worth more than theoretical perfection.

    ## What’s Next?

    The ML tools landscape will keep evolving. Foundation models are changing what “serving” means. Prompt engineering is the new feature engineering. But the fundamentals stay the same:

    – Feature consistency between training and serving
    – Fast, reliable inference at scale
    – Continuous monitoring and drift detection

    Build your stack around these principles, and you’ll adapt to whatever tools emerge next year.

    — Pushpjeet Cholkar, Data Engineer

  • 5 Silent Killers of Production Data Pipelines (And How to Fix Them)

    I’ve seen pipelines fail in the most dramatic ways.

    Not during development. Not during testing. In production. At 2 AM. Right before a stakeholder demo.

    And almost every time, the root cause wasn’t bad code. It was a bad assumption — one that was quietly baked into the design and never questioned.

    If you build data pipelines professionally, this post is for you. Let’s walk through five assumptions that silently kill pipelines, and more importantly, how to fix them before they become your emergency.


    1. “The Source Schema Will Never Change”

    This is the most common assumption I see — and the most dangerous.

    You build an ingestion layer that reads from a MySQL table or a REST API. It works perfectly in dev and staging. You deploy to prod and call it a day.

    Then three weeks later, the backend team renames a column. Or adds a NOT NULL constraint. Or changes a field from varchar to int. Nobody sends a Slack message. Nobody files a ticket. And your pipeline silently breaks — or worse, starts producing wrong results.

    The fix: Validate schema at ingestion time, not just at build time. Tools like Great Expectations, Soda Core, or even a simple Python schema check can catch drift early. Make schema validation a first-class citizen of your pipeline, not an afterthought.

    # Simple example: validate expected columns exist before processing
    expected_columns = {"user_id", "event_type", "timestamp", "session_id"}
    actual_columns = set(df.columns)
    
    if not expected_columns.issubset(actual_columns):
        missing = expected_columns - actual_columns
        raise ValueError(f"Schema drift detected. Missing columns: {missing}")

    2. “NULL Means Missing Data”

    This one is subtle but important. NULL in a database can mean many different things: the data was never collected, the user explicitly opted out, the value is unknown, or it’s a default placeholder.

    If you treat all NULLs the same way — replacing them with zeros, dropping them, or ignoring them — you might be making incorrect business decisions downstream.

    The fix: Treat NULL handling as a business logic decision, not a technical default. Sit down with your data analyst or product team and ask: What does NULL mean in this context? Then encode that decision explicitly in your transformation layer, and document it.


    3. “The Pipeline Will Always Run on Schedule”

    In an ideal world, your Airflow DAG fires at 6 AM every day, runs cleanly, and finishes in 20 minutes. In reality, you have late-arriving data, infrastructure hiccups, manual backfills, and retry logic that re-runs tasks multiple times.

    If your pipeline isn’t idempotent — meaning, running it twice produces the same result as running it once — you’re one retry away from duplicate data or corrupted aggregates.

    The fix: Design for idempotency from the start. Use INSERT OVERWRITE or MERGE instead of INSERT INTO. Add partition filters so reruns only affect the target date range. Test your pipeline by intentionally running it twice in a row and verifying the output is identical.

    -- Non-idempotent (dangerous):
    INSERT INTO orders_summary SELECT * FROM raw_orders WHERE date = '2026-04-13';
    
    -- Idempotent (safe):
    INSERT OVERWRITE orders_summary PARTITION (date = '2026-04-13')
    SELECT * FROM raw_orders WHERE date = '2026-04-13';

    4. “Row Count Equals Data Quality”

    I’ve seen dashboards that proudly show ‘1,000,000 rows processed’ as a success metric. But here’s the truth: you can have a million rows and still have complete garbage. Row count tells you the pipeline ran. It tells you nothing about whether the data is correct.

    The fix: Add meaningful data quality checks — completeness (are critical fields populated?), distributions (has the average order value suddenly dropped 80%?), referential integrity, and freshness. Libraries like Great Expectations, dbt tests, or custom Python scripts can automate these checks.


    5. “Logs Are Enough for Debugging”

    Logs are great. They tell you that something went wrong, and sometimes they even tell you what. But when a data engineer gets paged at 2 AM because a dashboard is wrong, the question isn’t just ‘what happened?’ — it’s ‘why did it happen, and what upstream process caused it?’

    That’s where data lineage comes in. Lineage gives you a graph of how data flows from source to destination — which table feeds which model, which model feeds which report.

    The fix: Invest in lineage from day one. If you’re using dbt, lineage is built in. Tools like OpenLineage, Marquez, or DataHub can add lineage tracking without a major rewrite. The setup cost is small compared to the debugging cost it saves.


    Putting It All Together

    The best data engineers don’t just build pipelines that work in dev. They build pipelines that survive reality — schema changes, retries, bad data, and all.

      Which of these has burned you in a real production incident? I’d love to hear your war stories in the comments.

      — Pushpjeet Cholkar, Data Engineer

    • Real-World AI Applications in 2026: What Data Engineers Need to Know

      Everyone’s talking about AI. But most of that conversation lives in the world of demos, benchmarks, and announcements.

      Let’s talk about where AI is actually running in production — quietly, reliably, at scale — and what that means for you as a data engineer.

      Fraud Detection: Real-Time ML at Scale

      Banks and payment processors were among the first industries to go all-in on production ML. Today, when you swipe your card, a model scores that transaction in under 100 milliseconds.

      These systems ingest streaming data (think Kafka), run it through feature stores, and call inference endpoints on models trained on billions of labeled transactions. The old rule-based systems have been replaced by gradient boosting models and neural nets that detect subtle behavioral patterns.

      What this requires from data engineering:

      • Real-time streaming pipelines (Kafka, Flink, Spark Streaming)
      • Feature stores with low-latency reads (Feast, Tecton, Redis)
      • Data quality monitoring — a bad feature can tank model performance overnight

      Demand Forecasting: Knowing What You’ll Buy Before You Do

      Retailers like Walmart, Zara, and Amazon have turned demand forecasting into a serious competitive advantage. Instead of static seasonal models, they now run AI systems that incorporate weather data, local events, social media trends, historical sales, and supply chain status — all in real time.

      Tech stack typically involved:

      • Time-series models (Prophet, NeuralProphet, DeepAR on AWS SageMaker)
      • Feature pipelines ingesting 50+ data sources
      • Orchestration via Airflow or Prefect
      • Results served into planning dashboards via dbt + Looker or Tableau

      This is a data engineering problem at its core. The model is only as good as the pipeline feeding it.

      Predictive Maintenance: Preventing Failures Before They Happen

      Manufacturing and energy companies are using IoT sensor data + ML to predict equipment failure before it happens. A turbine with 200 sensors generates millions of data points per day. ML models trained on historical failure patterns can now flag anomalies weeks in advance.

      The data pipeline challenge here is massive:

      • Ingesting high-frequency sensor streams
      • Handling missing data and sensor drift
      • Storing time-series data efficiently (InfluxDB, TimescaleDB, or Delta Lake with time partitioning)
      • Triggering alerts when anomaly scores cross thresholds

      AI-Assisted Code Reviews and Developer Tools

      Tools like GitHub Copilot, CodeRabbit, and Cursor are now embedded in daily development workflows. From a data perspective, these tools are powered by large language models fine-tuned on code, served via inference APIs with strict latency requirements.

      The impact on software teams is real: 30-40% reduction in PR review turnaround time, faster onboarding of new engineers, and fewer syntax-level bugs making it to production.

      Your Social Feed: The Most Visible AI in the World

      Every time you open Instagram, TikTok, LinkedIn, or YouTube, you’re triggering dozens of ML inference calls. Content ranking, ad targeting, notification timing, A/B test assignment — it’s all ML, running in real time, personalized to you specifically.

      The Common Thread: Data Engineering Is the Foundation

      Look at every example above. Every single one depends on:

      1. Clean, reliable data ingestion — if the pipeline breaks, the model breaks
      2. Feature engineering — raw data rarely goes straight into models
      3. Monitoring and data quality — models degrade silently when data shifts
      4. Scalable infrastructure — AI at scale requires petabyte-level thinking

      This is why data engineers are still the most underrated role in AI projects. The ML engineer gets the credit. The data engineer keeps the lights on.

      What You Should Take Away From This

      AI applications in 2026 are real, widespread, and deeply dependent on data infrastructure. As a data engineer, the smartest move is to understand what the models need — not just how to build pipelines, but how to build pipelines that serve real ML use cases.

      The gap between “data engineer” and “ML platform engineer” is closing. And the ones closing it fastest are the ones who understand both sides.

      What real-world AI application has impressed you the most? Leave a comment below — I read every one.

      — Pushpjeet Cholkar, Data Engineer

    • Spark, dbt, and Airflow: How to Use All Three Without Losing Your Mind

      Every data engineer eventually lands on the same question: “When do I use Spark vs dbt vs Airflow?”

      If you’ve asked yourself this, you’re not alone. These three tools form the backbone of a modern data stack — but the confusion about when to use which one leads to some of the messiest pipeline architectures I’ve ever seen.

      In this post, I’m going to break down each tool’s role, show you where they overlap (and where they absolutely don’t), and walk you through a practical architecture pattern that actually scales.

      The Short Answer (Before We Dive In)

      • Spark = distributed compute for large-scale data processing
      • dbt = SQL-based transformation layer inside your data warehouse
      • Airflow = orchestrator that schedules and monitors jobs

      They’re not competitors. They’re teammates. The trick is giving each one the right job.

      Apache Spark: Your Heavy-Lifting Engine

      Apache Spark is a distributed computing framework designed to process massive amounts of data fast. We’re talking terabytes or petabytes, spread across a cluster of machines working in parallel.

      When should you reach for Spark? Use it when you have raw, unstructured data coming from Kafka, S3, or HDFS. Use it when your data volume makes single-machine processing impractical. Use it when you need complex transformations before data hits your warehouse, or when you’re doing streaming ingestion alongside batch processing.

      Spark is excellent at the ingestion and raw processing phase. It can read from almost any source, apply heavy transformations in PySpark or Scala, and write results to your data lake or warehouse.

      What Spark is NOT: a scheduler, an orchestrator, or a transformation layer inside your warehouse. Using Spark to run light transformations on structured warehouse data is overkill — that’s dbt’s territory.

      dbt: The Transformation Layer Your SQL Deserves

      dbt (data build tool) changed how data engineers think about transformations. Instead of scattered SQL scripts with names like final_v3_FINAL.sql, dbt gives you a structured, version-controlled, testable transformation framework.

      Here’s what makes dbt powerful: Modularity lets you write reusable SQL models that reference each other. Testing lets you define schema tests (not null, unique, accepted values) that run automatically. Documentation auto-generates a data catalog from your models. Lineage lets you visualize how data flows from source to final table.

      dbt runs inside your warehouse — Snowflake, BigQuery, Redshift, Databricks. It doesn’t move data; it transforms data that’s already there.

      A Quick dbt Example

      -- models/marts/fact_orders.sql
      WITH orders AS (
          SELECT * FROM {{ ref('stg_orders') }}
      ),
      customers AS (
          SELECT * FROM {{ ref('stg_customers') }}
      )
      SELECT
          o.order_id,
          o.order_date,
          c.customer_name,
          o.total_amount
      FROM orders o
      LEFT JOIN customers c ON o.customer_id = c.customer_id

      That ref() function is dbt magic — it builds the dependency graph automatically, so dbt knows to run stg_orders and stg_customers before fact_orders.

      What dbt is NOT: a job scheduler, a data ingestion tool, or a substitute for Spark on large raw datasets.

      Apache Airflow: The Conductor of Your Pipeline

      Airflow is a workflow orchestration platform. Its job is simple but critical: run the right jobs, in the right order, at the right time — and tell you when something goes wrong.

      You define workflows as DAGs (Directed Acyclic Graphs) in Python. A typical daily DAG looks like this: Spark ingests raw data → dbt transforms it → dbt tests validate it. Clean, readable, and version-controlled.

      The #1 Airflow mistake I see: Running heavy data processing logic inside Airflow operators. PythonOperators with 10,000-row Pandas loops, inline SQL queries that run for hours — this kills your Airflow workers. Airflow schedules work. It doesn’t do the heavy work itself.

      The Architecture Pattern That Works

      Here’s the pattern I’ve used on production pipelines handling hundreds of millions of rows daily: Airflow triggers a daily DAG → Spark ingests raw data to the data lake → Airflow triggers dbt → dbt transforms inside the warehouse → dbt tests validate data quality → BI tools and downstream consumers read clean data.

      Each layer has one responsibility. Airflow handles scheduling and monitoring, Spark handles scale, dbt handles structured transformations. When something breaks, you know exactly where to look.

      Common Mistakes to Avoid

      1. Running Pandas in Airflow operators. Heavy compute belongs in Spark, not inside Airflow. If your DAG tasks take more than a few minutes, move the logic to a Spark job and trigger it from Airflow.

      2. Using dbt for raw data ingestion. dbt reads from what’s already in your warehouse. It doesn’t pull from APIs, Kafka, or flat files. Use Spark, Fivetran, or a custom ingestion job for that.

      3. Treating Spark as a scheduler. Spark has no built-in job scheduling or dependency management. Airflow is always needed to coordinate when Spark jobs run.

      4. No dbt tests. If you’re not running dbt test, you’re flying blind. Schema tests catch broken pipelines before your stakeholders do.

      Wrapping Up

      Spark, dbt, and Airflow are genuinely complementary. Once you understand each tool’s lane, using them together feels natural — and your pipelines become dramatically more maintainable.

      The key mental model: Airflow is the conductor. Spark is the muscle. dbt is the translator.

      Give each tool its role and stay disciplined about not crossing the lanes. Have questions about your specific setup? Drop a comment below — I read every one of them.

      — Pushpjeet Cholkar, Data Engineer