Author: pushpjeet

  • How Data Engineers Can Build a Personal Brand That Opens Doors

    There’s a quiet frustration that many data engineers share but rarely talk about.

    You’ve spent years mastering Spark, dbt, Airflow, and Python. You’ve built pipelines that process millions of rows. You’ve solved complex data quality issues that nobody else could crack. And yet — the career opportunities you want aren’t coming fast enough.

    Here’s the uncomfortable truth: technical skills get you in the door, but your personal brand is what keeps it open.

    In 2026, the data engineering job market is competitive. Companies aren’t just looking for engineers who can write great code — they’re looking for engineers who can communicate, influence, and lead. And your personal brand is the signal they use to evaluate that.

    What Is a Personal Brand (And Why Should a Data Engineer Care)?

    A personal brand is simply the impression people have of you professionally. It’s what comes up when someone Googles your name. It’s the posts you share on LinkedIn. It’s the articles you write, the problems you solve publicly, and the way you explain complex concepts to others.

    • Get you noticed by recruiters who aren’t posting jobs publicly
    • Position you as a thought leader in your niche (Kafka? dbt? streaming pipelines?)
    • Help you command higher salaries because you’re seen as an expert, not just a candidate
    • Build a network that sends opportunities your way

    The best part? You don’t need to be an influencer. You just need to be consistent.

    Step 1: Pick Your 3 Core Topics

    Pick 3 topics that sit at the intersection of what you know deeply, what you enjoy talking about, and what your target audience cares about. For me, those are: Data Engineering (pipelines, architecture, tools), Python for data workflows, and AI tools for engineers.

    Step 2: Show Your Work, Not Just Your Results

    Instead of waiting until you’ve built the perfect data pipeline to talk about it, share your journey. Instead of posting “Just shipped a new ETL pipeline!”, try: “We had 3-hour data latency. Here’s how I rebuilt it using Kafka + Spark Streaming to get it under 5 minutes — and the mistake I almost made.” That version teaches something. It builds trust.

    Step 3: Choose Your Platform and Post Consistently

    LinkedIn is the most powerful platform for data engineers right now. A personal blog gives you a home base you own — and helps with SEO. One post per week, every week, for six months will do more for your career than ten posts in a burst followed by three months of silence.

    Step 4: Engage, Don’t Just Broadcast

    Leave thoughtful comments on posts from engineers you admire. Reply to every comment on your own posts. Share other people’s insights with your own take added. The engineers who grow fastest are the ones genuinely engaging — not just broadcasting.

    Step 5: Be Patient and Track What Works

    Building a personal brand is a long game. Expect the first 90 days to feel slow. Track what resonates — which posts get comments, which topics drive profile views. As a data engineer, you’re uniquely positioned to be analytical about your content strategy. Use that superpower.

    Your Action Plan for This Week

    1. Update your LinkedIn headline to reflect your specialty
    2. Write one post about a problem you solved recently
    3. Leave five genuine comments on posts by engineers you respect
    4. Start a notes document to capture ideas for future posts

    The engineers who will thrive in the next five years aren’t just the ones with the most technical skills. They’re the ones who can communicate their value, build trust at scale, and show up consistently. Your code already speaks for itself. Now it’s time to let your voice do the same.

    — Pushpjeet Cholkar, Data Engineer

  • 5 AI & ML Tools Every Data Engineer Should Know in 2026

    1. dbt + LLM Macros: AI-Powered Transformations

    What it is: dbt (data build tool) integrates with large language models to auto-generate column descriptions, test cases, and documentation from your SQL models.

    Why it matters: Writing documentation and tests is the least glamorous part of data engineering — and the most skipped. LLM-powered macros change that equation. You can prompt dbt to generate YAML documentation for an entire model in seconds.

    -- Run dbt's AI doc generation on your model
    dbt docs generate --select my_model

    With tools like dbt-osmosis and emerging LLM integrations, you can now propagate column descriptions automatically across your DAG.

    2. MLflow: The Experiment Tracker You Already Need

    What it is: MLflow is an open-source platform for managing the ML lifecycle — experiment tracking, model registry, and model serving.

    Why data engineers need it: Even if you never train a model yourself, your pipelines feed them. When a model degrades, the first question is: “Did the training data change?” MLflow gives you the audit trail to answer that.

    import mlflow
    
    with mlflow.start_run():
        mlflow.log_param("learning_rate", 0.01)
        mlflow.log_metric("rmse", 0.84)
        mlflow.log_artifact("model.pkl")

    3. Feast: The Feature Store That Stops Pipeline Duplication

    What it is: Feast is an open-source feature store that lets you define, store, and serve ML features consistently across training and production environments.

    The problem it solves: Without a feature store, teams rebuild the same feature logic in multiple places. Models trained on one version of a feature get served predictions from a slightly different version — this is called training-serving skew, and it silently kills model accuracy.

    from feast import FeatureStore
    
    store = FeatureStore(repo_path=".")
    features = store.get_online_features(
        features=["user_stats:purchase_7d_avg"],
        entity_rows=[{"user_id": 1001}]
    ).to_dict()

    4. Great Expectations + Anomaly Detection: Quality Beyond Rules

    Why the combination matters: Rule-based expectations catch what you already know to look for. Anomaly detection catches distribution shifts, sudden value spikes, or gradual drift that no rule anticipated.

    import great_expectations as gx
    
    context = gx.get_context()
    validator = context.sources.pandas_default.read_csv("orders.csv")
    validator.expect_column_values_to_not_be_null("order_id")
    validator.expect_column_values_to_be_between("order_amount", min_value=0, max_value=50000)
    results = validator.validate()

    5. Vertex AI / SageMaker Pipelines: Orchestrating ML Like a Data Engineer

    If you’ve used Apache Airflow, these tools map cleanly to concepts you already know: DAGs, steps, inputs/outputs, artifacts — just applied to ML workflows.

    from kfp import dsl
    
    @dsl.pipeline(name="training-pipeline")
    def training_pipeline(data_path: str):
        preprocess_task = preprocess_op(data_path=data_path)
        train_task = train_op(data=preprocess_task.outputs["processed_data"])
        evaluate_task = evaluate_op(model=train_task.outputs["model"])

    The Big Picture

    In 2026, a pipeline doesn’t just move data from A to B. It moves data from raw sources to clean features to trained models to reliable predictions to business outcomes. Every link in that chain needs engineering.

    Build both sides of the stack. That’s where the real leverage is.

    — Pushpjeet Cholkar, Data Engineer

  • Stop Treating Your Data Pipelines Like Scripts — Build Them Like Products

    Every data engineer has a story.

    It usually starts the same way: someone needed a quick data pull, so you wrote a Python script. It worked. Then it got scheduled. Then it fed a dashboard. Then the VP of Sales started refreshing that dashboard every morning before their 9am standup.

    Your “quick script” just became critical infrastructure — and nobody updated the README.

    This is one of the most common patterns in data engineering, and it’s also one of the most dangerous. When pipelines are built like throwaway scripts, they become time bombs. They break at the worst moments, they’re impossible to debug, and they’re terrifying to hand off to someone else.

    The fix? Start treating your data pipelines like products.

    1. Version Control Everything — Not Just the Code

    Most engineers version-control their Python files. But your pipeline is more than just code. Version control your SQL transformations, dbt models, schema definitions, DAG definitions, and infrastructure configs. When a schema changes without a Git commit, you lose traceability.

    Practical tip: Use a monorepo structure for your data platform. Tools like dbt make this natural — every model, test, and doc block lives in version control.

    2. Write Data Tests, Not Just Code Tests

    Unit tests catch bugs in your logic. Data tests catch bugs in your data — and in data engineering, the data is usually where the real surprises hide. Most production data issues aren’t caused by broken code — they’re caused by an upstream source sending nulls, a date field switching formats, or a join key returning duplicate rows.

    Test for not-null checks on critical columns, uniqueness constraints on primary keys, accepted values for categorical columns, referential integrity between tables, and row count anomalies. dbt has built-in generic tests for all of the above.

    3. Build Observability From Day One

    If your pipeline fails silently, does it even matter that it failed? The answer is yes — and your stakeholders will make it very clear when they figure out their dashboard is two days stale.

    Observability means alerting on failures, data freshness monitoring, row-level audit logs, and lineage tracking. The rule of thumb: you should know your pipeline is broken before your stakeholders do. Always.

    4. Document the WHY, Not Just the WHAT

    Code explains what it does. Documentation should explain why it does it. Six months from now, when someone needs to modify a complex transformation, they don’t need to know that this column is a LEFT JOIN — they need to know why it’s a LEFT JOIN and what business logic it encodes.

    Write dbt model descriptions that explain business context, keep an Architecture Decision Record file in your repo for major design choices, and update docs as part of your PR review process — not as an afterthought.

    5. Treat Pipeline Failures as Incidents

    When your production pipeline breaks, it’s not just a bug — it’s a business incident. Log it with full error context. Alert the right people — not just the on-call engineer, but the data consumers who are affected. Fix it with a proper root cause analysis, not just a git revert. Then post-mortem it.

    Teams that run post-mortems on data incidents ship more reliable pipelines over time, because they learn from failures instead of repeating them.

    The Product Mindset Shift

    All of these practices come down to one mental shift: think about the people downstream from your pipeline. Ask yourself before every pipeline you build: Who depends on this data? What breaks if this pipeline fails at 2am? How will I know it’s working correctly tomorrow? Would a new engineer understand this in 6 months?

    If you can answer those questions confidently, you’re not just writing scripts anymore. You’re building data infrastructure that lasts.

    Have a “temporary” pipeline that’s been running for years? Share your story in the comments 👇

    — Pushpjeet Cholkar, Data Engineer

  • Seven Days, Seven Lessons: A Data Engineer’s Weekly Reflection

    Sundays are for slowing down. Not for scrolling through tutorials, not for chasing the next framework — but for actually sitting with what the week taught you.

    This week was unusually rich. I spent seven days writing about Spark partitioning, AI tools, Python idioms, career moves, advanced Airflow patterns, and real-world healthcare AI. By Saturday night, I noticed something strange: the daily topics were all different, but the lessons kept rhyming.

    Here are the seven that stuck. Not theory — things I actually changed my mind about this week.

    1. Tools Are Disposable. Judgment Is Not.

    Early in my career, I collected tools like trading cards. Airflow, dbt, Spark, Kafka, Flink, Snowflake, Databricks, Polars, DuckDB, Iceberg — if it had a logo, I wanted it on my résumé.

    This week I watched a senior engineer replace a 200-line Airflow DAG with a 40-line Python script and a cron job. The pipeline ran faster, broke less often, and was readable by a junior hire on day one.

    The lesson: Most of the time, the question isn’t “which tool is best?” It’s “do we even need a tool here?” Judgment is what turns a toolkit into a career.

    2. Fundamentals Compound. Trends Don’t.

    I’ve paid for three courses on “next-generation” data warehouses in the last two years. The knowledge that has actually served me across every one of those warehouses? How query planners work. How indexes get chosen. Why a seemingly innocent OR in a WHERE clause can destroy a plan.

    Fundamentals are boring to post about. They don’t trend. But they compound for decades.

    The lesson: Spend 20% of your learning budget on shiny things. Spend 80% on the fundamentals — SQL internals, distributed systems, data modeling, Linux.

    3. AI Isn’t Replacing Data Engineers. It’s Replacing a Certain Kind of Data Engineer.

    Every week there’s a new “AI will replace data engineers” post. This week I experimented openly with using AI to scaffold dbt models, write Spark transforms, and review my Python.

    The honest result: AI is extraordinary at boilerplate. It is still bad at judgment, architecture, cost modeling, and political navigation inside a company.

    The lesson: If your day-to-day is 80% boilerplate, 2026 is a wake-up call. If you spend your day on schemas, trade-offs, stakeholder alignment, and system design — AI is a jetpack, not a guillotine.

    4. Writing Publicly Is the Best Career Move I’ve Made.

    I didn’t get my last opportunity from a job board. I got it because someone read my LinkedIn posts and decided I thought clearly.

    Writing publicly forces something a promotion never will: you have to actually understand your own work well enough to explain it to strangers. That pressure makes you a better engineer.

    The lesson: Even if nobody reads it for the first six months, keep writing. The audience is a bonus. The clarity is the product.

    5. The Hardest Skill in 2026 Is Saying “Let’s Not Build That.”

    This one hurts to admit. For years, I measured my value by what I built. Pipelines shipped. DAGs authored. Models deployed.

    This week I killed three proposed pipelines before they started. Each would have added 3–5 weeks of work, two new data sources, and an ongoing maintenance burden. The business outcome we actually needed? A spreadsheet and a stakeholder conversation.

    The lesson: The best data engineers I know have a finely tuned “not now” reflex. They optimize for problems solved, not code shipped.

    6. Taste Is the Real Moat.

    You can teach someone Spark. You cannot teach them, in a weekend, to sense when a pipeline is getting too clever. To feel when a schema is drifting toward technical debt. To notice that a dashboard is answering the wrong question.

    That sensitivity is taste. It comes from reading other people’s code, breaking your own in production, and paying attention on purpose.

    The lesson: If you want to stand out in a field full of certifications, build taste. It takes years. It’s also the one thing the machine can’t clone.

    7. Unlearning Matters as Much as Learning.

    I started the week planning to write about new things I learned. I ended it realizing half the value was in unlearning — habits, tools, and opinions I had outgrown.

    • Unlearned: pandas is the only option. (Polars handled the heavy lifting in a fraction of the time.)
    • Unlearned: every pipeline deserves a DAG. (Some deserve a cron job.)
    • Unlearned: silent senior engineers are humble. (They’re just invisible. Speak up.)

    The lesson: Your growth isn’t only what you add. It’s what you’re willing to let go of.

    Closing Thought

    One week is a small window. But a week of deliberate attention will teach you more than a month of passive consumption.

    If you’re reading this on a Sunday, I’ll ask you what I asked myself this morning: What did you unlearn this week?

    Write it down. It’s probably the most valuable thing you touched all week.

    See you in the next post.

    — Pushpjeet Cholkar, Data Engineer

  • AI in Healthcare 2026: What’s Actually Running in Hospitals (and Why Data Engineers Are the Heroes Behind It)

    There’s a gap between AI headlines and AI reality.

    The headlines talk about chatbots passing medical boards and startups promising to “replace doctors.” The reality is more interesting — and much less flashy.

    Inside real hospitals in 2026, AI is already in the clinical workflow. It’s not replacing anyone. It’s quietly making every doctor, nurse, and pharmacist better at their job. And behind every one of those AI systems is a data engineering team doing some of the hardest, most regulated work in the industry.

    Let’s unpack what’s actually in production right now — and what it takes to build it.

    1. Medical Imaging AI: From Hype to Standard of Care

    Radiology was one of the first areas where AI moved from demo to daily use. In 2026, it’s genuinely part of the workflow.

    When you get an MRI or a CT scan at a modern hospital, the images don’t just go straight to the radiologist. They often pass through an AI pre-read first. Models trained on millions of labeled scans flag suspicious regions — early-stage lung nodules, brain bleeds, diabetic retinopathy, breast cancer indicators — before a human ever opens the file.

    The accuracy on narrow tasks is remarkable. For some cancer subtypes, the models now meet or beat specialist radiologists on sensitivity. But critically, the model doesn’t decide. The radiologist does. AI is the tireless junior partner that never misses a detail because it’s been a long day.

    What this requires from data engineering

    Medical imaging pipelines deal with DICOM files — large, metadata-rich, and privacy-sensitive. You’re moving hundreds of megabytes per study, sometimes gigabytes, across hospital networks and into inference systems. That means streaming ingestion, de-identification (stripping patient info before training), and deterministic audit trails. Every scan the model ever sees must be traceable back to a consent and a data use agreement.

    2. Drug Discovery: Biotech Gets a Cheat Code

    The old drug discovery process: 10+ years, $2+ billion per drug, single-digit percentage success rates.

    The new process in 2026: AI-designed molecules, protein folding predicted in seconds (thanks to AlphaFold and successors), and simulation-first pipelines that eliminate millions of candidates before a single lab experiment.

    Companies like Insilico Medicine, Recursion, and Isomorphic Labs have shown that generative models for molecular design can take months off discovery timelines. Some AI-designed candidates are already in Phase 2 trials.

    What this requires from data engineering

    Training molecular models at scale takes petabytes of structured chemistry and biology data. You’re building pipelines that ingest research papers, patent databases, assay results, genomic sequences, and 3D protein structures — and keeping all of it synchronized, versioned, and reproducible. Reproducibility is the big one: if a model suggests a drug candidate, regulators want to see exactly which data trained it.

    3. ER Triage: AI That Decides Who Gets Seen First

    Emergency rooms are chaos by design. When patients walk in, triage nurses have to decide — in seconds — who’s having a heart attack and who has the flu.

    Modern ERs in 2026 use AI triage assistants that pull together symptoms, vitals, prior history from the EHR, and even smartphone-reported data to produce a risk score. It doesn’t replace the triage nurse. It catches the edge cases tired humans miss on a 12-hour shift.

    Real study: hospitals deploying ML-based triage have seen meaningful reductions in missed sepsis cases and faster time-to-treatment for cardiac events.

    What this requires from data engineering

    Real-time ingestion from multiple systems: vitals monitors, EHR databases, lab systems, sometimes wearables. Sub-second feature computation. Strict governance on model inputs — you can’t use protected attributes like race in triage scoring without creating ethical and legal disasters.

    4. Ambient Clinical Documentation: Giving Doctors Their Evenings Back

    If you’ve talked to a doctor in the last few years, you’ve probably heard one complaint above all others: charting.

    Doctors spend up to two hours per day after their last patient writing notes in the EHR. In 2026, ambient AI (think Nuance DAX, Abridge, Nabla) listens to the doctor-patient conversation, transcribes it, structures it into the right EHR fields, and surfaces it for the doctor to quickly review and sign.

    This isn’t a research project. It’s shipping at scale. Major health systems have rolled it out to thousands of clinicians. The reported effect on physician burnout is substantial.

    What this requires from data engineering

    Secure audio pipelines. Real-time streaming transcription with medical vocabulary. PHI redaction for training data. Integration with a dozen different EHR vendors. And audit logging that satisfies HIPAA — every recorded conversation must be tied to a consent workflow.

    5. ICU Deterioration Models: Forecasting the Next Six Hours

    In an ICU, small changes matter. A slight drift in heart rate variability, a subtle trend in lactate levels, a blood pressure pattern — these can predict cardiac arrest or sepsis hours before it happens.

    Modern ICU early-warning systems use time-series ML to continuously score every patient. When the risk crosses a threshold, nurses get a notification. The best models have been shown to predict deterioration 6+ hours before a human clinician would have noticed.

    This is AI saving lives, not by being clever, but by being vigilant.

    What this requires from data engineering

    Continuous streaming ingestion from bedside monitors, lab systems, and medication pumps. Time-series feature engineering at scale. Alert fatigue management — too many false positives and nurses start ignoring the system. Rigorous A/B testing in a life-or-death environment.

    The Common Thread: Data Engineering on Hard Mode

    Every one of these AI applications has the same foundation: a team of data engineers figuring out how to move sensitive data through complex systems without breaking privacy laws, without losing fidelity, and without introducing bias.

    Healthcare data engineering is data engineering on hard mode:

    • HIPAA compliance — every byte must have a legal basis for existing where it exists
    • PHI handling — de-identification, pseudonymization, minimum-necessary principles
    • Consent tracking — patients can revoke consent, and your pipelines must respect that retroactively
    • Audit trails — regulators can and will ask who touched what data and when
    • Vendor integration hell — EHRs from Epic, Cerner, Meditech, and dozens of smaller players, each with their own APIs and quirks

    It’s hard. It’s slow. It’s unsexy. And it’s arguably the most meaningful data engineering work on the planet right now.

    Why This Matters If You’re in Data Engineering

    If you’re a data engineer early in your career and you want to work on something that matters, healthcare AI is one of the most compelling spaces in 2026. You’ll learn streaming at scale, you’ll master privacy engineering, you’ll deal with data governance most companies can only dream of, and you’ll work on systems where every bug actually matters.

    Every fraud detection pipeline protects some money. Every healthcare pipeline, done right, extends someone’s life.

    That’s a rare thing.

    Key Takeaways

    • AI is already running in clinical workflows across radiology, pharma, ER triage, documentation, and the ICU.
    • The magic is the quiet assist — not the replacement of doctors.
    • Every single one of these systems depends on regulated, auditable, clean data pipelines.
    • Healthcare data engineering combines the hardest parts of streaming, governance, and compliance.
    • If you want your pipelines to matter, this is one of the best places to build them.

    — Pushpjeet Cholkar, Data Engineer

  • Polars vs Pandas in 2026: When to Switch and When to Stay

    For over a decade, pandas has been the default answer to “how do I work with tabular data in Python?” It’s on every data engineer’s resume, in every tutorial, and baked into countless production pipelines. But in 2026, something has shifted. A Rust-powered challenger called Polars has matured from curiosity to production-ready tool, and data teams across the industry are quietly rewriting their hot paths.

    So is it time to switch? The honest answer is: sometimes. Let’s break it down.

    Why Pandas Became the Standard

    Before we criticize pandas, let’s be fair to it. Pandas won because it was good enough, early. Wes McKinney shipped it in 2008, and by the time most of us started doing serious data work, it already had the ecosystem, the Stack Overflow answers, and the muscle memory of a generation of analysts and engineers. Every notebook tutorial assumes pandas. Every ML library accepts a DataFrame. That gravity is hard to fight.

    But pandas also carries the scars of its age. It was designed before multi-core laptops were the norm, before Parquet was ubiquitous, and before anyone expected to process tens of gigabytes on a single machine. The API reflects that history — it’s quirky, it’s inconsistent in places, and it’s notoriously eager. Everything loads into memory. Everything runs single-threaded by default. Every .apply() is a Python for-loop wearing a disguise.

    What Polars Does Differently

    Polars is not just “pandas but faster.” It’s a rethink of what a DataFrame library should be in 2026.

    Lazy Evaluation

    The single biggest shift is lazy evaluation. When you write a Polars query using the lazy API, nothing executes immediately. Instead, Polars builds a query plan — much like a database would — and then optimizes it before running. It prunes unused columns. It pushes filters down closer to the I/O. It reorders joins for efficiency.

    The practical effect: if you read a 100-column Parquet file but only use 5 columns, Polars reads 5 columns from disk. Pandas reads 100 and throws 95 away. On a big file, that’s the difference between coffee break and lunch break.

    Parallelism Out of the Box

    Polars uses every core on your machine automatically. No multiprocessing, no joblib, no wrestling with the GIL. Aggregations, joins, and window functions all fan out across cores. On a modern 8-core laptop, that’s an 8x speedup you get for free.

    Memory Efficiency

    Polars uses Apache Arrow as its backing memory format. That means contiguous columnar buffers, explicit null handling, and no Python object overhead for every string. In my experience, a dataset that consumes 16GB in pandas will comfortably fit in 4-6GB in Polars.

    Expressions

    The Polars API is built around expressions — composable objects that describe a transformation. You write pl.col(“revenue”) * pl.col(“quantity”) and Polars handles the vectorization, parallelization, and type handling. No more .apply(lambda row: …) anti-patterns.

    A Concrete Example

    Here’s a small benchmark from a real project I did last week. I was aggregating a year of clickstream data — about 120GB of Parquet files.

    Pandas version:

    import pandas as pd
    df = pd.read_parquet("clickstream/*.parquet")
    result = df.groupby(["user_id", "event_date"])["revenue"].sum().reset_index()

    This crashed my 32GB machine. I had to chunk it manually.

    Polars version:

    import polars as pl
    result = (
        pl.scan_parquet("clickstream/*.parquet")
          .group_by(["user_id", "event_date"])
          .agg(pl.col("revenue").sum())
          .collect()
    )

    Ran in 14 minutes. No chunking. No manual memory management. The scan_parquet + lazy pattern let Polars stream the data through, only holding aggregation state in memory.

    When Pandas Is Still the Right Call

    I’m not here to tell you to delete pandas. There are plenty of cases where pandas is still the pragmatic choice.

    You’re Working With Small Data

    If your DataFrame fits in a few hundred megabytes and runs in seconds, the performance gap doesn’t matter. The pandas ecosystem, documentation, and Stack Overflow answers will save you more time than Polars will.

    You Need a Specific Ecosystem Integration

    Plotting with Matplotlib, feeding into scikit-learn, using Great Expectations — many libraries accept pandas DataFrames as a first-class input. Polars has a .to_pandas() method that makes interop easy, but if you’re bouncing back and forth a lot, the conversions add up.

    You Have Existing Code

    Rewriting a 5,000-line pandas codebase in Polars is not a weekend project. Be strategic. Identify the bottleneck stages and convert those. Leave the rest alone.

    Migration Tips

    If you’re ready to try Polars, here’s my recommended path.

    Start by installing both libraries side by side. You don’t have to pick one. Then pick your slowest pipeline stage — probably an aggregation over a big file — and rewrite just that stage in Polars. Read the input with pl.scan_parquet or pl.scan_csv, do the transformation, and use .collect() or .collect().to_pandas() to hand it back to the rest of your pipeline.

    Expect the API to feel alien for the first few days. .iloc is gone. .loc is gone. .apply is almost never the right answer. Instead, everything is an expression: pl.col(“x”).filter(pl.col(“y”) > 0).sum(). Once it clicks, you’ll wonder how you lived without it.

    Finally, read the Polars user guide. It’s one of the best-written pieces of open-source documentation I’ve encountered. Two hours with it will save you two weeks of Stack Overflow searches.

    The Honest Verdict

    Pandas is not going away. It’s the English of data tools — imperfect, quirky, but everyone speaks it. Polars is the precision instrument you reach for when the workload actually demands it. Learn both. Use the right one for the job. Stop writing overnight batch jobs when a 10-minute query will do.

    — Pushpjeet Cholkar, Data Engineer

  • 5 Apache Spark Optimization Tricks Every Data Engineer Should Know

    Most data engineers can write a Spark job. But writing one that’s actually fast? That’s where things get interesting.

    I’ve spent years working on large-scale data pipelines, and time and again I see the same performance mistakes show up in Spark jobs — even from experienced engineers. The good news: most of them are easy to fix once you know what to look for.

    Here are 5 Spark optimization tricks that have saved me hours of compute time and a lot of frustrated debugging.

    1. Stop Triggering Unnecessary Shuffles

    Shuffles are Spark’s most expensive operation. When a shuffle happens, data is redistributed across the network — taking time, memory, and I/O. Operations that trigger shuffles: groupBy(), wide join(), distinct(), repartition().

    The fix: use broadcast joins for small tables.

    from pyspark.sql.functions import broadcast
    result = large_df.join(broadcast(small_lookup_df), on="id", how="left")

    2. Get Your Partition Count Right

    Target ~128MB per partition. Too few = idle cores. Too many = scheduling overhead.

    df.rdd.getNumPartitions()
    df = df.repartition(200)
    df = df.coalesce(50)

    Use coalesce() when reducing, repartition() when increasing or needing even distribution.

    3. Cache DataFrames You Use More Than Once

    Spark’s lazy evaluation means it recomputes DataFrames from scratch each time an action is triggered — unless you cache them.

    df = spark.read.parquet("s3://my-bucket/data/")
    df.cache()
    count = df.count()
    filtered = df.filter(...)  # uses cache

    4. Switch to Kryo Serialization

    Kryo can be 2–10x faster than Java’s default serializer. One config line:

    spark = SparkSession.builder     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")     .getOrCreate()

    5. Actually Read the Spark UI

    The Spark UI (at http://localhost:4040) shows exactly where time is wasted — skewed partitions, slow stages, missed caches. Check the Stages tab, DAG visualization, and Executors tab. A task taking 5 minutes while 199 others take 10 seconds? That’s partition skew — fix it by salting your join key.

    Putting It All Together

    Checklist for every new Spark job: avoid shuffle-heavy operations where possible, check partition count, cache reused DataFrames, enable Kryo, and inspect the Spark UI after a test run. Optimizing Spark is systematic — find the bottleneck, understand it, fix it.

    — Pushpjeet Cholkar, Data Engineer

  • Why Data Engineers Need a Personal Brand (And How to Build One Without the Cringe)

    Let me paint you a picture.

    Two data engineers join the same company on the same day. Same skills. Same stack. Same team.

    Eighteen months later, one is a senior engineer with inbound recruiter messages and a growing online following. The other is still waiting for their “turn” at the next performance cycle.

    What separated them? It wasn’t the code. It was visibility.

    The Visibility Problem in Data Engineering

    Data engineering is one of the most impactful roles in a modern tech company. You build the infrastructure that powers product decisions, revenue models, and machine learning systems. Without you, data scientists are staring at empty Jupyter notebooks.

    And yet — data engineers are often the least visible people on the technical team. Our output lives in Airflow DAGs and dbt models that only your team appreciates. This invisibility has a career cost. A personal brand solves this.

    What Personal Branding Actually Means for Engineers

    First, let’s kill the cringe. Personal branding doesn’t mean becoming a LinkedIn influencer. For a data engineer, personal branding means one thing: making your expertise legible to the right people.

    The 5 High-ROI Moves for Building Your Brand as a Data Engineer

    1. Write About What You Just Solved

    Every week, you solve at least one problem that took you longer than it should have. A tricky dbt macro. A Spark memory tuning issue. A confusing Airflow dependency. Write 300 words about how you solved it and post it on LinkedIn or your blog. You will help dozens of engineers who are Googling the exact same problem.

    2. Narrate Your Architecture Decisions

    Most engineers document the what. Almost nobody documents the why. Why did you choose Kafka over Kinesis? Why did you pick Iceberg over Delta Lake? These decisions are gold. Write them up — share the best ones externally. This positions you as someone who thinks about engineering, not just implements it.

    3. Teach One Thing Every Week

    You know something that would be useful to someone at an earlier stage of their career. Teaching doesn’t require a YouTube channel. It can be a 5-minute Loom walkthrough, a reply to someone’s LinkedIn question, or a short “Today I Learned” post. Every time you teach, you reinforce your own learning and your reputation simultaneously.

    4. Be Consistent Over Being Viral

    The first 10 posts feel pointless. By post 50, you start getting DMs. By post 100, recruiters are finding you. Set a sustainable cadence — one LinkedIn post per week, one blog post per month. Two years of that consistency will transform your career.

    5. Engage, Don’t Just Broadcast

    The fastest way to grow your network is to add value in other people’s conversations first. Find the data engineers you respect. Comment thoughtfully on their posts. Share their work with your own take. This is how you get on people’s radars before you have a big following.

    A Week-1 Challenge

    This week, do one of these: write a LinkedIn post about a technical problem you solved recently, reply thoughtfully to three posts from data engineers you admire, or write up a short internal doc explaining an architecture decision you made. That’s it. No newsletter or podcast required yet.

    Final Thought

    You spent years learning SQL, Python, Spark, dbt, Airflow, and a dozen other tools. You’ve built systems that process millions of rows of data. Don’t let that expertise stay invisible. Your career is also a product. Build it with the same intention you bring to your pipelines.

    Start this week. One post. One doc. One comment. You’ve got this. 🚀

    — Pushpjeet Cholkar, Data Engineer

  • Two years ago, choosing an ML tool meant picking one of three options. Today, I track over 50 tools in the MLOps space—and new ones ship every week.

    But here’s the thing: more options don’t mean easier decisions. They mean paralysis.

    As a data engineer, you’re not here to evaluate every tool. You’re here to ship models, monitor them, and keep pipelines running at scale. This guide cuts through the noise and covers what actually matters in production.

    ## The Three Layers of Your ML Stack

    Modern ML infrastructure has three distinct challenges, and each needs a different tool.

    ### Layer 1: Feature Engineering & Storage

    This is where your ML maturity actually lives. Raw data → Features → Training → Inference. If features don’t flow smoothly between training and serving, you’re in trouble.

    **The Problem:** Most teams train with one data pipeline and serve from another. A feature computed in Spark during training might be computed in Pandas on the inference server. Slight differences in logic. Slight differences in timing. Your model silently degrads, and you don’t know why.

    **The Solution:** Feature stores.

    Three mature options exist today:

    – **Tecton** – The enterprise choice. SOC 2 compliant, strong operational support, battle-tested at scale. Cost is high; complexity is justified.
    – **Feast** – The open-source backbone. Free, flexible, runs on Kubernetes, smaller community. Great if you want control and don’t need support.
    – **Databricks Feature Store** – If you’re already in the Databricks ecosystem, it’s deeply integrated and surprisingly good.

    **My take:** Most teams start with Feast. It teaches you what a feature store should be. Move to Tecton when your features become mission-critical.

    ### Layer 2: Model Serving & Inference

    You’ve trained a model. Now what? It lives in a notebook? Nope. It needs to serve requests at scale, in real-time, with sub-100ms latencies.

    **The Problem:** Data scientists export models from Scikit-learn, XGBoost, or PyTorch. Engineers containerize them. But the serving layer often becomes a bottleneck—custom Python Flask servers, inconsistent dependencies, no monitoring.

    **The Solution:** Specialized inference frameworks.

    Two leaders emerged:

    – **BentoML** – Designed for data engineers. One Python decorator turns your model into a production service. Handles batching, scaling, dependency management. Fast to deploy, mature community.
    – **Seldon Core** – Kubernetes-native. Runs on your cluster, scales with your workload, integrates with monitoring stacks. Steeper learning curve, but worth it at scale.

    **My take:** BentoML gets you 80% of the way there with 20% of the complexity. Use Seldon when you need predictable, declarative scaling.

    ### Layer 3: Model Monitoring & Observability

    This is where most teams fail silently.

    You ship a model. It works great in testing. But three months later, data drift happens. Your model is making predictions 40% less accurate than when you trained it. You have no idea. Customers do.

    **The Problem:** ML is invisible. You can’t just use application monitoring. You need to watch for data drift, prediction drift, feature distribution changes, label shifts.

    **The Solution:** Dedicated ML monitoring tools.

    The ecosystem split into two camps:

    – **Arize & Whylabs** – Purpose-built for production ML. Dashboard views into model health, drift detection that works, integrations with all the tools you use. Not cheap, but focused.
    – **Open-source alternatives** – Alibi Detect, Great Expectations for data quality, Prometheus for basic metrics. Requires assembly but free.

    **My take:** If your model touches customers, Arize or Whylabs pays for itself in one prevented incident. If it’s internal, Great Expectations + Prometheus works.

    ## The Real-Time ML Shift

    One more trend worth discussing: batch processing is giving way to streaming.

    Yesterday’s architecture: Daily batch pipeline. Train models on yesterday’s data. Serve predictions from this morning’s batch.

    Tomorrow’s architecture: Real-time feature pipelines. Models trained on streaming data. Sub-second predictions.

    Tools enabling this shift:

    – **Kafka** – The backbone. If you’re building streaming features, you’re using Kafka.
    – **Flink** – Distributed stream processing at scale. Complex, but handles what Spark can’t.
    – **Bytewax** – Lightweight Python framework for stream processing. Newer, but impressive for ML workloads.

    ## Practical Decision Framework

    Here’s how I choose tools:

    **1. What’s your bottleneck?**
    – Can’t train consistently? Fix your features first. You need a feature store.
    – Model works in test but fails in production? You need monitoring.
    – Can’t serve fast enough? You need BentoML or Seldon.

    **2. What’s your scale?**
    – Under 10k requests/day? Start simple. BentoML + Great Expectations might be enough.
    – Over 100k requests/day? You need Seldon + proper monitoring.
    – Over 1M requests/day? You probably need specialized infrastructure (Tecton + Arize or custom).

    **3. What’s your team’s expertise?**
    – If you have Kubernetes experts, use Kubernetes-native tools (Seldon, KServe).
    – If you have Python experts, lean on Python-first tools (BentoML, Feast).
    – If you have data engineers (likely), build around data-centric tools (Feature stores, streaming).

    ## The Biggest Mistake

    Evaluation paralysis.

    I’ve seen teams spend six months comparing tools and ship nothing. The difference between Feast and Tecton matters less than actually having a feature store. The difference between BentoML and Seldon matters less than actually monitoring your model.

    Pick a tool. Use it for three months. Then evaluate. Tools improve monthly—your production insights are worth more than theoretical perfection.

    ## What’s Next?

    The ML tools landscape will keep evolving. Foundation models are changing what “serving” means. Prompt engineering is the new feature engineering. But the fundamentals stay the same:

    – Feature consistency between training and serving
    – Fast, reliable inference at scale
    – Continuous monitoring and drift detection

    Build your stack around these principles, and you’ll adapt to whatever tools emerge next year.

    — Pushpjeet Cholkar, Data Engineer