1. dbt + LLM Macros: AI-Powered Transformations
What it is: dbt (data build tool) integrates with large language models to auto-generate column descriptions, test cases, and documentation from your SQL models.
Why it matters: Writing documentation and tests is the least glamorous part of data engineering — and the most skipped. LLM-powered macros change that equation. You can prompt dbt to generate YAML documentation for an entire model in seconds.
-- Run dbt's AI doc generation on your model
dbt docs generate --select my_model
With tools like dbt-osmosis and emerging LLM integrations, you can now propagate column descriptions automatically across your DAG.
2. MLflow: The Experiment Tracker You Already Need
What it is: MLflow is an open-source platform for managing the ML lifecycle — experiment tracking, model registry, and model serving.
Why data engineers need it: Even if you never train a model yourself, your pipelines feed them. When a model degrades, the first question is: “Did the training data change?” MLflow gives you the audit trail to answer that.
import mlflow
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("rmse", 0.84)
mlflow.log_artifact("model.pkl")
3. Feast: The Feature Store That Stops Pipeline Duplication
What it is: Feast is an open-source feature store that lets you define, store, and serve ML features consistently across training and production environments.
The problem it solves: Without a feature store, teams rebuild the same feature logic in multiple places. Models trained on one version of a feature get served predictions from a slightly different version — this is called training-serving skew, and it silently kills model accuracy.
from feast import FeatureStore
store = FeatureStore(repo_path=".")
features = store.get_online_features(
features=["user_stats:purchase_7d_avg"],
entity_rows=[{"user_id": 1001}]
).to_dict()
4. Great Expectations + Anomaly Detection: Quality Beyond Rules
Why the combination matters: Rule-based expectations catch what you already know to look for. Anomaly detection catches distribution shifts, sudden value spikes, or gradual drift that no rule anticipated.
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("orders.csv")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("order_amount", min_value=0, max_value=50000)
results = validator.validate()
5. Vertex AI / SageMaker Pipelines: Orchestrating ML Like a Data Engineer
If you’ve used Apache Airflow, these tools map cleanly to concepts you already know: DAGs, steps, inputs/outputs, artifacts — just applied to ML workflows.
from kfp import dsl
@dsl.pipeline(name="training-pipeline")
def training_pipeline(data_path: str):
preprocess_task = preprocess_op(data_path=data_path)
train_task = train_op(data=preprocess_task.outputs["processed_data"])
evaluate_task = evaluate_op(model=train_task.outputs["model"])
The Big Picture
In 2026, a pipeline doesn’t just move data from A to B. It moves data from raw sources to clean features to trained models to reliable predictions to business outcomes. Every link in that chain needs engineering.
Build both sides of the stack. That’s where the real leverage is.
— Pushpjeet Cholkar, Data Engineer
Leave a Reply