tech
Notes on a boring data pipeline
In favor of the cron job. A defense of small infrastructure for small problems, and the failure mode of choosing the interesting tool.
By Robel Wolde 4 min read
The data pipeline I am most proud of is profoundly boring. It is a Python script, run by cron, that pulls from a vendor API, normalizes some fields, and inserts into a Postgres table. There is a try/except that logs to a file. The whole thing fits in 240 lines.
It has been running, untouched, for thirteen months.
In that time I have watched three other pipelines in the same organization replace themselves at least once each. Airflow to Prefect. Prefect to Dagster. A streaming pipeline that briefly involved Kafka, then briefly involved Pulsar, and now involves a GitHub Actions cron that calls a Lambda. Each migration was justified, in the moment, by promises about scale we did not need or observability we never used.
I want to argue for the boring path. Not because everyone should write cron scripts — they should not — but because the failure mode of interesting infrastructure is consistent and underdiscussed.
The drift
Interesting infrastructure has a maintenance gradient. When the person who chose the tool leaves, their replacement learns it under duress, during an incident. The choice gets re-evaluated. A new tool is proposed. The proposal sounds good because it is the latest iteration of an evolving discourse on data engineering, and the previous tool is by now two iterations behind. The migration begins. Six months later you have two pipelines doing the same thing — the old one, still running but distrusted, and the new one, not quite finished. By month nine somebody has decided that neither is right and the answer is a third thing.
Boring infrastructure does not have this gradient because there is no discourse to fall behind. Cron has been correct for forty years. Postgres will outlast our careers. A Python script with explicit error handling is something every backend engineer can read. The cost of onboarding the next person is one afternoon.
What boring buys you
The interesting tools are interesting because they are solving real problems — for someone, somewhere. If you have those problems, you should use them. Most teams do not.
Most teams have a few hundred records arriving every few minutes, a couple of joins, a denormalized table consumed by a dashboard. This is “small data” in any sense that matters. Small data wants a small tool. The small tool buys you:
- Predictable failure. Cron skips a run when the script returns nonzero. The next run picks up. There is no DAG state to inspect, no scheduler to restart.
- Cheap debugging.
python pipeline.pyruns the whole thing locally with the same code path as production. There is no equivalent for a Dagster sensor. - Trivial observability. Tail the log. The vocabulary you need to debug it is the vocabulary you already know.
The boring path concedes some things. You will not get distributed retries. You will not get a beautiful UI. You will not be able to brag about it in a system design interview. These are real costs.
When boring is wrong
Boring is wrong when your job actually demands the interesting tool: thousands of jobs with complex dependencies, multi-tenant infrastructure, real exactly-once semantics, hot reloads. If that is you, build it. The mistake is reaching for it because it sounds correct, then living with the maintenance overhead while serving small-data problems.
The signal I look for: when the pipeline has been running long enough that someone mentions “we should rewrite this.” If the rewrite would replace 240 lines of Python with 800 lines of YAML and a new vendor relationship, the answer is almost always no. If the rewrite would address a specific failure mode that has actually occurred, the answer is probably yes.
A small heuristic
Whenever I am about to introduce a new tool, I write down the next-three-things it makes harder. Not easier — harder. Onboarding. Local development. The version upgrade I will need to do in eighteen months. If I cannot articulate those costs, I do not understand the tool well enough to choose it. If I can, I usually find that the boring alternative does not have them.
Boring is not a state. It is a discipline of declining temptation. Two of my favorite engineers do this constantly. Their work shows up at code review meetings, never at conferences.