Why Data Engineers Don’t Work in a Vacuum
What goes wrong when data engineers are left out of the loop
1. Introduction
It starts with a message in a company-wide Slack channel:
“Hey, can someone from Data Engineering check the pipeline? Our dashboard is blank again. Also — where do we find the definition of ‘active user’? We used to pull it from user_metrics
, but it seems off now."
Silence.
Then a ping to a data engineer directly.
“Hey! Do you know who owns this dataset? And whether the numbers changed last week? Finance is asking.”
The data engineer reads the thread, sighs, and starts the usual ritual:
Opening Power BI to check the broken dashboard, poking through BigQuery tables to validate the data, flipping between dbt and Dataform to trace model dependencies, scanning Airflow DAGs and a web of cron jobs, and diving into the logs of a third-party replication tool that syncs data from an external vendor — all while thinking:
“We restructured that pipeline two sprints ago — why didn’t anyone tell us this dashboard still relied on the old logic?”
This is the kind of situation that plays out daily in many companies.
And it’s built on a quiet, persistent myth:
That data engineers are individual contributors who work best when left alone to “just build pipelines.”
But here’s the truth:
Data Engineers are some of the most collaborative roles in tech. They don’t work in a vacuum — they operate at the intersection of business context, technical infrastructure, and stakeholder needs.
2. What the Myth Misses
There’s a persistent idea in some teams — especially product-driven or engineering-heavy ones — that once a pipeline is “up and running,” data engineers can step back. That their job ends at ingestion, transformation, and delivery.
But here’s what that myth misses:
Every table represents a decision. Every metric reflects a definition. And every pipeline encodes assumptions that must be made explicit to remain useful.
In a remote world, where decisions are async and tribal knowledge doesn’t spread through overheard conversations, data engineers can’t afford to build in isolation. Their work touches every team — and when they’re not in the loop, cracks form fast.
3. When Communication Breaks Down, Pipelines Become Problems
Let’s walk through some real-world scenarios where poor communication — not poor code — created major issues:
3.1 🔍 The Unused Dashboard
An analyst spent three weeks building a dashboard using a new dataset delivered by the data engineering team. But users in finance rejected it because the revenue logic didn’t match their monthly close process. Why? The DE team had modeled revenue based on invoices sent, not revenue recognized.
At first glance, these sound like the same thing — but they’re not, especially in accounting and financial reporting. Here’s the distinction:
📦 Invoices Sent (Billed Revenue)
This is when a company sends an invoice to a customer — typically triggered when a product is delivered or a service is provided.
- Example:
You sell a software subscription. On Jan 1, you send an invoice for $1,200 covering a 12-month plan.
→ This is invoiced revenue — the full $1,200 is now “billed.”
📊 Revenue Recognized (Accrual-Based Revenue)
This follows accounting rules (like GAAP or IFRS), which require companies to recognize revenue over time, as the service is actually rendered or the product is used.
- In the same example:
You can’t recognize all $1,200 in January. You must spread it out over the 12 months.
→ So you recognize $100/month from Jan to Dec.
❗ Why It Matters
Different teams care about different views of revenue:
- Sales may want to see invoices sent (to track deals and cash flow).
- Finance cares about recognized revenue (for compliance, forecasts, and earnings reports).
- Product may care about usage-based revenue (if billing is tied to product consumption).
If a DE builds a dashboard showing “revenue” based only on invoices — and Finance is expecting recognized revenue — you have a data trust issue. Even though the pipeline technically works, the definition is wrong for the context.
✅ What Should Happen Instead?
This is where data engineers shine — by asking clarifying questions early:
- “When you say ‘revenue’, do you mean billed, recognized, or collected?”
- “How do we handle prepayments or multi-month contracts?”
- “Should we follow accounting standards or operational metrics?”
This kind of communication ensures pipelines reflect the right truth — not just any truth.
The problem wasn’t the pipeline — it was the missing conversation. Had the DE team asked a few clarifying questions early on, this could’ve been avoided.
3.2 🚨 The Cost Explosion
A machine learning team rolls out a new model. It’s powered by a dataset that’s refreshed daily — no big deal, right? Except… no one double-checks how that dataset is stored.
Behind the scenes, it’s backed by an unpartitioned table in BigQuery, containing tens of millions of rows. Each day, as the model retrains and scores, the entire table is scanned. Again. And again. And again.
At first, no one notices.
Until the next billing cycle hits — and the data team is staring at a 3x increase in cloud spend, with no clear explanation.
🧠 What went wrong?
Here’s the issue: the table wasn’t partitioned, which means every query scanned the entire dataset — regardless of whether it needed a day’s worth or a year’s worth of data.
Partitioning is one of the most basic (and critical) data design choices in systems like BigQuery. When used properly — say, by event_date
or created_at
— it allows queries to read only the relevant slices of data. Without it, even small jobs become unnecessarily expensive.
In this case:
- The ML team didn’t realize how expensive it would be to hit that table daily.
- The DE team didn’t know the model was using that dataset in production.
- And no one had set up cost alerts or query volume monitoring.
It wasn’t a technical failure — it was a communication failure.
And it cost the company thousands.
💬 What could’ve prevented it?
- A simple Slack message: “Hey, is this dataset optimized for daily use?”
- A shared dashboard monitoring table size, query scans, or daily spend
- Or even better — a culture of shared ownership between ML and DE teams
The takeaway?
Cloud costs are not just an infrastructure concern. They’re a collaboration concern. When teams don’t talk, you don’t just risk broken pipelines — you risk burning money.
3.3❓The Broken Trust
A product manager opens a dashboard they used to rely on.
They glance at the numbers, pause, and frown. Something feels off — active users seem lower than expected, and revenue doesn’t align with last month’s report.
They ask around:
“Hey, does anyone know who owns this dashboard?”
“Has the data changed recently?”
“Which table are these metrics coming from?”
Silence.
Eventually, someone says, “That dataset hasn’t been touched in months. I think the original owner left the company.”
Another says, “Didn’t the schema change recently? I remember seeing a dbt job fail last week…”
At this point, it’s too late. The PM closes the dashboard. They no longer trust the data.
🧠 What went wrong?
This isn’t just about data being “wrong” — it’s about uncertainty and lack of transparency. In remote settings, where teams rely heavily on dashboards and async tools, trust in data is everything.
And it erodes quickly when:
- Ownership is unclear
- Definitions are undocumented
- Schema changes go unannounced
- Downstream consumers aren’t looped in
In this case, the data may not have been broken at all. But no one could confidently say what it represented, who maintained it, or when it last changed.
🧰 Why this happens
Without a business glossary or data catalog, key knowledge about datasets lives in people’s heads — or worse, in Slack threads long forgotten.
Without clear lineage, consumers can’t trace where numbers come from.
And when remote teams don’t proactively communicate, small changes ripple into big consequences, unnoticed until trust is gone.
💬 What could’ve prevented it?
- A clearly assigned owner for the dataset
- A data catalog entry with descriptions and freshness metadata
- Notification of schema or logic changes through Slack, GitHub PRs, or a data contract
- A short check-in: “Hey, this dashboard still in use? Thinking of refactoring the model behind it.”
The takeaway?
People don’t stop using dashboards because of one bad number. They stop because no one can explain it. And once trust is broken, it’s hard to win back — no matter how technically correct the data might be.
4. What Great Data Engineers Actually Do
Yes, they write code. Yes, they build systems. But that’s just the surface.
Here’s what lies underneath:
4.1 💼 They Learn the Business
Great DEs don’t stop at understanding the schema. They learn how revenue is booked, how supply chains operate, how customer churn is calculated. That’s what lets them model data meaningfully, not just correctly.
“Understanding the business context is what turns a table into a story.”
4.2 🔁 They Manage the Full Data Lifecycle
From ingestion to deletion, DEs handle data with care. They think about lineage, versioning, retention, and regulatory impact. Data doesn’t just flow — it ages and evolves.
“Data doesn’t just flow. It ages, evolves, and sometimes needs to be let go.”
4.3 💸 They Optimize for Cost and Sustainability
Cloud costs aren’t invisible. They’re just delayed. Data engineers help teams build responsibly, designing pipelines that are fast and efficient.
“Every poorly partitioned table is a budget leak.”
4.4 🔍 They Prioritize Observability
If it can’t be monitored, it can’t be trusted. DEs define data SLAs, freshness metrics, and alerts — and make the data platform observable by design.
“If you can’t observe it, you can’t trust it.”
4.5 🛠️ They Guide the Stack
They help choose the right tools at the right time — whether that’s shifting from batch to streaming or monolith to modular.
“It’s not just about building pipelines — it’s about choosing the right systems to scale the business.”
4.6 📚 They Build Culture
In remote teams especially, culture doesn’t spread through osmosis. Senior DEs mentor, write documentation, set standards, and model curiosity.
“A strong data culture is often built by the quiet leadership of DEs.”
4.7 🧪 They Enable Experimentation
Without data engineers, product and growth teams can’t run A/B tests, track experiments, or measure outcomes effectively.
“Without data engineers, you’re not testing — you’re guessing.”
5. So, No — They Don’t Work in a Vacuum
Especially in remote settings, where async work is the default and miscommunication is the norm, data engineers are often the bridge — not the bottleneck.
They’re the ones turning requests into requirements, uncertainty into lineage, and reactive dashboards into reliable infrastructure.
6. The Best Data Engineers:
✅ Know their stakeholders by name.
✅ Understand the “why” behind every table.
✅ Turn chaos into clarity — and pipelines into engines of trust, speed, and scale.
They don’t just build pipelines. They build partnerships — and they make the entire business stronger.
Thank You For Reading. How About Another Article?
Are You Not Entertained?
Hit 50 times that 👏 button and something wonderful will happen.
- Follow for more educational content and stuff!