M365 Show -  Microsoft 365 Digital Workplace Daily
M365 Show with Mirko Peters - Microsoft 365 Digital Workplace Daily
Is Your Dataflow Reusable—or a One-Trick Disaster?
0:00
-19:28

Is Your Dataflow Reusable—or a One-Trick Disaster?

Picture this: your lakehouse looks calm, clean Delta tables shining back at you. But without partitioning, schema enforcement, or incremental refresh, it’s not a lakehouse—it’s a swamp. And swamps eat performance, chew through storage, and turn your patience into compost. I’ve seen it happen in more tenants than I care to count.

Here’s the fix: stick around, because I’ll give you a 60‑second checklist you can run against any dataflow—parameters, modular queries, Delta targets, and partitioning. Dataflows Gen2 use Power Query/M, so the same rules about modular queries and functions still apply.

Subscribe at m365.show, grab the checklist, and let’s see if your “working” dataflow is actually a time bomb.

Why Your 'Working' Dataflow is Actually a Time Bomb

The real issue hiding in plain sight is this: your dataflow can look fine today and still be hanging by a thread. Most people assume that if it refreshes without error, it’s “done.” But that’s like saying your car is road‑worthy because it started once and the check engine light is off. Sure, it ran this morning—but what happens when something upstream changes and the entire pipeline starts throwing fits? That silent culprit is schema drift. Add one column, shift field order, tweak a data type, and your flow can tip over with no warning.

For most admins, this is where the blind spot kicks in. The obsession is always: “Did it refresh?” If yes, gold star. They stop there. But survival in the real world isn’t just about refreshing once; it’s about durability when change hits. And change always shows up—especially when you’re dealing with a CRM that keeps sprouting fields, an ERP system that can’t maintain column stability, or CSV files generously delivered by a teammate who thinks “metadata” is just a suggestion. That’s why flex and modularity aren’t buzzwords—they’re guardrails. Without them, your “fixed” pipe bursts as soon as the water pressure shifts.

And the fallout is never contained to the person who built the flow. Schema drift moves like a chain reaction in a chemical lab. One new field upstream, and within minutes you’ve got a dashboard graveyard. There’s Finance pushing panic because their forecast failed. Marketing complaints stack up because ad spend won’t tie out. The exec team just wants a slide with charts instead of cryptic error codes. You—the admin—are stuck explaining why a “tiny change” now has 20 dashboards flashing red. That’s not user error; that’s design fragility.

Here’s the blunt truth: Dataflows Gen2, and really every ETL process, is built on assumptions—the existence of a column, its data type, order, and consistency. Break those assumptions, and your joins, filters, and calculations collapse. Real‑world schemas don’t sit politely; they zigzag constantly. So unless your dataflow was built to absorb these changes, it’s fragile by default. Think of it like relying on duct tape to hold the plumbing: it works in the moment, but it won’t survive the first surge of pressure.

The smart move isn’t hope. It’s defense. If schema drift has already burned you, there’s a quick diagnostic: run the 60‑second checklist. First, does your flow enforce schema contracts or land data in a Delta table where schema evolution is controlled? Second, does it include logic to ingest new columns dynamically instead of instantly breaking? Third, are your joins coded defensively—validating types, handling nulls—rather than assuming perfect input? If you can’t check those boxes, then you’re not done; you’ve just delayed failure.

And before you think, “Great, so everything’s doomed,” there’s mitigation available. Fabric supports strategies like dynamic schema handling in Mapping Dataflows and parameterizing queries so they adapt without rewrites. CloudThat and others highlight how dynamic schema detection plus metadata repositories for mappings can reduce the fragility. Those aren’t silver bullets, but they keep your pipelines from detonating every time a developer adds a field on Friday afternoon.

One important caveat: even a “healthy” Dataflow Gen2 design has limits. They don’t handle massive datasets as well as Spark, and wide joins or deep transformations can turn refreshes into crawl speed. If you know volumes will hit high scale, offload the heavy work to Spark notebooks and keep Dataflows for lighter prep. Key2Consulting and CloudThat both call this out in practice. Treat Dataflows as one tool in the kit—not the hammer for every job.

Bottom line, a so‑called working dataflow that can’t weather schema drift or large‑scale growth isn’t reliable. It’s fragile, adding silent debt into your system. And disposable pipelines, stacked on top of each other, create a tower of quick fixes nobody wants to maintain.

That puts us at the next layer of the problem: the bad habits baked in from the very start. Think your setup looks clean? Let’s run it against the three sins that turn “working” pipelines into a nonstop ticket machine.

The Three Deadly Sins of Dataflow Design

Here’s where most dataflows go sideways: the three deadly sins of design. They’re simple, they’re common, and they guarantee headaches—hardcoding values, piling on spaghetti logic, and ignoring the fact that scale exists. We’ve all slipped into them because in the moment they look like shortcuts. The problem is when those “shortcuts” snake into production and you’re left with fragile pipelines no one wants to untangle.

First up: hardcoding. You’re tired, the refresh is failing, so you paste a file path or static date directly into your query. It works. For now. But what you’ve actually done is cement brittle assumptions into your pipeline. The second someone moves that file, renames a table, or asks for the same logic in a different workspace, the entire thing snaps. A better fix is dead simple—centralize values. Either store them as parameters inside your dataflow or, if you’re managing multiple environments, put your config data in a metadata table. SQL or Cosmos DB both work fine for this. Then your flows don’t care which folder or server they’re pointing at—you just swap the parameter, and everything still refreshes.

Sin two: spaghetti logic. It usually starts clean—three steps to connect, transform, and load. Fast forward a few months and you’ve got twenty chained queries full of nested merges, conditional splits, and mystery filters no one admits to writing. At that point, your dataflow feels less like logic and more like a plate of noodles that shocks you if you pick the wrong one. Debugging is guesswork, collaboration is impossible, and governance goes out the window because nobody can even explain where the fields came from. The fix? Break the work into named, single-purpose queries. Use functions in Power Query M for reusable bits like date handling or path parsing. Yes, Dataflows Gen2 supports this, but remember: reusing those blocks across workspaces has limits. If you need true reuse across your tenant, build the canonical version in a single “source of truth” dataflow or push complex transformations down into your lakehouse or notebook layer. Bottom line—write logic in chunks that humans can read tomorrow, not in one monster chain nobody can ever touch again.

Last sin: ignoring scale. On demo-sized test data, everything looks magical. Four thousand rows? Instant refresh. Then the real dataset drops—four million rows with concurrent refreshes—and suddenly your dataflow is standing in quicksand. It backs up the refresh queue, hogs compute, and everything else grinds to a halt. This isn’t Fabric being weak; it’s your design never accounting for production volume. Small joins turn into bottlenecks, wide transformations chew through memory, and incremental refresh gets ignored until jobs start timing out. If you actually want things to run, test with production-like volumes early. Use coalesce to cut down partition counts and repartition strategically so the engine isn’t juggling thousands of tiny chunks. And give yourself a hard rule: if your refresh times balloon past usable, it’s time to either push the heavy transformations into a Spark notebook or tune the partitioning until they behave. Test at scale, or production will test you instead.

Here’s the kicker—these sins don’t live in isolation. Hardcode enough values and you’ll be rewriting every time the environment shifts. Let spaghetti logic grow and you’re one step away from a full black-box nobody understands. Ignore scale, and eventually workloads pile up until the whole refresh ecosystem collapses. Each one of these mistakes adds debt; mix them and you’re trading resilience for fragility at compound interest.

Fixing them isn’t about perfection—it’s about giving yourself guardrails. Safe defaults like parameters, modular queries, and realistic testing keep your pipelines stable enough to survive the normal chaos of changing schemas and growing datasets. The trick now is turning those guardrails into your standard operating mode, so you build flows that adapt instead of collapse.

And that’s the bridge to the real differentiator—the design habits that actually make dataflows reusable instead of disposable.

The Secret Sauce: Modularity and Parameterization

So how do you actually keep a dataflow from turning into a throwaway experiment? The answer comes down to two things: modularity and parameterization. Skip those and you’re not building pipelines—you’re cobbling together one-offs that collapse the minute requirements shift. A reusable dataflow is one that can drop into another project, adapt with minimal tweaks, and still function. Anything else is just glorified copy-paste maintenance.

Modularity starts with carving transformations into standalone steps. Picture a block of logic that standardizes customer names. If you bury it inside a bloated 50-step chain, it’s stuck there forever. Pull that same logic into a separate function, and suddenly it’s a reusable tool across any data source. That’s the shift: building small, utility-style steps with a single clear purpose, instead of fusing everything into one unreadable chain. Power Query M makes this easy—you can write custom functions with a defined input and output, then reuse them across queries in Dataflows Gen2. The habit to build here is simple: every complex transform should be a named function with a clear input/output signature and a single parameter bag. If you can’t explain it in one sentence, split it.

Parameters are the other half of the equation, and yet they’re treated like garnish by most admins. Without them, you end up hardcoding connector strings, file paths, or query dates into every pipeline. It feels faster in the moment, but the second you switch environments or adjust schedules, you’re hacking code again. Put all those values into parameters—pipeline-level parameters in Fabric, or better yet, pull them from a centralized metadata store if you need cross-workspace consistency. Then a path or key change becomes a one-line tweak instead of 15 near-identical rewrites. CloudThat and others call this metadata-driven design, and it’s the antidote to Groundhog Day maintenance.

A practical example: handling date filters. Without parameters, you’re littering static “today minus one” logic in multiple queries. Change the rule, and you’re chasing it everywhere. Parameterize that filter once—or wrap it in a custom function—and use it consistently. Next time the business decides “yesterday” should actually mean “last business day,” you adjust it in one place. It’s the same principle Excel power users learned when they stopped copy-pasting macros across 15 files. Update once, fix everywhere. That’s reusability in action.

When you approach dataflows this way, they start working like Lego kits. You snap in the reusable “clean customer” piece, click it onto the “validate transaction” piece, and then add the “calculate totals” tile. Building a new workflow doesn’t require duct tape or rewriting—it’s just assembly. The bigger payoff is maintenance. When a policy changes, you patch the block once and every dependent dataflow inherits it automatically. No black-box chain reactions, no endless debugging.

Now, here’s the caveat: Dataflows Gen2 doesn’t give you unlimited reusability across workspaces. Functions and queries don’t magically sync tenant-wide. If you truly want global reuse and version control, push heavy shared logic down into the lakehouse or a Fabric notebook. Keep workspace-level dataflows small, modular, and parameterized, then centralize the big, shared transformations at the storage or notebook layer. Think of it as a hybrid model: flexibility up top, governance underneath.

The reason governance teams cheer for this approach is that modularity plus parameters make your dataflows transparent. Functions and parameter sets are easier to audit, easier to document, and far less prone to mystery logic creeping in. No more spaghetti queries nobody can explain—just clean building blocks that spell out what they do. And since duplication is gone, the volume of break/fix tickets drops like a rock.

So the rule is clear: modularize your transformations into functions, parameterize every environmental value, and keep shared logic centralized when reuse must cross workspaces. Do this, and dataflows stop being disposable hacks and start becoming durable assets. Skip it, and you’re signing up for copy-paste chaos.

But even the cleanest logic needs a reliable foundation to stand on. And in Fabric, that foundation is your lakehouse. If the storage layer underneath isn’t governed properly, everything built on top will eventually sink.

Taming the Lakehouse with Delta Lake

Which brings us to the piece that gives your lakehouse actual discipline: Delta Lake. In Fabric, Delta isn’t optional—it’s the difference between governed storage and a heap of CSVs disguised as a “platform.” What Delta brings to the table are three things you can’t fake: ACID transactions, schema enforcement, and versioning with time travel. ACID means no more half-written data corrupting reports—every write is atomic, consistent, isolated, and durable. Schema enforcement blocks sneaky mismatched data types or surprise columns before they poison your pipeline. And version logs with time travel give you the rollback and audit trail when compliance asks, “What did this dataset look like last quarter?” That’s not theory; that’s operational survival.

The opposite is what you see all the time: admins drag and drop CSVs into OneLake, declare victory, and forget that governance is more than shared storage. On the surface, it looks centralized. Reality? A junk pile. No schema validation, no transaction logs, no audit history. Querying across that is like fishing blindfolded in mud—you might catch something, but most of what you reel in is garbage. And when schema drift hits—say someone renames or reorders columns—you’re left with silent corruption until an angry manager points out that one report shows 112 sales and another swears it’s 109. Without ACID, schema enforcement, and time travel, you can’t even tell where the truth went.

Delta Lake fixes this by putting hard structure around the chaos. Every write produces a transaction log. Every change produces a trackable version. Queries hit governed tables instead of raw files, so when formats shift they break fast and visibly instead of corrupting weeks of output. With schema enforcement, if your vendor slips in two new fields, the job fails where you can see it, not downstream in someone’s quarterly board deck. And with time travel, you can literally query yesterday’s or last week’s snapshot—perfect for audits, debugging, or recovering from the intern who wrote a join that ate half the dataset.

Here’s the metaphor: dumping flat files into storage is a junk drawer. Technically everything’s “saved,” but you won’t find what you need without blood pressure spiking. Delta Lake is a cabinet that actually locks, labels, and versions. You can track changes, retrieve old entries, and trust that the count doesn’t change depending on which light switch is on. Which one would you rather bet your finance reports on?

And if someone on your team says, “We’ll skip Delta, we don’t need the overhead,” point them at audit time. Without transactional logs and schema enforcement, you’ll have inconsistent row counts between dashboards nobody can reconcile, and a forensic process as painful as piecing together a shredded Excel workbook. If compliance ever knocks, you’ll wish you had the logs Delta generates for free.

Real life example: LinkedIn documented a pipeline where they used a gateway to pull CSVs directly into OneLake and then leaned on dynamic schema options. That works—but the wise part was landing them in Delta tables, which gave them controlled schema handling and versioning instead of betting on flat files to play nice forever. It’s the difference between “refresh fails safe and we fix in five minutes” versus “three weeks of corrupted financials and loss of trust.”

And it’s not just governance—Delta helps you keep performance sane too. Queries don’t need to churn through countless micro-files just to return a month of data. You can partition data by logical attributes like date or region, which slashes the amount scanned. On top of that, Fabric supports performance practices for Delta such as OPTIMIZE, which merges those tiny files into efficient chunks; VACUUM, which clears out obsolete data files; and Z-ordering, which clusters data around heavily queried columns so filtering doesn’t drag. They’re simple to run, and they stop your storage from bloating while keeping queries snappy.

I’ve seen the cost of ignoring this. One client thought solid dataflow logic alone gave them safety. Then a vendor dropped extra columns in the weekly file, nobody noticed, and it corrupted weeks of downstream reporting. It wasn’t malice—it was just a lack of schema enforcement. Once we flipped to Delta, the file failed loudly on day one, they adjusted the mapping, and trust in the reports was intact. Same engineers, same tools—the only difference was using Delta’s structure instead of flat file faith.

Bottom line—Delta Lake isn’t extra credit. It’s insurance baked into your storage. Skip it, and you’ll find out the hard way during a compliance check or a performance crunch. Use it, and your lakehouse stops being a liability and becomes a reliable engine you can actually build on.

Now, governance is only half the battle. Even with Delta in place, sloppy design choices around partitioning or brute-force reloads will drag you back into slow query hell. And that’s where a lot of admins still bleed time.

Performance Pitfalls: Partitioning and Incremental Refresh Done Right

Partitioning and incremental refresh are the two basics that decide if your pipeline flies or crawls. Partitioning means you chop data into segments that match how it’s usually queried—say by month or region—so Fabric only pulls the slices it needs. Incremental refresh skips the brute force reloads and processes only new or changed data since the last run. Together, they’re the reason some queries feel instant while others make you wonder if the server died.

Partitioning is hands down the sharpest tool to speed up large datasets, but only if you pick the right grain. Align partitions with real query patterns. If most reports are month-based, partition by month. If rollups are by region, slice by region. What you don’t do is carve up by random columns like user ID or churn out daily folders for ten years of data. That’s how you create hundreds of tiny files that waste more time being managed than actually read. Imagine cutting a pizza into a hundred shards—it’s technically partitioned, but good luck eating it. The overhead buries any benefit. Bottom line: pick natural query keys, avoid high-cardinality junk, and always test your approach at production scale. Something slick on 10,000 rows can crash and burn once you load 10 million.

Even if partitions are perfect, full reloads will grind things down. That’s why incremental refresh matters. Instead of repeatedly hauling every row across the wire, it just picks up what changed. Think of it like a phone update. Nobody wipes the OS to reinstall it whole every time; they patch what’s new. Skip incremental refresh and you’re basically re-flashing the entire phone, night after night. Real-world difference? We’ve seen nightly jobs processing tens of millions of rows drop from multi-hour marathons down to minute-level updates just by flipping to incremental mode. Same hardware, no code changes, completely different outcome.

And here’s the catch: piling on more capacity won’t fix poor design here. You can burn through budget on bigger machines, but if you’re still slicing data into toothpick partitions and reloading everything nightly, performance will sink. Capacity helps only if the design underneath is efficient. Partitioning tied to behavior, plus refresh that patches instead of rebuilds, solves more than renting a larger server ever will.

There are some easy practices to keep partitions healthy too. If you’re landing data in Delta tables, don’t let them rot into thousands of micro-files. Run OPTIMIZE to compact them into efficient blocks. And when writing out, use coalesce to reduce partition counts where possible. Repartitioning has its place, but coalesce is faster for merging down to a practical number without reshuffling the whole dataset. These tiny tweaks pay off in faster queries and smoother refresh cycles.

The net effect is refreshes that don’t feel like punishment. Instead of watching jobs chew compute for hours, you get predictable performance where dashboards pop when users actually open them. That’s the goal—not magic, not brute force, just applying partitioning and incremental refresh with common sense and a few guardrails.

Do these right, and you’ve taken fragile batch jobs and turned them into pipelines that scale without bankrupting your capacity budget. Miss them, and you’ll keep wondering why performance tanks every time volume doubles. At this point, the pattern is clear: good dataflows aren’t about hacks or shortcuts. They’re designed to stay consistent even as the landscape shifts. And that leads us to the last piece of the puzzle—building for resilience so they don’t just work once, but keep working when everything around them inevitably changes.

Conclusion

Here’s the bottom line. Reusable dataflows don’t come from hope—they come from three habits practiced every time. One: modularize and parameterize so you’re not hand-editing the same logic fifteen times. Two: land everything in Delta Lake so ACID transactions, schema enforcement, and version logs keep your data trustworthy. Three: tune partitioning and use incremental refresh so jobs run fast instead of burning hours and credits. Those three cover 90% of what keeps pipelines durable.

Want the quick win? Subscribe to the podcast and leave me a review—otherwise I’ll assume you enjoy spending weekends debugging CSVs. Don’t make me call your boss.

Treat the lakehouse like an asset with governance, and you’ll watch maintenance tickets drop instead of multiply.

Discussion about this episode

User's avatar