Fabric Lakehouse Governance & Data Lineage

M365 Show with Mirko Peters - Microsoft 365 Digital Workplace Daily

0:00

-22:13

Fabric Lakehouse Governance & Data Lineage

Mirko Peters - M365 Specialist

Aug 16, 2025

If you've ever wondered why your data suddenly disappears from a report, or who exactly changed the source file feeding your monthly dashboard, you're not alone. Most teams are flying blind when it comes to seeing the full journey of their data.

Today, we're going to trace that journey inside Microsoft Fabric — from ingestion, through transformation, into analytics — and uncover how lineage, permissions, and the catalog work together to keep you in control. By the end, you'll see every hop your data makes, and exactly who can touch it.

Seeing the Invisible: The Path Data Actually Takes

Most people picture data traveling like a straight road: it leaves the source, passes through a few hands, and ends up neatly in a report. In reality, it’s closer to navigating an old building that’s been renovated a dozen times. You’ve got hallways that suddenly lead to locked doors, side passages you didn’t even know existed, and shortcuts that bypass major rooms entirely. That’s the challenge inside any modern analytics platform—your data’s path isn’t just a single pipeline, it’s a web of steps, connections, and transformations.

Microsoft Fabric’s Lakehouse model gives the impression of a single, unified home for your data. And it is unified—but under the hood, it’s a mix of specialized services working together. There’s a storage layer, an analytics layer, orchestration tools, and processing engines. They talk to each other constantly, passing data back and forth. Without the right tools to record those interactions, what you actually have is a maze with no map. You might know how records entered the system and which report they eventually landed in, but the middle remains a black box.

When that black box gets in the way, it’s usually during troubleshooting. Maybe a number is wrong in last month’s sales report. You check the report logic, it looks fine. The dataset it’s built on seems fine too. But somewhere upstream, a transformation changed the values, and no one documented it. That invisible hop—where the number stopped being accurate—becomes the needle in the haystack. And the longer a platform has been in use, the more invisible hops it tends to collect.

This is where Fabric’s approach to lineage takes the maze and lays down a breadcrumb trail. Take a simple example: data comes in through Data Factory. The moment the pipeline runs, lineage capture starts—without you having to configure anything special. Fabric logs not just the target table in the Lakehouse but also every source dataset, transformation step, and subsequent table or view created from it. It doesn’t matter if those downstream objects live in the same workspace or feed into another Fabric service—those links get recorded automatically in the background.

In practice, that means if you open the lineage view for a dataset, you’re not just seeing what it feeds—you’re seeing everything feeding it, all the way back to the ingestion point. It’s like tracking a shipment and seeing its path from the supplier’s warehouse, through every distribution center, truck, and sorting facility, instead of just getting a “delivered” notification. You get visibility over the entire chain, not just the start and finish.

Now, there’s a big difference between choosing to document lineage and having the system do it for you. With user-driven documentation, it’s only as current as the last time someone updated it—assuming they remembered to update it at all. With Fabric, this happens as a side effect of using the platform. The metadata is generated as you create, move, and transform data, so it’s both current and accurate. This reduces the human factor almost entirely, which is the only way lineage maps ever stay trustworthy in a large, active environment.

It’s worth noting that what Fabric stores isn’t just a static diagram. That automatically generated metadata becomes the basis for other controls—controls that don’t just visualize the flow but actually enforce governance. It’s the foundation for connecting technical lineage to permissions, audit trails, and compliance cataloging. When you hear “metadata,” it can sound like passive information, but here it’s the scaffolding that other rules are built on.

And once you have that scaffolding in place, permissions stop being static access lists. They can reflect the actual relationships between datasets, reports, and workspaces. Which means you’re not granting access in isolation anymore—you’re granting it with the full context of where that data came from and where it’s going. That’s where lineage stops being just an operational tool for troubleshooting and becomes a strategic tool for governance.

Because once you can see the full path every dataset takes, you can make sure control over it travels just as consistently. And that’s exactly where permission inheritance steps in.

One Permission, Everywhere It Matters

Imagine giving someone permission to open a finished, polished report — only to find out they can now see the raw, unfiltered data behind it. It’s more common than you’d think. The intent is harmless: you want them to view the insights. But if the permissions aren’t aligned across every stage, you’ve just handed over access to things you never meant to share.

In the Lakehouse, Microsoft Fabric tries to solve this with permission inheritance. Instead of treating ingestion, storage, and analytics as isolated islands, it treats them like rooms inside the same building. If someone has a key to enter one room, and that room directly feeds into the next, they don’t need a separate key — the access decision flows naturally from the first. The model works by using your workspaces as the control point. Everything in that workspace — whether it’s a table in the Lakehouse, a semantic model in Power BI, or a pipeline in Data Factory — draws from the same set of permissions unless you override them on purpose.

In a more siloed environment, permissions are often mapped at each stage by different tools or even different teams: one team manages database roles, another manages storage ACLs, another handles report permissions. Over time, those separate lists drift apart. You lock something down in one place but forget to match it in another, or you remove a user from one system but they still have credentials cached in another. That’s how security drift creeps in — what was once a consistent policy slowly turns into a patchwork.

Let’s make this concrete. Picture a Lakehouse table holding sales transactions. It’s secured so that only the finance team can view it. Now imagine you build a Power BI dataset that pulls directly from that table, and then a dashboard on top of that dataset. In a traditional setup, you’d need to manually ensure that the Power BI dataset carries the same restrictions as the Lakehouse table. Miss something, and a user with only dashboard access could still query the source table and see sensitive details.

In Fabric, if both the Lakehouse and the Power BI workspace live under the same workspace structure, the permissions cascade automatically. That finance-only table is still finance-only when it’s viewed through Power BI. You don’t touch a single extra setting to make that happen. Fabric already knows that the dataset’s upstream source is a restricted table, so it doesn’t hand out access to the dataset without verifying the upstream rules.

The mechanics are straightforward but powerful. Because workspaces are the organizing unit, and everything inside follows the same security model, there’s no need to replicate ACLs or keep separate identity lists in sync. If you remove someone from the workspace, they’re removed everywhere that workspace’s assets appear. The administrative load drops sharply, but more importantly, the chances of accidental access go down with it.

This is where the contrast with old methods becomes clear. In a classic warehouse + BI tool setup, you might have a database role in SQL Server, a folder permission in a file share, and a dataset permission in your reporting tool — all for the same logical data flow. Managing those in parallel means triple the work and triple the opportunity to miss a step. Even with automation scripts, that’s still extra moving parts to maintain.

The “one permission, many surfaces” approach means that a change at the source isn’t just reflected — it’s enforced everywhere downstream. If the Lakehouse table is locked, no derived dataset or visual bypasses that lock. For governance, that’s not a nice-to-have — it’s the control that stops data from leaking when reports are shared more widely than planned. It aligns your security model with your actual data flow, instead of leaving them as two separate conversations.

When you combine this with the lineage mapping we just talked about, those permissions aren’t operating in a void. They’re linked, visually and technically, to the exact paths your data takes. That makes it possible to see not just who has access, but how that access might propagate through connected datasets, transformations, and reports. And it’s one thing to enforce a policy — it’s another to be able to prove it, step by step, across your entire pipeline.

Of course, having aligned permissions is great, but if something goes wrong, you’ll want to know exactly who made changes and when. That’s where the audit trail becomes just as critical as the permission model itself.

A Single Source of Truth for What Happened and When

Ever try to figure out who broke a dashboard — and end up stuck in a reply-all thread that keeps growing while no one actually answers the question? You bounce between the data team, the BI team, and sometimes even the storage admins, piecing together guesses. Meanwhile, the person who actually made the change is probably wondering why the metrics look “different” today. This is the part of analytics work where the technical problem turns into a game of office politics.

Audit logs are Fabric’s way of taking that noise out of the equation. They act like a black box recorder for your entire Lakehouse environment. Every significant action is captured: who did it, what they touched, and when it happened. It’s not just a generic access log—Fabric ties these entries directly to specific objects in the platform. So if a dataset’s schema changes, you can see the exact user account that made it, along with a timestamp and the method they used.

Here’s where the connection to lineage makes a difference. If all you had was a folder of log files, you’d still end up manually cross-referencing IDs and timestamps to figure out the impact. But because Fabric already maps the data flow, those logs don’t live in isolation. You can view a dataset’s lineage, click on a node, and see precisely which actions were run against it. That means you can trace a broken metric right back to the transformation job it came from — and identify the person or process that ran it.

The coverage is broad, too. Fabric’s audit layer records access events, so you know when someone queried a table or opened a report. It logs creation and deletion of datasets, pipelines, and tables. Modifications get a record whether they’re structural, like changing a column type, or procedural, like editing a pipeline activity. Even publishing a new version of a Power BI report counts as an event, tied back to its lineage. All of it gets the same treatment: time, user, and object ID, stored in a consistent format.

This uniformity is what turns the logs into something usable for compliance. Regulatory audits don’t care about your internal tooling—they care that you can prove exactly who accessed sensitive data, under what authorizations, and what they did with it. Fabric’s audit trail can be queried to produce that history across ingestion, transformation, and output. If an HR dataset is classified as containing personal information, you can show not only the access list but every interaction that dataset had, right down to report exports.

Incident investigations work the same way. Say a number in a quarterly report doesn’t match the finance system. Instead of speculating, you go to the dataset feeding that report, pull its audit history, and see that two weeks ago a transformation step was added to a notebook. The person who committed that change is there in the log. You can verify if it was intentional, test the outcome, and fix the issue without having to untangle chains of hearsay.

One of the underappreciated parts here is how it integrates with Purview. While Fabric keeps the logs, Purview can pull them in alongside the catalog and lineage data from across the organization. That means the audit for a dataset in one workspace can be looked at next to its related objects in other workspaces and even non-Fabric data sources. For large organizations, this stops investigations at the borders between teams. Everything’s indexed in a single, searchable layer.

When you link logs and lineage like this, you get more than a record of events—you get a timeline of your data’s actual life. You can follow the route from source to report, while also seeing who stepped in at each point. It’s a complete view that connects human actions to data flows. That’s what saves you from chasing people down in email threads or making decisions based on guesswork.

And beyond solving technical problems, this level of visibility takes the politics out of post-mortems. You’re not relying on memory or conflicting descriptions— you’ve got a single, objective record. No matter how complex the pipeline or how many teams touched it, you can back every claim with the same source of truth. And once that visibility is in place, the obvious next step is to scale it out, so that same clarity exists across every dataset and every team in the organization. That’s where the catalog comes in.

Purview: The Map Room for Your Data Universe

Knowing the lineage inside one workspace is useful — but it’s also like knowing the street map of your own neighborhood without ever seeing the city plan. You can navigate locally, but if the delivery truck gets lost two suburbs over, you have no idea why it’s late. That’s the gap between workspace-level insight and an enterprise-wide view. And that’s exactly where Microsoft Purview steps in.

Purview sits above Fabric, acting like an index for everything the platform knows about your data’s structure, movement, and classification. Instead of digging into each workspace separately, you get a single catalog that brings lineage, definitions, and access rules into one place. It aggregates metadata from multiple Fabric environments — and from outside sources too — so your view isn’t limited by team or project boundaries.

The problem it solves is straightforward but critical. Without a central catalog, each team’s view of lineage ends at their own assets. The BI group might know exactly how their dashboards are built from their datasets. The data engineering team might know how those datasets were sourced and transformed from raw data. But unless they’re trading notes constantly, the full picture never exists in one system. Troubleshooting, compliance checks, and data discovery all slow down because you have to stitch fragments together manually.

In Purview’s catalog, lineage from ingestion to analytics is mapped across every Fabric workspace it’s connected to. Imagine opening a dataset’s page and not only seeing its lineage inside its current workspace, but also the ingestion pipeline in another workspace that feeds it, and the curated table two more steps upstream. That’s not a separate diagram you have to maintain — it’s read directly from Fabric’s metadata and preserved in the catalog. From there, anyone with the right access can navigate it like a continuous chain, no matter which logical or organizational boundaries it crosses.

One of the most tangible benefits is search. Purview isn’t just indexing object names; it understands classifications and sensitivity labels. If your compliance officer wants to know where all data containing “customer phone number” is stored or consumed, they can run a query across the catalog and get every instance — in Lakehouse tables, Power BI datasets, even Synapse artifacts. That search works because Purview stores both the technical metadata and the business metadata you’ve added, so “customer phone number” could match a column in a Lakehouse table as well as a field in a report’s data model.

That connection to business glossaries is where Purview goes beyond being a passive map. If you’ve defined common business terms, you can link them directly to datasets or columns in the catalog. It means that “Net Revenue” isn’t just a label in a report — it’s tied to the actual data source, transformation logic, and every report that uses it. For governance, this reduces ambiguity. Different teams aren’t debating definitions in chat threads; they’re all pointing to the same glossary entry, which links back to the exact data objects in Fabric.

Integration with technical assets is broad and consistent. Purview understands Power BI datasets, including their table and column structures. It knows Lakehouse tables and the pipelines feeding them. It registers Synapse notebooks, SQL scripts, and dataflow artifacts. And for each asset, it keeps track of lineage relationships and classifications. This makes it just as easy to trace the origin of a KPI in a Power BI report as it is to audit a transformation notebook’s impact on multiple downstream tables.

Centralizing all of this breaks down silos in a practical way. With no single catalog, the security team might only see logs and permissions for their own systems, while the analytics team works in total isolation on reporting models. Purview creates overlap — the catalog becomes the single reference point for technical teams, analysts, and compliance officers alike. It means a governance policy written at the organizational level can be checked against real data flows, instead of relying on assumptions or self-reported documentation.

And that’s the point where technical reality meets compliance reporting. You’re not just drawing maps to satisfy curiosity. You’re connecting verified lineage to actual usage, classifications, and security rules in a way that can stand up to audits or investigations. Whether the question is “Where is this sensitive field stored?” or “Which reports depend on this table we’re changing?”, the answer is in the catalog — complete, current, and verifiable.

With that kind of organization-wide visibility in place, you can finally see how every piece of the pipeline connects. Which raises the next challenge: ensuring that transparency isn’t lost once the data starts changing inside transformations.

Keeping Transparency Through Every Transformation

Every time data goes through a transformation, you’re removing or reshaping something. Maybe it’s a simple column rename, maybe a full aggregation — but either way, the original form changes. If the system isn’t capturing that moment, you’re left with a number you can’t properly account for. It still looks valid in a report, but ask how it was calculated and you’ll find yourself digging through scripts, emails, and memory to reconstruct what happened.

Inside Microsoft Fabric, this is where the Synapse transformation layer earns its keep. Whether you’re working in SQL scripts, Spark notebooks, or Dataflows, each step that changes the data keeps its connection back to the original source. The Lakehouse doesn’t just store the output table — it also knows exactly which datasets or tables fed into it, and how they link together. Those links become part of the lineage graph, so you can navigate both the “before” and the “after” without guessing or relying on separate documentation.

The risk without transformation-level lineage is pretty straightforward. You start trusting aggregates or calculated fields that may be outdated, incomplete, or based on incorrect joins. You can double-check the final query if you have it, but that tells you nothing about upstream filters or derived columns created three models earlier. This is how well-meaning teams can ship KPIs that contradict each other — each one consistent within its own context, but not rooted in the same underlying data path.

Here’s a simple scenario. You’ve got a transaction table logging individual sales: date, product, region, amount. The business asks for weekly sales totals by region. In a notebook, you group by week and sum the amounts, creating an aggregated table. In most systems, the link back to the base table isn’t tracked beyond the notebook script itself. In Fabric, that weekly sales table still appears in the lineage graph with a live connection to the source transaction table. When you click that node, you see where it came from, which transformation objects touched it, and where it’s used downstream in reports. That connection doesn’t fade after the job completes — it’s part of the metadata until you delete the asset.

On the graph, each transformation appears as its own node: a Dataflow, a Notebook, a SQL script. You can see both the incoming edges — the datasets it consumes — and the outgoing edges — the tables, views, or datasets it produces. This makes it obvious when multiple outputs come from the same transformation. For example, a cleansing script might produce a curated table for analytics and a separate feed for machine learning. The lineage view shows those two paths branching from the same point, so any changes to that transformation are visible to the owners of both outputs.

What’s useful is that this scope isn’t limited to one type of tool. A Dataflow transforming a CSV has the same kind of upstream and downstream tracking as a Spark notebook joining two Lakehouse tables. That consistency is possible because Fabric’s internal service mesh treats these tools as peers, passing metadata the same way it passes the actual data. The fact you built something in SQL and your colleague built theirs in a visual Dataflow doesn’t mean you need two different ways to see the lineage.

This automatic, tool-agnostic mapping turns an abstract governance goal into something you can actually act on. Quality assurance teams can audit an entire calculation path, not just the last step. Compliance officers can prove that a sensitive field was removed at a specific transformation stage and never reintroduced. Analysts can check if two KPIs share a common base table before deciding whether they truly compare like-for-like. It’s not about policing work — it’s about trusting outputs because you can see and verify every step that shaped them.

In a BI environment, trust is fragile. One unexplained spike or mismatch erodes confidence quickly. When you’ve got transformation-level lineage baked in, you can answer “Where did this number come from?” with more than a shrug. You can click your way from the report through each transformation, all the way back to the original record. And when that degree of traceability is combined with governance controls, permissions, and catalogs, the result isn’t just visibility — it’s an entire data estate where every decision and every metric can be backed by proof. That’s what ties all of these capabilities together into something more than the sum of their parts.

Conclusion

In Fabric, lineage, permissions, logging, and cataloging aren’t extra features you bolt on later — they hold the Lakehouse together. They work in the background, connecting every source, transformation, and report with rules and proof you can actually rely on.

The clearer you see your data’s actual journey, the more confidently you can use it without creating risk. That’s the difference between trusting a number because it “looks right” and trusting it because you’ve verified every step.

Tomorrow, pick one of your data flows. Trace it start to finish. See what’s recorded — and what that visibility could save you.