How Snowflake Data Catalog Simplifies Data Discovery and Governance

Revelate

Table Of Contents

Data discovery isn’t merely nice to have—it’s a must. Any company that wants to seriously compete in today’s market needs to solve the three main data discovery problems: organizational data silos, lack of useful data catalog or inventory, and inadequate productization.

Data-rich organizations have amassed oceans of information, but data-mature organizations know exactly how much they have, what it is, where it lives, and most importantly, who can use it. Moving from data-rich to data-mature is no small task. Even remedial data discovery—the simple ability for anyone in an organization to find the data they need—is an expensive, time-consuming process. Among those who try, only a few overcome the challenges of costs, completeness, and sustainability.

Managing and accessing data is much more than granting read permissions to tables and databases. It’s about tracking who accessed what, when, for what purpose, and how they got permission to view it. It’s also about tracking how data has changed over time, why, and when. In a word: governance.

Thankfully, there are tools like Snowflake’s Data Catalog that simplify these difficult tasks of data discovery and governance. Snowflake’s centralized metadata repository makes data easy to find and understand. Its data lineage capabilities track the origins and evolution of data over time, and its role-based access controls ensure no one has inappropriate access to data.

Understanding data discovery and governance

Data discovery and governance are two huge topics of utmost importance. Data discovery is the process of discovering the data an organization has and using that data to glean valuable insights. Data governance, or a data governance framework, is a set of guidelines, policies, and procedures that define how an organization should manage, collect, and protect its data. There is, however, a lot more to these concepts than just their definitions.

Organizations with successful and thorough data discovery are going beyond the simple notion that any user can search for data in their organization. These data-mature organizations built and deployed user experiences built for searching across an entire data ecosystem. They’ve gone all-in on data cataloging efforts and have even productized data to make it as consumable as possible for technical and non-technical data consumers. Systems that enable data discovery are tightly integrated with identity and access management (IAM) systems to ensure users have only the data access they need and nothing more.

Data discovery and governance don’t necessarily go hand-in-hand, but you can’t do data discovery well without effective data governance. You can, however, have a successful data governance program without any organizational data discovery capabilities. You won’t be surprised to find out that most large organizations lack these capabilities. It’s not a bad thing per se, but it’ll be tough for those organizations to compete in a data-driven market.

Successful and thorough data governance ensures that organizations know, understand, and record everything related to data. Effective governance is a thorough set of automated and manual procedures that ensure organizations use data in accordance with appropriate norms, expectations, and security policies. Data-mature organizations with fully-automated governance are rare, but they do exist. There are significant challenges to building and deploying all of the systems that monitor and document data usage and notify the right people when there are exceptions.

Introducing Snowflake Data Catalog to your enterprise

Snowflake is a cloud-based data platform that provides a unified experience for data warehousing, data lakes, data engineering, data science, and data application development. It offers high performance, scalability, and security through its primary components:

  • Snowflake Database: A distributed, columnar database that stores data in the cloud and is optimized for analytical workloads
  • Snowflake Data Warehouse: A fully managed, scalable, and secure environment for data warehousing

Snowflake offers governed collaboration, scalable infrastructure via Snowgrid and elastic infrastructure, and intelligent workloads.

A primary component of the Snowflake Data Platform is the Snowflake Data Catalog (SDC), which simplifies data discovery and governance with a variety of capabilities, such as:

  • Centralized metadata repository, which includes information about all data’s source, format, schema, and usage
  • Data lineage, which tracks the flow of data through an organization and identifies potential data quality issues
  • Automated governance policy enforcement, like using the catalog to define who has access to certain data sets and to track changes to data

Ancillary benefits of the SDC include:

  • Increasing data literacy across an organization by making it easy to find data and make better decisions
  • Improving data quality by using lineage information to identify and correct data quality issues
  • Reducing data silos by making it easy to share metadata across different departments and teams, improving collaboration, and making it easier to get insights from data

Streamline data discovery with Snowflake Data Catalog

The SDC’s data discovery capability centers around four primary steps:

  1. Data collection: Identifying data that’s most relevant to a business use case
  2. Data preparation: Using an ELT/ETL process to get the data ready for the data consumer
  3. Visual analysis: A data exploration process that ensures the data is asking and answering the right questions
  4. Advanced analysis: Employing data mining and AI/ML to gain insights

These steps are built upon Snowflake’s ability to automatically extract via ETL/ELT data pipelines and catalog metadata for better visibility. Once Snowflake collects and catalogs the metadata, users can search and explore it. This process applies to both simple and complex architectures, like data mesh and data lakehouses, and factors into data modeling and profiling scenarios.

Enhance data governance with Snowflake Data Catalog

The SDC can automate governance via data access controls, permissions, lineage tracking, and data stewardship.

  • Access controls: Define who has access to what data with role assignments based on a user’s job function, department, or other criteria
  • Permissions: Define what actions a user can take on data (e.g., view, edit, or delete)
  • Lineage tracking: See how data flows through an organization and look for data quality issues
  • Data stewardship: Ensure that data is accurate, consistent, and complete

The SDC offers some additional sophisticated security features to enhance organizational data governance policies. For example, they offer:

  • Column-, row-, and policy-based data masking
  • Tagging sensitive data objects for usage tracking
  • Classification of sensitive data
  • A full record of access history
  • Full audit records of object dependencies for virtualized views and other indirect access methods

Taking advantage of Snowflake’s suite of governance tools makes compliance with regulatory requirements much easier.

Collaborate and share data with Snowflake data catalog

Snowflake makes it easy to share data with other Snowflake accounts with Snowflake Shares. A Share is a way to grant access to a specific dataset to another Snowflake account.

To share data with the SDC, you first need to create a Share and enter the following information:

  • Share name: The name of the Share
  • Account ID: The ID of the account that you want to share the data with
  • Dataset: The dataset you want to share
  • Permissions: The permissions you want to grant to the other account

Once you create the share, the other account will be able to see the shared dataset in their SDC. They will also be able to query the dataset, just like they would any other dataset in their account.

There are some limitations to this type of sharing, whether using Snowflake Listings, Snowflake Shares, or Delta Shares. Namely, Snowflake Shares aren’t easily searchable unless the data has already been shared with your account. Second, the Snowflake UI is not designed for non-technical users. When data is shared with a Snowflake account, it shows up as a table, or set of tables, with little-to-no helpful metadata to guide a consumer.

Consistent practices in governance, data quality, and data maturity build trust across an organization. There’s no guarantee that your organization will become data-driven because of sharing, but it’s guaranteed not to happen if there’s no sharing.



Benefits of SDC for data discovery and governance

In addition to the SDC’s benefits involving data sharing, lineage, and enhanced governance, there are plenty of other less-obvious and non-technical benefits:

  • Collaboration: The SDC supports collaboration, which makes it easy to share metadata with others. This collaboration can be done through comments, tags, and sharing permissions. This process can help provide everyone in the organization with access to the information they need to make informed decisions about data
  • Security: The SDC is secure, which ensures your metadata is protected. The catalog uses role-based access control (RBAC) to control who can access and modify metadata so that only authorized users have access to sensitive data
  • Increased visibility: The SDC makes it easy to see who has access to what data and what changes have been made. This feature helps identify potential data security risks and the steps to take to mitigate them
  • Improved compliance: The SDC can ensure data is used in a compliant manner because the catalog tracks data lineage and enforces data governance policies
  • Reduced risk: The SDC can reduce the risk of data breaches and other data-related incidents because the catalog can track data access and identify potential security risks

Don’t skimp on data discovery and governance

Organizations that skimp on data discovery and governance initiatives — or skip them altogether — are missing out on huge opportunities. Not only could they gain a competitive advantage with internal data sharing, data-driven decision-making, and a culture of data-minded operations, they could also face difficult compliance and regulatory situations.

Snowflake does a lot out of the box to enhance data discovery and governance, but it doesn’t cover most of the critical business use cases that companies face today. That’s where Revelate steps in.

Revelate offers a seamless, automated pipeline to package up datasets into products. Easily share your data to technical and non-technical users across your organization with Revelate. While Snowflake and the SDC can break down organizational data silos, Revelate provides a proactive approach with a powerful metadata management framework and product fulfillment capabilities. It’s fully compatible with a variety of data lineage tools, data catalog tools, data transformation tools, ELT and ETL tools — well, all the tools.

But compatibility isn’t where Revelate stops; as a close data partner with Snowflake, Revelate understands how to meet your custom data sharing needs. We are proficient with Snowflake in a variety of data ecosystems. Our professional services team is happy to help you go from zero to marketplace-proficient.

Want to get started?

Let’s collaborate

Unlock Your Data's Potential with Revelate

Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!

Get Started