Table Of Contents
You want to buy a charcoal barbeque grill for your upcoming trip. So, you visit an online marketplace to find the right one that best fits your needs. But here’s the catch: the online marketplace doesn’t have a product catalog, so there’s no way to filter and sort through their inventory. Then you spend way too much time searching for the right grill that you could have quickly found if there was a product catalog.
The challenges are similar for corporate users who want to view where a prospect is in the sales funnel—from lead to conversion to payment, sale, and customer service. If this data is spread across different platforms, tools, and accounts, it can be challenging to quickly discover and assess.
Data catalogs make data easy to find and access
A data catalog helps users understand which data exists and where, so they can get it without back and forth between departments. Data catalogs help companies create a comprehensive, searchable inventory of data assets which includes types of metadata such as location, format, and relationships to other data assets.
Data catalogs are especially critical to organizations where data is an underutilized asset. Think of it like this: one team might collect data for a specific purpose, but that data may have the potential to be used outside its original context. For example, customer success teams often collect valuable information about existing clients, which product teams can leverage to build new features or optimize existing ones.
Because data catalogs can identify common attributes across data sets, such as a customer ID or product code, they can bridge the disconnect between data assets. This makes it easier for team members and stakeholders to understand and leverage data assets. A data catalog provides both technical characteristics (such as data formats, structure, and quality metrics) and business context (like data descriptions, stakeholders, business rules, and usage examples) of data assets. This ensures users have access to well-rounded, accurate information.
Data catalogs are built within internal data marketplaces. Think of internal data marketplaces like virtual shopping malls and data catalogs as directories that provide a unified, structured listing of all available data assets along with their location.
There are plenty of benefits to building a data catalog. They help:
- Provide a scalable way for organizations to manage ever-increasing data assets
- Curate data more efficiently via indexing and searching
- Eliminate manual processes and dependencies with advanced search capabilities typically driven by machine learning and semantic context
- Support data-driven decision-making with access to technical and business metadata
- Simplify data compliance and governance as automation tools do the heavy lifting
Now that it’s clear what a data catalog is and its benefits, let’s examine how to create a data catalog that also maintains security and compliance standards.
Best practices for building a data catalog for internal marketplaces
Building a data catalog isn’t always straightforward. It can be challenging for organizations to identify data assets, maintain data quality, build classification schemes and taxonomies, and establish development requirements.
We have five best practices we recommend following when building a data catalog for your internal enterprise data marketplace. They are:
-
-
- Define metadata
- Establish data quality metrics
- Establish roles and responsibilities
- Define taxonomies and relate business assets to each other
- Plan and manage data flows
-
Best Practice 1: Define metadata
The first step to building a data catalog is to capture metadata from existing data assets within your organization. Metadata is structured information about a data asset, which inclues a dataset description, context, file format and size, creation date, source, quality metrics, and usage.
For example, a data catalog has a dataset called “Employee Information.” The metadata might include the following:
- Description: The dataset contains employee personal information and performance metrics
- Format: Relational database (Oracle)
- Size: 100MB
- Columns:
- Employee ID
- Full name
- Department
- Position
- Date of birth
- Employment start date
- Salary
- Performance rating
- Data source: HR management system
- Data quality metrics:
- The source is 100% complete
- Data is verified and considered highly accurate
- Data adheres to predefined business rules and standards
- Data has been validated
- Business context: HR analytics, employee performance evaluation, workforce planning
- Data owner: Toby Flenderson (Human Resources)
- Last updated: June 20, 2023
- Tags: Employees, Human Resources, HR analytics, Performance Management
The metadata provides technical information, tags and descriptions for searchability, and operational metadata to demonstrate freshness.
For smaller organizations, handling metadata might be simpler simply due to volume and complexity. However, for medium- to enterprise-sized organizations with large, ever-increasing amounts of data, it’s tough to organize metadata in a scalable, structured manner. That’s when standardized metadata comes into play.
Standardized metadata and attributes
Standardized metadata provides a common set of metadata attributes and adheres to widely accepted frameworks or standards to describe and categorize data assets consistently. Some well-known standards include Dublin Core Metadata Initiative (DCMI), Data Documentation Initiative (DDI), Metadata Object Description Schema (MODS), and Common Warehouse Metamodel (CWM). Standardized metadata makes data discoverability easier and faster since it allows users to search, filter, and explore existing data easily.
Once you have selected the metadata you want to add to your data catalog, explore and shortlist metadata management tools. Revelate is one such tool; it can define metadata and fully customize how consumers browse and filter products.
Best Practice 2: Establish data quality metrics
One of the ways you can build a better, more effective data catalog is with data quality metrics. They help assess data assets, benchmark what good-quality data really means, and set data validation rules to verify whether data assets match catalog definitions.
Data quality metrics also align with data compliance and governance policies, so they provide a handy way to monitor and enforce data quality standards across data catalogs.
There are four key data quality metrics you should consider when building a data catalog:
- Accuracy: Data must be factually correct, otherwise your resulting analyses and insights may be false
- Accessibility: Data should be readily available to authorized users when they need it. This means data isn’t just available, but also usable for anyone without prior experience working with a specific data asset
- Completeness: Data should be complete on its own; no critical information should be missing
- Timeliness: Data must be fresh to be accurate and valuable
There’s no one-size-fits-all for measuring data quality metrics. For example, most organizations use error ratio to measure accuracy. Some companies refer to this in a positive percentage like 97.5%, while others express this as the percentage of errors like 2.5%. Similarly, some enterprises use automated duplicate detection tools to ensure there are no inaccuracies.
Once you have identified the key data quality metrics for evaluating your data assets, document and register them in a centralized place within the data catalog. This way, team members can actively manage and track them.
Best Practice 3: Establish roles and responsibilities
Managing a data catalog can be chaotic when roles and responsibilities are unclear. A robust data governance team consists of data admins, custodians, stewards, and users.
This team can be further broken down into multiple roles: lineage admin, lineage editor, and lineage producer. The scope of each role is based on how much access and information you want to share with certain users.
Here’s an overview of the roles and responsibilities in a data catalog:
Producer’s perspective data owners dara stewards data engineers |
Control over the data production processes(data quality & availability) Data preparation and certification for data sharing and consumption by others Knowledge about which owned data is being used by who and for what purpose Full audit to support normative and regulatory compliance |
Oversight perspective C-level Data Office Audit Legal Compliance |
ROI maximization for data initiatives productivity and efficiency boosting Processes and tasks automation for costs reduction and operational risk mitigation Unified and homogeneus vision of the company’s data usage and consumption Normative and regulatory compliance Possibility to value and monetize data |
Consumer’s perspective data analysts data engineers data scientists |
Gain deeper knowledge about available data along with its context and meaning Better decisions driven by the access to better quality and well-managed data Agility in data consumption and exploitation leading to self-service Control over the data production processes(data quality & availability) |
First, identify sponsors and key stakeholders. These are the most crucial team members who will essentially be data owners and contributors. Typically, a data catalog sponsor is someone in an executive role who believes in and supports your data catalog, ensures it’s up-to-date, and strives to improve it continuously.
Then, identify data stewards—technical data stewards, business stewards, lead/data stewards—who will create data policies and maintain the data quality. Next, identify subject matter experts (SMEs) who have strong knowledge of your organization’s data and can connect the dots. They may also edit, review, and approve data to be finally uploaded in the data catalog.
Best Practice 4: Define taxonomies and relate business assets to each other
Taxonomies for data assets will make data management easier. For example, applicable regulation, information sensitivity, and semantics and ontology will act as the guiding principles for data usage. This method shows users exactly what, how, and where they should use data.
Example data asset taxonomy:
Asset Name: Sales Data
Category: Sales
Data Type: Structured Data
Description: Data related to sales transactions, including purchase history and order details.
Source System: Sales Database
Data Fields: Order ID, Product ID, Purchase Date, Quantity, Unit Price, Total Price, Customer Name
Data Sensitivity: Low
It’s also important to identify frequently-accessed data and business critical data. Critical datasets are those that could significantly impact business continuity, reputation, or financial stability. Managing these important datasets ensures teams have better access to good-quality data. Data stewards and data engineers can generate patterns and logical groupings of data to further facilitate ease of data usage.
Additionally, SMEs can map relationships between different data assets by creating knowledge graphs and semantic maps to increase their use cases. For example, teams might discover cross-sell and up-sell opportunities that can, for example, improve the ROI of existing campaigns and generate new revenue streams based on the performance of sales and marketing assets.
Best Practice 5: Plan and manage data flows
Data catalogs can track data lineage to illustrate where data is coming from and going to, and how it may have changed over time. This includes transformations like aggregation, filtering, or merging and when they might have occurred.
But here’s the catch: data lineage tools map out the flow of data within a set of domains or a specific domain, but they don’t enable users to identify data flows between disparate datasets.
To see data movement within your organization with more precision and granularity, use automated data ingestion, validation, and profiling tools. These tools have comprehensive and sophisticated capabilities that support a variety of data sources, making it easier for organizations of all sizes to query, validate, and profile ingested records.
Data governance managers, for example, can use lineage tools to track data accuracy, compliance, and security as data moves and changes throughout their ecosystem. They can lean on that data to make informed decisions about who is accessing and using which data, why, and whether that’s appropriate. Or, they could look for opportunities to improve data usage policies.
As these tools enrich and store technical and business metadata in the data catalog, users can access data assets in a logically ordered and secure environment.
Tips to ensure security and compliance
Insecure and ungoverned sensitive data may lead to compliance and regulatory issues. So it’s undeniably important for organizations to look holistically at how they store, share, and use data.
Employ security best practices
Use identity access management (IAM) and role-based user permissions to ensure the data in your data catalog is only accessible to authorized users.
This process includes giving minimum possible permissions to a set of IAM users to reduce the risk of unintended or malicious data delete actions. To protect data sources from unauthorized modifications and security breaches, provide read-only access to the majority of users.
As your data catalog evolves, you can set up more security policies that ensure only authorized users obtain specific privileges.
Improve regulatory compliance
Data catalogs with robust compliance tools in place can automate data classification and profiling as well as enforce data protection rules to anonymize and restrict sensitive information access.
Failure to comply with industry and national compliance laws can lead to massive penalties. For example, violations against the California Consumer Privacy Act (CCPA) can lead to a civil penalty of anywhere from $2,500 to $7,500 per violation.
Another significant compliance law is the General Data Protection Regulation (GDPR), a European Union (EU) law that sets guidelines for businesses on how to collect, process, and use the personal data of users.
To improve regulatory compliance for your data catalog, consider providing a data risk control center for users and their teams so they can easily view up-to-date information from the privacy catalog. Ask your legal team to create data contracts that specifically lay out the collection, profiling, and usage of data for users.
Real-world examples of internal marketplace data catalogs
Now that you know what a data catalog is, and the best practices to build a data catalog, let’s dive into some of the tools you can use to build a data catalog.
1. Collibra
Collibra is a popular data cataloging tool that helps teams discover and catalog large volumes of data from multiple sources. It allows users to annotate databases with context-rich business metadata and technical metadata for better discovery and collaboration between different teams.
The data platform uses machine learning to automate workflows, which helps make sense of complex data sets faster.
However, given its advanced and rich feature set, Collibra comes with a steep learning curve, so you may want to hire professional data engineers who can help set up your data catalog.
2. Alation
Alation is another well-known data cataloging tool that leverages machine learning to drive pattern recognition to uncover insights about data usage, including user breakdowns, data popularity, and recommendations.
It also uses natural language search to improve the functional data literacy of users. Additionally, users can define a catalog of reusable terms, making it easier and faster to search for relevant information, when needed. Plus, users can also group similar types of data together and tag it using relevant keywords with the platform’s easy-to-use taxonomy structure.
Overall, Alation is a powerful data cataloging tool, but like any other platform out there, requires some time and practice to get familiar with its user interface.
Go beyond your data catalog
Building a modern data catalog goes beyond the concept of capturing data assets from multiple sources and metadata management. It encapsulates deep investment into automation and discovery techniques such as natural language processing, behavioral intelligence, and machine learning. With the right set of processes, people, and tools in place, you can build a data catalog that organizes and provides data in real-time while eliminating manual dependencies.
Data catalogs provide a clear picture of an organization’s data and help expand its use cases and opportunities so more people within the organization can use it.
Ready to build a flourishing internal data-sharing system and culture? Download our guide to unleashing the benefits of internal data sharing in your organization.
Unlock Your Data's Potential with Revelate
Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!