Table Of Contents
When US President Dwight Eisenhower spoke to the US National Security Council in 1953, he famously said, “Plans are useless, but planning is essential.” It’s unlikely he was talking about data catalogs and metadata management, but for the next few minutes, let’s just pretend that’s exactly what he was doing.
Organizations implement data catalogs to maximize the value of their data and to extract meaningful insights. This doesn’t happen by accident, nor is it a matter of “set it and forget it.” A successful data catalog initiative requires planning, acting, assessing, and reacting. As Eisenhower implied, plans aren’t perfect, but one of the ways to aim for perfection with a data catalog is by effectively managing your metadata.
No matter how much data you catalog, your efforts are likely to be practically useless without strong metadata management. Doing this well requires manual work, automation, and following some best practices. Metadata is what makes cataloged data discoverable, filterable, and usable, so it’s important to prioritize it before getting started as early as possible in your cataloging efforts.
Benefits of metadata management
Metadata makes your data useful. Without it, you’re leaving all the hard work to your data consumers. They would otherwise have no idea what data is available, who owns it, how to access it, or how to use it. When your metadata is managed, governed, and enforced, your company can experience these benefits:
- Improved data discovery: Metadata helps users find the data they need, including its location, format, and lineage. This ensures that the right data is used for the right purpose.
- Increased data understanding: Metadata provides information about the data’s meaning, quality, and limitations. This helps users make better decisions about using the data in a compliant and secure manner.
- Enhanced data governance: Metadata improves data governance by surfacing ownership, access permissions, and usage policies. This increases protection of an organization’s data assets.
- Increased data reuse: Metadata makes it easier for users to find and understand data. This improves the likelihood and efficiency of data-driven decision-making.
- Improved data quality: Metadata improves data quality by surfacing provenance, lineage, and quality metrics. This helps users to identify and address data quality issues, making it more reliable and trustworthy.
When data is discoverable and its terms of usage are clear, consumers can get more out of the data and generate meaningful insights. For companies looking to increase data maturity and become more data-driven, metadata management is the way.
Establishing a metadata management strategy
Every business collects data for different purposes, which means that no two metadata management strategies will look the same. However, there are some fundamental pieces of establishing a metadata management strategy:
- Data lineage: The history of a data set, from its source to its current location (or destination). When consumers know how the data was created, processed, and changed over time, they have cause to trust it and use it with confidence.
- Data quality: The degree to which data is accurate, complete, and consistent. When consumers know that data is high quality, they will be more likely to use it.
- Data ownership: Who owns a data set and who has permission to access it. Clear ownership ensures that the data is used in a compliant and secure manner.
- Data usage: Information about how a data set is being used and potentially being misused.
- Data format: The structure of a data set, which helps users to understand how to interpret and use it in different applications.
- Data tags: Keywords or phrases used to describe a data set. Tags are a quick and easy way for consumers to understand the data at a glance.
- Data source: Where the data originated (e.g. Facebook, HubSpot, an internal system), such as a database, file, or sensor.
- Data schema: The structure of the data, such as the field names and data types.
- Data retention policy: How long the data should be retained.
- Data sensitivity: The level of sensitivity of the data, such as public, private, or confidential.
With these foundational pieces of metadata being considered, it’s important to define clear objectives, assess metadata requirements, and create a governance framework. In other words, understand what you want your metadata to do for you, know your success criteria in making that possible, and design a system that ensures you stay on track. Be sure to have goals and KPIs to ensure you know how to measure your future success.
It’s particularly important to consider how automation plays a role in all of this. No matter how large or small your organization, you want to minimize the chance of human error, backlogged work, or human dependencies. The more you can automate, the greater your chances for long-term data catalog success. Obviously, you will need humans to oversee and manage the framework and any automation, so make sure you have a cross-functional team identified for long term maintenance.
Types of metadata
There are no limits on how much metadata you can collect and track, but there are a few categories of metadata that are important to know and understand:
- Descriptive metadata: Information about the data (e.g. title, author, date created) help consumers find and understand the data.
- Structural metadata: Information about the structure of the data (e.g. schema, field names) helps consumers understand how the data is organized.
- Administrative metadata: Information about the management of the data (e.g. ownership, access permissions, usage policies) ensures compliance and security.
- Relationship metadata: Information about how data sets are related and for what purpose (e.g. customer purchase history and its relationship to demographic data).
- Provenance metadata: Information about the origin and history of the data (e.g. user survey data and who administered the survey) identifies potential quality issues.
- Process metadata: Information about the processes used to create and manage the data (e.g. whether data was created or processed by an algorithm and how quality was ensured).
Every data catalog and provider is likely to have their own set of metadata classifications. Be sure to identify what metadata is important for your business and what your competitors or industry peers may be doing to manage metadata.
Many metadata standards have ISO standards exist. These include:
- ISO 11179: This standard defines the concepts, terminology, and components of metadata, and guidelines for the implementation of metadata management systems.
- ISO 25964: This standard provides a set of guidelines for the implementation of metadata registries. A metadata registry is a repository of metadata that can store and manage metadata about data sets.
- ISO 2709: This standard defines the format for exchange of metadata between different systems. It is a widely used standard for the exchange of metadata about data sets.
- ISO 19115: This standard provides a set of guidelines for the creation and use of metadata for geographic information. It is a widely used standard for the exchange of metadata about geographic data sets.
The specific standards you use will vary depending on your organization’s needs and the specific data assets that are being managed. You might also consider the use of a metadata ontology to help data consumers to understand the metadata and to find the metadata they need. Some popular metadata ontologies include the Dublin Core Metadata Initiative (DCMI) and the Resource Description Framework (RDF).
Metadata capture and documentation
Every major data catalog platform available today offers metadata capture and documentation. Metadata is usually captured by one of these methods:
- Manual entry: A user manually enters data into the catalog. This is the most common method for capturing metadata, but it can be time-consuming and error-prone.
- Automatic extraction: Tools can automatically extract metadata from data sources. This can be done using a variety of tools, such as data profiling tools and metadata extraction tools.
- Hybrid approach: A hybrid approach involves manually entering some of the metadata and automatically extracting other metadata. This can be a good way to capture the most important metadata manually, while also capturing the less important metadata automatically.
Metadata can be stored in a variety of places and in any number of formats. Every data source will have its own format for metadata.
- Data files: Stored in the data files themselves, often in simple data sets like CSV files.
- Data dictionaries: A centralized repository for storing information about data sets, such as their structure, format, and meaning.
- Metadata registries: A more specialized type of repository for storing metadata about large datasets, such as those used in data warehouses and data lakes.
- Data catalogs: Metadata is often stored across multiple data catalogs, which is common in large organizations where lines of business are logically separated.
Once you know where your metadata is stored, you need to standardize how it is captured from each source. Finally, you need to document the metadata attributes, which will differ for every data source and type. For example, you’ll want to know the data type of the attribute (e.g. integer, string, decimal) and standardize around that for all metadata related to one source, or many related sources.
With that in place, be sure to document everything you know about the metadata to ensure the data consumer knows how to interpret the dataset. For example, the word “page” may have various definitions as it relates to a dataset. It could be a page in a book or a page in a document or a page in a spreadsheet. Your documentation should provide clarity about as many aspects of metadata as possible. Some metadata is created by an internal process, other metadata is created by a data source, and the data consumer should know the difference.
Metadata capture is a process
Capturing (and documenting) metadata is a process. That process should be well understood, documented, and repeatable. As such, it’s likely to lead to metadata capture policies, procedures, and standards.
Wherever you have an opinion on—or a particular approach to—how metadata is captured, it needs to be a process. It can’t be “John in the data department captures all the consumer data from social media and he’s got his way of doing things. Mary in the finance department captures all the expense data from social media and she’s got her own way of doing things.”
Make the process work for you and your organization. Otherwise, you’ll end up in a mess.
Metadata quality assurance
There are a million ways your metadata capture, documentation, and processes can go wrong. Even if nothing is overtly wrong, you may have a script that inadvertently mis-codes a metadata attribute. (Remember Eisenhower! Plans are useless!)
As such, you’ll want to regularly run QA on your metadata processes. If you use automation to manage your metadata, you’ll want to use QA before, during, and after you roll that automation out. If it’s manual, you’ll want to ensure enough safeguards are in place that nothing goes wrong. Here are some recommendations:
- Define clear metadata quality standards: Define the specific attributes that will be tracked, as well as the quality requirements for each attribute. For example, “All dates should be captured as YYYY-MM-DD.”
- Implement automated quality checks: Once the quality standards have been defined, automated quality checks can be implemented. Use tools to scan for errors, such as missing values, invalid data types, and inconsistent values.
- Manually review metadata: Just because you’ve automated doesn’t mean your automation is doing what you expect. You can either write automation to check your automation, or have a team of data stewards of a data quality analyst. Identify errors that were not caught by the automated checks and check for overall metadata consistency.
- Continuously monitor metadata quality: Metadata quality is an ongoing process. Be sure everything remains accurate and up-to-date by regularly running automated quality checks and manually reviewing the metadata.
Successful metadata QA also includes employing a data profiling and cleansing strategy. Metadata profiling is the process of collecting information about the metadata, such as its data types, formats, and usage. Metadata cleansing is the process of cleaning up the metadata, such as by removing errors, duplicates, and inconsistencies.
Metadata search and discovery
A number of methods exist for searching metadata and discovering data. The focus of this article is on data catalogs, which usually offer metadata search and data discovery features. Sometimes data catalogs are referred to as “metadata repositories,” which can include products like Alation, Collibra, Atlan, and Dremio. Other metadata search and discovery tools can include Google Search, Microsoft Search, and Solr, which are sometimes deployed locally within a company’s data center via appliances, VMs/containers, or SaaS products.
When metadata is loaded into a data catalog, it gets indexed for searchability. The index can search for metadata by keyword, tag, or other criteria. Metadata indexing is a key part of metadata management and dramatically improves the speed of a search.
Metadata can be indexed by keyword, tag, or ontology. They each have pros and cons, including accuracy, processing time, and overall complexity. The specific indexing method you choose will likely depend on your specific needs and requirements. However, all of the indexing methods listed above can be used to improve the usability and findability of metadata.
Metadata discovery is the broader process of finding metadata, while search is the narrower process of finding data using the metadata itself. Discovery can involve looking through metadata repositories, data dictionaries, and other sources of metadata, while search typically involves using a metadata search engine. Typically, discovery is a manual process of clicking and digging in whereas search is more of an automated process that minimizes work for the data consumer.
Metadata security and privacy
Metadata security and privacy are important considerations for organizations that collect and manage metadata. Metadata can contain sensitive information, such as the identity of individuals, the location of assets, or the details of transactions. If metadata is not properly secured, it could be accessed by unauthorized individuals or organizations, which could lead to privacy breaches or other security incidents.
Access to metadata should be governed by automated controls, like role-based access controls (RBAC). If you’re using a data catalog, make sure that it’s compatible with your organization’s security technologies and compliant with your overall privacy and governance requirements.
Metadata often contains sensitive information, so you want to ensure you maintain privacy and confidentiality. Be sure to identify what may be sensitive, protect it, monitor access to it, and educate employees so they know what is sensitive and how to protect it. Encrypt everything that needs to be protected and figure out how to employ the latest security practices on your metadata repository.
Metadata Auditing and Compliance
Treat your metadata as you would any other sensitive data: make sure you monitor who’s looking at it and that their usage is appropriate. Set up automation to watch access logs and raise awareness for any anomalous behavior. For example, if you see someone in customer support trying to access internal HR data, that is probably an inappropriate use of access.
Make sure you understand the data regulations that apply to your company. Regulations can vary based on industry, geography, association with certain organizations (e.g. governments, healthcare companies), and what type of data you’re collecting. Metadata is no exception to the rules.
As with any other compliance processes, be sure to do internal audits, pay for third party audits, and prove to yourself that you are being good stewards of your private information. Set up audit objectives, find the right methods, and document your findings.
Metadata training and awareness
Metadata is data, but it doesn’t necessarily sound like it’s the same kind of data that’s subject to rules and regulations. Don’t be fooled: it definitely is subject to compliance and regulation.
One of the best ways to become more data mature is to train all of your employees about how data is used. Whether it’s a question of access, appropriate use, or general checks and balances to ensure other employees are using data appropriately, everyone needs to be clear on how to use data for their role.
It’s important that all your employees understand nearly everything we’ve shared in this article:
- What is metadata and why is it important?
- What are the different types of metadata?
- What are the principles of good and bad metadata?
- What are the best practices for using metadata?
- What are the legal and compliance requirements for metadata?
When people have the information they need to be good stewards, they can live up to expectations and your business can thrive on data.
Use metadata management to improve your data catalog
While Eisenhower may not have been referring specifically to data catalogs and metadata management all those years ago, his words hold true in this context. Implementing a data catalog tool requires careful planning and proactive management of metadata to maximize the value of data and extract meaningful insights.
Metadata management plays a critical role in making data useful and accessible. It enables improved data discovery, increased data understanding, enhanced data governance, increased data reuse, and improved data quality. Without strong metadata management, cataloging efforts are likely to be practically useless. Metadata provides the necessary context and information for data consumers to know what data is available, how to access it, and how to use it effectively.
Establishing a metadata management strategy involves considering key elements such as data lineage, data quality, and data usage. By defining clear objectives, assessing metadata requirements, and creating a governance framework, organizations can ensure that metadata is captured, documented, and maintained effectively.
Unlock Your Data's Potential with Revelate
Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!