Table Of Contents
Companies of all sizes are getting absolutely crushed under the weight of their data. There’s too much infrastructure, there are too many first- and third-party data sources, and managing it is practically impossible. Perhaps most importantly, it’s incredibly expensive to maintain. Unsurprisingly, there’s been an explosion of interest in data catalogs over the past five years.
The bigger the company, the bigger its data ecosystem. Data teams are ripping out whatever hair they have left trying to centralize everything in a warehouse, contain costs, and respond to the growing pile of data request tickets in JIRA. They need a solution—fast.
The data team’s first instinct is to organize and make sense of everything. Data catalogs are the perfect solution—except that’s more cost and more to manage. Meanwhile, the executive team is asking, “Wait, we need to spend more on storing and managing our data? Are these costs just going to grow forever?”
The CFO wants to reduce costs and leverage data for business value. The data team wants to streamline and organize data and operations. The two are in contention, but can be solved with cataloging and productization.
What is a data catalog?
Data catalogs are software tools that allow companies to create a comprehensive inventory of all their data assets, including the location, format, relationships to other data, and other relevant details. The goal of a catalog is to provide a clear understanding of what data exists and where it is stored. This is why every data catalog website has a picture of a library somewhere. Like libraries, data catalogs make it easy to look for information and find it.
The concerns of a growing data ecosystem are less to do with having another data source and more to do with the increased volume of data. Not all data sources are created equal. Some may have simple datasets with one or two small tables that change every few minutes. Others may have dozens of large tables that change every few seconds. Either way, it’s important that data engineers set up processes for onboarding new data sources and ensuring the data is managed from day one. Otherwise, they’re defeating the purpose of having a data catalog in the first place.
This brings us to one major shortcoming of data catalogs: neatly organized data doesn’t solve the problems of a growing data ecosystem. The way to make data useful is by building data products. Data catalogs are great for labeling data, understanding what goes where, and streamlining operations, but organization doesn’t beget ROI.
What is a data product?
Data products are a collection of one or more “digital assets” that provide business value. Digital assets can be any type of self-contained data set, like a PDF, CSV file, or an individual table in a database. Just like physical products, they need to have utility and be usable by a consumer. They can’t simply be a solution in search of a problem.
It’s absolutely important that data products follow the principles of product management, or else a data product really has no reason to exist:
- They must solve customer problems in a cost-effective and meaningful way
- They have a life cycle, requiring care and maintenance based on customer feedback, market conditions, and the costs of manufacturing and storage
- They are subject to the competitive landscape
Data products are the outcome of the data productization process. Data productization transforms raw data into a consumable product or service. Like any product or service in the real world, the resulting data product/service should fulfill a purpose. The data product market supports various flavors and levels of sophistication (i.e., entry-level and professional variants).
A good data product serves several goals:
- Extracting insights from data
- Building models
- Creating visualizations and reports
Catalogs are “infrastructure out,” products are “value in”
Building a data catalog is all about understanding the data infrastructure and putting labels on what’s there. Productizing is all about starting with the idea that problems could be solved with data in the infrastructure. The two are not in opposition but do come from different directions. In other words: data catalogs start by looking at the infrastructure (“infrastructure out”), and data products begin by looking for value (“value in”).
How and why data catalogs are built
Data catalogs are an excellent way to establish and grow a data-driven organizational culture. When teams and employees know they have self-service access for their organization’s vast pool of data assets, they’re much more likely to leverage data to solve their problems. Life is also much better for both data consumers and producers when people aren’t required to submit and resolve never-ending streams of data access requests.
Catalogs have additional benefits, like metadata management, data lineage tracking, and specialized data tracking. For example, a machine learning data catalog employs automated data discovery and classification. This helps data scientists and ML developers identify features, relationships, and patterns in data to make better predictions and maintain data quality. Other specialized catalogs are designed to manage industry-specific data, like geospatial, financial, scientific, and healthcare data. They are purpose-built to serve visualization-, research-, and compliance-related use cases.
Data governance and protections are essential enterprise data catalog capabilities. Data catalog features like role-based access credentials (RBAC) and data masking ensure that sensitive data is not inappropriately shared. This is especially important for organizations that rely heavily on information from finance, marketing, and sales. Appropriately configured catalogs proactively address security concerns and protect organizations from expensive compliance violations.
The data catalog process involves:
- Identifying every data source and destination
- Example sources: PDFs, CSVs, databases
- Example destinations: Data warehouses, data lakes, lakehouses
- Documenting the findings
- Collecting metadata, like types of data, owners, and whether it’s sensitive
- Defining data quality and security standards
- Establishing a governance framework
- Implementing data cataloging software
- Maintaining and improving the data catalog
Once the data catalog is available to its end users, they can more easily find and use the data they seek. Catalogs are excellent for internal data sharing, but put the burden of utility on the catalog users. Though data catalogs can be used to curate data for specialized and general purposes, they don’t provide this value out of the box. Catalogs are excellent for making raw data safely available, especially for internal users.
Data catalog use cases are more exploratory and less defined than data product use cases because the intentions are different.
How and why data products are built
Data products start with utility and value. Part of that value proposition is usability, meaning the data consumer (whether a human or a machine) can easily apply the product’s underlying data to their business problem.
Data products can be downloadable and contain curated collections of data. They can be distributed as zip files or access to a protected file storage service. Examples of downloadable products include historical stock market data or log files from a web server. Data products can also present themselves as services, like an API or an interactive website like Google Analytics or Salesforce Einstein AI.
Building a data product involves:
- Identifying a problem a data consumer faces
- Understanding the potential problem’s value and potential product market fit
- Determining which data is required to create the product (and identifying any data gaps to fill)
- Building the product
- Designing the packaging (including product metadata)
- Delivering it to a marketplace
- Iterating using market and customer insights
- Marketing the product effectively
- Iterating on this process (this could take months or even years)
Data products are the best way for organizations to build ROI on their data. For internal data sharing use cases, employees and partners can explore data to find meaningful and unique insights. These can be especially helpful for research and development, product management, financial controllers, and security teams.
For external data commercialization use cases, entirely new lines of business can be built on product-generated revenue. CFOs, in particular, love new sources of revenue. Commercial data products drive greater value from data and ensure data initiatives are aligned with the overall business strategy.
Data cataloging and data productization go hand-in-hand
Cataloging and productizing data are complementary and distinct processes. One does not beget the other, though data catalogs do provide a solid foundation for data products. Many organizations invest in data catalogs hoping for eventual productization, but aren’t prepared for the challenge. It takes a lot of data maturity to do either.
It’s a bad idea to build a catalog and ask, “What can we build with all this?” At the same time, it’s no good to design a data product and ask, “Where the heck do we find the data for this?” Catalogs provide a clear path toward productization, whereas productizing clarifies the need for a good catalog.
Both cataloging and data productization allow organizations to:
- Clean, refine, and process data
- Build models and algorithms to analyze it
- Create visualizations and reports
- Deliver insights to internal stakeholders
- Deliver value to external consumers
When catalogs and data products are developed together, companies can quickly develop and find ROI. The catalog makes it easier for data product teams to discover and identify meaningful data that can go into a product. The data product generates revenue (or some other form of value) to offset the climbing costs of data infrastructure management. The data consumers are happy because they get what they need and can request that high-value products get made.
Data catalogs and data products are symbiotic and synergistic. A catalog and a suite of products create the 1+1=3 effect for an organization, especially because it steers the operating culture to become more data-driven.
Catalogs or products—which is better?
Getting to successful data productization requires deep investments in people, processes, and technology. On the people side, productization requires—and begets—a data-driven organizational culture. On the process side, it requires orchestration and transformation of disparate data sources, each containing different pieces of value for the end product. On the technology side, data needs to be understood, appropriately accessed, and successfully moved, all at the same time.
Organizations that choose to productize with the aid of a data catalog will have much more success in each of these areas. The data catalog makes it possible to know what data can be productized, where it exists in a vast ecosystem, who owns it, and how it can be retrieved. In short, a data catalog eliminates several major hurdles in the journey to data productization.
Data productization and data catalogs offer synergistic benefits:
- A clear picture of the data makes it easier to productize
- Data products that meet data consumer needs can generate ROI and revenue
- Both strategies move organizations toward a data-driven culture
- Both catalogs and products accelerate time to value for internal and external use cases
Revelate is proud to partner with several leading data catalog providers, including Snowflake and Immuta.