Table Of Contents
Imagine you are a data scientist predicting the likelihood of a customer churn. You have access to a large dataset of customer data, but the data is structured in a way that makes it difficult to understand the relationships between the different entities in the dataset. This can’t be easily accomplished with traditional data structures, but there’s a new type of structure emerging called a knowledge graph.
Knowledge graphs can represent anything (e.g. people, objects, credit card numbers) at varying levels of complexity. The connections between nodes can also represent complex processes.
For example, you could use a knowledge graph to represent customer data in a way that makes it easier to understand the relationships between the different entities. You could create a knowledge graph that includes entities for customers, products, and orders, then create relationships between these entities to represent which customers have purchased which products and which products have been returned by whom.
This simple example illustrates how a knowledge graph differs from other types of data structures. Data.world’s data catalog uses knowledge graphs to build relationships between data across a business’s ecosystem.
Whereas most catalog providers can say, “This data is related to this other data,” data.world can say, “This data represents a user and this other data represents an item they purchased and the connection between the two is a transaction.” In other words, you can contextualize all of your data and the relationships between them.
With data.world as your catalog and Revelate as your discovery, productization, and fulfillment platform, you can take a huge leap forward in your data maturity and commercialization efforts.
What Revelate and data.world do?
Data.world is a cloud-based data catalog and collaboration platform that helps organizations discover, share, and use data. It improves data management practices and makes data more accessible and usable for both technical and non-technical users. By integrating with a variety of popular data analysis tools, such as Jupyter Notebooks and Tableau, you can easily analyze data on data.world then share the results of analysis with others
Revelate is a data productization and fulfillment platform that sits on top of data.world. Revelate is designed to complement data catalogs with key features that catalogs typically cannot deliver:
- Data discoverability and access that conforms to any governance model
- Strong metadata capabilities that allow data to be thoroughly described, found, and understood
- RBAC-backed data fulfillment, with fully-automated permissioning and documentation
- Fully-automated data product manufacturing
Our robust metadata management makes it easy for data consumers to find and understand what’s in a data set. In a tool like Snowflake, a user gets access to a table, view, or user defined functions.Then they have to dig through indecipherable table and column names with little to no metadata to explain what they’re looking at. It sounds great to have access to raw data, but that puts all the onus of normalization and insight discovery on the data consumer.
This isn’t a big problem for organizations with high data maturity and a large roster of data-sophisticated professionals. But what about supply chain managers who just want to see what’s coming down the pike? Should they be expected to know whether to perform a left join on a table called “tbl_inbound_shipments” and a column called “inbound_shipment_status_cd”?
Revelate focuses on data collaboration by making data discoverable, collaborative, and rich with metadata. Data.world is not designed for most data sharing use cases, especially for non-technical users. Rather, it supports semantic metadata enrichment, focuses on collaboration within the data.world community, provides semantic modeling and linking across data sheets, and supports public sharing. It provides plenty of technical benefits, but does not provide a clear path for the non-technical business user.
Benefits of a knowledge graph-based data catalog
Most companies store data in data warehouses and data lakes with no formal relationships between the databases, tables, and other data assets (e.g. unstructured and semi-structured data). Data catalogs allow data teams to describe the data through simple relationships like foreign and primary keys, or parent-child relationships. This is helpful, but does not accurately reflect the complexities of data relationships.
A knowledge graph can go further by utilizing flexible relationship types, like “likes” or “subscribers” on social media. While most platforms may relate a subscriber ID to the ID of an account they follow (e.g. a”subscriber” table exists to reflect the relationships between two accounts), a knowledge graph can actually describe the relationship between two data entities as “this person subscribes to that person.” This contextualized relationship helps machine learning and artificial intelligence algorithms better explore and understand underlying datasets.
The main advantages of a knowledge graph-based data catalog are:
- It’s easier to find the data you need because the relationships more accurately reflect how the human brain perceives and understands them
- AI/ML systems can better describe and summarize the data, even for complex queries (e.g. “Find every customer who purchased a particular product using PayPal”)
- Data visualizations can more easily be enriched with contextualized relationships
- Data isn’t just numbers and text, but entities with meaningful relationships (see Google’s “Strings, not Things” article from 2012)
Some companies are advocating for open data initiatives with standardized frameworks that would allow for deeper relationship contextualization. Though the industry has yet to rally around a specific standard, data.world offers a catalog that’s prepared for that future.
Benefits of using Revelate and data.world together
Using Revelate as the interface to the data.world catalog improves data discovery and collaboration. Revelate’s central metadata repository helps users find the data they need more quickly and easily through natural language queries and rich metadata management where data product managers can describe their data products.
Revelate also enhances data.world’s data governance capabilities with sophisticated data lineage and data quality tracking capabilities. This is typically very difficult to do completely and correctly as every organization has its custom processes and organic data ecosystem. Revelate, however, can operate within any data governance framework (across a variety of data cataloging tools, too). No matter what your data governance model looks like, we plug right in and ensure that data lineage, tracking, and auditability are in place without any additional data management software.
Data management workflows are often complicated, even in an easy-to-use data catalog like data.world. Here are a few:
- There are innumerable data sources (in-house and third-party), each of which require data pipelines
- Data pipeline construction and maintenance requires a significant investment, including 24/7 operational staffing
- Compliance and regulatory requirements are often changing
- Scaling infrastructure to accommodate increasing demand is neither cheap nor easy, especially as a company becomes more data-dependent
- Most people don’t know what data actually exists in their warehouses, how to access it, or who to ask for permission
Revelate simplifies the process with a productization pipeline that helps manage, package, and fulfill data sharing. Data.world does support quick and easy data discovery, but it’s not built around user-defined and user-understandable metadata. Think of it like buying a martini glass on Amazon. The data catalog behind the scenes might say there are 35 types of martini glasses for sale, but the metadata framework is the one that suggests one is perfect for a cold martini on a hot sunny day.
Technical requirements and integration complexity of data.world
Like many data catalogs, data.world can be complicated to set up and implement. Cataloging data is a difficult thing to do, especially for users who are not familiar with data catalogs or data science. And while data.world offers dozens of data integrations, some competitors offer hundreds. This can be a challenge for existing data pipelining and automation workflows.
- The ever-increasing number of data sources and destinations requires more pipelines with more functionality
- Data is often fragmented and incomplete, making it much harder to normalize as it goes downstream
- Data security and access controls cannot be consistently applied, nor can the process of adding to the overall data ecosystem
- Each data source integration has a different set of capabilities (e.g. change data capture with hard and soft deletes)
Data governance is also a challenging problem for most organizations dealing with any catalog, but some data.world users may experience the following:
- Data ownership: Data.world allows users to upload their own data to the platform, but it can be difficult to determine who owns the data once it is uploaded. This can lead to conflicts between users who believe that they own the data and users who believe that they have the right to use the data. This is because data.world does not have a strict policy on data ownership. Instead, it leaves it up to the user to decide who owns the data they upload. This can lead to confusion and conflict, especially if multiple users upload the same data
- Data privacy: Data.world allows users to share their data with others, but adhering with data privacy regulations can be complex and must be accounted for. If a user uploads a data set to data.world and claims ownership when, in fact, they don’t, this can open a can of worms for privacy and governance concerns
- Data quality: Data.world does not provide any guarantees about the quality of the data that is uploaded to the platform. This means that organizations need to be careful about the data that they use from data.world, as they may not be able to rely on the accuracy or completeness of the data
The final area of complexity is with costs. Data.world is a paid platform with simple pricing for non-enterprise users on its Community Edition. However, for enterprise customers, pricing is based on a number of factors, such as the number of users, the amount of stored data, and the features that are used. As of this writing, there are four enterprise pricing plans starting with 10 included users, and the option for add-on services.
Best practices for integrating Revelate and data.world
Though Revelate isn’t (yet) listed on the integrations page for data.world, we are certainly able to work on top of it. Here are some best practices for integrating these two platforms:
- Identify use cases and requirements: Know exactly what you want to see in a test integration. Make sure it’s a high-value, high-visibility use case that will get stakeholder buy-in and reduce any political friction you might encounter
- Develop an integration plan and timeline: Proving out a technology platform integration is usually a non-trivial engagement. Set some reasonable time constraints and expectations so you can plan for success. This way, if it takes too long, you can put pressure on either data.world or Revelate to get to success
- Provide training and support for users: Be sure your teams are equipped and set up for success, with metrics that demonstrate functionality that’s meaningful to your business’s data goals. This includes making sure people know how to use both data.world and Revelate to fulfill whatever use cases are on their backlogs
- Establish clear data governance policies and procedures: A successful data platform integration should not disrupt your data governance policies and procedures. Make sure you know what your policies and procedures are, look for ways to violate them, and make sure you close any loopholes that might provide compliance-breaking data access
Revelate + Data.world: Better data management and collaboration
Data.world is a data catalog with a strong set of features. Using Revelate in concert with data.world will bolster your data lineage, quality, governance, productization, and sharing efforts.
By using data.world’s knowledge graph-based functionality in conjunction with Revelate, organizations can further enhance their data management and collaboration practices. Revelate offers flexible metadata management and data productization capabilities, making data sets more discoverable and understandable for both technical and non-technical users. It simplifies data discovery and promotes collaboration by providing a central metadata repository and supporting natural language queries.
Integrating Revelate and data.world improves data governance efforts by enhancing data lineage, data quality tracking, and auditability. Revelate’s productization pipeline simplifies data sharing and fulfillment processes, enabling organizations to package and deliver data in user-defined formats. This integration enhances data discovery, collaboration, and metadata management, making it easier for organizations to derive value from their data assets.
While implementing and integrating data.world can be complex, organizations can navigate challenges by defining clear data ownership, privacy, and quality policies. It’s important to carefully plan and align use cases, provide user training and support, and establish robust data governance procedures.
Unlock Your Data's Potential with Revelate
Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!