Partner Data Sharing for Databricks

Revelate
partner-data-sharing

Table Of Contents

Most data consumers don’t know how or where to find the data available to them. The data they seek could be available internally or via organizational data partners. For example, a food manufacturer partners with a logistics company to optimize their supply chain. In that case, the food manufacturer may want to share their data and production schedules with the logistics firm to optimize supply chain operations, ensure timely delivery of goods, and minimize inventory holding costs.  

How can the logistics firm ensure their people and systems can find and consume the food manufacturer’s data easily? How can the food manufacturer provide ongoing data in a straightforward, automatic manner? Partner sharing can help resolve these questions.

Partner sharing is a data strategy that allows companies to share data with one another. This approach operates on the principle that companies can achieve more when collaborating and sharing data. Sometimes this collaboration can lead to new products, similar to how Uber and Google Maps shared mapping data to develop an enhanced ride-sharing service. Other times, it can lead to optimization, as with the food manufacturer and logistics company above.

Understanding Partner Sharing

What Is Partner Sharing?

Partner sharing occurs when data providers exchange marketplace data with business partners. Partner sharing can involve consuming external data or distributing data externally. For example, Walmart and Procter & Gamble exchanged sales and inventory information, enabling Walmart to optimize inventory management and supply chain operations. The results? Improved customer satisfaction. 

Partner sharing establishes thriving relationships that facilitate collaboration and engagement. Some key performance indicators (KPIs) that indicate a successful partner data sharing relationship include:

  • Business growth metrics like revenue, market share, customer acquisition, or customer retention
  • Operational efficiency metrics like reduced costs, optimized resources, or streamlined processes
  • Customer experience metrics like customer satisfaction and engagement
  • Innovation aspects such as new opportunities or product/service enhancements
  • Data quality metrics such as the quality and relevance of shared data and the insights it generates

Partner Sharing with Databricks

Databricks is an innovative software company that provides businesses with a state-of-the-art data processing platform. Its partner sharing capabilities enable companies to collaborate in real time with third-party vendors and suppliers on data analytics projects. 

Databricks allows authorized users to select subsets of data to share with their partners, then share data securely. Databricks also includes audit logging and access controls, which enable companies to remain compliant and track how partners use their data.

How Partner Sharing with Databricks Works

Databricks users can share data with partners via Delta Sharing, the open source data sharing protocol developed by Databricks. It supports fine-grained access controls so an owner or administrator can provide access to partners (or partner groups).

Permissions are assigned to users, but implemented through Unity Catalog and Delta Sharing. Unity Catalog is Databricks’ centralized metadata management platform that governs, stores, and provides access to data. Delta Sharing is built on top of Unity Catalog, so there is a tight functional integration between the two technologies. With Unity Catalog and Delta Shares, it is easy to track who has access to the shared data, as well as to audit how the data is being used.

Partners can use these Databricks technologies to access and interact with shared data and resources once the data owner configures the ACLs. The partner can then work with the data owner or other partners to build machine learning models, run queries, and write, develop, and execute code.

Benefits of Partner Sharing

Data sharing improves collaboration by allowing partners to easily and securely share and access data. Continuous data sharing leads to more data-driven decision-making, builds trust between partners, and demonstrates a commitment to protecting shared data. Other partner sharing benefits include:

  • Deeper insights into customer behaviors and preferences
  • Better decision-making driven by insights from shared diverse data sets
  • Expanded market reach and customer acquisition, enabling organizations to tap into new customer segments and gain insights that drive targeted marketing strategies
  • Cost savings by eliminating duplicative data efforts
  • Compliance and risk mitigation by ensuring data sharing practices align with legal requirements

In sum, partner sharing empowers businesses to gain a competitive advantage by providing a broader perspective, deeper insights, and collaborative opportunities for growth.

 

data

Advantages of Databricks

Here are some key features that make Databricks a top choice for many organizations that engage in partner sharing.

Handle complexity and balance costs with scalability

Databricks offers advanced features that ensure scalability. First, the platform relies on distributed computing to split its workload across multiple nodes, enabling faster and more efficient processing of large datasets. It also incorporates autoscaling capabilities to automatically adjust the size of the compute cluster based on workload requirements. 

In addition, Databricks dynamically and efficiently allocates memory and CPU resources to allow the system to scale with business needs. These features enable businesses to handle large and complex datasets, efficiently scale their processes, and manage costs.

Streamline data workflows with automation

Databricks’ focus on automation streamlines your workflow and achieves results more efficiently. Using automated infrastructure provisioning, Databricks can abstract away the complexities of infrastructure setup so users can focus on data tasks instead of hardware and software configuration. Databricks can also automate the creation, configuration, and management of compute clusters for a simpler setup. This task automation reduces the time and effort required to manage your infrastructure.

Collaborate across programming languages

Databricks’ web-based interface, Notebook, makes it easy for team members with different expertise and programming language preferences to collaborate and communicate. Notebook allows data analysts and engineers to work simultaneously on the same data using their preferred languages, such as SQL, Python, R, or Scala. Developers can run cells written in different programming languages in the same Notebook and choose between running interactive or scheduled data workloads.

Advanced and efficient performance

Databricks’ advanced and efficient performance is a crucial value-add. It empowers organizations to process and analyze large-scale datasets at high speeds, enabling faster insights and decision-making. This also optimizes resource utilization and enhances overall productivity and competitiveness in data-intensive workflows.

Disadvantages of Databricks

Cost

Databricks pricing depends on the selected cloud services, which can be a stumbling block for smaller projects. Databricks Notebooks that run code require a cluster for computation, and even simple data processing using Pandas incurs expenses for virtual machines and database throughput units. 

Databricks users will find that it’s relatively easy to incur higher costs. For instance, running continuous extract-transform-load (ETL) processes can incur substantial overhead. Additionally, migrating marketplace data to other systems can also incur higher costs, particularly if the migration requires data transformation

Users should monitor the cluster size to avoid incurring unnecessary costs, as a fellow team member might create clusters without proper oversight or fail to appropriately manage and terminate them when no longer needed.

Steep Learning Curve

Databricks has a steep learning curve, and inadequate documentation can make finding relevant information or troubleshooting issues frustrating. Working with Delta Live Tables can be particularly challenging and complex. Databricks is geared towards technical users, requiring significant manual input for tasks such as resizing clusters or configuring updates.

Limited Customization

Databricks users have voiced concerns about limited customization options. This limitation can make integrating the platform with existing systems and workflows challenging, especially when making pipeline changes. Databricks also limits the ability to customize the target path’s structure, compression capabilities, and format (e.g. Databricks does not support CSV file compression).

How To Share Data in Databricks: Delta Sharing

Databricks recognizes that not every conceivable organization you want to share data with has a Databricks account. That’s why they created Delta Sharing, which enables organizations to share data externally. However, it can also be employed to share data internally within an organization, fostering collaboration between various teams, departments, or business units.

What Is Delta Sharing?

Databricks Delta Sharing is an open protocol that lets you share marketplace data securely, regardless of what computer platform the data recipient uses. Delta Sharing is built into Databricks’ Unity Catalog data governance platform, allowing a Databricks user to share data with individuals or groups outside their organization. Delta Sharing enables you to share data securely with anyone, even if they don’t have Databricks. 

Best Practices for Partner Sharing in Databricks

Databricks recommends the following best practices for partner sharing: 

Define Clear Objectives and Guidelines

Set clear objectives and guidelines for data sharing: establish what the data is for, what you want it to accomplish, and how to use it. This ensures the shared data is highly valuable and relevant. It also lays the foundation for data productization, transforming raw data into actionable products that align with specific business objectives and end-user needs.

Once data partners agree to data productization, they can adopt a product-oriented mindset, implement efficient operational processes, and effectively manage data as a valuable product.

Without these objectives, data partners risk sharing irrelevant data that fails to generate valuable insights. Moreover, if the dataset includes outdated or unrelated customer information, it can produce a decision-making process based on inaccurate or irrelevant data.

Establish Clear Procurement Flows

Data partners aren’t always aware of what data they have access to. Data platforms like Databricks are great for sharing between technical, data-proficient people and teams, but they don’t come with data discoverability and productization out of the box.

To build an effective partner sharing methodology, it’s best to use Databricks (including Delta Sharing and Unity Catalog) as a back-end for a data productization and fulfillment platform. This way, non-technical users can easily find, assess, and procure the data they need without becoming experts in database schemas, navigating tables, and deciphering cryptic column naming conventions.

Monitor Partner Sharing Activities

Databricks provides audit logs that serve as an authoritative record of all activities related to any action on the platform. They highly recommend that you configure the audit logs for each cloud and set up automated pipelines to process those logs, in addition to setting up alerts on important events.

Databricks audit logs can offer evidence that a verified user is sharing the Databricks notebook. Logs document the cluster a non-verified user uses to access this data. These audit logs are an indispensable tool for ensuring data partners comply with data governance policies.

Manage Access Sharing Permissions

Databricks recommends assigning access-control policies to groups instead of individual users to ensure consistent, easy management. They also recommend that admins use account groups to grant access to the Unity Catalog or provide workspace permissions. 

Consequently, the Finance account group is granted access to financial data, while the Marketing and Sales account groups are denied access to that data. Following these best practices ensures that companies properly secure data and only provide access to those who need it. This approach reduces the risk of data breaches or data misuse.

Ensure Security and Compliance

Best practices against the most probable security risks involve:

  • Leveraging multi-factor authentication
  • Restricting network access
  • Separating admin accounts from regular user accounts
  • Monitoring user activities to detect anomalies

Databricks also recommends that you manage personal access tokens for REST API authentication to avoid granting workspace users the ability to use passwords. While you can use these tokens for credentials when automating Databricks workspace-level functionality, they do not replace account-level passwords.

Data-sharing partners should also employ SCIM provisioning to sync users and groups automatically from their identity provider to their Databricks account. This approach streamlines onboarding and offboarding processes and prevents unauthorized access to sensitive data.

Enhanced Data Collaboration With Revelate

Revelate, a platform providing advanced capabilities and tools specifically designed for data sharing and collaboration across organizations, can greatly enhance the data sharing experience of Delta Share users. Revelate offers advanced features such as data cataloging, data lineage, data access management, and data collaboration workflows that are specifically tailored to facilitate seamless sharing and collaboration on data assets. 

By leveraging Databricks’ partner sharing capabilities alongside Revelate’s specialized data collaboration features, enterprises can significantly improve their ability to share data with partners, streamline the collaboration process, and derive valuable insights from shared data assets.

Unlock Your Data's Potential with Revelate

Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!

Get Started