The Final Mile: How to Get Data Access and Distribution Right
In our last blog post, we explored the dynamics of the modern data buying (and, by extension, selling) process. In particular, we identified the three key steps in the process, starting with Discovery, proceeding through Procurement and ending logically enough with Data Access and Distribution.
Perhaps counter-intuitively, this final step – the actual delivery of the data to the consuming person or application that will derive value from – sets the parameters for the rest of the data sales workflow.
In many ways, the characteristics of the datasets in question – as defined by the five Vs of Big Data, namely: volume, velocity, variety, veracity and value – will dictate the delivery options available to the seller. A real-time data feed requires a different delivery mechanism to a static database comprised of flat files.
It is the responsibility of the data seller to put in place delivery options that are appropriate to the dataset being sold. Deciding what’s appropriate will require the seller to understand who the consumers are and how they will want to use the data.
Methods of accessing data
In most cases, the consumers’ preferences fall into several discrete categories: subscription model, one-time purchase, on-demand, and the services model sometimes offered with a sandbox facility. Each has its benefits and shortcomings for the consumer and implications for the seller’s data service delivery and design.
The subscription delivery model typically allows the consumer to take ownership of the data, ingesting it into an internal system and using the data in any way they see fit, subject to the licensing agreement. By definition, this approach requires the data buyer to maintain a team with a strong data engineering skillset. The buyer needs to understand how the data is delivered, what transformation processes are needed to make the data usable, how to establish multiple internal delivery pipelines where required, and how to ensure appropriate levels of redundancy, latency and security.
This requirement for technical sophistication of the buying organization often leads sellers of subscription-based services to question the need for investment in making their data easy to consume. As a result, many subscription services – particularly in the capital markets segment – are delivered in proprietary or semi-proprietary formats, with little or no instruction on how to integrate the data with the buyer’s internal systems. And in many cases, the data is so critical that demand is high, giving little incentive for the seller to change its approach.
Subscription-based streaming services, in particular, are vulnerable to this situation. The emphasis is on the buyer to land the data onsite in real-time without losing updates, which could require expensive backfill and replay capabilities to restore later.
All of this makes it relatively difficult for smaller organizations, which often lack the requisite technical resources, to consume subscription-based services. As a result, subscription-based services – including streaming services – are often targeted at enterprise clients that are more likely to have in place the data engineering resource needed.
One-time purchases of datasets, meanwhile, present consumer organizations with the same technical challenges as subscriptions, except that they need to perform the required data transformation only once. These types of purchases often involve historical datasets with no updates, except corrections, for the buyer to accommodate. But while the one-time purchase model puts less onus on the buyer’s engineering sophistication than continuously delivered subscriptions, the need for consistent processing and pipelining still presents a barrier to entry for consumers.
On-Demand Delivery, alternatively, is a model where the data provider offers a service through which the data buyer does not store the data in their own systems. On-demand data access requires the buyer to receive the data into memory at the time of processing. If the buyer needs to prepare the data for processing at all, they are required to transform it every time it is used. While data consumers can still join the data with other internal datasets, it is not stored by client systems, which can limit the flexibility in how it is used.
At high volumes, this setup may introduce cost considerations, too. On-demand processing constantly makes calls to externally hosted datasets, inviting high networking costs, as well as, the cost of additional memory required to process the incoming data. And the more complex the clusters of processors needed to handle the data become, the less ‘on-demand’ the service becomes, restricting its usage and value. As a result, on-demand services are better suited at lower volumes for smaller consuming organizations. Even with modern data sharing capabilities, this approach can incur high networking costs for the data providers.
This switches responsibility for the usability of the data entirely to the data provider, which means organizations seeking to supply datasets on demand must invest in a more robust infrastructure that ensures the data is delivered in a consumable way to relatively smaller and less sophisticated buyers. In short, on-demand delivery needs to be as simple and easy as possible for the consumer so that there is a strong incentive to remain within the on-demand platform.
Services delivery model
The services model approach is typically used to deliver a data service, such as analytics, rather than the raw underlying data itself. In essence, the data buyer asks a question of the dataset and the provider delivers an answer to that question in the form of a report, aggregated display or other analytical response. In more sophisticated instances, the consumer is able to modify the parameters of the question, allowing query of the host database, perhaps by time-frame, instrument type or some other characteristic.
Providers of services model offerings may allow consumers access to the underlying raw data through a so-called sandbox, hosted by the seller. This approach can allow buyers to manipulate the raw data while protecting the IP of the provider. But buyers may be reluctant for competitive reasons to commingle their own data within the sandbox environment, limiting the facility’s value. In reality, the sandbox facility is most commonly used for testing the usefulness of data sets in a controlled way before buying.
As can be seen from the nuances of the various models of data access and distribution, the buyer and seller may take on more or less responsibility for rendering the datasets in question useful. This has implications for the design of the original data products, and for licensing agreements and other commercial considerations, which we’ll consider in our upcoming blog.