Dremio: A Unified Data Lake and Virtualization Tool

A person working on laptop on data marketplace tool.

Table Of Contents

Describing Dremio can be a bit of a challenge. People’s definition of Dremio changes depending on who you ask or what you Google. Some people call it a data lake engine, some say it’s a semantic layer between users and data sources, and others call it a SQL lakehouse platform. Dremio describes itself as “the easy and open data lakehouse,” which doesn’t help much.

In this article, we will clearly define what Dremio does and how it fits into your data ecosystem. and how you can use it to enable data-driven decision making in your organization. Dremio is an excellent tool that can help your organization make decisions built on actual real-world data. Here’s how.

A unified data lake

Dremio is a data lakehouse, which is a portmanteau of data warehouse and data lake. This means you get the performance and trustworthy ACID (atomicity, consistency, isolation, durability) transactions of a data warehouse and the scalability of a data lake. Data lakes are popular for their ability to manage data with varying degrees of structure and their ability to enable business intelligence and machine learning workloads. Dremio, therefore, serves as a hybrid data storage solution called a lakehouse.

The second thing to know about Dremio is that it offers self-service SQL analytics in near-real time. This is noteworthy because until recently, it was very difficult to get meaningful insights from a data analytics destination without waiting for computationally-expensive data transformations and ETL/ELT pipelines. Dremio has solved this problem, even to the point that they can offer sub-second analytics performance. Impressive!

The third thing to know about Dremio is that they’re built on the Apache Arrow project, largely because they invented Apache Arrow. Arrow is a language-agnostic, columnar memory format for flat and hierarchical data. It is efficient for analytical operations on modern hardware, such as CPUs and GPUs. Arrow also has a standard interface for transferring data between different systems and programming languages.

To put it briefly, Dremio is a comprehensive data solution offering centralized data storage and streamlined operations. You can use it to query and analyze data really, really fast. This benefit makes it unique because there is little-to-no time wasted on computation, normalization, transformation, or data pipelines.

Data virtualization tool

Another unique feature of Dremio is called “data virtualization.” This feature allows you to see all of your data through one interface, even if the data is scattered in a million different places. It’s as if all of your USB cables were in one neatly-organized drawer, despite the fact that you keep finding them all over your home, in your car, and in your semi-monthly Monoprice shipment.

This is where the semantic layer terminology comes into play. Even though all your data is scattered across multiple locations, like your SaaS providers, various servers in your cloud, and in your data warehouse, Dremio acts like it’s all in one easily-accessible data destination. In other words, Dremio doesn’t care where your data is and makes it look like you don’t have to, either. It effectively eliminates the need for data movement and duplication between sources and destinations.

Even better, Dremio offers real-time access to these disparate data sources despite their differing data domiciles. This access simplifies data management, even for non-technical users. All you need to know is SQL and you’re good to go.

Powerful cloud analytics

Dremio is cloud-native and cloud-agnostic. You can deploy it into any of the popular public cloud providers, such as AWS, Azure, and GCP. It’s also available as a SaaS, making it inherently scalable and flexible. Need more resources? Just add more compute resources.

The meat and potatoes of Dremio is its near-instant access to data analytics. It comes with a variety of visualization tools, which make it easy to spot trends and anomalies. It also integrates seamlessly with a variety of machine learning models, which speeds the time to value for predictions and data-driven decision making.

Because it provides analytics in the cloud at-scale, it’s rather straightforward for any cloud infrastructure expert to get everything up and running, assuming you’re not running the SaaS model.

Simplifying data pipeline

Anyone familiar with data integration might be asking, “How do they virtualize all these data sources and keep it fast?” Nearly every large company out there deals with setup and maintenance of expensive data integration and pipeline platforms. Dremio makes it sound like all of that is unnecessary—so what’s the catch?

To achieve their near-unbelievable performance without building data pipelines, they use a technique called “query pushdown.” Query pushdown is a way to optimize queries by pushing as much of the processing as possible to the data sources. Dremio uses Apache Arrow to accelerate data access and processing, which makes query pushdown more efficient.

Here’s how it works:

  1. A user submits a query to Dremio
  2. Dremio parses the query and determines which data sources they need to execute it
  3. Dremio sends a query to each data source, requesting the data they need to execute the query
  4. The data sources return the data to Dremio
  5. Dremio combines the data from the data sources and returns the results of the query to the user

By pushing the computational processing to the data sources themselves, Dremio greatly simplifies the traditional ELT/ETL data pipelining process.

Another impressive technique Dremio employs is called metadata-driven transformation. Metadata-driven transformation is a way to transform data by using metadata to define the transformation. This process means that users do not need to write any code to transform the data. Here’s how it works:

  1. A user defines the transformation in metadata
  2. Dremio uses the metadata to transform the data
  3. The transformed data is returned to the user

You can define transformation metadata in a variety of ways, including:

  • SQL: Users can define the transformation in SQL
  • JSON: Users can define the transformation in JSON
  • UI: Users can define the transformation using the Dremio UI
  • Once the user defines the metadata, Dremio can use it to transform the data without the need for any coding

These are just a few ways Dremio makes it much easier to prepare data for analysis.

Revelate: complementary data management tool

It’s no secret that we’re kinda crushing on Dremio and all of its data integration innovations. But we have a few tricks up our sleeve, too.

Dremio does a lot of things well, but it doesn’t do everything. Dremio and Revelate together are a formidable combination. Revelate integrates seamlessly with Dremio, and we solve all of the problems organizations face with data sharing. We have automated data productization pipelines, including preparation and profiling of data from various sources with barely any overhead. We’ve also made data discovery so easy, even your most data-reluctant employees will be dreaming of semantic layers and ACID transactions.

Even cooler, we help identify relevant datasets for teams across your organization. Our platform has automated access control and tight IAM integration for set-it-and-forget-it data governance. We have enhanced data lake capabilities with data cleansing, and we work with every type of data warehouse, lake, and lake house architecture. Need advanced data profiling and validation? We’ve got that too.

Whether you’re using AWS data lake or Dremio or you’ve rolled your own proprietary thing, we’ve got you covered.

Unlock Your Data's Potential with Revelate

Revelate provides a suite of capabilities for data sharing and data commercialization for our customers to fully realize the value of their data. Harness the power of your data today!

Get Started