What is the Data Lakehouse?
The Data Lakehouse (or Delta Lake) concept has been making rounds in the data and analytics community for a while now. But what is it exactly and does it improve on the concepts we already know?
More or less since the beginning of the analytics industry as we know it, data warehouses have been the go-to resource for most enterprise analytics and reporting needs. In recent years, data lakes have entered the mix as a popular alternative, providing higher flexibility and lower storage costs. And although both definitely have their strengths, they also come with seperate challenges:
Data warehouses are costly and inflexible, only storing highly structured data, while data lakes quickly become chaotic and difficult to manage, access and govern.
The data lakehouse seeks to address these issues, while marrying the benefits of its predecessors. In order to assess if it’s worth the hype, let’s first briefly brush up on the traditional approaches.
Data warehouses are still extremely popular, and rightfully so: They help make reporting straightforward and accessible thanks to the schema-based datasets of structured data, which can easily be queried through SQL. In addition to this being a great enabler for self-service analytics, traditional data warehouses of course offer all of the benefits of relational databases like ACID transactions, indexing, security, and caching features.
That’s great, so what more could you want? These benefits come at a price, since the highly organized nature of data warehouses make them rigid and relatively inflexible. Ongoing engineering will in many cases be required to accommodate source systems changes, or new data requirements from your stakeholders, and if you find yourself dealing with unstructured data (like audio, images, video or something similar for data science purposes) forget about landing that type of content in your data warehouse. Although you can technically store semi-structured data in a data warehouse, think of them as tabular only. And as your datasets grow over time, so does your cloud bill - storing vast amounts of structured data is not cheap 😬
And that brings us to data lakes. The savior for all businesses looking for higher flexibility and lower storage costs? 🤔
As a lot of companies find themselves flooded with data, they don’t have capacity to store all of this in neat, organized data warehouses. In a data lake you can store data in it’s raw format - no preprocessing needed to convert to a tabular format accepted by a data warehouse - and you can save it to an inexpensive data lake storage account (like ADLS Gen2). So you can react quicker to new data capture requirements and save money at the same time.
Additionally, data lakes provide a decentralized way of working with data, as compute and storage is seperated. All contents of a data lake can be processed through the compute engine of choice. Data scientists can process data through a local Python notebook or using large Apache Spark clusters, thus reducing bottlenecks. Compute resources are also a lot more expensive than storage, so this decoupling of resources allows data engineers to focus on optimizing compute-heavy workloads for cost reduction, and also destroying compute nodes as soon as they finish processing a job. This way you only pay for the processing power you use and make sure you aren’t billed for excess capacity.
But as you might have guessed, there’s also a catch this time around. Data lakes tend to grow quickly in size and this, combined with the ability to store any data type and oftentimes lack of structure and governance, can make them very difficult to extract value from (which was the point in the first place). As a worst-case scenario, your lake can deteriorate into a data swamp where value from captured data is never realized.
Unprocessed data types don’t exactly lend themselves to easy analysis either. That’s why some companies have adopted a two-tier approach.
Two-tier approach: Why not a bit of both?
So you store your data in a lake, load this into a data warehouse, and expose it to your users. Easy peasy, right? Businesses opting for this approach will quickly find themselves maintaining a data lake and most likely several data warehouses separately. On top of that, you also pay for storage in both your lake and your data warehouse, since you are essentially duplicating the data you expose in the data warehouse.
So what we need is a unified platform that can act as a lake in terms of flexibility and storage costs, while giving us the accessibility and structure of the data warehouse - and that’s exactly what the data lakehouse attempts to do!
The Data Lakehouse
This brings us to the main event: the data lakehouse. Is it possible to get the best of both worlds? And how does this differ from the traditional approaches?
A key difference is implementation of data warehousing functionality on top of open data lake formats. This can be described as a dual layered architecture, where the warehousing layer exposes lake data directly, while enforcing structure and schemas. For instance, the lake data can be stored in highly optimized parquet files, which the warehousing layer will expose as external tables, allowing analysts to query data directly using SQL for BI and reporting needs. The data still only resides in one place, avoiding duplication costs and ensuring a single source of truth. Pretty cool, right?
Another big step to establish the data lakehouse as a proper design choice was the introduction of the Delta file format. Delta uses versioned Parquet files to store data and acts as a layer on top of Parquet files that adds a transaction log (which adds history and time travel options), but most importantly introduces ACID transactions. This is a huge step to enforce data quality and control in data lake storage, since it adds transactional assurance to lake storage systems.
If this hasn’t gotten you excited yet, I don’t know what will! A lot of cloud vendors are already providing lakehouse offerings (some more complete than others), allowing you to start building your data platform on lakehouse tech today. But as promising as this all seems, there is still room for improvement in certain aspects. Before you start building, keep in mind that some lakehouse components are still evolving and maturing. If you’re only looking for a proven and highly tested solution, you might be better off sticking with a traditional approach, but keep in mind that you will be missing out on principles that will keep your architecture future-proof for many years to come, and you could have a migration on your hands before you know it.
I believe the wide adoption of this relatively new paradigm surely will keep development of lakehouse tech going at a fast rate, since the underlying data management principles are solid, so if you can accept a little uncertainty in a few nonessential aspects, the data lakehouse is definitely worth considering over its predecessors.