When databases were first invented data was small, storage and compute were expensive. Performance of original databases was achieved through tight coupling of compute and storage layers and for decades now, databases have been squeezing out every ounce of performance they can.
This is great when you have steady or at least predictable load on systems and when your data is relatively small.
Unfortunately typical data volumes in the enterprise have grown dramatically. This growth is driven mostly by the proliferation of apps rather than the volume that individual apps generate.
What that means is that databases, relational or not, tend to be a great choice for most applications. The challenge with leveraging traditional databases as a data storage solution comes when data, oftentimes generated by multiple applications, must be brought together to drive operational insights.
The solution to this problem used to be relatively simple. We take the data from the application’s operational database and put it in a data warehouse. This was simple when operational databases had known schemas, when we had a small number of apps (remember when apps were built only for the most critical business processes?!?) and when a small number of people were going to use the data in the data warehouse for reporting purposes.
The rise of schemaless application databases, the combined volume of apps that we’d like to derive insights from and the democratization of analytics (from BI to Data science) has complicated what was once simple. The business implication of neglecting this complexity is paying for compute on data that you’re not leveraging to drive insights.
Luckily, vendors have stepped up to the plate and provided a series of options to choose from when architecting data intensive systems.
This requires purchasing capabilities from two vendors: a computing framework vendor and a storage vendor. In the cloud, these tend to be a computational framework such as spark and an object storage solution like S3, GCS, Azure Blob/ADLS.
This approach is great if you’re going to be designing custom data intensive applications to serve specific use cases. It also allows you to build compute intensive logic across different storage solutions and can scale to some of the largest data intensive requirements.
The downside of this approach is that it tends to be costly from a development, maintenance and optimization perspective when you have simple use cases such as data replication, migration or transformation.
Modern data warehouses have managed to de-couple their storage and compute layers dramatically. Big query, Redshift, Snowflake and Synapse all leverage cloud storage and independently scalable compute to drive insights from data. These solutions allow you to get just enough compute for your needs at any given time and to scale the compute back if you don’t need it. This has allowed organizations to dramatically reduce friction to deriving insights from data by making storage of data incredibly cheap, to the point of it becoming a non-issue.
Unfortunately even modern data warehouses have their setbacks. For one, their column based data structures can make realtime ingest a challenge. Secondly, the way that data is loaded into a data warehouse tends to require specific methods to create optimal performance. Third, the examples mentioned all have very different performance characteristics and tradeoffs. We’d recommend diving deep into these warehouses before making a selection. Finally, you generally interact with your data warehouse through SQL, not a more general purpose programming language like Python or Scala (though you can use libraries that simplify leveraging SQL in those languages). SQL as the main interface can either be a massive advantage or a limitation depending on the use cases at hand.
HTAP systems are a new breed of databases that bring the best of transactional systems and data warehousing together. These systems allow for realtime use cases with the data as well as large scale analytics workloads.
These systems can be found in two main flavors:
Both of these flavors have their pros and cons and a deep comparison of the two is outside of the scope of this piece. However, what is important here is that these HTAP systems allow you to scale compute and storage. The level of decoupling depends on the implementation itself but you can generally get incredible flexibility and real time performance from these new types of data systems.
Other technology vendors have also started to build decoupled systems. For example, Mongo DB’s data lake solution allows you to leverage your Mongo DB compute to extract insights from object stores.
Another alternative to this problem is virtualization of the data sources. This creative approach reframes the problem to say, instead of decoupling storage and compute at the analytics storage layer, why don’t we leverage extra capacity at the source. This something that can be accomplished through several tools that sit on top of your source and leverage the resources available there for analytics purposes. This approach is elegant but it’s simplicity can be incredibly appealing at first glance but disastrous in production environments where sources are somewhat resource constrained.
In the beginning of that article I said “The challenge with leveraging traditional databases as a data storage solution comes when data, oftentimes generated by multiple applications, must be brought together to drive operational insights”.
Then I went through a host of solutions that could help alleviate that problem, all of which use the same basic strategy: partial or complete decoupling of storage and compute. Some of these solutions have downsides; they are expensive to operate and maintain, they are too reliant on SQL, they struggle with real-time data ingest, their approaches too simple to operate in production environments etc.
All of these points of consideration are immensely important because of one thing: bringing together data from disparate sources to drive operational insights is key to your business. There is no getting around it -- today’s enterprise exists in a data-driven world.
Keep these things in mind when you’re thinking about architecting your data systems.
You can also check out our approach to solving challenges like getting teams the data they need to make decisions in formats they can work with.