In 1992, Arbor Software shipped the first version of Essbase. Which stands for Extended Spreadsheet Database.
In 1998, Microsoft shipped Microsoft SQL Server Analysis Services.
The time of multi-’dimensional’ databases had come into full being and almost 30 years later these OLAP engines are very much still in play. For those who may be new to OLAP, it stands for Online Analytical Processing, if you want to totally geek out have a look at Dr. Edgar F. Codd who coined the term. Note: It doesn’t stand for OLD Analytical Processing!
Databases up to this point were essentially two dimensional, records and fields, and required query language knowledge in order to retrieve data. With the onset of OLAP, business users were now able to ask questions of the data in a dimensional fashion, get speed of thought answers and not be required to learn any type of query language. Individuals could simply go into excel and drill down, pivot, and swap dimensions and measures. It quickly became the default language of business. Instead of looking at records, fields, facts, etc… any user could come in and begin looking at Sales by Region, by a Product, by a Channel, by a Market, by a Scenario (Actual, Budget, Forecast, etc..), by a Time Period, by,by,by.... These “by’s” are what define the actual multidimensionalness of OLAP and the ability to just drag & drop, drill down, drill up, etc.. all within excel with the click of a mouse (versus writing SQL) became the rage of the 1990’s and in large part still is today.
Theoretically, these OLAP engines had no dimensional limit. Of course, in the real world it became evident very quickly that movement of data off source, cardinality of dimensions, intensive calculations and size of data played a very key role with regards to performance. In the early days, approaching 10-13 number of dimensions could and did very much begin to degrade performance, and not just query performance. Changing the “cube”, updating the data, and calculating data could begin to take hours if not days. These challenges became acutely apparent as things progressed and of course we had not yet introduced truly “big data”.
MOLAP in the era of big data is simply unsustainable for the same reasons and issues that Essbase and SSAS (both MOLAPs) become unsustainable late last century. MOLAP solutions move data and store every intersection of the cube. When we talk about ‘every intersection’, we mean every single member of every single dimension by all measures. This resulted in what has typically been described as data explosion, cubes could get quite large. That explosion is part of what would begin to slow a cube down, and certainly part of why folks would start to look to limit the number of dimensions to 10-13 or less. This is no different today than it was then, except for the big data effect which makes all of these pains even more excruciating. This is why a hybrid approach such as the AtScale architecture is necessary.
Challenges of OLAP on Big Data
Diving into a bit more detail, the data that was actually queried within these cubes was only a small subset of the overall cube. Sometimes as little as only 20% of the overall cube size, leaving the remaining 80% as pure cost in management and performance degradation.
Fast forward. It’s the mid-00’s, “Big Data” and “Hadoop” are steadily marching along. OLAP cubes are still widely in-use and definitely “exploding” with data. What does ‘exploding’ really mean? Yahoo! is running with a 24TB Analysis Server Cube that takes 7 straight days of non-stop compute in order to calculate and months to make any type of change.
For many customers, OLAP and “Cube”, had become dirty words. I know this because I lived it back in the late 90’s and early 00’s working with Essbase. I know it as well because when I started with AtScale in 2015, any prospect/customer that had any experience with “Cubing” immediately looked at me with suspicion, and rightly so!
How do you handle data explosion?
How do you handle big data?
How long does it take to process the cube?
How long does it take to make changes, like adding a dimension?
Do we have to move data? Do you support HBase? Do you support….?
How long, how long, how long….
It was easy to see that these OLAP Admins and Power Users had been living with some pain. To add insult to injury, data was growing exponentially fast and business users still expected “speed of thought” response times. Of course, in many cases “speed of thought” had begun to slow down. Can we get it under 15 seconds? 10 seconds? 5? Speed of thought (a phrase started back in those early 90’s days) had morphed into “Can I get it by lunch?”.
The answer was clear to the folks at Yahoo! The world needed a new type of OLAP. So that’s what a few of them set out to do.
Towards A Brave New OLAP
Enter AtScale. How do we solve for truly large amounts of data, models with large numbers of dimensions and measures, and the absolute need for interactive query performance?
I would start these meetings with, “This is not last century OLAP”, or with something along the lines of, “The 90’s called and they want their OLAP back!” In reality, what I meant was, this is not a MOLAP solution, it’s a hybrid.
So how do we take advantage of concepts like distributed compute? How do we keep the things that worked with regards to Online Analytical Processing but lose the really painful stuff? While taking into consideration, Big DATA? I’ve seen many attempts over the last few years and most of them have failed, or worse; created even more complicated and extremely technical environments. With regards to complicated environments, all one needs to do is look at the Big Data Ecosystem… It’s enough to make one wish for the 1990’s Metal Band Big Hair to come back.
Companies are taking their big data and making it small, pushing it to relational DBs, moving it off platform, relying on many different technologies and/or extracting it into various reporting tools. Where, unfortunately, they are likely in violation of corporate governance and have the pleasure of fighting over whose numbers are the right numbers. It’s the same old rock band but it is incredibly more complicated and incredibly more “BIG”. Worse, many vendors are pushing this; move data, index data, calculate all intersections, it’s the same old MOLAP issues with a bigger problem; massive volumes of data.
So how does Atscale solve this problem? It’s quite simple; we create a virtual data warehouse. Virtual being the key. One might call us a virtual Hybrid OLAP (HOLAP). To the end-user in Tableau or Excel we look like the good old big hair band we all knew and loved back in the early 90’s, but without all the limitations because we aren’t storing every single intersection. Business users can still see their business by Time, by Region, by Market, etc.. they can still drill down, swap, pivot and more importantly they still have their performance.
This “virtual Data Warehouse” sits on a node of your data platform, where it intercepts inbound SQL or MDX from various BI tools and converts it into Spark, Impala, Hive on Tez, GBQ, Redshift, etc… querying data where it lies. There is no data movement off platform, AtScale creates intelligent aggregations that are stored on platform and provide order of magnitudes of higher query performance. We are leveraging the power of distributed compute, we are leveraging the power of these various platforms and we are only storing smart slices of the cube that have actually been queried. These adaptive aggregates are stored right next to the raw data, meaning there is no data movement off the platform.
This means a couple of things, first you gain all the advantages of a universal semantic layer, including but not limited to: A source of truth, meaning no more fighting over whose numbers are correct. Data governance with regards to NOT moving or reducing big data. Additionally, because of the virtual nature of AtScale cubes, you no longer have the restrictions of yesteryear. I’ve seen models with 100’s of dimensions and 100’s of measures. There are no “data explosions” as AtScale’s intelligent aggregates only represent the 20-35% of the cube that is actually being queried. You don’t find yourself in a situation where you have to buy a bigger or faster box in order to, in the words of Ricky Bobby, “go fast”. You don’t scale up, you scale across, again leveraging the power of distributed compute or whatever the underlying platform is and NOT bottlenecking yourself into one box that is trying to do all the heavy lifting and failing. As Impala, Spark, or Hive on Tez or LLAP, GBQ, Redshift, Snowflake or whatever distro you happen to be on gets faster, we get faster. Whether it’s on premise or in the cloud. As you add new data nodes and compute power to the platform, we automatically get faster. If you decide, as one customer did, to move from a on premise hadoop cluster to Google BigQuery it’s a simple plumbing redirect, not a months long project with a painful end user disruptions.
It’s like you’ve got the best music from the early 1990’s, without all the big hair. It’s a masterpiece movie from the past digitally remastered to take complete advantage of modern capabilities.
It’s OLAP for the 21st Century without all the luggage.