The rapidly exploding demand for business intelligence on big data is nothing new - this trend is clearly indicated in the latest Big Data Maturity surveys (2015 and 2016). As shown in the graphic below, 75% of respondents are planning on deploying BI workloads on their big data platforms (with 73% of respondents already with some BI use cases deployed).
A successful implementation of BI use cases on big data is dependent on several key functional and performance requirements. In this blog, we will take a look at how the innovative use of Apache Hive (backed by Druid indices) and AtScale uniquely satisfy these requirements.
What’s Needed for BI on Big Data
A modern platform for BI on big data needs to satisfy a number key criteria. While the list below is not exhaustive, some of these criteria include:
- The ability to support the most common data visualization tools, including Tableau, Microsoft Excel, Qlik, Spotfire, and others
- Support for the most common query languages used in business intelligence - SQL and MDX
- Interactive response times on the largest of data sets (on the order of billions of rows of data)
- User-friendly abstractions - showing measures, dimensions, and hierarchies instead of underlying database schemas
- A scale-out approach to supporting data growth and performance improvements
- When we created AtScale, we kept the above requirements in mind. We focused our development efforts on innovating in areas where we didn’t see any existing solutions, and on partnering in areas where we felt there were existing technologies.
One of the key design goals of the AtScale architecture is to provide scale-out big data analysis capabilities by leveraging the underlying performance and scale characteristics inherent in the data platforms that we support. In essence, we believe that projects and software companies that are focused squarely on scale-out SQL-based data processing provide the perfect foundation for AtScale’s Adaptive Cache an Dimensional Calculation Engine. As a result of this approach, AtScale is able to benefit from the constant innovation that is happening in this rapidly developing space. For example, our recent BI-on-Big Data Benchmark found that the Hive SQL-on-Hadoop query engine demonstrated a 4X performance improvement within a development cycle of only 6 months!
Druid Integration with Hive…
As a part of our close partnership with Hortonworks we routinely discuss joint opportunities to make our vision of BI-on-Hadoop an even better experience for our shared customers. One such opportunity we identified was to leverage a performance-optimized storage and query layer called Druid. Without going into great detail there are several characteristics that make Druid a great option for low-latency queries:
- Support for sub-second query response times on large aggregated data stores
- Built in support for time-series analysis
- Excellent support for concurrent queries
- Horizontal scalability for terabyte scale data sets
To learn more about this open source project you can visit http://druid.io.
Druid + AtScale = Really Fast OLAP
Because of these characteristics, we are very excited about Hortonworks’ investment to integrate Druid with the Apache Hive project. As discussed in Carter Shanklin and Slim Bougerra’s great blog post, the integration of Hive and Druid further enhances Hive’s suitability to be used in environments that required the interactive query response times of traditional OLAP applications, even on the largest of data sets. With this integration, AtScale will now be able to optionally store the aggregate tables that it maintains as part of the AtScale Adaptive Cache as Druid tables - all through Hive query interface, as shown here.
Like Peanut Butter and Chocolate
As discussed in Carter Shanklin’s great blog post, it should be noted that the Hive/Druid integration alone is not an OLAP platform, per se. For example, while Hive provides a SQL query interface to Druid tables, most enterprise class OLAP solutions need to support the MDX (multi-dimensional expression) query language. Additionally, Druid and Hive on their own do not support required OLAP concepts such as hierarchies, multi-pass calculations, and distinct counts.
As such, we think that the combination of AtScale + Hive/Druid together make an ideal platform for supporting OLAP-style analysis at Hadoop data scale. Big data practitioners and BI architects can use this combined solution to deliver a robust, scalable data platform for large scale OLAP style queries while providing business users (and their BI tools of choice) the ability to use a consistent and robust OLAP interface (using either SQL or MDX) that’s been designed to intelligently deliver optimal query performance regardless of the underlying storage and query engine.
To find out more (and if you have more questions), I’d like to invite you to join Hortonworks and AtScale on this webinar. Carter Shanklin and I will be hosting a conversation on EDW Optimization with customer Canadian Tire. The webinar format will be interactive and we’d love to hear some of your thoughts and questions.