How to Improve BI Performance on Hadoop

I was late to the game when I unboxed my first iPhone in 2008. As a mechanical engineering student at the time, it wasn’t an easy decision to spend 2 months’ food budget on a “non-essential gadget”. Thankfully, it turned out to be much more than a toy; the realization of the enormous potential of what you could do with data led me to embark on the business intelligence career and abandoned my dream to create the next fastest roller coaster.

Drowning in data but starving for insight

Companies aim to grow and increase revenue. Therefore, gaining understanding of customers and adopting Business Intelligence (BI) strategies have become an essential task. The concept behind BI is to use technology to gather, organize, analyze, and draw insight from data, thus creating action plans to improve revenue and efficiency. When BI became mainstream in the early 1990’s, enterprises would store data in one of the few then available databases such as Oracle, Teradata, IBM DB2, SQL Server, etc. Only individuals with very specific technical skills were able and allowed to touch the data. In order for companies to analyze or report insight from the data, they chose from a handful of full-service analytical reporting tools like Microstrategy, Business Objects, and Cognos.

Data size outgrowing traditional databases

I spent some time working at one of the large traditional database companies, and it was the company’s mission to provide a database focused on enterprise business analytics at large scale. The problem I experienced most at customer projects was that data scale was growing much faster than any existing tool could handle, in volume, velocity, and variety. This was around the time when Hadoop and other data lakes started to appear. The idea was to lower the cost by leveraging commercial hardware and make the clusters easily scalable. This was a great step forward on the infrastructure side. However on the BI side, the full-service reporting tools were unable to query the data lakes efficiently if at all.

bi performance can suffer on hadoop and on data lakes

The Rise of Self Service BI

Due to the complexity of traditional BI tools, data was still controlled by IT and data scientists. Business users would routinely submit requests for reports, and based on the requirements, IT team would have to extract, transform, and load the data into an Operational Data Warehouse (ODW), then build static reports and dashboards for the business users. Having to curate datasets and run queries became a burden on IT, and the delay between the request from business users and report generation led to missed business opportunities. The idea to empower business users with some ability to analyze data themselves began to surface. Tableau, Qlik, PowerBi, are some of the major players in what has become known as Self Service BI.

So Data Marts/TDE/QVD?

Because these BI tools face performance and concurrency challenges when querying data lakes directly, they typically come with a hardware component which IT has to manage separately. Since not all data could be moved into the data mart, companies faced a difficult decision to choose between losing data fidelity in exchange for performance and concurrency — which would effectively make their large quantities of data much smaller. Having to move data between systems also creates latency and security risks. Not to mention that the separate hardware servers for data marts are expensive and often proprietary.

Data Marts do not effectively solve for BI performance on Hadoop

What are the options?

The founders of AtScale all worked as technical contributors at various data platform and BI companies. They bled and lived through these exact problems and came up with a solution that addresses each of these problems. AtScale allows any BI tools to share a Universal Semantic Layer; the definition of transformations, calculations, and aggregations only have to be defined in one place. Because AtScale understands both SQL and MDX, it can connect to any BI tools via JDBC, ODBC, XMLA, or REST protocols.

How does AtScale work with Hadoop?

AtScale is deployed on an edge node in your hadoop cluster. A browser based UI is provided for your data architects to build OLAP data models visually using tables that are in the Hive metastore, and whether the structure of the underlying data is in OLTP or third-normal form does not matter. Since AtScale leverages a ROLAP architecture, BI users can begin querying the semantic layer right away; there’s no need to wait for cubes to fully build before querying.

AtScale improves BI performance on Hadoop

What about performance?

AtScale improves performance by optimizing and converting SQL and MDX queries from the BI tools into SQL-on-hadoop queries. As the BI users query the semantic layer, AtScale takes a heuristic approach and learns the behavior based on the users’ reports (dimensions, measures, filters, etc.). Then AtScale creates and maintains aggregate tables that are stored back into your data lake as Hive tables. When the next queries come in from any BI tools, AtScale looks through the available aggregate tables and serves the queries back without having to do full table scans or large joins. This not only ensures fast query response times, but also allows for much higher concurrency.

Wait, did you say security risks?

Enterprises put a lot of effort to secure the data lake cluster; however, when data leaves the cluster into data marts and extract systems, security policies have to be defined and maintained at both places. The data at rest and in transit between the systems also has to be encrypted, and secure transport protocols have be to ensured. This creates a high level of complexity and chances for error. AtScale is deployed as a part of your hadoop cluster, and it integrates with LDAP, Active Directory, Kerberos, Sentry, and Ranger for user authentication and authorization. Data stays in the cluster, aggregates created by AtScale are stored back in the cluster, BI tools directly query AtScale without any moving any data out of the cluster at all. Admins also have complete control over users’ data access at a granular level across all data platforms.

During AtScale’s six-year history, we have been able to help some of the largest enterprises in the world, including Visa, Toyota, TD Bank, DBS, Aetna, and more. This is because they were all experiencing the same issues we faced, and our vision resonated with theirs. We are growing with our customers and are currently working on future-proofing data architectures. With the rising new platforms and technologies, AtScale is being designed to embrace data virtualization. We have expanded our supported platforms beyond Hadoop to Google BigQuery and Amazon Redshift, with more to come in the near future.

Read Customer Stories

Guide: How to Choose a Semantic Layer

The Ultimate Guide to Choosing a Semantic Layer

READ NOW