November 7, 2019Microsoft Azure’s Big Move in the Cloud with Synapse
“Speed of Thought Analytics” is a somewhat vague term that refers to the user experience of people interacting with data. Generically it can be taken to mean analytics performance that supports the navigation of data at a pace in sync with the cognitive reasoning of the user. This can vary by use case – a power analyst ripping through an ad hoc analysis in an excel pivot table is moving faster than a sales team reviewing a Power BI dashboard during a forecast meeting. But in both cases, sluggishness in querying data adversely impacts user experience and puts a drag on the process of creating data-driven insights. When query times stretch to multiple seconds, users notice and start thinking about alternatives.
Cloud data platforms have completely revolutionized enterprise data management. And the raw query horsepower of modern cloud platforms is astounding. But modern enterprise data sets are huge – maybe not classic “Big Data,” but they are pretty freaking big. Delivering consistent speed of thought performance across the spectrum of analytics and BI use cases is – as we say in Boston – still wicked hard.
This post will summarize why speed of thought analytics is so hard, what data teams do to deal with the problem, and [spoiler alert] how AtScale helps. This is a companion post to a more technical one on how the AtScale platform works to accelerate BI and analytics query performance.
First off, it’s important to understand the challenges of data gravity. While the electronic 1s and 0s that comprise our enterprise data don’t weigh anything, the storage gear they sit on in the data center does. And moving those 1s and 0s very far from their resting place – even through the thick fiber optic pipes they can move around the world on – has physical limits. Getting data from the cloud data center to the Tableau workbook on your screen in the specific form you requested takes work. The decisions of where and how that work gets done (i.e., how queries are generated, where they are executed, how results are moved) dictate the performance perceived by the end user.
The oldest trick in the book to manage the data gravity problem is to move subsets of data from the data platform to a location more proximate to the end user (in extreme cases, maybe a CSV or spreadsheet on their local machine). Smaller data extracts can be more easily prepped and analyzed. The obvious limitations to this approach are:
- The data being analyzed is not automatically refreshed as new data comes in
- Local manipulations need to be repeated with every new extract
- The success and accuracy of this approach cannot be governed – either for accuracy or access control
- Data scale is limited to the power of the end user’s hardware and tools
The Physics of Full Table Scans
One of the main reasons to centralize data on a modern cloud data platform is to leverage the power of cloud query engines. Cloud query engines are the powertrains of any analysis of data sitting on the platform. They are fueled by the massive computing power of cloud data centers and can elastically scale on demand. Performance is certainly impacted by the power of the engine as well as the efficiency of the code that is executed on the engine. But no amount of power or sophistication of SQL can get around the physics of scanning through billions of rows of data.
BI and analytics are composed of “cuts” of data displayed as summary tables or charts. These cuts represent slices of data by summarizing metrics by dimensions (e.g., total revenue by quarter, average selling price of a product, total shipments by region). Analysts and decision-makers interacting with data move through it by specifying new cuts, which could include applying filters, adding or removing dimensions, or drilling up or down to different levels of detail. Each cut represents a new query. Executing every single query initiated by a data consumer with a full table scan would result in comically slow performance. The classic solution to this challenge is the same as above: work with an extract of data with the same limitations.
The term Online Analytical Processing (OLAP) was coined in 1993 to describe an alternative data structure to the normalized data tables used to efficiently store and manage raw application data. OLAP data applies the principles of dimensional analysis to define measures (i.e metrics) and dimensions (i.e. drill-down categories) in a structured way that 1) limits user interaction with data in such a way that the results can be guaranteed to make sense and 2) the types of user-requested data cuts are predictable enough to pre-calculate summary statistics that can be staged as “aggregates.” A traditional OLAP cube is a data structure that has all possible data cuts pre-calculated and staged as aggregates. The benefit is that when a data consumer requests a cut, there is no need to scan the raw table. The data cut is simply delivered to the user’s table or chart.
Traditional OLAP is challenged to deal with the scale of modern enterprise data. The scale problem comes from the size of transactional tables coming from applications (exceeding billions of rows), the multiple applications that create data (CRM, Finance, ERP, Support, HR, etc..) and the number of dimensions that can be used to categorize data. OLAP solutions like Microsoft SQL Server Analysis Services (SSAS) choke on cloud-scale data. The solution is the same as discussed above: analysts rely on extracting slices of data into SSAS so it can effectively scale. This approach brings the same limitations of disconnectedness, complexity of maintenance, and lack of centralized governance.
Aggregates and Materialized Data Structures
While traditional OLAP approaches do not support the needs of modern cloud-scale businesses, some of the principles of accelerating BI and analytics performance carry over. Modern data teams use different approaches to identifying aggregates that should be staged to answer queries of analysts without performing full table scans. These aggregates can be identified based on obvious business needs (e.g., predicting that analysts will want to see revenue by week, month, and quarter) or by automated behavioral analysis (e.g., if users continually request shipments by state, create an aggregate).
No matter which techniques are used to identify the value of creating an aggregate, it still needs to be created, staged, and maintained in a physical location. Data engineers will create, or “materialize,” either a new table or a view to store these aggregates. They will build scripts to keep the aggregate updated on some schedule and they will redirect query traffic to the materialized data structure when appropriate. This approach can approximate the performance of traditional OLAP while keeping data consumers directly linked to live cloud data vs. local extracts. The challenge is that it requires a sophisticated data engineering team to constantly monitor performance and decide when to create new aggregates and where to materialize data structures to store them. In addition, given the wide variety of consumer queries, engineering teams often need to create super-aggregates that serve a broad range of queries. This results in larger tables (and large table scans) and sub-optimal performance.
Dimensional Analytics, Semantic Models, and Analytics Query Acceleration
While OLAP as a data engineering strategy does not translate well to the modern world, three fundamental concepts that emerged in parallel still serve us well.
Dimensional Analytics refers to the analysis paradigm where we put guardrails around how data consumers can interact with analytics. The objective is not to constrain or limit the experience of analysts, but to ensure that the cuts they request from data correspond to the right way to look at a metric and the right way to drill up or down on a dimension. Dimensions such as time, geography, product, and customer can be structured into hierarchies that are governed by a central analytics team within a dimensional model. This ensures that data consumers have a single, consistent way to look at aggregations at different levels in a dimensional hierarchy.
A Semantic Model is the logical data model that links the view of data that an analyst sees to the physical structure of the raw data. The idea is to let data consumers interact with a business-oriented representation of data (e.g., revenue by month or quarter) while defining the query logic to pull the summarized data from the raw data store. Business metrics can be defined and standardized in a semantic model to simplify and reduce the risk of inconsistencies. Metrics may be simple counts or sums of units or dollars, or they can be more complicated calculations like gross margin or average selling price. The goal is to make data more accessible to a wider range of users and to ensure consistent results across different uses.
Analytics Query Acceleration can leverage the concepts of dimensional analytics and semantic models to provide a map or graph to inform aggregate creation and materialized views. Graph-based query optimizers automate aggregate selection and creation to keep aggregate tables efficient, reusable, and as small as possible.
How AtScale Accelerates Analytics on Cloud Data Platforms
This post has been a short summary of the basic premise behind AtScale’s unique implementation of a semantic layer platform. The approach packages a combination of capabilities that:
1) Empowers self-service analytics by creating a single source of governed enterprise metrics and analysis dimensions
2) Ensures a high-performance analytics experience for data consumers working with common BI platforms (including Excel, Power BI, Tableau, and Looker) on major cloud data platforms (including Snowflake, Databricks, Google BigQuery, Amazon Redshift, and Microsoft Azure Synapse).
One of AtScale’s most common use cases is helping organizations modernize their OLAP strategies by seamlessly moving away from Microsoft SSAS to cloud-first strategies. Check out our recent analytics performance benchmarks for more details.