November 17, 2022Making Everyone a Power Data Analyst with a Semantic Layer
This is the fifth blog in my series The Semantic Layer. Back to the Future: What was Old is New Again, where I discuss the seven core capabilities of a semantic layer. In this blog, I will dive deeper into the multidimensional calculation engine that serves as the query and calculation engine for a business-friendly view of data. It is the calculation engine that implements the semantic data model logic that we described in my last post: The Semantics of the Semantic Layer Part 4: Data Preparation.
As a reminder, the following diagram shows the six core capabilities for a semantic layer. This blog will focus on the “Multidimensional Calculation Engine”, highlighted in red:
For a semantic layer to function, it must translate the inbound, logical queries coming from the data consumers into physical SQL queries in the dialects of the underlying data platforms.
The semantic layer data model is ideally suited to define the digital version of the business, but it needs to be capable of expressing a wide variety of business concepts in a variety of contexts.
Tabular and Multidimensional
There are two types of semantic layers, or models, to consider: a tabular semantic layer and a multidimensional semantic layer.
The tabular or relational model was popularized by modeling gurus like EF Codd and Ralph Kimbal in the 70s and 80s. These modeling techniques rely on concepts like fact and dimension tables and are meant to make a relational database or data warehouse easier to query. A tabular view of the data is useful since it presents data as a simple, flatten view of data in rows and columns. This format is ideally suited for SQL-based use cases and tooling using common protocols like JDBC and ODBC.
The multidimensional data model goes one step further, though. By defining relationships and aggregation rules, the multidimensional semantic model adds a business friendly context and makes hand writing SQL either unnecessary or substantially more simplistic. For the widest range of uses and consumption styles, a multidimensional semantic layer offers more power in an easier to use package because it combines business-friendly metadata and data in one interface.
The relational database, with its row-oriented architecture, is well-suited to store transactional data, but relational databases are not well-suited to express business logic, since the business thinks in multiple dimensions or cells, instead of rows. It’s not an accident that the spreadsheet interface like Excel became the dominant tool for modeling business because it is designed for multidimensional analysis. The spreadsheet’s cell-based calculation architecture provides a full range of expressiveness that a row-based engine lacks. As such, the semantic layer must be capable of modeling data in cells, not rows.
The multidimensional expression language, or MDX, has hundreds of functions and operators to express a wide variety of cell-based calculations for expressing complex business logic. Let’s take a simple example of defining a metric for a 30 day moving average of sales. While this seems like a simple metric, it gets complicated when you wish to use the metric in queries with a variety of time periods (i.e. 30 day moving average by week, month, quarter, year, etc.).
Using a semantic layer with support for MDX, we can define this metric once using the following expression:
The above expression packs a lot of power in a compact package. In this example, we are using an AVG expression to create an average of the [Sales] metric, a LAG function to specific the lookback period (30 days), and a range expression using a “:” operator and the CURRENTMEMBER expression to dynamically calculate a range of values given a query context.
We can then use this common expression in a variety of queries without regard to its context. In the following example, we computed a 30 day moving average of sales by quarter for All-Purpose Bike Stands:
By simply changing [Order Quarter] to [Order Week] in the above query, the MDX expression would render the correct moving average automatically. Furthermore, changing the filter (All-Purpose Bike Stands) also returns the correct answer without further model modification.
Without a cell-based computation language, a SQL-based semantic layer would require hand coding each query variation and require a separate query definition for each unit of time. Doing so would require loads of custom SQL code that lacks reusability and is tough to maintain, making a SQL-based semantic layer engine a poor choice for expressing even simple business constructs.
The inventors of OLAP introduced the world to hierarchies in the 80s and it’s been a core feature for data visualization ever since. Hierarchies allow users to drill into data along a defined path, moving from less to more detail intuitively. For example, the product hierarchy below allows Excel Pivot Table users to choose their level of product detail and drill down for more granularity with a simple mouse click.
Hierarchies provide more value than just making data visualization more intuitive. Hierarchies also provide easy-to-use alternative representations of the same base data. For example, the date dimension in the model below (as viewed through Tableau) allows users to group order data by Year-Quarter-Month-Day, by Year-Quarter-Week-Day or by reporting period using a 4-4-5 calendar popular with the retail and manufacturing industries.
Without the ability to model hierarchies, end users lose critical functionality when using tools like Tableau, Power BI and Excel which support hierarchical visualizations.
Key Takeaway: A semantic layer must be backed by a multidimensional, cell-based engine to express complex business logic. Semantic layer solutions that use SQL-based calculation engines cannot express business constructs in a variety of contexts. A semantic layer must also support hierarchies to allow for intuitive drill paths and level relationships. Semantic layer solutions that only support dimensions and metrics do not provide an intuitive data navigation experience for end users.
“Anything by Anything”
Trying to guess what questions end users may ask of data is an exercise in futility. The needs of the business often outpace the ability for data teams to respond. It’s imperative, therefore, that a semantic layer be as flexible as possible in answering queries.
Recently, at the Coalesce 2021 Conference, Drew Banin, co-founder of dbt Labs, delivered an excellent presentation called “The Metric System”. Drew did a great job explaining the value of the semantic layer and he lays out the dbt Labs approach to delivering one.
In his presentation, Drew introduced the following example of a metric model:
I know this is meant to be a simple, illustrative example, but it (and similar metric layer solutions) demonstrates some serious shortcomings.
In the metric model example above, you will notice the following “filters” directive:
This directive limits the model to analysis by plan and country. While this is an illustrative example and more dimensions can be added to the model, ultimately this markup language is too simplistic.
A better approach is to use a multidimensional model that allows for “anything by anything” queries by supporting relationships between several entities (i.e. customer by location) with many-to-many, many-to-one and semi-additive relationships for expressing business rules.
For example, the multidimensional model below, based on an excerpt of the TPC-DS benchmark model, allows users to report sales and returns by source, by store, product, date, warehouse, customer and customer demographics in any combination desired.
The relationships between these business entities is crucial and SQL alone is too cumbersome for expressing these relationships.
Key Takeaway: A semantic layer must support a wide range of query patterns that are not constrained to a single view of data or subset of dimensions or metrics. Semantic layer solutions that define slices of data are not flexible enough to support a wide range of use cases and will force end users to wait for model owners to introduce new model views.
A Multidimensional Calculation Engine Is Critical
Besides serving as a metrics hub, a semantic layer —powered by a multidimensional calculation engine — creates a digital twin of your business. In order to express the complexities of business processes, the semantic layer platform must support dimensional expressions, hierarchies and entity relationships defined in a semantic model and backed by a graph-based query planner. In my next post, part six of eight, we’ll dive into the importance of the semantic layer’s performance management features.
In the meantime, If you are looking to skip ahead, I encourage you to read the full white paper “The Semantics of the Semantic Layer”.