May 12, 2022The Semantics of the Semantic Layer Part 4: Data Preparation
I co-founded AtScale to focus on the challenges of supporting a large number of data analysts working on disparate sets of data managed in a massive lake. We borrowed the term “semantic layer” from the folks at Business Objects who originally coined it in the 1990s. The term was actually over 20 years old when we adopted it.
So what is a semantic layer exactly? If you Google the term, the following definition will pop up, which is a pretty darn good definition in my opinion (Google’s highlighted words, not mine):
Wikipedia defines a semantic layer as a business representation of data that allows end users to access data autonomously. Everyone can agree that a business-friendly view of data that provides users with self-service access to analytics is desirable — true data democratization. It’s easy to see why it is fundamental to scaling data and analytics.
The challenge is actually implementing a semantic layer in a way that just works.
We began building the AtScale semantic layer after working on big data from the trenches. We had to deal with the basic challenges of data scalability, query performance, metrics sprawl, complicated data pipelines and shadow business intelligence (BI). While the challenges seemed obvious to us, most of the industry was preoccupied with shifting data gravity to the cloud. With cloud data re-platforming in full swing, we are finally seeing attention turning to the last mile of enterprise analytics with the semantic layer topic surging in popularity.
The Semantics of a Semantic Layer
Cloud giants like Google and Snowflake, the unicorns like dbt Labs and a host of venture-backed startups are now talking about this critical new layer in the data and analytics stack. Some call it a “metrics layer”, or a “metrics hub” or “headless BI”, but most call it a “semantic layer”. I can’t tell you how happy it makes me that the industry is finally recognizing the importance of the semantic layer in a modern, cloud-first analytics stack. I couldn’t agree more that a logical, business-friendly view of data is what’s needed to make analytics accessible to everyone, not just data engineers and SQL jockeys.
While it might be just a matter of semantics, I prefer “semantic layer” over “metrics layer”, “metrics hub”, or “headless BI.” I think that the term “semantic layer” best describes this business-friendly data interface because it covers all types of data and use cases.
For example, the terms “metric store” and “metric layer” ignore the concept of “dimensions” altogether. Take a look at just about every BI tool on the market (i.e Tableau, Looker, Power BI) and they all include measures (or metrics) and dimensions in their interfaces. Metrics measure something but dimensions (i.e. “product”, “time” and “location”) categorize data by grouping or aggregating metrics. So, terms using “metric” are confusing and don’t map to how these layers will be consumed.
The term “headless BI” is also problematic because it only covers business intelligence use cases. A universal data layer is useful to more than just business analysts and BI. Data scientists need to access a consistent business-friendly interface to data for building and training their models. Furthermore, application developers who are building data-driven applications also need interfaces to data. As such, the term “headless BI” is inadequate because it only covers a single use case: business intelligence.
More Than Just Metrics
There’s a reason why independent semantic layers have taken time to come to market — building a semantic layer is hard. Yes, a semantic layer serves as a common metrics store or single source of truth, but there’s much more to it than that. For a semantic layer to be viable, it needs to:
- Support any query tool, interface or protocol with a live connection to data
- Express the most complex business logic (serve as a digital twin) using a semantic model
- Deliver queries in under 2 seconds
- Govern access to data for every query
- Connect to any backend data store
The 7 Requirements of a Semantic Layer
The following diagram illustrates the core capabilities of a semantic layer:
1. Consumption Integration
For a semantic layer to be truly universal, it needs to support “live” query connections for everyone. This means a semantic layer must meet the following requirements for connecting users to data:
- Multimodal: It supports a variety of use cases and personas, including the business analyst, data science and application developer.
- Open: It supports a wide range of query tools using their native protocols, including SQL, MDX, DAX, Python, REST, JDBC and ODBC.
- Lightweight: It has a “zero footprint” on end users machines, which means that there is no client-side software or plugins necessary to access the semantic layer.
2. Semantic Modeling
The core of the semantic layer is the data model. A business-friendly data layer cannot exist without a map of the logical elements (dimensions, metrics, hierarchies, KPIs) to the physical entities of databases, tables and relationships. In order to deliver a digital twin of the business, a semantic layer must meet the following requirements:
- Object-oriented: It supports reusable models and components to drive a hub and spoke analytics management style.
- Multi-source: It supports the ability to blend data from multiple sources in a model
- Smart push down to data platforms for optimal performance.
- Plug & play pre-built models for known schemas (i.e. SaaS apps, 3rd party datasets)
- Programmable: It supports a CI/CD compatible markup language and shares its metadata via APIs with data catalogs & other tools.
3. Data Prep Virtualization
Data transformations are a necessary requirement for a semantic layer. A semantic layer platform should support virtualized calculations for expressing business logic and be capable of generating multi-pass SQL queries to handle calculations that require different levels of granularity like ratios and weighted averages. The semantic layer engine must deliver the following capabilities:
- Open: It supports transformation expressions using the native platform’s SQL dialect & MDX expressions.
- Multipass: It supports the ability to perform pre-query & post-query calculations for handling calculations at different levels of granularity.
- Virtualized: It supports inline transformations using direct queries without data movement or creating c.
4. Multi-dimensional Calculation Engine
The semantic layer data model must be backed by a scalable, multi-dimensional engine to express a wide range of business concepts in a variety of contexts. The semantic layer engine must deliver the following capabilities:
- Cell-based: It supports matrix-style calculations (time intelligence, multi-pass, etc,) without pre-calculation using a multidimensional expression language like MDX or DAX.
- Graph-based: It supports thousands of dimensions, attributes and metrics using a graph-based query planner.
- Dynamic: It supports “anything by anything” queries with constraints and filters applied at query time.
5. Performance Optimization
A semantic layer must accelerate the performance of the underlying data platform in order to deliver “speed of thought” queries. Without acceleration, a semantic layer will likely be bypassed using BI tool extracts and imports, defeating the purpose of a semantic layer. As such, a semantic layer must include the following capabilities:
- Autonomous: It automatically tunes and improves performance using machine-learning and user query patterns.
- In Situ: It improves performance without moving data outside the native data platform or requiring a separate cluster for managing aggregates.
- Adaptive: It is always learning and improving performance based on system performance and user query behaviors.
6. Analytics Governance
There’s a range of tools and products that act as a query layer to govern data access. Data governance is a core requirement for a semantic layer and thus must be a core requirement, rather than a separate layer. For a semantic layer to satisfy a wide range of data governance, it must deliver the following capabilities:
- Integrated: It integrates with corporate directory services (i.e. AD, LDAP, Okta) for user identity management.
- Row & Column: It applies row-level filtering & column-level masking to every query based on user, group and role-based (RBAC) data access rules..
- Real-time: It enforces governance continuously and in real time at the time of query.
7. Data Integration
In modern data and analytics ecosystems, data lives in multiple silos, including on-premise, legacy data warehouses, data lakes, cloud data warehouses and SaaS applications like Salesforce. A semantic layer must be capable of accessing and modeling data across multiple sources with the following capabilities:
- Data Platform Optimized: It works with a variety of data silos equally well by supporting native platform dialects and optimizations.
- Virtualized & Federated: It supports blending of data across multiple data platforms and minimizes data movement with query push down. Where necessary, a semantic layer may work in conjunction with data virtualization engines that make it possible to federate large data sets that reside on physically disparate platforms.
- Extensible: It supports a variety of data types including nested data structures like JSON and supports native data platform extensions.
Not As Easy as It Looks
As you can see, building a semantic layer platform is not simply a matter of defining metrics with a cool new markup language. For a semantic layer to be practical and usable, it needs to:
- Be capable of expressing your most complex business constructs
- Be able to perform better than your underlying data platforms
- Be able to connect live to all your data platforms
- Be able to connect to all your data consumption tools
- Be able to govern every query at the user level
- Be able to scale to everyone in your business
If any of these semantic layer requirements is missing, working groups or individual users will inevitably and necessarily create a localized data set with a localized semantic model that fits their purpose. And then the semantic layer crumbles. In other words, it’s binary — it either works 100% or it doesn’t work at all. Therein lies the challenge for anyone building a universal semantic layer from scratch. It’s not good enough to deliver an MVP that sort of works and can be enhanced as you go. The MVP is not an MVP — it just needs to work completely on day one or no one will bother using it.
More to Come…
The team at AtScale has spent more than a decade working to deliver the vision of a universal semantic layer and making it work for real, demanding customers. A universal semantic layer has become a critical component in the modern data and analytics stack. We cannot be more pleased to see our industry partners (and competitors) agree.
In the next few weeks, I will be diving into greater detail about the seven additional capabilities of a universal semantic layer. For more information on the semantic layer, download the white paper “The Semantics of the Semantic Layer“. It dives deep on the 7 key requirements and shares a decade of experience making it work for real, demanding Enterprise customers.