May 26, 2021A Semantic Layer for Shared Data with Secure, Open Source Delta Sharing
Data virtualization refers to a general technology approach of abstracting data away from physical data sources, including data warehouses, data lakes, application data, without having to copy or move it. Data virtualization solutions are generally grouped in with other data integration technologies including ETL and ELT approaches.
AtScale is sometimes grouped in with traditional data virtualization vendors (for some examples, see this Gigaom Radar report on Data Virtualization) based on the approach we use to connect a semantic layer to raw data sources. While data virtualization concepts developed somewhat in parallel to semantic layers, it is useful to examine the origins and primary use cases for pure-play data virtualization solutions. That makes it easier to understand how a semantic layer uniquely supports modern BI and analytics programs built on cloud data platforms.
Evolution of Data Virtualization
The term data virtualization originated in the late 2000s initially as a broader set of capabilities including query federation, a concept that had emerged in the 1980s. The basic approach was to construct a virtual, logical data warehouse providing a single interface to retrieve data from multiple databases without any “physical” data integration. Queries could be generated at the virtual level and “passed” through to the underlying databases, with results returned to the user or client making the request.
The approach has been used for analytics use cases, reporting, data mining, and application design. It sometimes is positioned as an alternative to building a data warehouse – most notably when positioned with a modern data lake infrastructure as a “Lake House.”
Along with adjacent ETL/ELT technologies, data virtualization has become a foundational technology in modern data infrastructure. Data virtualization has the advantage of not being subject to the laws of “data gravity,” since no physical data (except query results) are moved. The concept has become even more attractive with the rise of big data analytics and cloud data platforms, where it can be difficult (read: complicated, expensive, and slow) to physically move data for all data consumption use cases. The entire data integration category, including data virtualization, is an increasingly relevant component of the modern data fabric. When it comes to BI and analytics use cases however, traditional data virtualization approaches are limited in their ability to support performant, multi-dimensional analysis.
The Co-Evolution of the Analytics Semantic Layer
The term semantic layer emerged in early 1990s with the original Business Objects patent describing the concept as a way of providing “a new data representation and a query technique, which allows information system end users to access (query) relational databases without knowing the relational structure or the structure query language (SQL).”
This approach established a whole school of thought around making dimensional analysis accessible to business analysts with limited understanding of database structure. It is easy to see the parallels to the origins of data virtualization – both focused on providing an alternative user interface to raw data in order to facilitate consumption and hide the underlying complexity of the raw data.
Traditional OLAP solutions ingest raw data and store dimensional aggregations in proprietary, physical data structures. These solutions leverage the semantic layer to present a more user friendly, logical view of the data on a physically separate copy of the data. The combination of specialized data visualization tools and optimized, physical data structures (i.e. OLAP cubes) yielded the “speed of thought” analytics capability that is the foundation of modern business intelligence programs.
The key limitations of traditional OLAP are 1) OLAP cubes must be loaded and frequently refreshed to stay current; 2) building OLAP cubes from big data sources or data with high dimensionality is impractical, leading to compromises in how granular analysis can get; 3) different teams may have different analysis priorities, leading them to define alternative views to support their needs. This ultimately leads to cube sprawl and competing logical definitions (i.e competing semantic layers) for different teams. In other words, there’s no single source of the truth.
Integrating Data Virtualization into a Semantic Layer for BI & Analytics
AtScale extends the original notion of a semantic layer by combining the benefits of dimensional analytics with data virtualization and an intelligent orchestration engine (that we call autonomous data engineering). The goal has been to deliver powerful analysis capabilities to data consumers, regardless of which BI platform they use (e.g. Tableau, Power BI, Excel, Looker, and more). The key benefits of this approach are:
- Self-service BI Guardrails: Analysts and data scientists request data through the semantic layer’s business-friendly logical model,connecting to AtScale with tools of their choice. The semantic layer forms a single source of truth for business metrics, as well as the dimensions on which data consumers define queries.
- Live Cloud Data Access: The semantic model presented to the user maintains a virtualized connection to underlying cloud data sources. As users request data through the semantic layer, virtualized queries pull data from the underlying source in real time. Any transformation needed to translate raw data to the logical model happens on the fly, with no need to manually transform the underlying dataset – regardless of whether the raw data is in a cloud data warehouse (like Snowflake) or a data lake platform (like Databricks).
- Query Performance Acceleration: AtScale accelerates analytics query performance in two unique ways. First, by leveraging the concept of dimensionality, the AtScale engine automatically creates aggregations through its knowledge of data relationships and query behaviors. The aggregations are defined and stored within the cloud infrastructure, leveraging an AI engine and intelligent data locality using what we call “preferred aggregate storage.” Second, a cloud orchestration engine optimizes the use of the underlying cloud data platform. This capability is key to taking full advantage of organizations’ investments into powerful modern cloud data platforms– and for managing cloud costs.
The AtScale semantic layer plugs into an organization’s data fabric, delivering a unique approach to dimensional modeling and performance acceleration for analytic use cases. While AtScale is a true dimensional analysis platform, it is differentiated from traditional OLAP approaches in that it does not rely on a physical cube. Reliance on separate, physical data structures for dimensional analysis limits the scale of data access to days or weeks. Most modern analyst teams require access to multiple years’ worth of data to make informed, data-driven decisions.
AtScale has pioneered the concept of a virtualized, dimensional query capability that delivers all the speed and analytical sophistication of traditional OLAP solutions, without the baggage and scalability limitations of creating separate, physical copies of the data.
The AtScale semantic layer is a virtualized dimensional modeling and analytics platform.