The Complete Buyer’s Guide For Intelligent Data Virtualization

Future proof your technology choices and avoid vendor lock-in

By 2022, 60% of all organizations will implement data virtualization as one key delivery style in their data integration architecture.

Gartner Market Guide for Data Virtualization

ABOUT THIS GUIDE

Data virtualization technology has been around for quite some time, primarily supporting operational use cases. However, the explosion of data in size and variety, coupled with the increased focus on analytical use cases, has created new challenges for legacy data virtualization technologies. The need for ad-hoc access to both live and historical data steadily increases among business users and that demand stretches the limits of how the most robust analytics tools can process these massive datasets. Moreover, the volume and speed with which data is generated is beyond the capacity and economic bounds of today’s typical enterprise infrastructures.
In this guide, we will lay out some key considerations for organizations looking to apply data virtualization to their analytics use cases. Along with key features and capabilities, we’ll discuss the differences between data virtualization and query federation, drill down on caching techniques and include a detailed ranking for evaluating vendors in the space.

Data Virtualization Defined

WHAT IS DATA VIRTUALIZATION?

Data virtualization technology is based on the execution of distributed data management processing, primarily for queries, against multiple heterogeneous data sources, and federation of query results into virtual views. This is followed by the consumption of these virtual views by applications, query /reporting tools, message-oriented middleware or other data management infrastructure components.

Gartner Market Guide for Data Virtualization

The Challenge

IS YOUR ORGANIZATION READY FOR DATA VIRTUALIZATION?

It wasn’t that long ago when most believed that a centralized data warehouse could be the data panacea for the enterprise. It didn’t take long to realize that the demands of the business were too fluid to wait on IT to collect, normalize and store data in a single physical location. Since then, our challenges of providing an enterprise data “single source of truth” have become even more elusive. The explosion of data sizes, varieties and formats have made it all but impossible to standardize on a single storage format. Even worse, with the emerging popularity of the public cloud, data stewards now need to deal with data locale as well. While it’s tough enough for those who need to engineer and secure the data, it’s even harder for the consumers of data. We are now asking our business users to engineer, wrangle and access data on-premise, in the cloud, in databases and in data lakes.

If these types of challenges sound familiar, you may be ready to explore data virtualization. By abstracting away how and where data is stored, data virtualization frees the enterprise from the inflexibility and costs that data silos beget. By presenting users with logical data views, data virtualization creates agility for the enterprise while future-proofing the technology choices for years to come.

Even though data virtualization technology has been around for several years, the emergence of the public cloud and the explosion in the number of data platforms and analytics tools has made virtualization an essential component for enterprises. According to Gartner, by 2022, “60% of all organizations will implement data virtualization as one key delivery style in their data integration architecture.”

The Approach

GETTING STARTED ON YOUR SEARCH

There are a variety of data virtualization approaches in the market and it’s important to understand the different styles before launching your search. According to Andrew Brust in his GigaOM report titled “Data Virtualization: A Spectrum of Approaches”, there are four main approaches or styles for data virtualization.


To further summarize, you can think of these approaches as pure data virtualization, database federation and next generation ETL tools. If you are looking for the most flexible approach, the pure data virtualization platforms are your best choice. These platforms provide the best degree of “futureproofing” since they allow users to plug and play with new data platforms as they emerge. Also, these tools usually allow users to define virtual calculations which promotes the consolidation of business terms, definitions and calculations in a central location. In contrast, the database federation tools force you into a single database platform while the autonomous data warehouses require data movement and lock customers into a single vendor’s ecosystem.

Recommendation: The Core Data Virtualization style of data virtualization delivers the most flexibility for future-proofing your information architecture and minimizing vendor lock-in.

Intelligent Data Virtualization Defined

WHAT IS “INTELLIGENT” DATA VIRTUALIZATION?

Traditional data virtualization platforms typically rely on the operator to manually tune query performance. This becomes a difficult task especially when dealing with multiple data platforms. A federated query that joins data from more than one data platform can create an unreasonable amount of data movement in real time. Simple memory and table caching are not enough to handle the variation in query patterns, and transferring atomic data for heterogeneous joins may create an unacceptable volume of network traffic which results in a poor and frustrating user experience.

“Intelligent” data virtualization addresses these scale and performance issues. Intelligent data virtualization platforms avoid high volumes of real time traffic due to federated joins by creating and managing a distributed and data platform optimized cache. By inspecting query patterns and data platform and network performance characteristics, this approach ensures that the heavy query lifting is pushed down to the remote data platforms. By avoiding unnecessary data transfers, intelligent data virtualization delivers more consistent query performance with far less resource consumption.

Recommendation: When choosing a vendor, make sure that query performance management is automated based on user query patterns, network performance and data platform capabilities.

Key Considerations

Seven Considerations for Evaluation

When choosing a vendor, there are a few core capabilities to keep in mind. Depending on your needs, you can weigh the options accordingly. The following categories are further broken down in our checklist later in this document.

Analytics Versus Operational Use Cases

Until recently, traditional data virtualization has focused primarily on operational application integration use cases. In the last few years, enterprises have expanded their use of data virtualization beyond just development or test deployments. According to Gartner, in 2011 only 11% of surveyed organizations reported that they were utilizing data virtualization and primarily in operational application integration—as a semantic tier to multiple datasets that were not permanently stored in operational data stores, warehouses or marts. By 2018, however, as many as 40% of organizations were utilizing data virtualization with the majority targeting analytics use cases.

If you are targeting operational application integration as a use case, traditional data virtualization platforms may be good enough. In these use cases, the data size and query profiles tend to be more predictable and thus more suitable to a one time optimization effort. However, analytics workloads are much more varied and tend to scan and aggregate large amounts of data. For these use cases, it is essential that the virtualization solution support autonomous performance management. One time, manual performance tuning is not suitable for analytical workloads and will likely lead to low end user adoption due to unpredictable and poor query performance.

Recommendation: For analytical use cases, choose an intelligent data virtualization platform that automatically manages query performance.

Data Sources & Connections

There are two questions to consider when choosing a data virtualization solution with regards to data platform connections.

The first is rather obvious: how many different data platforms or data sources does the solution support? To maximize flexibility, your vendor should support data sources that are (1) relational (i.e Oracle, Teradata, Snowflake), (2) file-based (i.e. CSV, JSON, XML, HDFS, S3), (3) API based (i.e. REST, HTML) and (4) application based (i.e. Salesforce, Workday, Service Now). With this type of coverage, your virtual view designers can incorporate just about any data, on-premise, in the cloud, structured and unstructured without involving ETL or manual data movement.

The second consideration may be less obvious but is even more critical than the first. The question to ask is “Where does the query processing happen?”. There are two different answers to this question.

The first answer is to deploy a “least common denominator” approach. In this style, the virtualization engine exposes its own proprietary dialect (usually based on Postgres) that is responsible for all the data processing and aggregation. This approach becomes problematic when your data platforms have varied data types or require platform specific functions. With a lowest common denominator approach, the virtualization engine is required to stream atomic level data to the vendor’s proprietary query engine for applying calculations to aggregation functions. This approach tends to break down when data is large since data platform query push down is limited.

The second answer and approach is a pure query push down. In this approach, queries are passed through to the data platform and aggregations and platform specific functions are performed remotely. In this case, for table joins, only the pre-aggregated data is streamed to the virtualization engine which minimizes the amount of query time and data movement.

For analytical use cases, it’s imperative to minimize query time data transfer for heterogeneous table joins.

Recommendation: For analytical use cases, choose a data virtualization platform that maximizes query pushdown to minimize query time data movement.

Development Environment

When data virtualization first arrived on the scene in the early 2000s, IT managed data pipelines and the centralized data warehouse was king. Later, business intelligence (BI) innovators like Tableau and Qlik democratized reporting and analytics with their user friendly, self service oriented tools. As these tools took off in the enterprise, IT’s role as the centralized arbiter of data access became less the norm and more of the exception.

When evaluating data virtualization vendors, it’s important to keep this democratization and self service trend in mind. Therefore, a citizen data scientist or business analyst should be able to design virtual data views and semantic models in the data virtualization platform. In fact, the process of defining virtual data views should be as easy as building a Tableau worksheet. With this in mind, it’s imperative that you choose a data virtualization vendor that supports a web-based (as opposed to desktop based) design environment that promotes multiple, simultaneous designers and leverages a library to promote re-use and standardization.

Recommendation: Choose a data virtualization platform with a web-based, multi-user design environment with libraries to promote re-use and enforce standardization.

Calculations & Analytical Functions (OLAP)

If you are looking to address analytical use cases, it’s imperative that your chosen platform can handle the complex analytical calculations required by business applications. For example, just about every analytical use case requires time calculations and time intelligence. At a minimum, this means that your data virtualization platform must support period-over-period (i.e., this year/last year), period-to-date (i.e., year to date, quarter to date) and moving averages (i.e., monthly average sales). In addition, many analytical use cases require semi-addition metrics: distinct counts for counting unique customers, first and last functions for reporting on beginning and ending inventory levels. These types of calculations are often referred to as OLAP calculations. In addition to OLAP functionality, support of the OLAP protocol (MDX) means that the virtual platform can also serve queries from tools like Excel, Cognos, BusinessObjects and Microstrategy that speak the MDX protocol.

Without support for analytical functions and MDX (OLAP), analytical workloads must be pushed down into the tooling layer which subverts the power of the universal semantic layer and often results in inconsistent business definitions and conflicting reports.

Recommendation: Choose a data virtualization platform with a web-based, multi-user design environment with libraries to promote re-use and enforce standardization.

Query Performance & Caching Approaches

When evaluating vendors, this is arguably the area where you should spend most of your time. Without consistent and performant query serving, a virtualization platform has little value.

In analytical use cases, business users are accustomed to interactive query performance since they typical query proprietary analytical databases or cubes that are designed for fast queries. As a result, a virtualization platform needs to deliver even better performance than the native platforms they interact with since the virtualization layer needs to match or beat the existing solutions they are replacing. To make matters worse, data virtualization queries often include heterogeneous database joins that further tax query performance.

Data virtualization solutions that simply cache query results or create cached tables are not sufficient for analytical use cases. As mentioned above, analytical queries are too variable and often scan and aggregate massive amounts of data on the fly.

Recommendation: Choose a data virtualization vendor that includes a comprehensive performance management system that goes beyond simple caching techniques.

Client-side Requirements

In order to maximize the ROI for a data virtualization platform investment, it’s extremely important to drive broad adoption for your data consumers in the enterprise. There’s little value to a universal semantic layer if it’s not used by business analysts and data scientists as their primary means to access data. Therefore, it’s imperative that your chosen data virtualization vendor platform supports a lightweight client-side footprint. For example, does your data virtualization vendor require the installation of custom drivers on client desktops? Does the virtualization platform require a desktop application for managing or discovering new virtual views? If either question is “yes”, IT must now push client-side software and manage versioning on every desktop that needs access to data. In a large enterprise, this may be a huge inhibitor to end user adoption.

Recommendation: Choose a data virtualization vendor with a zero client-side footprint, leveraging existing tooling connectors for accessing the virtualization layer.

Security & Governance

Since data virtualization platforms serve as middleware for analytical queries, it’s imperative that the platform integrates with the enterprise’s security infrastructure. There are two main forms of security to consider: authentication & authorization.

First, a data virtualization platform must integrate with the enterprise’s single sign-on infrastructure in order to authenticate users, whether that be Active Directory (AD), LDAP, OAuth or other third party authentication platforms. The authorization capabilities must flow through the client applications and the data virtualization platform must synchronize users automatically.

Second, the data virtualization platform must include the ability to hide or mask sensitive columns, limit data rows based on user access rules and impersonate users when querying the underlying data sources. Impersonation is especially crucial since using a proxy user (instead of the query user) to query underlying data sources may circumvent security policies for those data platforms and force users to duplicate security policies in the virtualization layer.

Recommendation: Choose a data virtualization vendor that integrates with your single sign-on standards and supports column level security, row level security and
impersonation.

Conclusion

So much has changed since data virtualization platforms hit the scene in the early 2000’s. Data has grown in scale and has become more varied, the data lake arrived to compete with the data warehouse, the public cloud became a compelling destination and data platforms continue to proliferate, not consolidate. To summarize, here are the key recommendations to keep in mind as you choose your vendor:

  1. The Core Data Virtualization style of data virtualization delivers the most flexibility for future-proofing your information architecture and minimizing vendor lock-in.
  2. When choosing a vendor, make sure that query performance management is automated based on user query patterns, network performance and data platform capabilities.
  3. For analytical use cases, choose an intelligent data virtualization platform that automatically manages query performance.
  4. For analytical use cases, choose a data virtualization platform that maximizes query pushdown to minimize query time data movement.
  5. Choose a data virtualization platform with a web-based, multi-user design environment with libraries to promote re-use and enforce standardization.
  6. Choose a data virtualization vendor that includes a comprehensive performance management system that goes beyond simple caching techniques.
  7. Choose a data virtualization vendor with a zero client-side footprint, leveraging existing tooling connectors for accessing the virtualization layer.
  8. Choose a data virtualization vendor that integrates with your single sign-on standards and supports column level security, row level security and impersonation.

As you can see, there’s a lot to consider when choosing a data virtualization platform. Refer to this guide and ranking to help you make order from the chaos and have the confidence to chose the best vendor to realize your ambitions of simplifying your data analytics stack.

Download the Complete Guide

Includes a Feature Table Template to Get Started in Your Evaluation

Download

More Great Content

ABOUT ATSCALE

The Global 2000 relies on AtScale – the intelligent data virtualization company – to provide a single, secured and governed workspace for distributed data. The combination of the company’s Autonomous Data Engineering™ and Universal Semantic Layer™ powers business intelligence and machine learning resulting in faster, more accurate business decisions at scale. For more information, visit www.atscale.com.

KEEP READING & DOWNLOAD THE PDF