How Data Virtualization Helps Orchestrate Security Policies

Data virtualization uses the concepts of abstraction to decouple data-consuming clients from the means of materializing the answers to their questions. Abstraction is the most powerful technique we have in computer science. As users demand more sophisticated software, and business environments evolve, inevitably the complexity of their data environments increases dramatically. I think the ability to think abstraction is the difference between us and most other species on the planet (apologies in advance to my dog). Unchecked, the conflagration resulting in these two natural vectors is the perfect storm for massive security breaches.

What exactly are the security challenges with abstraction?

IT and management ‘needs’ security. If data is regulated, confidential or high value, it must be protected as such.

The business wants to move quickly on ideas, test hypotheses and make decisions to double down or cut bait (test before you invest). In order to execute in this fashion, a couple of conditions need to be satisfied:

Use case building needs to be agile.
Query performance has to be close to the “speed of thought”.

The business has wants, and IT & Management have needs. If IT get their way and the business continues to be underwhelmed by the analytical delivery capability of the company it’s often lose-lose because the company is still susceptible to data breaches.

Another way to think about the analytical framework is operational – what can we do, and policy – what may we do.

I feel like this relationship is similar to the common software development axiom – Good, Fast, Cheap, pick any two – except that it’s Fast, Agile, Secure. Damn the torpedoes, I say – I want all 3!

How Agile Analytics influence and impacts data security

The interesting part of delivering agile analytics is that it is never a single tech stack, product or vendor and seamless integration between components is almost impossible. Moreover, the manner in which the components work with each other and the security of information as it passes between them become risk points. Questions arise, such as

What are the data entitlements for the data warehouse, data marts, cubes, etc.?
How much can each user or group see? Single-subject area, multiple-subject area or everything?
What is the nature of their access? Read-only or update?

The data topology in the analytics environment influences data accessibility and security. The recent trend has been towards user-empowerment and query results are frequently downloaded to products like Tableau and PowerBI for further refinement and use.

How distributed is the data architecture supporting Business Intelligence?
Does the distributed nature of data increase security risks?
What are users doing with the data they download? Where is it ending up?
Are users sharing data to external parties with no auditability?

If you’ve been paying attention, you may think there has to be some sort of relationship between fast, agile secure analytics and data virtualization and you would be correct. The abstraction that data virtualization provides is the critical missing piece to maintaining native security benefits while layering in additional security concepts and technologies without giving up on the agility and performance requirements your business consumers care about.

The single biggest mistake an organization can make in securing sensitive data is…

The lack of understanding where sensitive data resides due to a lack of set policies to systematically and consistently categorize data. Consequently, there are no controls in place to ensure that all categories of data are handled appropriately.

False confidence that you’re aware of all functional activities throughout IT systems
Devices are not secured sufficiently and data is physically “leaving the building”
Not properly classifying and protecting data against current threats (classified and encrypted)
Failing to protect networks and data from internal threats

No. Just kidding. The single biggest mistake an organization makes is…

Trusting its technology.

Not understanding the data lifecycle. Underestimating the necessity of managing software vulnerabilities! Not valuing the data to enable risk-based investments. Not adding security layers to data shared in the cloud. Relying on obsolete security models in complex IT environments. Not having a robust identity verification system.

Why do these things happen?

Because Data Engineering is extremely difficult to get right. Hiring data engineers is incredibly difficult, perhaps the hardest job function to fill. According to a breakdown of data from Burning Glass’s Nova platform, which analyzes millions of active job postings, “data engineer” remains the top tech job, with an 88.3 percent increase in postings over the past twelve months (its top-ranking remains unchanged from last month). Even after you manage to hire data engineers, the complexity of your data infrastructure increases with every new storage and processing system that are being invented to solve the ever-increasingly specialized data problems that come with an increased appetite for more data, and more uses of that data.

What is security, and how does data virtualization help?

The concept of data virtualization by itself does not solve a problem, it provides a framework or method for solving the problem via techniques that would be difficult or impossible to support in a non-virtualized environment.

The main contribution of a virtualized data warehouse is the automation of data engineering. Removing the biggest contributor to data-based security issues by decoupling access to the data, gaining an omniscient level understanding of the query workloads and then autonomously execute jobs to accomplish the data operations required to make the use case work; in some cases, that’s defederation, pre-aggregation, on the fly query rewriting, and masking/policy enforcement, among many other things.

As for Security, Information Security has three goals: Confidentiality, Integrity, and Availability. Data Governance is a fundamental part of security. The entitlements that governance manages ensures that only the appropriate users have access to specific data. Concretely, your data platform must have the following components:

Authentication and Authorization: Data consumers must confirm their ID via Authentication and claim verified access to elements of data via Authorization. Data virtualization allows the application of native authorization and authentication systems, as well as applying a secondary layer of authorization defined in a global policy management system.
Code-Free Role-Based Access Control: Many security professionals believe Keepin’ It Simple, Stupid (KISS) is the right way to design security systems. A Corollary is: Everything Should Be Made as Simple as Possible, But Not Simpler. Virtualization over a network of data warehouses allows the creation of a consistent security configuration uber-experience.
Encryption Support: A virtualization server sitting in front of a network of data warehouses can provide functionality that may not be supported in the underlying systems, such as line-level encryption, Kerberos, etc. In this way, the virtualization server acts as a bastion host of sorts, able to be hardened and protected specifically protecting servers sitting behind it.
Row Level: The ability to know who the user querying the data is via their confirmed identity and then automatically apply a predicate (a WHERE clause) dynamically to their queries joining to a security dimension and thus providing a way to enforce entitlements in a virtualized environment where direct connection to the data may not support that concept.
Perspectives: Data virtualization gives us the ability to decide which parts of a database schema are or are not displayed to a specific use or group of users. For example, the same data set can server both Sales and HR as long as HR can see the SSN and Sales have no idea it is in the dataset. By reusing the exact dataset, we avoid copies of data and potential leaks.
Automated Lineage: Autonomous data engineering hidden behind virtualization means that every data element, no matter how it is served and optimized has a solid connection to the source of that data.
Consolidated Logging for Auditability: Virtualized access to data means there could be a global log of all query workload, which makes auditing and reporting on the use of data much easier than an integration exercise that brings data out of every data warehouse and combines them for analysis.

In some cases, your use cases for data are more exotic, including sharing data outside of your company. In those cases, concepts like “Zero Trust” come into play including:

SSO via OAuth/SAML (never share passwords, never require the user to give credentials to your applications).
Multi-factor authentication (challenges)
Integration with an external directory via LDAP/Active Directory or cloud-based identity solutions
Role-based access control
Validate every device

Differential Privacy: A statistical technique that aims to provide means to maximize the accuracy of queries from statistical databases while measuring (and, thereby, hopefully minimizing) the privacy impact on individuals whose information is in the database.

In summary, agile, self-service analytics is enabled by having solid operational analytics and an understandable, scalable approach to security & policy. Data infrastructure is only getting more complex, and data-driven enterprises move faster every day. The cognitive load required to be successful is massive, so the only scalable and efficient way forward will be through the automation of security aspects hidden behind a data virtualization layer.

Good luck and godspeed!