December 17, 2019Big Data Analytics in the Cloud for Today’s Distributed and Diverse Data
91% of businesses today are facing persistent barriers to digital transformation and are looking to make major investments in security and compliance, multi-cloud, self-service analytics and artificial intelligence. Only too aware of operating in a ‘transform or die’ landscape, performance is becoming table stakes. The next hurdle to overcome is ensuring that analytics keeps pace with the need for today’s data driven businesses to move quickly and independently. All of the components of this hurdle boil down to data agility – and finding a way to let the machines do the work because modern data is too distributed, dynamic, and diverse for humans to manage it. With AtScale’s new feature release, businesses can deliver on the promise of data transformation with a single enterprise view of all analytics data at scale.
Several years ago if you had asked me what is the biggest issue pressing business-focused analytics consumers on the path to digital transformation, I would have answered scale & performance. Hands down, the amount of data people collected and stored represented a huge challenge for analysts. When I say performance, I mean in a holistic manner; not just per query execution speed, also large query performance, small query performance, specific workload type performance, and concurrent querying performance. In order to be a production-ready implementation, production must be handled at scale.
Today? I feel like the performance challenge has been largely addressed. Not that it’s solved, but we have technologies and techniques that can mitigate the majority of the challenge. In an absolute sense, performance will never be solved because people want answers in 0 ms – in fact, they actually want the answers to magically arrive before they even think of the question. Adding to that, the more data we consume, the bigger our appetite for data consumption gets.
So then, what’s the big new challenge?
I believe we should focus deeply on the growing demand for agility in analytics. Think about the overall process of inventing new analytics, or updating/refreshing existing analytics. The cycle time for thinking of a valuable analysis, getting the data together, asking the question, discovering you made a few mistakes, correcting those mistakes, gathering more/different data and then finally learning that you either had a great intuition and this is now a core KPI, or else while it was a great exercise (the journey is more important than the destination but we mostly get paid by getting to the destination, right pilots?), the analysis did not yield the insight we were hoping. That’s a process that can take a really long time. Today, it includes many manual human steps, and humans add latency.
In other words
There’s huge value in extreme analytical agility—just imagine where we’d all be if we could be nimble with data. The pace of innovation from personal empowerment and a corporate perspective would be dazzling.
Reflecting on enterprise agility makes me remember when I joined PeopleSoft in the early 2000’s. PeopleSoft, the (spiritual) precursor to Workday, required us to use the internal procurement tool if we wanted office supplies. It. Was. Terrible. I was used to using Amazon to “procure” things and given that relative experience, I was no longer happy to muddle my way through some enterprise application with a poor UX. The point is, the easier technology has made things, the less we are willing to put up with bad experiences. Marc Benioff calls this the consumerization of the enterprise. Data and analytics are so critical to most people’s lives, and unfortunately there is still big time friction standing between you and the ability to use the data to steer your actions.
So, what’s the path forward?
Anything you do on a repeated basis should be understood. Maybe not at a molecular level, but you need to understand the fundamentals. Often in technology, a problem space like analytics has both a subject matter complexity and an infrastructure complexity. So the goal of technology is not to separate the user from understanding the fundamentals of their problem space, it IS to free them from the constraints of their infrastructure.
To solve this class of “big problem” I often hear (and repeat), that abstraction is key. Thinking about how we got to where we are in the analytics industry reminds me of the Leibniz-Russell theory of perception: Data infrastructure exists in our mind, and concretely potentially across multiple data centers and it’s very difficult as humans to reconcile the abstract and the concrete.
For instance, think about an observation of daily life, railroad tracks for which we all see an apparent convergence in the distance; however, we know convergence does not occur because trains would crash if it did. The skeptic requires us to go to that region on the tracks in order to resolve this discrepancy between what we observe and what logic demands. But how close do we have to approach this region to verify non-convergence?
Said another way, what exists and what exists in your head are two different things that are difficult to reconcile.
How does this relate to IT infrastructure?
Even the most knowledgeable IT worker who built the infrastructure has at best an external perception of how it functions. We abstract the notion of a database, a network; even the workload itself is abstract. Ultimately, the disjoint between mind (what we are trying to accomplish) and body (the infrastructure to get that actualized) means the situation is ripe for optimization.
This optimization benefit—the ability to layer in services, the ability to use the infrastructure to a degree that can only be achieved by the assistance of machines – we call this data virtualization. Decoupling the question from how the answer is manifested, hiding all the complexity and layering in multiple value-add services is the path forward for implementing technology that can and will solve previously unsolvable problems.
But hold on – the term Data Virtualization has existed for a while now, why does it need to be redefined? Think about operating system virtualization. From when it was introduced to now it has improved in performance and functionality to the point where it underpins the majority of OS rollouts in the cloud environment. OS virtualization is complex of course, but not nearly as complex as data virtualization. For years, Data Virtualization was a combination of Federated Query & Caching.
The problems there were that Federation doesn’t scale to production, and Caching can only accelerate workloads it has seen before.
We believe that our approach represents a solid evolution of those two data engineering techniques and builds on the very recent phenomenon of data engineering as a career.
Computers don’t have the philosophical schism that exists in human beings, nor do they have the ability to reason between the external and internal perception of differences. What they do have is an amazing ability to collect and collate information to make the best possible decision from observable metrics.
If Data Lands in a Database and No One Knows, Is It Valuable?
More concretely, the super-fantastic product team is excited to release a new customer-facing application we call “The Virtual Cube Catalog”. A data catalog is an essential piece of a contemporary company’s data infrastructure. It makes sense – if you are going to use and generate data on a consistent basis, the very next problem you run into is discoverability. If there is friction in finding data, agility is hurt, and people revert to their default “gut-instinct” mode of solving problems. This is, as computer guys say, sub-optimal.
“Everything should be made as simple as possible, but no simpler.”
– Albert Einstein
– Matthew Baird
Building a large scale data platform that includes multiple end-user applications, has the ability to query across data sources, supports ML/AI & BI, and scales to thousands of users and autonomously drives performance of heterogeneous infrastructure should look easy from the user perspective. However, it is almost by definition difficult to do this with regards to the infrastructure requirements. High Availability (HA) and Disaster Recovery (DR) requirements alone mean standing up multiple servers, integrations to data warehouse often require Kerberos configuration, JVM tuning, and compute cluster; and all this infrastructure means IT has to manage a sophisticated multi-machine implementation. Doing this raw with no support is possible, albeit time-consuming and difficult. To combat this complexity AtScale is introducing the new AtScale Orchestrator application for holistically managing the entirety of the platform. Over time we expect Orchestrator to encompass more and more platform management activities, providing both a UI based application interface and a scriptable API for folks that like to “automate all the things”.
The Matrix Of Pain
When the industry largely moved to hosted SaaS offerings, I thought I was putting the Matrix of Pain (MoP) behind me. The MoP is the combination of different infrastructure and associated configurations that must be tested every time a new release is to become Generally Available (GA). The thing is, every company has very specific reasons for choosing the data warehouse they use, and most analysts have spent years of their lives mastering their analytical tools. In fact, the cloud has created MORE choice and made the matrix even larger.
Again, the solution is automation.
Data Platform support for legacy data warehouses takes a big step forward in our 2020.1 release with Teradata getting native Kerberos and Impersonation support. The platform team has been on a tear, adding in support for HDP 3.1.0, CDH 6.2.X, Oracle 12C (Secure) and Teradata 16.20 (Secure). The number one rule our engineering team has is “No Wrong Answers”. We will never ship a product with a known accuracy issue. We invested a huge amount of engineering to enhance our platform test automation framework to make regression-proof support for data warehouses straight-forward. This allows our engineering team to focus on building new features and functions while being assured that the quality of the runtime is paramount.
With its 2020.1 platform release, AtScale delivers on a complete business intelligence and analytics platform with these three elements:
- Multi-Source Intelligent Data Model—AtScale’s logical data models via an intuitive user experience without copying or transforming existing data structures. AtScale’s autonomous data engineering further simplifies and accelerates the user experience by assembling the data needed for queries in a just-in-time fashion and then maintaining acceleration structures for subsequent workloads.
- Self-Optimizing Query Acceleration Structures—AtScale incorporated additional information into the creation and lifecycle of acceleration structures, including data locale and platform capabilities. AtScale alleviates the “lowest common denominator” approach to query planning that results in significant resources being wasted on manual data provisioning and movement. AtScale’s Autonomous Data Engineering automatically determines the necessary structures and their optimal location.
- Virtual Cube Catalog—AtScale’s new virtual cube catalog accelerates discoverability with comprehensive data lineage and metadata search capabilities that integrate natively into existing enterprise data catalogs. This new capability translates directly into business semantics and empowers business analysts and data scientists to locate the necessary data for business intelligence, reporting and AI/ML activities.
Data is too distributed, too dynamic and too diverse for humans to be expected to manage it. The good news is rise of the data engineering profession and subsequent codification of best practices and heuristics for high quality, scalable data management coupled with the amazing work of data scientists and advances in machine learning have created the perfect opportunity to take a good idea: Intelligent Data Virtualization, and make it into a great implementation.