The Future of Data Warehousing
Surprise! You never would have guessed the future is in the cloud.
We should not be overly concerned with “the future” of data warehousing. The past five years have brought fundamental new capabilities at the platform level and most organizations are wrestling with adopting these new technologies. We should be focused on “the now” of data warehousing and how to use the new technology as a platform for creating “the future” within your own organization. New approaches and technologies have effectively removed limitations and now the future is malleable – up for grabs to the most aggressive and creative companies. The future is how each individual company will maximize their use of new techniques and technologies to create fantastic customer experiences. The industry would benefit from a pause to digest the new capabilities.
What do Data Warehousing Customers Want?
It’s fair to say that customers of data warehousing tech are burnt out on the complexity of both implementing and using traditional data warehouses. The cost elements also represent a sticking point. However, if data warehouses delivered on their promise the cost would be a much easier pill to swallow. It might be fine if you had a single data warehouse, but most (all?) businesses have multiple data warehouses often on different architectures that require costly specialized dbas and operations folks to create even a modest success. So given all the complexities and costs, why bother?
The big trend in Data is making all of your data available in one service. There are many ways to achieve this outcome. If you can consolidate your data into a single store, large-scale serverless data warehouse technologies like Google’s BigQuery are capable of storing and serving an enterprise’s data needs as a single endpoint. Most likely you already have so many data warehouses that consolidation is not an option and you have to explore the next generation of emerging data virtualization technologies that can present a single data service view into multiple data warehouses whether on-premise, in the cloud, or any combination of the two.
Also trending strongly is the desire to minimize the data engineering and operations requirements associated with traditional data warehousing. Serverless models minimize or eliminate the need for a data engineering/data ops team as the scaling and uptime requirements are outsourced to the cloud vendor.
Once you’ve got a super scalable, low-friction data service the goal is to make that high-value data as available as possible to a diverse set of data consumers. The first step is to make the data discoverable via a data catalog. In order to decentralize access, you need to centralize access authorization: This means policy-based securing of data with support for Role Based Access Control (RBAC), end to end encryption, auditing, lineage, etc.
Thinking about this holistically, you can’t easily and scalably achieve the goal of discoverability and secure governed self-service without delivering on the single data service vision.
To summarize, the next generation of data warehousing includes functionality that supports:
- A single gateway to all your data
- Ability to outsource the complexity of keeping that service highly available to all your consumers.
- Policy-based access control and governance
- High UX discoverability interfaces
- Serves all consumers, from Business Intelligence to machine learning and data science use cases.
The technology has arrived, and can now be combined to deliver on the self-service vision. Done correctly, a true governed, universal self-service data initiative will not provide incremental value, it will be the engine for digital transformation.
What is a Cloud Data Warehouse?
It should not surprise anyone that a cloud-based data warehouse shares a lot of features and functionality with its on-premise brethren, but there are some very key differentiators that have a meaningful impact on everything from operations to capabilities.
Keeping software up to date is a major concern with on-premise data warehouses that will entirely be in your rearview mirror if you transition to a Cloud EDW. I predict no one will miss this, just like electric car drivers do not reminisce about the great experiences they had at a gas station. The benefit of keeping up to date on software means new and compelling functionality is delivered at regular intervals, whether it’s advanced geographic information system (GIS) or built-in machine learning support.
Success in using data will lead to one thing – an increased appetite for more data. Cloud databases represent a scalable platform that can grow with your organization’s desire to use more data to make decisions. Arguably, cloud databases represent the only way you can be successful with large scale analytics in a potentially cost-effective way. The key phrase in the previous sentence is cost-effective. Adopting a cloud data warehouse will always mean paying for resources used, which maps directly to your ability and desire to use more breadth and depth of data. You will trade cost in data engineering and software licenses for a consumption-driven cost model. That can be hard to reconcile.
- Data Access. Somewhat of a side effect of massive scalability, the ability to store many forms of data – not just purely relational – coupled with the innate elasticity of cloud-based services allows a very broad rollout of data services without having to worry about the human and capital scaling requirements. This freedom from traditional friction points cannot be underestimated when the goal is to enable as many people as possible with access. Nothing quashes momentum more than having to hire more data engineering folks, or acquire more hardware.
- Performance & Scalability. Traditional on-premise data warehouses can definitely scale, no question. The real innovation is the freedom from DBAs and DataOps. You will still need these folks in your org, albeit at lower numbers. Companies like Google and Amazon definitely know how to make a highly available service.
- “NewTech”. There has been a convergence of attributes traditionally associated with data lakes into the cloud databases, mainly in the machine learning and data science space. Some vendors have been more aggressive in courting those workloads and providing innate support. Innovation beyond where we traditionally defined data warehouses should be expected to continue.
- Migration Strategy/Security Risks. There is nothing architecturally flawed here, this is purely mechanical. Data needs to be moved, and new security for that data needs to be implemented. Applications that generated data need to be repointed, infrastructure that deposits data need to be reconfigured.
- Cost. Legitimately low key a big concern. Just based on the nature of data, I don’t foresee any innovation on pricing other than consumption/resource utilization based. Vendors try to abstract you from this with “all you can eat” pricing, however, you should always be aware of what broad adoption will cost you. Additionally, remember all those data engineers you saved by moving to the cloud? Yah, this is where you are going to potentially have to deploy them to optimize the consumption paths.
- Performance. After just raving about how great performance is, I am going to turn around and say, it’s not enough. While you can pay for performance, achieving a good balance – rationalized unit economics for analysis – is still challenging and I expect data warehouse and virtualization vendors to continue to work their performance roadmaps in perpetuity.
The Future of Data Warehousing is the Cloud
The answer is pretty easy, actually: There is currently no viable on-premise competition for what cloud data warehouses provide. Organizations are moving to cloud data warehousing technologies for the reasons of Performance, Security, Agility, and operational simplification.
Is Data Warehousing Dead?
Absolutely Not. Data Warehousing is more alive today than ever before and is the building block for most data-centric innovation. The concept of the Data Lake is converging with what cloud EDWs provide, and as such have given data warehouses a needed refresh on how we conceptually position their application in an IT environment.
While I think it’s a great trend to see this convergence and increase of responsibility by a data warehouse, the term EDW itself is both overloaded and carries baggage. The future of data warehousing may involve a name change.
The ability to bring on new exciting workloads such as data science, coupled with a more successful roll-out of self-service may be enough to overcome the sins of the past. Success cures all woes.
Gaining traction, great mindshare.
- Compelling new functionality around data sharing.
- Very real use cases.
- Extremely popular in the data science community due to the easy to get started and consume model.
- Lives in between the completely serverless BigQuery model and the more configuration driven RedShift.
- The only entirely serverless offering. I’ve been exposed to the team of engineers and supporting technology that Google uses to keep BigQuery highly available and high performance and exactly what you would expect from Google.
- Best in class GIS support.
- Integrated Machine Learning declarative Language (BigQuery ML, or BQML).
- The OG cloud data warehouse.
- Based on proven IP.
- Requires the most data engineering and configuration.
Azure SQL Data Warehouse
- While late to the game, Microsoft appears ready to fight with the Hyperscale and Synapse offerings.
- Based on the very mature SQL Server IP, so very compatible with any applications that previously talked to another version of that product.
Why Oracle does not create a global scale serverless offering is beyond me.
If you can put all your data in one big data warehouse—do it, while lock-in is a concern, the value of consolidation of data is incalculable. If you can’t do that—and let’s face it most people can’t—look at virtualization technologies to create a single Data Warehouse consumption endpoint.
How can AtScale Help?
AtScale’s data abstraction layer removes the complexity of existing and future data platforms for business users, no matter where or how data is stored. AtScale’s intelligent data virtualization enables your move to a hybrid or multi-cloud data architecture while protecting your users from disruption.