December 17, 2019Big Data Analytics in the Cloud for Today’s Distributed and Diverse Data
What is Cloud Migration?
Cloud migration is the process of moving data and application hosting infrastructure from on-premise data centers to a public cloud hosting service like Amazon AWS, Google Cloud Platform or Microsoft Azure.
Migrating your data center operations to a public cloud is not for the faint of heart. There are so many things to consider when migrating an on-premise infrastructure to a third party hosted environment. For this post, we’ll be focusing on migrating data platforms to the public cloud. In my experience working with a large number of enterprises migrating their data operations to the cloud, I’ve seen the good, the bad and the ugly. I’ve compiled a list of five things to keep in mind as you consider making this well -worth-it transformation.
What to Watch Out For When Moving to the Cloud
- Higher (and Less Predictable) Operating Cost
I’m always amazed at how many times I hear from my customers about cloud sticker shock. I’ve experienced it myself when we moved our own testing infrastructure to AWS. The ease of spinning up new servers (and forgetting to shut them down) is the perfect recipe for ever-escalating costs. On the cloud data platform side, it can be even more scary. A malformed query can cost thousands of dollars if you’re on a consumption-based pricing plan, and that same query can eat up all your available resources if you’re on a fixed pricing plan. Predicting monthly charges is nearly impossible for a consumption-based plan even though it offers the most flexibility.
What You Can Do to Make Cloud Operating Costs More Predictable
Before migrating your data platform infrastructure to the cloud, you need to have a resource governance plan. Snowflake has a cool feature that allows you to create separate warehouses with different resourcing levels without duplicating data. This allows you to vary cost and resources at a fine grain depending on the priority of the function or group. Caching technology and a central data governance platform is another way of managing costs. At AtScale, we’ve saved tens of millions of dollars for our customers by avoiding redundant full table scans through our Adaptive CacheTM technology.
- Impact on Report and Application Migrations
For any database migration, applications and reports that have been written for a respective database dialect need to be ported to the new platform. An unfortunate side effect of the self-service BI revolution is that these data models and dialects are often baked directly into hundreds or even thousands of reports and dashboards. The effort to port these reports is enormous and often the solution is to leave the existing reports on the older platform while new reports are written against the new data platform. Of course, this just complicates the stack for everyone, insures an enduring presence for the legacy platform and sets up the same scenario for future migrations.
What You Can Do to Avoid the Impact of Cloud Migration on Reports and Applications
The key to driving agility and future-proofing your analytics stack is to separate core business logic from the application code, visualization tools and AI/ML platforms. This is easier said than done. The key is to adopt a server-side semantic layer that presents a scalable, consistent backend to your front ends. If you choose the right middle layer, you can abstract away the physical location and format of your data and give yourself the room and freedom to adopt new data platforms and analytics applications in the future.
- Finding and Using Data in Multiple Locations
It’s hard enough for business analysts and data scientists to find and use the data they need to do their jobs. Throughout the self service analytics revolution, we’ve asked them to become data engineers – they need to learn how to decipher data in all the various data platforms, dialects, types and formats before they can do their actual jobs. In fact, data scientists spend 80% of their time just preparing data for their models. In the cloud era, we’ve made their jobs even harder. Our downstream users now need to know where to get their data. Data is moving to the public cloud, SaaS application clouds and some of it will always remain on premise. To get a consolidated view of data, the people that use data to drive our businesses now need to figure out how to combine data in different locations, data platforms and formats.
What You Can Do to Minimize the Impact of Disparate Data
The most obvious solution to having data in disparate locations is to consolidate it in once place in the cloud using the new cloud data warehouse technologies like Snowflake and Google BigQuery. However, that’s a long term prospect and they’ll always be a shiny new platform that some data migrates to. Data virtualization is a key solution here. By abstracting away the location and format of the data, you deliver your business users and data scientists a single view of enterprise data, hiding where and how it is stored. Moreover, virtualization makes IT more agile and gives them the ability to store data in the most suitable platforms while giving them the flexibility to adopt new platforms in the future without re-architecting their stack or disrupting their downstream consumers.
- Security Integration Challenges
In an on-premise environment, you’re used to owning 100% of your security, networking and compute infrastructure. Single sign on (SSO) technologies like AD and LDAP, VPNs, firewalls and the rest are locked down and in your control. When you move to the cloud, you now have to integrate your data center security stack with your cloud vendor’s stack. Therein lies the challenge. For the cloud data warehouses, you may need to deal with an entirely different authentication protocol. For example, Google BigQuery relies on the Google Identity Platform for authentication, which means you need to figure out how to sync your Active Directory (AD) with Google’s sign-in directory. There goes your carefully crafted SSO strategy that you worked so hard to deliver over the past decade.
What You Can Do to Orchestrate Security Policies
Again, abstraction and virtualization is the key to dealing with this problem. You can invest in an SSO solution like Okta or a virtualization platform like AtScale’s and centralize your authentication management. But you can’t stop there. You also need to consider managing access to your data assets using a data governance and security layer. I would recommend that you implement these abstractions before you start your cloud migration since they provide huge benefits behind the firewall and make cloud migrations less disruptive and risky.
- User Retraining
It’s already bad enough that you have to retrain your users on where to get their data, but if you also need to convince them to learn new tools for visualization and/or modeling, your chances of success drop dramatically. Moving to a new cloud data platform often requires that you either switch to new tools or that you re-engineer your data models to work on the new architecture. In either case, you will have grumpy downstream customers if you make them give up their tried and true BI tools or deliver them a solution where performance doesn’t match their existing on-premise experience.
What You Can Do to Make Cloud Migration Seamless for BI Users
Before you migrate to a cloud data platform, make sure there’s a path to support your existing tools and applications. Your prospective cloud vendor will undoubtedly try to convince you to trade out your existing tools for theirs. Don’t do it. Pick a solution and a cloud partner that will allow you to preserve your existing BI and AI tools’ licensing and training investments and provide an upgrade path to new tools in the future.
I hope this post doesn’t discourage you from taking the leap to modernize your data stack in the cloud. The agility the cloud brings is well worth the effort and our customers who have done so are happy they did. If you plan ahead and avoid these pitfalls, you and your users can reap the benefits of a truly agile data infrastructure now and in the future.