August 4, 2020Consider This When Managing Your Snowflake
For practical reasons, it is increasingly difficult—and sometimes impossible—to perform large-scale analytics with an on-premises data warehouse. Scaling physical hardware to the needs of your customers is so cost intensive that you might well find yourself running a data center rather than performing analytics on your data. If you’d rather keep your focus on your core business, you may have migrated, or be planning a migration, to a cloud data warehouse such as Snowflake.
What’s more, the cloud data warehouse is infinitely more scalable. You don’t have to blend data from different levels of granularity across different systems—instead, all the ETL and data is in the same place. Effectively, if you need to do something such as taking a data model table and turning it back into a data lake table, you can just join it across in a single SQL statement. This is a massive time savings in terms of data management.
With all of this said, there are some caveats. Organizations who move to the cloud are frequently surprised by its costs, wondering where the cost-saving benefits are once they see the bill. This is for two reasons. First, It can be difficult to understand cloud pricing models, which often scale in unexpected ways. Second, cost, processing power and efficiency don’t have a linear relationship in the cloud.
Moving your data warehouse to the cloud gives you massive advantages in terms of flexibility, scale, and potential cost savings as well—but by the same token it’s likely that you haven’t perfectly optimized your cloud data warehouse, you may be paying more than you should. Here are a few ways to lower the costs of running analytics with on the Snowflake platform and how to increase ROI at the same time.
One: Create a Centralized Cost Management System
Technically, all data within Snowflake is within a single data warehouse, but administrators can subdivide that data warehouse as much as they like, creating multiple data warehouses for the organization. This lets the company isolate data warehouses by department—in other words, if marketing runs a complicated SQL query, it won’t force the sales analytics team into a lengthy queue.
Even though you can create as many warehouses as you like, there’s a pitfall inherent to this approach. If you give each team its own data warehouse and then look away, you may yourself suddenly hemorrhaging costs into the cloud.
First, your team will get stuck on a difficult query. Next, they’ll set the data warehouse to 4XL in order to throw additional compute and memory at the query. Finally, they’ll get through the query, but they’ll forget to tune down the data warehouse. The data warehouse is effectively stuck at 4XL, which is twice as expensive as 3XL, which is twice as expensive as 2XL, and so on.
To avoid this, you need to make sure that your data warehouse scales to the size of its queries—and then scales back. Creating a cost oversight team helps with this for a few reasons. First, you have accountability—if a single department is using too many resources on their queries, there is a concrete reporting structure that can investigate and then rectify this issue.
Second, you can create uniform policies across your departments, such as automatically killing any query that runs longer than two hours, or automatically downscaling warehouse instances that are scaled too large for too long. Finally, there’s an appeals structure in place where if a department needs to run a query that would otherwise violate policy, the business leader has someone to ask about it.
Two: Understand How Data Warehouse Sizes Maps to Cost
There are no hard and fast rules governing warehouse size and how this relates to cost. With that said, in most cases you’ll find yourself making warehouses smaller rather than the other way around. Simple tasks—such as writing data on and off a disk—tend to need less compute power, so you can run these tasks on a small and inexpensive cluster.
You’re going to be surprised to find that in many cases you can reduce the size of a data warehouse—to the tune of a fairly large cost savings—with barely any perceptible difference in how the job executes. On rare occasions, the reverse will be true. Large aggregates and joins will take large amounts of computing power. Provisioning those jobs with the maximum amount of compute and memory means that they’ll get done faster. Sometimes, they’ll get done so fast that the cost of provisioning these jobs at 4XL is less than the cost of a medium-sized warehouse.
It’s easy to think of this as being more of an art than a science. You might ask yourself, if you’re running tens of thousands of jobs a week, how can you learn which jobs need to be right-sized in any kind of approachable or scalable way? Fortunately, the jobs themselves contain the answer.
Three: Visibility Equals (Cost) Control
Every job you run within Snowflake contains metadata which is stored and queryable. In other words, every job you’ve every run is stored within a data table—and you just happen to run an analytics platform. You have the tools to derive extremely detailed information from the jobs you run, visualize this information in a meaningful way, and then use this information not just to cut costs, but also to anticipate future costs (and cut those as well).
Let’s say that you take the data tables containing your query metadata and put them into Tableau. This means that you can see your queries as they’ve been run every month, every week, every day, and every hour. You can run reports on multiple axes—how long each query runs, how much each query costs, how many resources each query needs. Finally, you can condense these metrics into a report that is sent regularly to your cost management team.
As you look over this data, you’re going to see spikes. Some of these spikes are anomalies, and some anomalies are recurring. In other words, you can predict when you’re going to see anomalous spikes in your data. If you can predict these spikes, you can see what’s causing them—badly-formed queries, or queries that are wrong-sized for the warehouses they refer to, or tasks that are queueing for abnormally long amounts of time. Once you derive the cause, you can eliminate the slowdowns that accumulate both costs and frustration from you, your users, and your customers.
Four: Understand Your Control Mechanisms Within Snowflake
There are multiple levers you can pull that will allow you to control costs within your Snowflake implementation. If you’re using multiple warehouses within your implementation, you can easily isolate a single instance that’s causing an issue, and you can experiment with it until you find a mechanism that works to control costs and eliminate slowdowns.
- Horizontal Scaling
If a single warehouse doesn’t have enough compute power to handle the number simultaneous queries coming through, it will begin to queue them so it can handle one query at a time. Alternatively, you can set Snowflake to scale horizontally—spinning off an identically-sized cluster so it can handle queries in parallel. You’d think that horizontal scaling might increase costs, but there’s not always a linear relationship. There’s value in being able to finish a queued workload aggressively if you downscale immediately afterwards.
- Data Caching
Snowflake will cache data within warehouses. if you have a duplicative workload, you can consolidate that within a single warehouse. In other words, you can use caching to offset some of the cost of a larger warehouse. This is another situation where it turns out that using a larger warehouse size can save you money. While you’re powering more of a workload with a single warehouse, you can cut your costs overall because the warehouse is all using the same data, thus cutting your I/O time.
- Code Rewrite
No one claims to write SQL perfectly. Simple is better in terms of cost savings. In many cases, it can be more efficient to perform joins if you aggregate the output of a job into a flat data warehouse table and then query off that table.
By optimizing your query performance using these mechanisms, you’ll be able to generate cost savings, avoid anomalies, and increase the productivity of your workforce—but there’s still more you could be doing.
Five: Create a Universal Semantic Layer™ using AtScale
One last thing to consider is how difficult it may be to write a query in order to answer a business question. If you’re like most businesses, you don’t have enough data scientists to go around, which means that personnel—and not data warehouses—could be one of the major bottlenecks preventing your business leaders from accessing analytics.
Instead of hiring more data scientists or purchasing larger data warehouse volumes, you can instead create a universal semantic layer. This makes your data warehouse consistent, and easy to use—essentially liberating the productivity of your analytics team. Data scientists and analysts no longer need to spend long hours performing ETL and data engineering. Instead, they can move right to the querying phase, arriving at actionable insights that can assist your business even further.
AtScale provides a full multi-dimensional engine that provides a rich business friendly interface for users while ensuring consistency for key business metrics and definitions. In addition to the power of a semantic layer, AtScale’s single point of entry delivers a one stop data governance shop. You can apply your data governance policies at a logical or physical level while virtualizing and hiding the physical implementation of the data.
Simply put, AtScale’s Universal Semantic LayerTM provides labor-saving automation and makes your data even easier and safer to use.
As we said at the beginning of this article, the potential for advantages that will allow you to truly accelerate and scale your analytics is within reach. We hope these techniques help you start realizing the ROI you want to (and will be able to) achieve through your move to the cloud.