December 17, 2019Big Data Analytics in the Cloud for Today’s Distributed and Diverse Data
I’ve never seen a faster adoption of a new technology platform than I have with the introduction of cloud-based Data Warehouses. Companies of all sizes have embraced the power and ease of use of BigQuery, RedShift, Azure SQL Data Warehouse, and Snowflake without requiring expensive internal IT operations teams to scale those platforms. Cloud data warehouse vendors have amazing economies of scale with their devops employees: not only can they hire the best, they provide full time, interesting work to retain them.
The trade off of outsourcing data platform support and maintenance is the cost: if you aren’t careful it is very easy to spend a lot of money very quickly even during a POC phase, let alone a production roll out to a large audience. I had the rather uncomfortable experience of observing a 3rd party consultant execute a complex query on lots of data, go home for the night, and wake up $11,000 worth of compute later.
Sometimes a specific analysis is so important that the cost of interpreting the needed data for that analysis isn’t really an issue. However, the vast majority of analyses are done to make decisions that should result in the business performing better. As a result, the cost per analysis is instrumental in deciding whether to implement a use case or not. Reducing the cost per analytic increases the ROI of running analyses, leading to a more data driven business.
Cloud Warehouse Pricing Overview
The pricing of cloud based data warehouses is generally resource usage based, meaning you pay for the amount of data you store, query, and in some cases insert. Some warehouses, such as Google BigQuery, have an “all you can eat” model, which caps cost, but also caps the amount of resources you can employ to process your workloads.
When using a cloud data warehouse with this type of pricing model, it is critical to follow best practices such as avoiding scanning raw data and limiting expensive operations such as joins whenever possible. Following these best practices enables an enterprise to scale its cloud warehouses for use in business intelligence and roll it out to a large number of consumers, resulting in a more data-driven business.
Through experience deploying BI applications on cloud data warehouses at the largest companies in the world, AtScale discovered and implemented many techniques necessary to get the most analytic bang for your IT buck. These cost savings can be grouped into two categories:
- General, shared techniques that save money on every cloud data warehouse.
- Platform-specific optimizations.
General Cloud Warehouse Cost Optimizations
1.) Autonomous Data Engineering
The biggest resource and money saving technique AtScale applies is automated data engineering that creates and cradle-to-grave manages aggregated data structures, substituting them into user queries with no end-user knowledge.
Possibly the fastest and lowest cost way to structure data for business intelligence purposes is as a single table that can answer questions without doing a distributed, high-cost, JOIN operation. This is true to a degree and depends on the number of dimensions and cardinality of those dimensions. AtScale solves for the resource usage and performance optimum, which has the side effect of also reducing cost.
By automatically pre-aggregating data, the amount of data required to answer a question is reduced by roughly the total number of rows divided by average dimensional cardinality (number of unique values for this member). Often this is on the order of 10X to 1000X reduction in rows scanned to answer questions.
While the overwhelming bulk of costs for cloud databases is in the compute usage and not the storage costs, these data structures (tables) are automatically deleted when AtScale deems they are no longer useful – such as the case when the system adapts and builds a more useful data structure, therefore lowering your storage costs.
2.) BI Workload Patterns
The more we know about a problem, the better we can craft a solution. Business Intelligence workloads often follow some identifiable patterns that AtScale takes advantage of to save time and money.
A. Unique Dimensional Values
Interactive dashboard builders know this pattern well – each dimension has a filter of unique values somewhere on the dashboard that users employ to display the exact data they are interested in. For example, *“I want to see sales of blue shirts in Montana for June 2018.” *
It’s not unusual for there to be a dozen drop downs of unique values. For certain unnamed cloud warehouses this can add over a minute to the render time, even if the dimensional values come from very small (<100) row tables. In worst case scenarios where the dimensional values are contained entirely in the fact table – imagine where the sales transaction has a color field – you may need to scan the entire table of millions of rows to find out there are only 8 unique colors which could take multiple minutes (and lots of $$’s). With AtScale, once you have identified what your dimensions are, AtScale will go ahead and build a dimensional aggregate that can be pinned in memory for quick and easy access; those dozen filters will render in less than 100 milliseconds giving your business analysts an excellent experience. Since these types of queries are executed quite frequently when building dashboards as well as built in BI client features, the load reduction can be significant.
B. Dynamic Constraints
The transparent application of a security dimension is a very powerful and useful technique for using a single dataset to deliver analytics to a mixed audience. Imagine a situation where you have country managers looking at a large dataset and you want to only return the data that they are entitled to view. In this case you want to essentially filter the result set with the equivalent of a predict that looks something like “WHERE user.country = ‘ CANADA’”. In some cases this query can be satisfied with a simple filter, in other cases the relationship from a security perspective is more sophisticated and may require a join or other expensive construct. Since the Dynamic Constraint is applied on most queries, it makes sense in some cases to pre-construct or cache the lookups and avoid doing a more expensive JOIN entirely.
C. Runtime Filtering
Sometimes it’s possible to select a set of data and build an IN clause rather than doing a more expensive distributed JOIN operation. AtScale will use estimated cardinality statistics for dimensional values to rewrite the query appropriately. A runtime filter is a list of values for join key columns. When this list of values is pre-calculated and the query is rewritten to use the values, the SQL Engine can filter out non-matching values immediately after reading them, rather than doing a join which entails transmitting the raw data to another host to compare against the in-memory hash table.
D. Join Re-Ordering
While arguably the optimizers in each data warehouse’s SQL engine should do this automatically, the fact is, not all of them do. In those cases, AtScale maintains a set of statistics so the JOIN clauses can be rewritten to optimize the query.
AtScale has its own HLL implementation for distinct count estimation as well as a T-Sets based implementation for Quantile estimation. HLL allows for things like multiple distinct counts on Cloudera Impala in the same query without unions and sub queries, and you can merge them from agg tables and use incremental aggregation. These are all features that eliminate the need for large scale full table scans.
How AtScale Optimizes Cloud Warehouse Cost by Platform
If there is one thing I’ve learned about Business Intelligence on Big Data, and specifically on shared platforms such as Hadoop, it’s that every query you execute needs to be hyper-optimal. A small inefficiency multiplied by billions of rows of data or by a distributed JOIN quickly becomes a performance and cost issue. In other words, you need to sweat the details for every single interaction. Every database platform we support has different performance characteristics and features that allow for optimal interactions.
“To solve problems at scale, paradoxically, you have to know the smallest details,”
Alan Eustace, Google
Some not-to-be-named visualization tools generate terrible SQL. Since AtScale is the abstraction layer between visualization and warehouse, we can optimize those queries in any way we want. In some cases, we replace raw table reference with aggregates, in other cases we rewrite generalized SQL into queries optimized for the specific cloud data warehouse in use. AtScale understands different dialects and different version of those dialects very well, so we can optimize to use the fastest functions possible for each dialect.
1.) Azure SQL Data Warehouse
For Azure SQL DW we have benchmarked the different costs of using Microsoft supplied functions to do different operations such as date parsing, and we automatically use the most efficient function.
2.) BigQuery Cost Optimization
BigQuery is an amazing piece of technology, capable of easily querying petabytes of data. It’s also very easy to accidentally spend a lot of money. There are a number of best practices that Google recommends, and AtScale applies these automatically so you can scale out business intelligence on BigQuery to thousands of analysts in a cost-effective manner.
According to Google, it’s best practice to not use a LIMIT clause as a method of cost control:
“Applying a LIMIT clause to a query does not affect the amount of data that is read. It merely limits the results set output. You are billed for reading all bytes in the entire table as indicated by the query.”
All queries that do not hit a partition in BigQuery incur the cost of a full table scan, even if you put a LIMIT N constraint. When possible, AtScale uses the tabledata.list API to avoid the full table scan costs.
Partition Support is another way to greatly reduce the amount of data considered for scanning and joining. BigQuery supports a PARTITIONTIME column, which takes the timestamp at INSERT time which may not be suitable for data coming from an OLTP database for analysis in BigQuery. Google introduced Partitioned tables allowing you to bind the partitioning scheme to a specific TIMESTAMP or DATE column which gives you the freedom to do create the partitions based on the meaningful date columns, and not just the INSERT date. AtScale supports partitions in Aggregations to both speed up and reduce cost of queries.
AtScale can employ a technique that allows our customers to overcome the “date and timestamp” only partitioning schema supported by BigQuery. It is possible to map dimension IDs to unique dates and transparently substitute the unique date into the query as well as the ID to ensure partition pruning can occur, without the business analyst having to know about this exotic optimization.
BigQuery Optimization Example
Let’s look at BigQuery’s pricing for a detailed example. In this hypothetical scenario, you are leading a project to put dashboards in front of one thousand analytics users at your organization. These users will run approximately 100 queries per day.
In addition to the average scan size, you’ll have to also consider how complex your userbase’s queries are. Let’s assume for the sake of illustration that the average query scans 5GB of data. Under these conditions, the monthly cost of your queries will be about $50k.
Employees * queries_per_day * working_days = queries_per_month 1000 * 100 * 20 = 2,000,000 queries per month Queries_per_month * average_query_scan = monthly_scan_volume 2,000,000 * 5 GB = 10,000,000 GB or 10,000 TB Monthly_scan_volume * query_scan_cost = monthly_query_cost 10,000 TB * $5/TB = $50,000 per month
First, if your load is that high, you shouldn’t be purchasing On-Demand pricing. Instead, you should purchase the flat-rate model which starts at $40k/month and includes 2,000 slots.
Second, let’s take a hypothetical swing at how much the same load would cost with AtScale using a modest 10X reduction in average_query_scan, and a 25% reduction in queries_per_day due to techniques discussed later in this blog:
Employees x queries_per_day x working_days = queries_per_month 1000 x (100 x .75) x 20 = 1,500,000 queries per month Queries_per_month x average_query_scan = monthly_scan_volume 1,500,000 x .5 GB = 750,000 GB or 750 TB Monthly_scan_volume x query_scan_cost = monthly_query_cost 750 TB x $5/TB = $3,750 per month
That is a savings of $46,250 per month, or a 92.5% reduction in cost.
In real customer scenarios we’ve seen 36X (2-3 minutes → ~5 seconds) faster queries, and savings ranging from $10k/month for a single query type to savings of $190k/mo for the entire workload.
I love saving money (ask me about my $81 a month car lease). What I think happens in the real world is: When the cost of operational analytics decreases, companies gain a competitive advantage as they can afford to have more analysts gain more insights into their key business questions.
3.) AWS Redshift Cost Optimization
AtScale supports two of the top three Redshift data engineering-based optimizations:
Distribution Keys. Distributing appropriately across the slices means better parallelization during expensive operations. Keys are picked based on the following statistics: High Cardinality, uniform distribution, and commonly joined. Distribution Keys built into aggregate tables mean faster queries, and much lower CPU/Memory and IO utilization.
Sort Keys. Like Hadoop, Redshift can skip entire pages when filtering on dimensional values if the data is inserted in an ORDERED format. The SQL engine looks at the header for a page, decides if the values fall within the MIN and MAX values and then can skip entire swaths of data. We saw order-of-magnitude improvements when implemented correctly (It acts as a pseudo partition) reducing resource use when filtering by almost 90%. Since Time Series analysis is a key analytics use case, Sort Keys are incredibly impactful when doing Period over Period, Lead, Lag, and other time based calculations.
4.) Snowflake Cost Optimization
While all of the three cloud data warehouses will benefit from more optimal query formation and automatically engineered structures that answer questions with less resource consumption, Snowflake has a very interesting scaling model that lines up perfectly with some unique features of the AtScale’s intelligent data virtualization.
Snowflake supports what they call Multicluster Warehouses. This concept is described as the ability to scale compute resources to manage your user and concurrency needs as they change, such as during peak and off hours.
The credit usage model is calculated as such:
“The total number of servers for the warehouse is calculated by multiplying the warehouse size by the maximum number of clusters. This also indicates the maximum number of credits consumed by the warehouse per full hour of usage (i.e. if all clusters run during the hour).”
The benefits of multicluster warehouses are that there is no need to manually increase the size of the warehouse, or start additional warehouses and explicitly move the workloads to the newly created resources, and reverse the process when the load disappears.
AtScale supports the concept of using a different engine for different workloads:
- Small Interactive (user queries)
- Large Interactive (big user queries – Aggregate misses)
- System Queries (Aggregation creation)
For small interactive queries, which should be the majority of your queries, it’s best to set up a cluster of small, low-credit usage warehouses because it gives you the finest granularity for cost control. Imagine you use the largest (and most expensive) Snowflake warehouse, it could theoretically support 100 concurrent users. If you have any less than 100 users, you are paying for a resource you aren’t using. If you have 101 users, you will spawn a second warehouse and have the capacity for 200 users, thus over procuring by 99 users (almost 50%). However when you have a big workload to do, like building an aggregate of a 100 million row table, you definitely want to use the large warehouse to churn through that work as quickly as possible, the small warehouse will take too long.
AtScale supports multiple engines, and Snowflake supports multiple sized cloud data warehouse clusters and thus you can get a near optimal distribution of work with a very well fit, least-wasteful use of compute resources. Large jobs wake the large warehouse, do the work, and put it to sleep, while lots of smaller concurrent queries scale the small data warehouse cluster in a way that tracks to actual consumption.
Scaling Snowflake with Small vs Large Data warehouses
If you can only support one type of warehouse in Snowflake, you overprovision to handle all types of queries, even the big ones. AtScale can intelligently use small warehouses for the bulk of aggregated queries, and only use large warehouses for larger queries that either aggregate data, or do raw fact-based queries.
The difference in credit usage between the smallest and largest data warehouse types is significant. The largest warehouse types employ 118 times as much credit usage as the smallest warehouse types. User consumption of data can be modeled as a curve that ramps up at the beginning of day, and ramps down at end of day. Larger warehouses scale in a more coarse curve fit, where you overprocure and waste resources.
Coarse Grained (Large) vs Fine Grained (Small) Data Warehouse Scaling with Snowflake
Large data warehouses scale in larger units, meaning more waste until the capacity is used. Put another way, even if you only have one user querying, you still have an entire Large size Snowflake data warehouse allocated. Small data warehouses allow you to fit the number of warehouses used closer the the utilization curve.
With Large data warehouses, excess unused capacity can easily exceed 25% of your bill.
Since AtScale allows multiple data warehouses to be used automatically, large data warehouses will be used to to build aggregates and then go away, and small data warehouses will serve the queries that hit those aggregates. Fitting the smaller resources to the usage curve means much less wasted dollars due to over-procuring.
In summary, just as the Cloud transformed the Enterprise Datacenter, Cloud Data Warehouses are fundamentally game-changing for your enterprise data. Cost efficiency is a critical piece of successfully migrating from legacy on-premise data platforms to cloud data warehouses. AtScale’s technology relentlessly adapts to optimize the layout of your data and the efficiency of your multidimensional analytical queries to provide near-optimal performance and potentially an order of magnitude lower use of resources and dollars.