Building Time Series Analysis on Snowflake with a Semantic Layer
In a recent post, we discussed how a semantic layer helps scale data science and enterprise AI programs. With massive adoption of Snowflake’s cloud data platform, many organizations are shifting analytics and data science workloads to the Snowflake cloud. Leveraging the integration of AtScale’s semantic layer to Snowflake, data teams can position the data scientist teams to spend less time data wrangling and more time creating sophisticated models that create value for their organizations.
Predictions based on time-series analysis are one of the most common products delivered by data science teams. This post focuses on how AtScale + Snowflake create a powerful combination in this common use case.
What is Time-Series Analysis and Why is it Important?
Predictions based on time-series analysis are extremely common and used across a variety of industries. They are used to predict values that change over time. For instance, they are used to predict weather patterns that impact a range of other activities. They are used to predict sales that impact revenue forecasts, stock price performance, and inventory stocking requirements. They are used in medicine to establish baselines for heart or brain function. They are used in economics to predict interest rates.
Time-series analysis is a blanket term for a variety of statistical modeling techniques including trend analysis, seasonality/cyclicality analysis, and anomaly detection. Time-series is often coupled with regression analysis that finds relationships between different phenomena. Building a model to predict ice cream sales may look at: How sales change based on season (i.e. seasonality); How sales are impacted by cost of milk (regression); Long term shift in consumer preferences (trend analysis); The relevance of isolated events like Covid (anomaly detection).
Time-Series predictions are built by complex statistical models that analyze historical data. There are many different types of time series models (e.g. Auto-regressive, moving average, exponential smoothing) and many different regression models (e.g. linear, polynomial). And all models have multiple parameters on which they can be built. Modern data scientists leverage machine learning (ML) techniques to find the best model and set of input parameters for the prediction they are working on.
All of these models are fueled by data. Lots of data.
Why is Time Series Analysis Hard?
Time series analysis is based on complex statistical techniques designed by brilliant mathematicians. Luckily they are widely available and accessible from a variety of sources. ML algorithms are also pretty complicated, but also available in data science platforms that are increasingly accessible (i.e. AutoML tools). The practical challenge is managing the data that fuels time series analysis.
Data scientists typically spend an enormous amount of time wrangling data. In particular, preparing and maintaining the “time relative” data that fuels time series is difficult. Time series models are often based on Lag or Window-based measurements. A lag measurement is a comparison of values separated by a fixed time interval (e.g. sales of ice cream yesterday vs. same day last week or last year). A window-based measurement is an aggregation across a set period of time (e.g. 3 or 5 day rolling average). Lags and windows can be based on any number and any definition of time. Managing all these variations of input data is hard.
Adding to the data challenge is the need to create a uniform set of data that is aligned by a common definition of time. Models may incorporate data from different enterprise systems (e.g. inventory data in an ERP and sales data from a CRM). They may also incorporate data from outside the organization (e.g. daily temperature). Preparing and maintaining clean data that can fuel models is complicated. And as predictive models are moved into production (i.e. published to business users) the source data needs to be maintained on an ongoing basis. Ideally production models are not disrupted by shifts in raw data.
Why Snowflake is a Powerful Repository for Time-Series Data
The first data challenge to overcome is to create a practical central repository for time-series data. The Snowflake data cloud is an ideal solution with an established ecosystem of integrations and frameworks for building an enterprise-class, cloud-based data warehouse.
The ability to cost-effectively manage long histories of a wide range of enterprise data is table-stakes for building out time-series analysis.
Expanding available data to include external data providers augments data science programs. With the Snowflake Data Marketplace, data teams can expand their access to 3rd party data and simplify the process of integrating with their primary data. Brick and mortar retailers may benefit from incorporating foot traffic data into their predictions. Ecommerce companies can leverage sources of web traffic information. Ice cream vendors can use historical temperature charts.
Snowflake provides the infrastructure to integrate, curate, and query large and complex data sets that fuel time-series analysis.
How AtScale’s Semantic Layer on Snowflake Supports Time-Series Analysis
The AtScale semantic layer incorporates a powerful dimensional analysis engine. Data teams can leverage the platform to build conformed dimensions that facilitate all types of business intelligence and data analytics. Time is arguably the most important dimension that every data analyst or data scientist cares about when reporting or making predictions about the future of the business. AtScale supports teams managing some of the complexities of time.
First off, AtScale lets data teams create a conformed time dimension that can be used across all data sets. This simplifies joining disparate data sets from different sources – including data from 3rd parties. Without this, data scientists would need to manually prepare data sets that use different definitions and granularities in their time dimensions and to manually create the aggregation logic they need (e.g. hourly, daily, monthly, quarterly, annual, etc…).
While some elements of a time dimension are fundamentally the same across data sets, some organizations need to define custom definitions. For example, some companies may analyze time series based on a custom definition of fiscal quarter. Others may analyze data related to holidays not fixed to a date (e.g. black friday). Others may segment weekdays vs. weekends. Applying this kind of logic consistently across different datasets and data models is important.
When the time dimension is managed in AtScale, it is simple to define and automatically support any number of time relative measures. This means applying any number of window or lag scenarios in calculated changes in a measure over time. This can be done with basic metrics (like simple counts or quantities) as well as more complicated calculated metrics (like gross margin or sales ratios). The screenshot below shows a sample of features that are defined and maintained within AtScale.
When AtScale is used to define data inputs to a time series analysis, it automatically transforms the live data residing on the Snowflake cloud. This means teams can leverage all of the utilities of managing large data sets on Snowflake, while simplifying the data prep and complicated transforms associated with preparing data for time series analysis. AtScale lets data teams consume data on BI tools of their choice like Power BI and Excel. AtScale AI-Link lets data scientists use Python to integrate with their models and leverage AutoML platforms while leveraging the power of Snowflake infrastructure.
Finally, AtScale + Snowflake make it easier to publish predictive results out to the business community. Production models can write back predictive results through AtScale, into Snowflake. AtScale can then be used to deliver model outputs to business users through the same BI tools they use for reporting historical data. For example, retail executives could analyze detailed sales predictions alongside historical performance in a tool like Power BI. And leveraging the dimensional analysis capabilities of AtScale, they can easily drill up and drill down into data. They could drill down from quarterly predictions into monthly level predictions. They can drill down on a prediction of national sales to sales by state to sales by store. This added business context is key to realizing the value of time series outputs.
The screenshot below illustrates how model outputs can be plotted alongside historical data in a tool like Tableau. Analysts can leverage the dimension model to further investigate predictions at any level of detail.
AtScale and Snowflake have both independently earned reputations for delivering practical solutions to modern data teams. As data science becomes a growing priority for many organizations, there is an opportunity to combine the strengths of both companies to deliver unique capabilities to data scientists and the data teams supporting them.