May 6, 2019Seamless Adoption of Snowflake in the Cloud: Rakuten with AtScale
This is the first of a three part blog series discussing the power of AtScale and Snowflake to help enterprise data science teams scale and leverage the agility of a cloud based infrastructure. We have written before about the power of combining AtScale and Snowflake for time series predictions. In this series, we will take a more foundational view of the topic and look at some examples. For a related technical workshop, take a look at this video.
AtScale sits on top of Snowflake, simplifying the interface to raw cloud data for both business intelligence and data science users. AtScale integrates with Snowflake, coordinating query traffic and executing advanced analytic functions within the Snowflake Data Cloud. AtScale is able to leverage the power of Snowflake to optimize query performance as well as resource consumption resulting in highly efficient use of infrastructure. This blog will be an overview of the AtScale semantic layer and how it supports data science workloads on Snowflake. The second will be a credit card fraud detection demo of AtScale AI-Link – illustrating Python connection to the AtScale semantic layer to support feature engineering, model training, and closed loop reporting in Tableau. The third will illustrate the potential to leverage JavaUDFs using Snowpark for Scala to create a model scoring function which then gets published to a database used for making predictions.
Snowflake for Data Science
Snowflake offers a cloud-based data storage and analytics service, with a wide range of features such as data sharing, scalable computing power, and third party tools support and sourcing. This cloud platform provides the functionality of an enterprise analytics database; however, there is no hardware, virtually no software to install, and it offers scalable and available on-demand compute instances.
The figure above shows how Snowflake’s characteristics support basic data science workflows.
Snowflake creates solutions for data warehousing, data lakes, data engineering, data science, data application development, data sharing, as well as some of these key aspects:
- Supports all types of data including CSVs, JSON, XML, Parquet, Avro, etc.
- Supports data analyst and BI activities through Snowsight through visualization and dashboarding capabilities
- Provides functionalities to support feature engineering and transformation using Snowpark, which is a library that allows developers to best leverage their own data engineering and data pipeline skill sets, improving productivity, security, and the amount of processing systems
- Work with an extensive list of partners that are popular in the data science ecosystem as well as providing support for programming languages like Go, Java, .NET, Python, C, Node.js, etc.
- Uses BI tools such as Tableau and PowerBi to visualize performance of models over time
- Runs predictions with external functions (AWS Lambda) or Java UDFs
More on Snowpark
Snowpark helps with multiple use cases such as data transformation, data preparation and feature engineering, ML Scoring/inference to operationalize ML models in data pipelines, ELT systems, and data apps. This allows coders to write in their preferred language and tools, complete/debug data pipelines with familiar constructs such as DataFrames, functions, and use third-party libraries. Snowpark inherently pushes all of its operations directly to Snowflake without the need for Spark or any other intermediary.
More on Java UDFs
For scenarios where you are using machine learning scoring, applying custom code or even using third-party libraries, the benefits that these UDFs have are as follows:
- Developers can build custom functionality in Snowflake using the JVM languages and popular libraries
- Snowpark ‘publishes’ functions developed in Scala as UDFs for execution in Snowflake via SQL or the Snowpark API
- Users can access this functionality as if it were built in functions in Snowflake
- Administrators can rest easy: data never leaves Snowflake and access and execution permissions for functions can be controlled
How AtScale Supports Snowflake Data Science Use Cases
The AtScale semantic layer is where you can define your business concepts in one single place, exposing a single source of governed metrics and analysis dimensions to all data consumers. This includes both quantitative and qualitative data including such as KPI’s, time series analytics, calculations, dimensional information, categorical information. Dimensional data refers to a structured way to look at categorical information like time, geography, customers, vendors, product/time hierarchies, etc. When these concepts are defined in one single place, data consumers can more reliably navigate large data sets consistently. Data scientists can leverage this consistent source of data within their models.
In addition to managing historical data within the AtScale semantic layer, it is possible to manage the outputs of data science models in the same way. As production ML models generate modeled insights, like predictions or recommendations, they can be written back to Snowflake through AtScale. This enables two important capabilities:
- Decision makers can work with data science output in the same tools they are used to working with historical data. Predictions can be studied within the same Power BI or Tableau dashboards managed by traditional BI teams.
- Leveraging the power of dimensional analysis, analysts and decision makers can drill into predictive results in the same way they would interact with historical data. A regional manager could look at predicted sales for the entire region next year and then drill into more granular predictions by store, by product line, by month, etc… As AI-generated predictions become more commonplace and pervasive, the concept of a semantic layer delivering business context to data becomes equally important for modeled insights as it is for historical data.
“With AtScale, data scientists can programmatically write results back to the semantic layer, and also, more importantly, they can automatically write predictions back to the semantic layer”
AtScale enables data, features and relationships to be modelled and persisted on top of Snowflake data warehouses. The AtScale semantic layer platform can align BI and AI workloads – bridging data science teams with business intelligence. AtScale’s close integration with Snowflake lets joining customers leverage Native Data Frame support via Snowpark to build sophisticated data pipelines for production ML models. Java UDFs, called from Snowflake, allows fast compiled custom code execution and access to Java based languages and libraries directly in Snowflake.