As of October 24th, 2022 AtScale announced the newest release of AI-Link v2.0.0 with new capabilities to support large-scale ML model training and inference for forecasting use-cases with Spark to continue to bridge the gap between BI and AI.
Too Many AI Projects Fail
Businesses see AI as a gateway to strategic advantage in the market, and are looking for opportunities to institute this technology into their existing workflows. BCG and MIT conducted a survey of global business leaders and found that 83% of them view AI as an immediate strategic priority, and 75% of respondents believe AI will help them move into new businesses and ventures.
However, despite this growing interest and investment, successful adoption of AI has not been widespread; year over year since 2020, over 79% of AI projects remain in pilot or limited production applications to deliver value to the business or end consumer, and in fact only 53% of AI projects make it into a production environment.
What’s more concerning is the lack of value these projects have been able to produce so far: Gartner reported that over 85% of AI projects (including those in production) are not yet producing business value.
What PySpark Support in AtScale Means: More Success from AI Projects
To date, AtScale AI-Link has enabled organizations to programmatically interact with the semantic layer using python via a REST endpoint and pandas dataframes. This allows users who know python to take advantage of the rich business-defined metric store and the optimized query engine to query, transform, and define feature pipelines for ML training. However, pandas dataframes are not the only tool in the data scientist’s toolbox. Once datasets start moving into the domain of Big Data (GBs, TBs, PBs), data scientists must shift to distributed computing in order to effectively process data and train performant ML models . While pandas is very popular for creating features, it is limited to the memory of a single machine, making it non-viable for Big Data. AtScale helps with this, but ML models may sometimes need Terabytes of data to adequately train. When data scientists reach this point, they turn to tools like the PySpark dataframe.
PySpark is an interface for Spark written in python that enables python applications to utilize Apache Spark distributed data processing capabilities. By offering native support for the PySpark python library, AI-Link will open the door to processing large scale data sets used for ML applications while still leveraging the business definitions in the semantic layer. This is on top of the native benefits of Spark. For example, Spark enables the ability to work on distributed memory, making operations around access and processing, especially for real-time streaming, much faster than traditional MapReduce approaches.
With this latest announcement, AtScale AI-Link 2.0.0 will natively support reading data directly into a PySpark dataframe (no longer just pandas) to allow data scientists to access and train larger models – and writeback those ML predictions so business analysts can drive strategic decisions from AI.
This means more models created, more models in production, and more ROI from AI investment.
How to Apply this: A Common ML Workflow
Let’s see this live by looking at a common machine learning workflow. We will walk you through how a user leverages PySpark and AI-Link to query business-defined metrics in AtScale, removing the need for creating complex database queries. Once the user has this data in a PySpark dataframe, they are free to train a model with their ML framework of choice. To expose the predictions from this trained model, we will use AI-Link to writeback these predictions to both the semantic layer and underlying data warehouse. We’ll then show how these predictions can be seamlessly grouped and manipulated as new measures within a business intelligence (BI) tool so subject matter experts or business analysts can understand and drive strategic initiatives. We use Databricks for the purpose of this demo, but the flow is compatible with any PySpark environment.
Step 1: Connect to AtScale Data Model
Step 2: Generate PySpark dataframe
Now that we have access to the DataModel, we can generate our PySpark dataframe using a jdbc connection. The metadata and options keys necessary are warehouse specific, here we use Snowflake:
We now have all the necessary information to get at the underlying data. Since we are leveraging the semantic model, we can reference features and measures that are defined across any tables with just their query names. For example:
Once the data is queried into a spark dataframe, we can use your ML framework of choice for feature selection and generating predictions. For the sake of this post, we use a dummy prediction column called ‘pred_sales’.
Step 3: Writeback Predictions for Analysis
With our predictions in place, exposing them to the business is straightforward. We’ll simply connect to our source db, write our predictions as a dataframe to AtScale (which will push them down to a new physical table in our source db), and create an aggregate feature with the predictions so they can be leveraged like any other measure in a BI Tool:
This aggregate feature can now be accessed like any other via BI tools or get_data:
And there we have it, in just a few steps we’ve loaded data into a Spark dataframe while leveraging the query optimization of the semantic layer, made predictions, and written them back into the semantic layer with an aggregation method for widespread use.
We enable users to create new measures to help explain and make ML predictions relevant to the business problem in easily referenceable BI dashboards. AtScale optimizes aggregations so a user can programmatically append data models and use the data warehouse engine for optimal query performance, even when working with datasets typically too large to load into a dataframe. This means 5x more AI-generated insights available to business users to drive team efficiencies, cost savings, and newly created business models.
With Spark support, we enable data scientists to take advantage of the semantic layer to drive the creation and analysis of business-defined AI model output in a matter of only a few steps using their preferred platforms for AI – like Databricks Notebooks – and create faster time to actionable insights for business stakeholders.
AtScale AI-Link opens the aperture for the consumption and use of AI across use-cases and industries. If you have any questions, or would like a free trial to test AI-Link for yourself, feel free to reach out any time to ailink@atscale.com.