Data Lake Intelligence with Amazon Redshift Spectrum

In a recent article in Information Age we talked about how the worlds of big data and the cloud are preparing the path for the emerging concept of Data Lake 2.0. We believe that Amazon’s approach to the cloud-enabled data lake very much aligns with this vision. In AtScale we want customers to have access to the tools they are familiar with to derive value and insights from their data. This post will dive deeper into the core components of the cloud data lake and show how customers are now able to deliver modern, agile business intelligence on top of the data stored in these lakes.

LEARN ABOUT AMAZON REDSHIFT AND ATSCALE

The Basics

The core components of Data Lake 2.0 include cloud storage, a shared data catalog, and access to multiple data processing engines. Amazon’s offerings in these areas are detailed below.

Shared Data Catalog: Amazon Web Services provides several mechanisms for sharing data catalogs between processing services. For example, the Amazon Glue Data Catalog can maintain and expose a shared data catalog service that can be used and accessed by services like Amazon Elastic MapReduce (EMR) and Amazon Redshift.
Built-for-Purpose Data Processing: with shared storage and a shared data catalog in place, AWS customers are able to seamlessly use processing services like Amazon EMR to do large scale batch data processing while also using Amazon Redshift Spectrum to execute analytic queries against the same base data.

While the components above are sufficient to act as a modern, cloud-enabled data platform, one of the core use cases that needs to be enabled for this platform to deliver business value is business intelligence. Data lakes and self-service data visualization must satisfy some core requirements to be a modern business intelligence platform.

These platforms must deliver a design experience that enables the creation of BI models (more traditionally, cubes) directly on data stored on cloud storage.
A successful platform needs to enable an interactive, highly concurrent query experience for business users without requiring data movement off the data lake.
A modern platform for data lake intelligence must support a wide range of client access modes, including both SQL and MDX (multidimensional expression language).

AtScale enables a design experience directly on data on Amazon S3 while it’s Adaptive Cache layer leverages Amazon Redshift to deliver an interactive query experience. This approach robustly satisfies all of the core requirements described above. AtScale’s open data interface allows Business Intelligence (BI) tools like Tableau and Microsoft Excel to easily access and query data without requiring additional data transformation, data movement or tool specific extracts.

In the rest of this post we will provide an end-to-end demonstration of how Amazon S3, Amazon Redshift, Amazon Spectrum, and AtScale can provide a complete solution for cloud-based data lake intelligence.

The Use Case

For purposes of this scenario, let’s imagine there is a need to allow business analysts to easily interact with and analyze the performance of content delivered using Amazon’s Cloudfront content delivery network (CDN). For this example, we will parse the raw logs (which are stored on S3), create a Hive table (also stored on S3), register this Hive table in Redshift (using Amazon Redshift Spectrum), create a virtual cube using AtScale, and then query and optimize the underlying data and queries using Amazon Redshift. The overall architecture of this approach is shown below.

Let’s dive into the specific steps to realize this use case.

Parse Raw Cloudfront Logs and Create a Hive Data Set

The logs for Cloudfront are stored on Amazon S3, and the first step to making them consumable is to use a batch processing system like EMR to prepare the data. Given the volume of raw CDN data and the batch nature of data parsing required, this Amazon EMR processing step might be run daily or hourly, writing back row level (but parsed) records back to Amazon S3.

The Hive query to create the table is:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( DateObject Date,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
OS String,
Browser String,
BrowserVersion String
)
The Hive code that parses the log files using the RegEx SerDe looks like the following:

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+[^(]+(.*%20([^/]+)/$”
) LOCATION ‘${INPUT}/cloudfront/data/’;

Creating an External Table in Amazon Redshift Using Spectrum

Using the code above, a table called cloudfront_logs is created on Amazon S3, with a catalog structure registered in the shared Amazon Glue data catalog. Because of the shared nature of Amazon’s S3 storage and Glue data catalog, this new table can now be registered on Amazon Redshift using a feature called Spectrum. Spectrum allows users to run Amazon Redshift SQL queries against raw data stored on Amazon S3. Essentially, this extends the analytic power of Amazon Redshift beyond data stored on local disks by enabling access to vast amounts of data on the Amazon S3 “data lake”.

The process of registering an external table in Redshift using Spectrum is simple.

First, create an external schema that uses the shared data catalog:

create external schema spectrum from data catalog
database ‘spectrumdb’
iam_role ‘arn:aws:iam::123456789012:role/RedshiftSpectrumRole’
create external database if not exists;

Then, create an external table reference that points to the cloudfront data that was output from the EMR Hive process:

drop table spectrum.cloudfront;
create external table spectrum.cloudfront(
DateObject CHARACTER VARYING(100),
Time CHARACTER VARYING(100),
Location CHARACTER VARYING(100),
Bytes INT,
RequestIP CHARACTER VARYING(100),
Method CHARACTER VARYING(100),
Host CHARACTER VARYING(100),
Uri CHARACTER VARYING(100),
Status INT,
Referrer CHARACTER VARYING(100),
OS CHARACTER VARYING(100),
Browser CHARACTER VARYING(100),
BrowserVersion CHARACTER VARYING(100)
)
stored as parquet
location ‘s3://s3bucketname/parquet/cloudfront/’;
To view the external Spectrum table definition and confirm its creation you may query the SVV_EXTERNAL_COLUMNS system view within your Redshift database.

Modeling an AtScale Virtual Cube on the Cloudfront Data Set

Once the cloudfront data set in Amazon S3 has been registered within Redshift, it is now possible to start creating a cube within AtScale using this data set as the core building block. The screenshot below shows the design experience in AtSale when designing a virtual cube on top of data in Amazon Redshift and Amazon S3.

The screenshot above highlights several core capabilities that AtScale and Amazon are able to support when used as an end-to-end data lake intelligence solution:

The core fact table in this model, cloudfront, is available for modeling within AtScale (along with the associated ability to preview the data set). However, this “table” is simply an external reference, using Amazon Redshift Spectrum, to the raw data set residing in Amazon S3. This means AtScale cube designers can design virtual cubes directly on data in Amazon S3.
Note that this model is able to join raw data from Amazon S3 (the aforementioned cloudfront table) with data that already exists in Amazon Redshift tables. In this case there is a dimension, Cloudfront Location, that is sourced from a table, cloudfront_location, that is stored in an Amazon Redshift database.
AtScale can build a traditional dimensional model (with measures, dimensions, and hierarchies) with a mix of underlying schema representations. For example, while the Cloudfront Location is modeled like a traditional dimension in a star schema, the browser and OS dimensions are based directly from the fact table, taking the form of a degenerate dimension. This level of flexibility means that AtScale, Amazon Redshift, and Amazon S3 can be used together to support a broad range of raw data structures.

Executing Queries in Tableau and Excel

Once created, the AtScale virtual cube can be exposed to virtually any data visualization tool that generates SQL or MDX queries. Let’s take a look at what the cube looks like when queried using Tableau.

In this view, we are showing the total number of bytes distributed across cloudfront locations, and with the OS of the requestor highlighted by different colors. Note that there are 255,475 bytes associated with Linux for Cloudfront Locations in Miami, FL.

Looking at the query logs for this query in AtScale helps to better understand the mechanics of the system. First, the inbound query (the query from Tableau to the AtScale cube) shows a very simple SELECT with a GROUP BY:

SELECT
Cloudfront.Cloudfront_Location_2 AS cloudfront_location_2,
Cloudfront.Os AS os,
SUM(Cloudfront.m_bytes_sum) AS sum_m_bytes_sum_ok
FROM
cloudfront.cloudfront Cloudfront
GROUP BY
1,
2
Upon receiving this query, AtScale uses its knowledge of the AtScale Virtual Cube model, along with the availability of data on the Amazon Redshift cluster, to construct an appropriate query to push down to Amazon Redshift, as shown in the outbound query below:

SELECT
t_6.loc_nm_c7 AS cloudfront_location_2,
cloudfront_t4.os AS os,
CAST(SUM(cloudfront_t4.bytes) AS BIGINT) AS sum_m_bytes_sum_ok
FROM
spectrum.cloudfront AS cloudfront_t4
JOIN (
SELECT
cloudfront_locatio_t5.loc_cd AS loc_cd_c8,
cloudfront_locatio_t5.loc_nm AS loc_nm_c7
FROM
atscale.cloudfront_location AS cloudfront_locatio_t5
GROUP BY
1,
2
) AS t_6 ON cloudfront_t4.”location” = t_6.loc_cd_c8
GROUP BY
1,
2
Note that the query above is able to access both the cloudfront data in Amazon S3 (by querying the previously created spectrum.cloudfront table) and join it with data stored as a native Amazon Redshift table – the atscale.cloudfront_location table.

Let’s take a look at what it looks like to query this same AtScale virtual cube from Excel, looking at the same elements from the previous query.

Note that even though the BI client in this case is different, the results are identical to the Tableau results – 255,475 bytes for Linux and Miami, FL. Also, note that the inbound query for this result set was created in MDX (multidimensional expression language).

SELECT
NON EMPTY Hierarchize(
{ DrilldownLevel({ [Os].[Os].[All] },,, INCLUDE_CALC_MEMBERS) }
) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,
HIERARCHY_UNIQUE_NAME ON COLUMNS,
NON EMPTY Hierarchize(
{ DrilldownLevel(
{ [Cloudfront Location].[Cloudfront_Location_2].[All] },,,
INCLUDE_CALC_MEMBERS
) }
) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,
HIERARCHY_UNIQUE_NAME ON ROWS
FROM
[Cloudfront]
WHERE
([Measures].[m_bytes_sum]) CELL PROPERTIES VALUE,
FORMAT_STRING,
LANGUAGE,
BACK_COLOR,
FORE_COLOR,
FONT_FLAGS

AtScale Adaptive Cache on Amazon Redshift

As the scenario above highlights, AtScale’s virtual cube interface, with its support for both SQL and MDX clients, is able to act as a single and consistent semantic layer for business intelligence across multiple clients. This capability alone is of great value to enterprises that are trying to bridge the gap between their big data platforms and their BI consumers. An additional objective of data lake intelligence is to support interactive BI queries, regardless of user concurrency and underlying data size.

To enable interactive query performance, even on the largest data set, AtScale constantly monitors end user query patterns and determines if the creation of an aggregate table could more efficiently satisfy similar versions of the same query. The collection of aggregate tables that are created (and maintained) by AtScale for a specific cube is called the Adaptive Cache. These tables are stored directly on the host data platform – in the example above, this means AtScale aggregates are stored on Amazon Redshift. An example of one such aggregate definition is shown below:

This aggregate was created in response to the original Tableau query that requested the sum of bytes grouped by OS and Location. For the MDX query from AtScale, the query can be satisfied by the Adaptive Cache aggregate table in Redshift and not the raw data in Spectrum. The outbound query to satisfy the Excel-derived MDX query is shown below:

SELECT
as_agg_85acb300_os_t4.os_c3 AS c0,
t_6.cloudfront_locatio_c8 AS c1,
CAST(
SUM(as_agg_85acb300_os_t4.m_bytes_sum_c4) AS BIGINT
) AS c2
FROM
atscale.as_agg_85acb300_os AS as_agg_85acb300_os_t4
JOIN (
SELECT
as_agg_939860e8_cl_t5.cloudfront_location_2_c2 AS cloudfront_locatio_c8,
as_agg_939860e8_cl_t5.key_c1 AS key_c1_c7
FROM
atscale.as_agg_939860e8_cldfrntlctn2 AS as_agg_939860e8_cl_t5
GROUP BY
1,
2
) AS t_6 ON as_agg_85acb300_os_t4.key_c1 = t_6.key_c1_c7
GROUP BY
1,
2
Although this query hit the aggregate table (and as a result was faster than a query against the raw data) the results and user experience were consistent for the end user.

In Conclusion: What This Means for Data Lake Intelligence

Key summary points:

Modern data architectures will take advantage of these capabilities when supporting BI
Amazon has a flexible, robust data lake architecture with S3, Spectrum, and Redshift
With AtScale and Amazon, a vision of data lake intelligence can be realized

Learn About Amazon and AtScale

The Amazon data lake architecture that includes Amazon S3, Amazon Spectrum, and Amazon Redshift provides an affordable, elastic, and fully managed data lake solution. The robust architecture provides the ability to store, access, and analyze vast amounts of data in a secure fashion to drive business agility. Modern architectures need to service the needs of the business users while providing security, governance, and flexibility.

Organizations need to consider their business users in the development of their data lake and provide easy access to data using the tools they know. When using a universal semantic layer like AtScale on the data lake, business users can be efficient, effective, and empowered by using the common logic delivered by AtScale. Business users can spend time debating what can be done based on the data, instead of debating the data itself, thus achieving the vision for data lake intelligence.

Learn more about AtScale Cloud today @ www.atscale.com/cloud