December 31, 2019Our Top 5 Blog Posts of 2019
Data Loading, part of the ETL trinity, is conceptually very easy. Where it gets interesting is – at what point do you do the loading? Pre or Post the Transformation, where data is transformed from one format or structure to another.
ETL refers to the process of Extracting data from numerous source platforms, both relational and otherwise, applying Transformations – either within or outside of the target system – and Loading that potentially transformed data into a target data warehouse for consumption by analytics and visualization.
The interesting bits of Loading are decisions around whether to load and then Transform, or Transform and then load. Technically, pre or post-load transformation doesn’t affect the process of loading. However, it could change the complexity as Transformation can change a data type, split fields, or make other changes that make loading much less of a straight line from Source to Target.
The decision to transform pre-load (ETL) or post-load (ELT) can be heavily influenced by many factors: the complexity of transformation, nature of load technology, and volume of data come to mind. The big data concept of map reduce was literally invented to “move the compute closer to the data, instead of the data closer to the compute.” So, often for big data environments, it is desirable to load and then translate in the data warehouse.
ETL vs. ELT – Differences in Performance
There are two common patterns you could follow when designing your data warehouse.
The big difference between loading or transforming first is performance. Databases that are at their core distributed, such as Amazon Redshift, Google BigQuery, and Snowflake are naturally optimized for ELT. As mentioned, ETL with large data sets can be problematic. To illustrate the differences between ETL and ELT we looked at 3 common operations – calculations, lookups (joins), and aggregations.
Standard SQL has many ways to alter data, and software code can obviously change data as well. In ETL, code is applied to the data to change the structure or format prior to moving it into a new repository. In contrast, in ELT, you define a calculated or derived column for the data you’ve already moved and specify SQL expressions compatible with your target platform to drive your calculations.
In ETL, this is performed in a row-wise manner, relating the fact data to a dimension via a key. Data related is fetched via some sort of API, appended to the source and then pushed to the target data warehouse. In the ELT, you can do the equivalent of a Create Table As Select (CTAS) that does the join via the RDBMS engine and deposited into a new table.
In the ETL process, aggregations are an interesting subject. Obviously, data volume is very important, and if you have multiple use cases that deal with the data at different levels of granularity, you need to generate these views as well. When these granular views consider most, if not all data elements, this combination of variations plus lots of data can wreck your data ingest window.
ELT supports CTAS for the creation of aggregations and this is significantly faster than data movement ETL solutions.
When should I choose ELT over ETL?
We believe ELT has significant benefit both from a performance and a maintenance perspective. There are situations where perhaps you have a proprietary algorithm that is not expressible in the originating data system that might require you to “bring the data to the code”; you’ll need to load the data into the system where the algorithm can run.
In summary, while data in and of itself may not elicit excitement, the decisions around when in the process it makes the most sense are important, particularly when thinking about ETL as a means to the end of business users being able to consume data..
AtScale greatly simplifies and enables business people to leverage autonomous data engineering to achieve a result equivalent to ELT, accessing data from both legacy and modern data warehouses such as Oracle, Teradata, Google BigQuery, and Snowflake and transforming your data through virtualization, modeling, and a Universal Semantic Layer. This effectively frees up time from worrying about mechanics to focus on the higher-order analytics that drive business.