June 29, 2021Building a Practical Data Fabric at Scale
Data transformation is the T in ETL. Data Transformation means that data in one format is processed, either inside or outside the data store and persisted in the new required format. This transformation can take several forms, from column storage types, file types, or even encoding types. There is no “single database” solution for enterprises, so data reside in different locations and formats making data transformation necessary to ensure data from one application or database is available to other applications and databases.
Data Transformation is one-third of the holy trinity of Extract, Translate & Load (ETL). ETL is a concept from the ’70s that was essential for data warehousing as the technologies available before AtScale required the transactional (operational) schema and the reporting schema to be very different for performance reasons. ETL pipelines are developed to migrate (move) data from one database to another, and specifically to load data to and from data marts and data warehouses.
I’ll be clear: for me, ETL is a four-letter word because it means someone had to execute a software project and that’s how bugs happen. Executing software projects is complex, and generally lengthy, while at the same time the keyboard is the bug-entry device. Do you know who develops software quickly and with potentially perfect bug-free code? Computers. Automation and data intellect can deliver you from the inevitable pain and suffering of software development. Some ETL is necessary, but too much makes your enterprise and all your data pipelines very fragile. This fragility creates a resistance to change by both IT and business users: The cost of implementing new ideas that use data is too high!
However, data is oxygen for the modern enterprise, and ETL has enabled a lot of value. Over the years and generations of technologies and development trends, ad hoc development of many legacy systems created information silos that contain redundant and inconsistent data. To answer many of the business’ analytical questions, it’s essential to integrate these information silos and leverage existing IT assets to create more flexible, agile enterprise systems. One way to do this is through data transformation.
So, what exactly is Data Transformation? In the ETL process, Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data or potentially code.
How Data Transformation is Accomplished
The first step of data transformation is mapping. Data mapping determines the relationship between the data elements of two use cases and defines how the data from the source application is transformed before it is loaded into the target. Data mapping produces a set of instructions or metadata that is needed before the actual data conversion takes place.
An example of field mapping: the information in one application might be rendered in upper case, while another use expects it in lower case.
The structure of stored data may also vary between data consumers, requiring semantic mapping prior to the transformation process. For instance, in a multinational company there may be a transform required to change the format used to represent dates, or simply to alter the precision of that particular data type.
Overall, there are two ways to approach ETL transformation:
- Multistage data transformation: Extracted data is moved to a staging area where transformations occur prior to loading the data into the warehouse.
- In-warehouse data transformation – Data is extracted and loaded into the analytics warehouse, and transformations are done there. This is sometimes referred to as Extract, Load, Translate (ELT).
In-warehouse transforms are gaining traction driven by two factors:
- The increased performance and scalability of the modern analytics database
- These types of transforms are expressed in SQL, the data manipulation language of choice
AtScale allows for in-warehouse transforms, lately bound and materialized for performance mitigation. We empower data consumers to employ powerful transformations to their data warehouse using a feature we call Composer, which is a visual user interface without a requirement for the user to understand SQL.
Types of Data Transformation
- Deduplication: Duplicate records result in incorrect answers to queries, a common data transformation step is removing dupes.
- Format Revision: Date/Time conversions, units of measure, and character set encodings are common for multinational corporations that do business in the US (Imperial) and everywhere else in the world (metric). OK, fine, Myanmar and Liberia also use Imperial.
- Cleaning: Null handling, standardization on things like M/F for gender is critical for grouping dimensions, and getting correct summarization of metrical values.
- Key Engineering: Occasionally, the relationship between data stored in different databases is some function of a key. In these cases, key restructuring transforms are applied to normalize the key elements.
- Predication/Filtering: Only move data that satisfy the filter conditions.
- Summarization: A key element of Business Intelligence. Values are aggregated stored at multiple levels as business metrics. For example, Order Quantity totals by city, state, and country.
- Derivation Applying business rules to your data that derive new calculated values from existing data – for example, creating a revenue metric that subtracts taxes
- Splitting: Splitting a single column into multiple columns.
- Data validation: Can be a simple “if/then” calculation or can be a multi-valued assessment, for example, if this value A is “foo” and value B is greater than 100, reject this row of data
- Integration: Similar to Key Engineering: Standardize how data elements are addressed. Data integration reconciles different data keys and values for data that should be the same
- Joining: Standard database joining, and more exotic joining from API or unstructured sources. Consider an example where your customers are stored in Salesforce and your sales data is in a homegrown system. Wouldn’t it be nice to have intelligence across both?
How AtScale can Help
AtScale does not remove the requirement to Transform data, however like any good technology Transformation should be applied correctly in order to gain the promised benefit. AtScale gives you a very powerful platform for defining data transformation in a late-bound, performant and easy declarative nature to empower agile analytics.
- Transformations are able to be defined by business people using a visual UI.
- Transformations are also definable by complex scripting if sophistication is required.
- Transformations can reference custom User Defined Functions (UDF).
- Transformations are materialized in summary structures (aggregates) so you only pay the price once.
- Transformation metadata is structured and managed to support governance/lineage.