Data Transformation is the practice of enhancing data to improve its ability to address relevant business questions, including cleansing, filtering, attributing and structuring to define, construct and dimensionalize topically, semantically and consistently for effective querying. Data Transformation is part of a broader set of activities referred to as ETL: Extract, Transform and Load, which is the process of making source data available, accurate and relevant for insights creation, including for user access and reporting and analysis.
The purpose of data transformation is to enhance the raw data from its source in order to provide a consistent, accurate and understandable set of data that can be queried by a broad array of business insights creators and consumers to address relevant business questions effectively.
Primary Uses of Data Transformation Software
Data Transformation is used to enhance data from its source so that it is analysis-ready: relevant, business-defined, accurate and consistent. Changes to the data involve many different options, including filtering, attributing and structuring. Major transformation activities are defined below:
Cleansing – Data cleansing involves removing or altering source data that is not necessary, complete or accurate. Cleansing ensures that data is relevant, useful and understandable.
Aggregation – Data aggregation is the method for transforming data to conformed definitions across dimensions, such as aggregating data that is sourced at a day level to be aggregated at different levels, such as month, quarter, year-to-date, etc. Aggregation ensures that the data is defined consistently based on how the data is used by the business. Aggregation also improves query speed and performance.
Attributes – Data attributes are definitions of the attributes or standardized groupings of the attributes instantiated in the data to improve analysis relevance and consistency. This can involve creating standard definitions for ranges of values of data, such as defining business-relevant geographical regions, age-groups for demographic analysis, or instantiating other standardized segmentation and grouping. Other methods for data grouping , including discretization of continuous data: binning, trending, probability distribution fitting and association / regression.
Metrics / Features – Data Transformation can also involve creating metrics – standardized calculations (e.g., Key Performance Indicators) that are used for analysis. Metrics used for data science are often referred to as features, and these definitions are also defined and instantiated as part of the data transformation process.
Hierarchies – Data transformation can be the step to define and instantiate hierarchies, which enable logical movement from summary to detail in the data. For example, a geographical hierarchy might involve a sequence such as “where marketed”> “state”>”county” where there is a direct and complete relationship between the values between highest and lowest level of detail.
Normalization – Data normalization is the process of scaling the data to enable “apples-to-apples” comparisons of metrics and features across multiple data sets, either integrated or analyzed and compared separately. Normalization may also be used to eliminate noise or skews in the source data. Examples of normalization techniques are as follows:
- Min-max normalization: performing linear transformation on the original data.
- Z-score normalization: z-score normalization assigns a value to the value associated with its place / position relative to other values using a normal / gaussian distribution
- Decimal scaling: normalizing the value of the attribute by standardizing the decimal point placement.
Benefits of Well-Executed Data Transformation
The benefits of data transformation are to deliver data that is topically and semantically relevant, accurate and consistent. Key impact drivers are being able to easily integrate, analyze and apply data science to the data in order to generate insights with greater relevance, speed, clarity, alignment and confidence as follows:
- Relevance – Insights delivered from data are relevant to the business subject and user needs. Data are available, accurate, timely, understandable and comprehensive to address the business users’ needs.
- Speed – Insights created from data enable actions to be taken faster, because the insights are structured to address business questions more timely and effectively.
- Clarity – The insights are seen as being clear, compelling and accurate in such a way that conclusions reached from the data are consistent across the audience of users, and data are sufficient to address questions posed from multiple interpretations of the data and insights.
- Alignment – Because the data is relevant, timely, accurate and comprehensive and the insights are presented in a clear, consistent way that is easily understood, interpretation of the data is consistent, supporting improved alignment regarding decisions and actions to be taken.
- Confidence – The insights created from data are trusted, and performance of decisions and actions are measured using data such that the relationship between insights and effective, impactful decisions and actions is direct, positive and improving.
Common Roles and Responsibilities for Data Transformation
Business Intelligence and the resulting creation of actionable insights from data delivered to business users involves the following key roles:
- Data Engineers – Data engineers create and manage data pipelines that transport data from source to target, including creating and managing data transformations to ensure data arrives ready for analysis.
- Analytics Engineers – Analytics engineers support data scientists and other predictive and prescriptive analytics use cases, focusing on managing the entire data to model ops process, including data access, transformation, integration, DBMS management, BI and AI data ops and model ops.
- Data Modelers – Data Modelers are responsible for each type of data model: conceptual, logical and physical. Data Modelers may also be involved with defining specifications for data transformation and loading.
- Technical Architect – The technical architect is responsible for logical and physical technical infrastructure and tools. The technical architect works to ensure the data model and databases, including source and target data is physically able to be accessed, queried and analyzed by the various OLAP tools.
- Data Analyst / Business Analyst – Often a business analyst or more recently, data analyst are responsible for defining the uses and use cases of the data, as well as providing design input to data structure, particularly metrics, topical and semantic definitions, business questions / queries and outputs (reports and analyses) intended to be performed and improved. Responsibilities also include owning the roadmap for how data is going to be enhanced to address additional business questions and existing insights gaps.
- Insights Consumers – Insights consumers (e.g. business leaders and analysts) are responsible for using insights and analyses created by insights creators to improve business performance, including through improved awareness, plans, decisions and actions.
Key Business Processes Associated Data Transformation
The process for delivering data transformation are as follows:
- Profiling – Data are reviewed for relevance, completeness and accuracy by data creators and enablers. Profiling can and should occur for individual datasets and integrated data sets, both in raw form as was a ready-to-analyze structured form.
- Preparation / Transformation – Data are extracted, transformed, attributed and dimensionalized to be available in a ready-to-analyze form, often with standardized configurations and coded automation to enable faster data refresh and delivery. Data is typically made available in an easy to query form such as database, spreadsheet or Business Intelligence application.
- Integration – When multiple data sources are involved, integration involves combining multiple data sources into a single, structured, ready-to-analyze dataset. Integration involves creating a single data model and then extracting, transforming and loading the individual data sources to conform to the data model, making the data available for querying by data insights creators and consumers.
- Extraction / Aggregation – The integrated dataset is made available for querying, including, including aggregated to optimize query performance.
Common Technologies Categories Associated with Data Transformation
Technologies involved with data transformation are as follows:
- Data Engineering – Data engineering is the process and technology required to move data securely from source to target in a way that it is easily available and accessible.
- Data Transformation – Data transformation involves altering the data from its raw form to a structured form that is easy to analyze via queries. Transformation also involves enhancing the data to provide attributes and references that increase standardization and ease of integration with other data sources.
- Data Warehouse – Data warehouses store data that are used frequently and extensively by the business for reporting and analysis. Data warehouses are constructed to store the data in a way that is integrated, secure and easily accessible for standard and ad-hoc queries for many users.
- Data Lake – Data lakes are centralized data storage facilities that automate and standardize the process for acquiring data, storing it and making it available for profiling, preparation, data modeling, analysis and reporting / publishing. Data lakes are often created using cloud technology, which makes data storage very inexpensive, flexible and elastic.
- Trends / Outlook for Data Transformation
Key trends to watch in the Data Transformation arena are as follows:
- Semantic Layer – The semantic layer is a common, consistent representation of the data used for business intelligence used for reporting and analysis, as well as for analytics. The semantic layer is important, because it creates a common consistent way to define data in multidimensional form to ensure that queries made from and across multiple applications, including multiple business intelligence tools, can be done through one common definition, rather than having to create the data models and definitions within each tool, thus ensuring consistency and efficiency, including cost savings as well as the opportunity to improve query speed / performance.
- Automation – Increase emphasis is being placed by vendors on ease of use and automation to increase speed-to-insights. This includes offering “drag and drop” interfaces to execute data-related preparation activities and insights creation / queries without having to write code, including reusing activities and processes, both for repeating use as well as sharing.
- Self-service – As data grows, availability of qualified data technologists and analytics are very limited. To address this gap and increase productivity without having to lean 100% on IT resources to make data and analysis available, Self-service is increasingly available for data profiling, mining, preparation, reporting and analysis. In addition tools like the Semantic Layer offered by AtScale, Inc are also focused on enabling business users / data analysts to model data for business intelligence and analytics uses.
- Transferable – Increased effort is also underway to make data easier to consume, and this includes making data available for publishing easier, including using api’s and via objects that store elements of the insights.
- Observable – Recently, a host of new vendors are offering services referred to as “data observability”. Data observability is the practice of monitoring the data to understand how it is changing and being consumed. This trend, often called “dataops” closely mirrors the trend in software development called “devops” to track how applications are performing and being used to understand, anticipate and address performance gaps and improve areas proactively vs reactively.
AtScale and Data Transformation
AtScale is the leading provider of the Semantic Layer – to enable actionable insights and analytics to be delivered with increased speed, scale and cost effectiveness. Research confirms that companies that use a semantic layer improve their speed to insights by 4x – meaning that a typical project to launch a new data source with analysis and reporting capabilities taking 4 months can now be done in just one month using a semantic layer.
AtScale’s semantic layer is uniquely positioned to support data transformation – the ability to ensure that data are consistently defined and structured using common attributes, metrics and features in dimensional form, including automating the process of data inspection, cleansing, editing and refining it by adding additional attributes, hierarchies, metrics / features, and extracting / delivering ready-to-analyze data automatically made available as a ready-to-analyze source for any BI tool, whether it’s Tableau, Power BI or Excel. Moreover, this work only requires one resource who understands the data and how it is to be analyzed, eliminating the need for complexity and resource intensity. This approach to data transformation automation eliminates multiple data hand-offs, manual coding, the risk of duplicate extracts and suboptimal query performance.