Data extraction is the process of retrieving data from data sources for further data processing or storage.
Data extraction is perhaps the most important part of the Extract/Translate/Load (ETL) process because it inherently includes the decision making on which data is most valuable for achieving the business goal driving the overall ETL.
Sometimes data is relatively easy to extract because it exists in a structured data store such as a relational database management system (RDBMS) in which case, there is a well defined standardized Structured Query Language (SQL) that is very powerful for doing targeted extracts of exact data. SQL may also be used to do some level of translation/transformation making it even more powerful.
In other cases, the data exists in a non-SQL based data store or even spread across many different digital, and potentially non-digital, formats. In this case, more exotic tools or bespoke code is required. Unstructured Data Extraction generally makes the projects longer so the general rule of understanding the value of the data you are going to extract is even more important.
Streaming Data Pipeline vs. Batch Data Extraction
Another consideration in the Extraction phase is the velocity of data. In some cases data is continuous, meaning new data elements are arriving on a regular basis. This is sometimes referred to as a Streaming Pipeline of data and more applied to structured data. Streaming data use cases exist in all industries and are often employed for workloads in IOT, finance (Fraud Detection), Security monitoring, Healthcare, Advertising, etc.
The more prevalent mode of Extraction is batch. Batch Extraction refers to a defined process running on a time interval. This discrete execution of the extraction process can approximate a streaming use case by running quite frequently. Typically the majority of current data freshness requirements are in hours or minutes, not seconds or real time, so batch is the overwhelming majority of implementations. Due to popularity, more infrastructure and tools exist in the batch extraction space from established enterprise vendors such as Informatica, while many of the streaming tools such as Kafka originated in the Open Source.
Why is Data Extraction Important for Business?
Decisions at this point in the pipeline can and will heavily influence the viability of downstream use of the data, so it is critical to always identify your business goal and use a combination of human analysis and machine learning to identify the correlation between the source data you are extracting and the analysis/decisions you are hoping to optimize. An example of a way to accomplish this is by using the Agile method of running a Spike Solution (a simple program that explored potential solutions) to ensure the data you are investing in extracting is appropriate for the use case.
Photo by Iker Urteaga on Unsplash