March 5, 2019What’s the best BI tool for Hadoop?
Data streaming is a strategy employed when the source of information generates data on a continuous basis and near real-time updates are required to allow analysts a more recent view of data (usually aggregated) by which they make decisions. In the most extreme cases, the consumer of the data is no longer a human (where milliseconds are moot), rather it is processes driven by other code that uses the streamed data to control flow.
Streaming isn’t new, and streaming isn’t particularly exotic. Only a very small number of use cases require millisecond level streaming support, where the bulk (70%) of data freshness requirements are ok with hours of delay.
Data Streaming vs Batch Data Movement
Data exists everywhere, and when that data is captured from its native source there is a challenge that emerges to get the data into an appropriate technology for analysis. The process of moving that data can be classified into two high-level concepts: Batch & Streaming.
Batch data movement refers to when data is allowed to accumulate for a given time interval before being moved in bulk to the new system. If analysis requirements are for more real-time analysis, streaming systems are employed to route the arriving data to potentially multiple different data stores at the same time during ingress.
What are the benefits of data streaming?
Aggregating and analyzing data in real-time allows for more agility and potentially higher quality decisions based on the most recently available data. For instance, if you are spending money on a pay-per-click program, every minute that goes by where you aren’t optimizing for the right search phrases costs you real money. The old axiom “time is money” defines the reason for using a streaming solution. If delays cause you to make less optimal decisions that cost or lose money, streaming could be the answer.
When is data streaming used?
Data streaming is used when real-time analysis is required, in scenarios such as:
- Retail – Customer Experience, beacons based on location, and dynamic 1:1 engagement tactics require the current state of the customer experience to maximize value.
- Manufacturing – Machinery monitoring where downtime means money is being lost.
- Connected Car – These new vehicle capabilities require immediate Interaction with their environment.
- Cyber Security – Data breaches must be reported and corrected almost immediately to mitigate the amount of damage done by criminals.
- Weather – Weather safety requires immediate streaming data notification for highly dynamic systems such as tornadoes.
- Healthcare – Monitoring patients in real time during operations or post operation can mean the difference between life and death.
Data Streaming Technologies
Apache-based open source projects Spark, Kafka, Flume, and Flink are among the most popular streaming data frameworks and commercial entities such as Confluent exist to support and augment those frameworks. Talend, Informatica, and Oracle are leaders in the commercial/enterprise space.
Successful implementation of data warehouse streaming requires a sophisticated streaming architecture and Big Data solution. Typically these architectures should be able to process and execute more than 100,000 transactions per second to support Big Data analytics and data lake initiatives.
Some Challenges with Data Streaming
Downstream/derived tables need to be kept up to date. Streaming information potentially invalidates derived or pre-aggregated tables. Data arriving at different intervals can create consistency issues at query time. This may not affect the validity of the answers you are getting, however, you need to understand this in the context of your use case.
Updates to data can be difficult. Streaming is generally an append-only scenario, however in some cases late arriving data, or changes to data must be taken into account.
Using streaming tech can cause new stresses on both source and destination systems resulting in performance being adversely impacted. Due to the bespoke requirements around streaming, often custom development is required both to satisfy functional requirements as well as scaling efficiently to support a large number of data sources.
What’s the alternative
For small amounts of data, re-aggregating on each query is an option, however, we’re in the BIG DATA ERA and that is not possible at large scale.
Incremental update is an alternative that is much more straight forward to implement and can give 80-90% of the benefit of streaming at a much lower cost. AtScale enables Incremental update which is a configurable batch operation that operates on an interval that approximates streaming without the complexity and bespoke code required to get the minute refreshes down to seconds.