March 4, 2019Hortonworks chooses AtScale as its standard for BI-on-Hadoop
Part One – Modeling Many-to-Many Relationships Using Bridge Tables
I have been fortunate throughout my career to work with a range of companies, from some of the largest enterprises in the world to smaller companies that are disrupting industries with innovation. Typically, my work involves discussing data strategies, security, workload optimization, and testing with these organizations. While many of my conversations increasingly cover exciting new possibilities such as the democratization of operational analytics with Artificial Intelligence and Machine Learning, there are core business measurements that remain critical and – news flash! – they are not going away. Analysts use these core KPIs every day to turn data into actionable information for decision making. These calculations and the business logic to compute them still need a place in the modern data world. And AtScale makes this possible.
AtScale modeling uses typical star or snowflake concepts. In the AtScale Design Center, modelers usually work with fact tables, dimension tables, degenerate dimensions, measures, calculated measures and other familiar concepts. However, many AtScale customers have more advanced business rules that require extending this type of dimensional modeling to include more complex concepts such as many-to-many bridge tables. This post walks through how to effectively model these complex business requirements for impactful and efficient BI consumption.
One business rule I see quite often, especially in retail and sales geographic analytics, is that a Zip Code can belong to multiple Sales Territories, It is important to ensure that zip codes aggregate correctly up the geography hierarchy according to the appropriate business rules. For this example, there are a few notes for the calculation:
- A Zip Code (“Zip”) can belong to multiple Territories
- Sales Units per Zip are allocated in full to each of the Territories containing that Zip.
- Territory is resolved as a many-to-many relationship via the Territory – Zip Bridge Table, through to the Zip key to Sales Fact
Representing that via a logical model looks similar to the diagram below. Note that although there may be a “natural” hierarchy from Area -> Territory -> Zip, we have broken out the Area (not shown), Territory, and Zip dimensions into different dimensions and hierarchies.
Sample data for the Territory–Zip Bridge table is illustrated as below. Notice the complexities of the many to many relationship, such as ZipID 2 belonging to both TerritoryID 1 and 2:
A completed model in the AtScale Design Center supporting this rule looks like the diagram below. The model includes both the Zip and the Territory Zip Bridge dimensions. When creating analytics that require the Zip dimension only, we can get the facts via the relationship to the salesfact dataset.
Now, let’s take a look at the Territory Zip Bridge dimension:
The Area dimension follows the typical one-to-many relationship between Area and Territory Zip Bridge. The multiple arrow-heads illustrates that the Territory dimension is related to the Territory Zip Bridge dataset via a many-to-many relationship:
When we create the Territory Zip Bridge dimension, we create a single level hierarchy, but hide the Territory Zip Bridge level. As a result, the functionality of the many-to-many business rule is concealed from the business user, making the analytics of this complex relationship simpler; business users will only see the attributes needed for reporting. With the use of this dimension and relationship configuration, analyses that require the Area dimension will go through the one-to-many relationship of the Area dimension to Territory Zip Bridge dataset, while if Territory is required, the AtScale engine knows to go through and resolve the many-to-many relationship of Territory to the Territory Zip Bridge dataset.
Cube in Action:
Let’s take a look at some pivot tables in Excel that use the model described above. For the first use case, we select the Time and Product dimensions as filters, and the Zip dimension as rows. Sales Units is selected as the measure. Note that the sum of all the Zip Codes is the total of all of the zip codes that had sales for Product A in January 2017 – 550 sales units.
Now if we replace Zip with Territory, we aggregate each of the Zip Codes to their respective Territory (the sample bridge table is shown to see how it aggregates):
Territory 1 is the sum of Zip 1 and 2, or 223 units. Territory2 is the sum of Zip 2 and 3, or 175 units. The interesting calculation is the Grand Total – it still shows the correct sum as 550 units instead of just totaling up the rows to get 770 units. This illustrates the power of using MDX with AtScale, where the AtScale engine can break down the individual components of an MDX query into the correct, multi-query plans that run on the configured SQL engine.
Lastly, let’s add in Area as a higher-level aggregate of Territory. In this particular case, the North area is the sum of the unique values of Zip codes that belong to Territory 1 and Territory 2, or Zip 1, Zip 2 and Zip 3 – 140+83+92 = 315 sales units. Even more interesting is the South area is the sum of Zip 4 and Zip 5 where Territory 4 is already rolled into Territory 3 – 98+137 = 235 sales units.
AtScale allows for complex business rules to be modeled as part of its Design Center interface. In this case, a designer’s first instinct might be to model a Geography dimension such as Area -> Territory -> Zip. Modeling the hierarchy this way forces the aggregation rules up the hierarchy, which in this case would violate the business rules and lead to problems when analysts try to answer specific business questions involving geographic sales performance. Instead, we modeled Area, Territory, and Zip as separate dimensions that generated the correct results as we traversed the “hierarchy” of the geography, yet still allowed the end user to represent collapsible nodes in a pivot table. The key to this result is being able to leverage a many-to-many bridge table to allow AtScale to solve the complex relationships between the dimensions.
This is one example of using the power of AtScale’s multidimensional query engine and Design Center to model a more advanced dimensional model. In my next post, we will explore more complex rules, such as solving for multi-level metrics.