Five Sure-Fire Ways to Completely Ruin Your Data
If you’ve been following our blog, you know that we love to share our best practices and industry expert advice with you. What should you avoid doing with your data? Five members of the AtScale team share their thoughts.
Name: Dave Mariani, AtScale Co-Founder and Chief Strategy Officer
A: I think one of the biggest mistakes data engineers make is what I call “premature aggregation”. That is, summarizing data too early in the data processing pipeline which makes the fine-grained data impossible or extremely difficult to access for business users and data scientists. Rather, I recommend that customers store all the data that they capture “as is” as files in a data lake and only summarize data if absolutely necessary for performance reasons.
Name: Chris Oshiro, Field CTO
A: While there are advantages to denormalizing the data for some multidimensional use cases, I believe one of the biggest mistakes is denormalizing as a default behavior and trying to denormalize the data at a super granular level. While this begins to avoid the JOINs which can be costly, the technique of denormalization often creates very sparsely filed data given the combination of different datasets that naturally don’t belong in a single table. This impacts all of your calculations; you have to be extra careful around things like NULL handling for example. Also different data sets are recorded at different levels of granularity, so denormalization often aggregates data up to the least granular dataset. This loses a ton of data and again adds even more complexity to calculations because you could lose proper counts.
Name: Stella Valcheva, BI Expert
A: Absence of a company-wide data strategy leads to poor quality and a lot of overhead for consolidation and analysis. If data is not handled according to a corporate standard, a company would end up having to deal with data silos, misinterpretation and eventually wrong insights. In case the different departments or geo-locations have different understanding of the data and its role in the company, it becomes very costly to achieve full data synchronisation and therefore unlikely to rely on it for important decision making. Moreover, it is not a problem that could be solved solely by introducing an integrated IT solution. It is the company culture and the responsibility of each employee, beginning from the C-levels, that would turn data into an asset rather than a liability.
Name: Petar Staykov, Senior BI Architect
A: I think a common mistake that companies make is to organize their data in project-related structures rather than domain-related ones. By doing it this way, every project adds datasets according to its goals without taking into account the already delivered data sets from other projects. Over time, the data redundancy introduced grows and grows and different transformations are executed in different projects over similar data sets. Company data should be business process-oriented and not project-specific. In this way, every single project contributes to the common semantic layer and supports the company’s data-driven strategy.
Name: Gergana Ilieva, BI Expert
A: Data is a double-edged sword, razor sharp on both sides. As organizations engage with increasing volumes, the need for company-wide governance becomes even more acute. Often data is locked away and accessible only by certain teams. Different business units operate in different ways and can’t see the value of data held in other places. Data is often stored in disjointed places with no clear structure; bits are missing; and there is no collection protocol. Often companies do not have a central oversight on the process, ending up with increasing expenses due to data redundancy, inaccurate or incomplete insights, ineffective operations, and data leaks.