The 6 Principles of Modern Data Architecture
A version of this article originally appeared on the Cloudera VISION blog.
One of my favorite parts of my job at AtScale is that I get to spend time with customers and prospects, learning what’s important to them as they move to a modern data architecture. Lately, a consistent set of six themes has emerged during these discussions. The themes span industries, use cases and geographies, and I’ve come to think of them as the key principles underlying an enterprise data architecture.
Whether you’re responsible for data, systems, analysis, strategy or results, you can use the 6 principles of modern data architecture to help you navigate the fast-paced modern world of data and decisions. Think of them as the foundation for data architecture that will allow your business to run at an optimized level today, and into the future.
1. View data as a shared asset.
Enterprises that start with a vision of data as a shared asset ultimately outperform their competition, as CIO explains. Instead of allowing departmental data silos to persist, these enterprises ensure that all stakeholders have a complete view of the company. And by “complete,” I mean a 360-degree view of customer insights along with the ability to correlate valuable data signals from all business functions, including manufacturing and logistics. The result is improved corporate efficiency.
2. Provide the right Interfaces for users to consume the data.
Putting data in one place isn’t enough to achieve the vision of a data-driven organization. In order for people (and systems) to benefit from a shared data asset, you need to provide the interfaces that make it easy for users to consume that data. This might be in the form of an OLAP interface for business intelligence, an SQL interface for data analysts, a real-time API for targeting systems, or the R language for data scientists. In the end, it’s about letting your people work in the tools they know and are right for the job they need to perform.
3. Ensure security and access controls.
The emergence of unified data platforms like Snowflake, Google BigQuery, Amazon Redshift, and Hadoop has necessitated the enforcement of data policies and access controls directly on the raw data, instead of in a web of downstream data stores and applications. The emergence of data security projects like Apache Sentry makes this approach to unified data security a reality. Look to technologies that allow you to architect for security, and deliver broad self-service access, without compromising control.
4. Establish a common vocabulary.
By investing in an enterprise data hub, enterprises can now create a shared data asset for multiple consumers across the business. However, it’s critical to ensure that users of this data analyze and understand it using a common vocabulary. Product catalogs, fiscal calendar dimensions, provider hierarchies and KPI definitions all need to be common, regardless of how users consume or analyze the data. Without this shared vocabulary, you’ll spend more time disputing or reconciling results than driving improved performance.
5. Curate the data.
Time and time again, I’ve seen enterprises that have invested in Hadoop or a cloud-based data lake like Amazon S3 or Google Cloud Platform start to suffer when they allow self-serve data access to the raw data stored in these clusters. Without proper data curation (which includes modeling important relationships, cleansing raw data and curating key dimensions and measures), end users can have a frustrating experience—which will vastly reduce the perceived and realized value of the underlying data. By investing in core functions that perform data curation, you have a better chance of realizing the value of the shared data asset.
6. Eliminate Data Copies and Movement.
Every time data is moved there is an impact; cost, accuracy and time. Talk to any IT group, or business user for that matter, and they all agree; the fewer times data has to be moved, the better. Part of the promise of cloud data platforms and distributed file systems like Hadoop is a multi-structure, multi-workload environment for parallel processing of massive data sets. These data platforms scale linearly as workloads and data volumes grow. By eliminating the need for additional data movement, modern enterprise data architectures can reduce cost (time, effort, accuracy), increase “data freshness” and optimize overall enterprise data agility.
Regardless of your industry, the role you play in your organization or where you are in your big data journey, I encourage you to adopt and share these principles as a means of establishing a sound foundation for building a modern big data architecture. While the path can seem long and challenging, with the right framework and principles, you can successfully make this transformation sooner than you think.
Tell us about your core principles to Modern Data Architecture. What do you insist on day in and day out to manage big data for your organization? We’d love to know your insights.
About the Author: As head of product management, Josh drives AtScale’s product roadmap and strategy. He started his career in data and analytics as the product manager for the first “Datamart in a Box” at Broadbase, and he ran product management at Yahoo! for one of the largest data and analytics operations in the world. Josh joined AtScale from Pivotal, where he was responsible for data products such as Greenplum, Pivotal HD and HAWQ.