Data Operations is the practice (e.g., frameworks, methods, capabilities, resources, processes and architecture) for delivering data to create insights and analytics with greater speed, scale, consistency, reliability, governance, security and cost effectiveness using modern cloud-based data platforms and tools applying agile principles. Data Operations shares core concepts with Software Development Operations (focusing on software product development, deployment and delivery), including the need for effective, scalable, reliable, governed, secure delivery of data from source to target.
Data Operations recognizes the need for a system to manage ever-increasing amounts of data from many different sources delivered to many constituents with diverse needs, representing an evolution from the typical centralized, monolithic model of enterprise data warehouse to a distributed model incorporating a data mesh for federated insights creation supported by centralized, automated, self-service oriented services using a cloud-based data lakehouse (e.g., centralized repository of ready-to-analyze, discoverable, governed sources, both fully modeled and integrated as well as individual).
The purpose of data operations is to deliver data with greater speed, scale, reliability, consistency, governance and cost effectiveness using modern cloud-based data platforms applying agile principles. Fundamentally, Data Operations recognizes that data volume, velocity and variety is increasing, as are the needs of an ever-increasing audience of diverse users: data analysts, data scientists and decision makers – creating the need to move from centralized, monolithic enterprise data stores / data warehouse to a distributed model where insights and analytics are created by users using centralized self-service oriented infrastructure and tools that are secure and governed.
Key characteristics of an effective data operations are as follows:
- Agile – Data Ops apply Agile principles in order to continually evolve to address emerging challenges and needs with speed, flexibility and ever-increasing maturity.
- Cloud-based – Data Ops is particularly useful as companies move to the cloud, including hybrids that may include multiple public and private clouds.
- Distributed – Data Ops is particularly effective when it is configured to support a distributed model of delivery, where centralized infrastructure for data access, data preparation, insights and analytics creation tools (particularly self-service), governance and security support decentralized insights and analytics creation. Terms like Data Fabric can be used to describe the centralized aspects of the infrastructure whereas Data Mesh refers to the distributed nature of insights and analytics creation: business users own responsibility for the data they use and the insights and analytics they create.
- Secure – Data Ops works to ensure that all aspects of data and the resulting creation and distribution of insights and analytics are secure, following principles, policies and processes that ensure secure data storage and transmission that is efficient and effective.
- Governed – Data Ops is implemented deploying effective data governance: having policies and procedures for data, insights and analytics creation and usage, as well as data, both sourced, derived and published for insights and analytics creation, including availability, quality, discoverability, observability, access, sharing, changing and publishing. More recent technologies like data catalogs, feature / metric stores and the use of a semantic layer help to improve data governance across the entire data-to-insights lifecycle, including source, derived and published levels.
- Standardized / Automated – Processes are standardized and automated, supporting ease of development, use and operations, including for self-service to reduce resource dependencies and eliminating hand-offs between multiple functions.
- Self-Service – Processes are standardized and automated, supporting ease of development, use and operations, including for self-service by data analysts, data scientists, business analysts and end users to reduce resource dependencies and eliminate hand-offs between multiple functions.
- Discoverable / Shareable / Reusable – Source Data, Ready-to-analyze data, insights-ready data, metrics, features and semantic layer models are available, discoverable, accessible (governed), shareable and reusable to ensure effective utilization and improvement without incurring duplication.
Primary Uses of Data Operations
Recent research from Nexla (2022), surveyed data professional about the need for data operations, highlighted below:
- 85% of respondents say their companies have teams working on ML or AI. This is up from 70% in 2021
- 73% of respondents say their company has plans to hire in DataOps in the next year
- Data professionals are only spending 14% of their time on analysis. The rest of the time is spent on required but low value-add tasks like data integration, data cleanup, and troubleshooting.
- Data engineers spend 18% of their time on troubleshooting. That works out to 9.3 weeks a year!
- Data pros are longing for automation in their jobs. We asked data pros what tasks in their current role would benefit from automation:
- The majority, 56%, unsurprisingly said that data clean up would benefit from automation
- Analysis was the second-most cited task, at 47%
- Data integration was close behind at 46% and building data pipelines at 41%
Major data operation activities are defined below:
Cleansing – Data cleansing involves removing or altering source data that is not necessary, complete or accurate. Cleansing ensures that data is relevant, useful and understandable.
Transformation – Data transformation is the method for transforming data to conformed definitions across dimensions, such as aggregating data that is sourced at a day level to be aggregated at different levels, such as month, quarter, year-to-date, etc. Aggregation ensures that the data is defined consistently based on how the data is used by the business. Aggregation also improves query speed and performance. Other methods for data transformation include attributes, grouping and normalization.
Available / Discoverable – Data operations make data, including source, derived, integrated and published versions discoverable and reusable through tools such as data catalog, metric / feature store and semantic layer.
Benefits of Well-Executed Data Operations
The benefits of data transformation are to deliver data that is available for analysis and analytics use with speed, scale and cost effectiveness. Key benefits are listed below:
- Speed – Insights created from data enable actions to be taken faster, because the insights are structured to address business questions more timely and effectively.
- Scale – Data operations support an ever increasing number of data sources, users and uses that are also increasingly diverse across functions, geographies and organizations.
- Cost Effectiveness – With more data, comes more cost for data storage, compute and resources to manage. Data Ops is configured to ensure costs are minimized by focusing on providing consistent, effective processes that are automated, low code / no code self-service oriented, reusable with minimal resources and hand-offs.
Common Roles and Responsibilities for Data Operations
Business Intelligence and the resulting creation of actionable insights from data delivered to business users involves the following key roles:
- Data Engineers – Data engineers create and manage data pipelines that transport data from source to target, including creating and managing data transformations to ensure data arrives ready for analysis.
- Analytics Engineers – Analytics engineers support data scientists and other predictive and prescriptive analytics use cases, focusing on managing the entire data to model ops process, including data access, transformation, integration, DBMS management, BI and AI data ops and model ops.
- Data Modelers – Data Modelers are responsible for each type of data model: conceptual, logical and physical. Data Modelers may also be involved with defining specifications for data transformation and loading.
- Technical Architect – The technical architect is responsible for logical and physical technical infrastructure and tools. The technical architect works to ensure the data model and databases, including source and target data is physically able to be accessed, queried and analyzed by the various OLAP tools.
- Data Analyst / Business Analyst – Often a business analyst or more recently, data analyst are responsible for defining the uses and use cases of the data, as well as providing design input to data structure, particularly metrics, topical and semantic definitions, business questions / queries and outputs (reports and analyses) intended to be performed and improved. Responsibilities also include owning the roadmap for how data is going to be enhanced to address additional business questions and existing insights gaps.
Key Business Processes Associated Data Operations
The processes for delivering data operations include the following:
- Access – Data, often in structured ready-to-analyze form and is made available securely and available to approved users, including insights creators and enablers.
- Profiling – Data are reviewed for relevance, completeness and accuracy by data creators and enablers. Profiling can and should occur for individual datasets and integrated data sets, both in raw form as was a ready-to-analyze structured form.
- Preparation / Transformation – Data are extracted, transformed, attributed and dimensionalized to be available in a ready-to-analyze form, often with standardized configurations and coded automation to enable faster data refresh and delivery. Data is typically made available in an easy to query form such as database, spreadsheet or Business Intelligence application.
- Integration – When multiple data sources are involved, integration involves combining multiple data sources into a single, structured, ready-to-analyze dataset. Integration involves creating a single data model and then extracting, transforming and loading the individual data sources to conform to the data model, making the data available for querying by data insights creators and consumers.
- Extraction / Aggregation – The integrated dataset is made available for querying, including, including aggregated to optimize query performance.
- Publish – Results of queries are made available for consumption via multiple forms, including as datasets, spreadsheets, reports, visualizations, dashboards and presentations.
Common Technologies Categories Associated with Data Operations
Technologies involved with data operations are as follows:
- Data Engineering – Data engineering is the process and technology required to move data securely from source to target in a way that it is easily available and accessible.
- Data Transformation – Data transformation involves altering the data from its raw form to a structured form that is easy to analyze via queries. Transformation also involves enhancing the data to provide attributes and references that increase standardization and ease of integration with other data sources.
- Data Preparation – Data preparation involves enhancing it and aggregating it to make it ready for analysis, including to address a specific set of business questions.
- Data Modeling – Data modeling involves creating structure and consistency as well as standardization of the data via adding dimensionality, attributes, metrics and aggregation. Data models are both logical (reference) and physical. Data models ensure that data is structured in such a way that it can be stored and queried with transparency and effectiveness.
- Database – Databases store data for easy access, profiling, structuring and querying. Databases come in many forms to store many types of data.
- Data Warehouse – Data warehouses store data that are used frequently and extensively by the business for reporting and analysis. Data warehouses are constructed to store the data in a way that is integrated, secure and easily accessible for standard and ad-hoc queries for many users.
- Data Lake – Data lakes are centralized data storage facilities that automate and standardize the process for acquiring data, storing it and making it available for profiling, preparation, data modeling, analysis and reporting / publishing. Data lakes are often created using cloud technology, which makes data storage very inexpensive, flexible and elastic.
Trends / Outlook for Data Operations
Key trends to watch in the Data Operations arena are as follows:
- Semantic Layer – The semantic layer is a common, consistent representation of the data used for business intelligence used for reporting and analysis, as well as for analytics. The semantic layer is important, because it creates a common consistent way to define data in multidimensional form to ensure that queries made from and across multiple applications, including multiple business intelligence tools, can be done through one common definition, rather than having to create the data models and definitions within each tool, thus ensuring consistency and efficiency, including cost savings as well as the opportunity to improve query speed / performance.
- Automation – Increase emphasis is being placed by vendors on ease of use and automation to increase speed-to-insights. This includes offering “drag and drop” interfaces to execute data-related preparation activities and insights creation / queries without having to write code, including reusing activities and processes, both for repeating use as well as sharing.
- Self-service – As data grows, availability of qualified data technologists and analytics are very limited. To address this gap and increase productivity without having to lean 100% on IT resources to make data and analysis available, self-service is increasingly available for data profiling, mining, preparation, reporting and analysis. In addition tools like the Semantic Layer offered by AtScale, Inc are also focused on enabling business users / data analysts to model data for business intelligence and analytics uses.
- Transferable – Increased effort is also underway to make data easier to consume, and this includes making data available for publishing easier, including using api’s and via objects that store elements of the insights.
- Observable – Recently, a host of new vendors are offering services referred to as “data observability”. Data observability is the practice of monitoring the data to understand how it is changing and being consumed. This trend, often called “dataops” closely mirrors the trend in software development called “devops” to track how applications are performing and being used to understand, anticipate and address performance gaps and improve areas proactively vs reactively.
AtScale and Data Operations
AtScale is the leading provider of the Semantic Layer – to enable actionable insights and analytics to be delivered with increased speed, scale and cost effectiveness. Research confirms that companies that use a semantic layer improve their speed to insights by 4x – meaning that a typical project to launch a new data source with analysis and reporting capabilities taking 4 months can now be done in just one month using a semantic layer.
AtScale’s semantic layer is uniquely positioned to support data operations – the ability to ensure that data are consistently defined and structured using common attributes, metrics and features in dimensional form, including automating the process of data inspection, cleansing, editing and refining it by adding additional attributes, hierarchies, metrics / features, and extracting / delivering ready-to-analyze data automatically made available as a ready-to-analyze source for any BI tool, whether it’s Tableau, Power BI or Excel. Moreover, this work only requires one resource who understands the data and how it is to be analyzed, eliminating the need for complexity and resource intensity. This approach to data operations automation eliminates multiple data hand-offs, manual coding, the risk of duplicate extracts and suboptimal query performance.