Resources / Blogs / How to Architect for Data Consumption

How to Architect for Data Consumption

This is my pet peeve – technical architects are building systems and applications that make data analysis complicated, error prone, and inefficient. We need enablement of data consumption as a first class requirement of any system that is built. I explain here how we could architect differently.

Technical systems architects, including myself until recently, are used to building systems with considerations such as development time, robustness, and evolvability. Any analytics was an after thought. Having spent a few years crunching data at various scales, I see the world differently. I see barriers all around in systems to analytics. Here are a few thoughts on the nature of barriers and how to address them.

Understanding the Nature of Data Analysis

In order to architect for data, we first need to understand the factors that can make data analysis complex:

  1. Dataset Uncertainty. The specific datasets required for individual analytic tasks are determined by the business need. The dataset cannot always be predicted ahead of time.
  2. Time Uncertainty. Amount of time available depend on business timelines.
  3. Unknowns. Any hidden assumptions or contexts make analytics error prone.
  4. Data Validity. Most analysis assumes data is internally consistent. Errors are embarrassing and will trigger conflict with the engineering team. Reconciliations are very expensive as well.
  5. Deep Changes. Some analysis involves conducting experiments. These can entail deep changes to system architecture.
  6. Data Updates. Analysis sometimes involves updates to data. The interfaces have to be scalable because they often tend to be detailed – touching every record e.g., customer segment for each customer.
  7. Untracked Data and Tooling. Analysts use small and large freestanding tools that are not integrated with the larger platforms. However these tools and their output have to be incorporated into the larger system.
  8. Inefficiency. Business questions are often repetitive and have predictable structures. They often look at the same data and use standard methods.

All Data Should Be (Eventually) Consumable

Any data being captured will be about the service being offered, the system delivering the service, or the consumption of the service. Optimizing and adapting a business will require understanding every aspect of the business and acting on it. It is inevitable that chasing business questions will lead to every data corner.

If some data is never consumed, then that is a signal too. The data and application must be examined for the relevance of the original business objective that led to the collection of the data in the first place. The analytical process too needs to be audited for soundness.

Architecture Should Cover Analytical Systems

Often, the analytics tools and processes are assessed to be “outside the system” being architected, and are therefore not accounted for. Architectural changes impact the dependent analytical tooling, process, and results. The output of the analytical process is often untracked. The execution of the analytical process, especially on larger datasets, is often efficient if it works within the computational framework provided by the application.

Architects should consider analytical frameworks to be a component of the system with two way information exchange with the rest of the system, and provide frameworks to compute, store, and index analytical artifacts. This will allow the analytical process to scale with data, people and questions, and reduce confusion.

Data Integrity and Quality Guarantees

The architecture should specify and guarantee integrity checks for data. Integrity issues, if discovered later, not only invalidate analysis, but also make the analysts (and worse, decision-makers) distrust the data provided by the system. These guarantees should be continuously checked.

Quality is a slightly different challenge. Any ambiguity or gaps in information collected such as timestamps and unstructured text, has an impact on the availability and value of analytics output. Architectural tradeoffs impacting data quality, scope, and availability should be coordinated with the analytics team.

Data Discoverability and Accessibility

Analysts are unlikely to read through design documents and code. The architecture should provide interfaces that allow users to discover what data is being stored in the system, and provide standardized interfaces to access them, short of going to the database. These interfaces should be self-describing, comprehensive, and evolving with the rest of the system. This also implies that the architecture should come with a data governance structure, as well as implementation, management, and auditing as built-in capabilities.

Data Usability

Data cannot be used without metadata that provides the context. Metadata includes the semantics (meaning) of the data but also lineage, assumptions, dependencies, accuracy, and other information. This metadata is useful in determining the appropriateness, value, and scope of analytical processes. For example, if a certain database column is being deprecated, then analysts can update their tools and processes to ignore that column.

This metadata should be explicitly managed through the entire lifetime of the system, and the architecture should provide or support a data catalog that organizes the metadata. As the volume, lifetime, and diversity of data increases, the catalog becomes more and more valuable in focusing attention on the right data. Additional services such as a search may be required to enable quick discovery of relevant metadata.

Data Lifecycle and Management

Application data keeps evolving with architectural and implementation changes made during the lifetime of the application. The storage and access mechanisms or nature may change due to decisions of what data to keep, throw away and archive. Bugs in applications may require the data to be modified or deprecated. Security policy changes may impact accessibility and coverage of data.

All these changes impact the scope, depth, defensibility, and reproducibility of analysis. The architecture should provide mechanisms like callback hooks and lineage tracking to enable analytical tooling discover impact and find ways of coping with the changes.

Support for Right Abstractions

Business questions and the data that address those questions are often predictable and repetitive. It is not uncommon to repeat analysis for different timeframes, products, or geographies. Duplication in datasets or structures of analysis can be discovered over time and the right abstractions, interfaces and data models, can be created to reduce or eliminate this duplication. The metadata discovery interface should include these abstractions as well, along with the context.

Bulk Data Interfaces

Often the analyses involve constructing detailed models over a large number of records. For example, customer profiles have one record for each of the possibly millions of customers. The system must be updated with these changes. The architecture should provide bulk interfaces to enable efficient updates, and should support interfaces for multiple levels of granularity.

Experiments

Experiment design is increasingly common where product variants are presented to end-customers in systematic ways to help analysts understand end-customer sensitivity towards various product attributes such as pricing and size. This creates multiple control paths through the entire system, and significantly complicates the architecture needed.

Architecting with an awareness of the scope, depth, and likelihood of the experiments will reduce the need for convoluted methods or patches to the architecture.

Summary

Analytics used to be an optional addon. It is increasingly a core component as we make applications more intelligent and responsive. Thinking through the nature of analytics in the context of the business, both today and over the long-term, and the kinds of capabilities that will enable efficient and effective analytics will lead to better overall systems.

Adding data consumption to the goals of a system’s architecture will increase work in the short term in the form of new interfaces and mechanisms but the “data debt” has to be paid at one time or another. Putting in place frameworks and approaches early on will reduce the long term costs.

Technical architects often make assumptions about the use cases of data, knowledge and skill of the data user, and mechanisms to be provided. The data ecosystem is evolving rapidly, and it is best if the assumptions are explicitly identified and tested constantly to enable the systems to be in sync with the emerging needs.

(Hat tip to Premkumar for highlighting a gap)

Dr. Venkata Pingali is an academic turned entrepreneur, and co-founder of Scribble Data. Scribble aims to reduce friction in consuming data through automation.

Related Blogs

November 24, 2022

What is the Metadata Economy?

We live in a hyper-digital world, and due to the nearly  infinite number of data sources that surround us, the volume of data generated collectively by individuals, applications and corporations is larger than ever. With such a monumental amount of data to sift through, two core principles have  become increasingly important: Metadata – Make it […]

Read More
November 10, 2022

Data Science Teams are Doing it Wrong: Putting Technology Ahead of People

Despite $200+ billion spent on ML tools, data science teams still struggle to productionize their data and ML models. We decided to do a deep dive and find out why.  Back in 1991, former US Air Force pilot and noted strategist John Boyd called for U.S. Military reforms after Operation Desert Storm. He noted that […]

Read More
November 3, 2022

MLOps – The CEO’s Guide to Productionization of Data [Part 2]

With data being touted as the oil for digital transformation in the 21st century, organizations are increasingly looking to extract insights from their data by building and deploying their custom-built ML models. In our previous article (MLOps – The CEO’s Guide to Productionization of Data, Part 1), we learned why and how embedding ML models […]

Read More