Resources / Blogs / How to Architect for Data Consumption

How to Architect for Data Consumption

Data consumption through dashboards

This is my pet peeve – technical architects are building systems and applications that make data analysis complicated, error-prone, and inefficient. We need enablement of data consumption as a first-class requirement of any system that is built. I explain here how we could architect differently to improve data consumption.

Technical systems architects, including myself until recently, are used to building systems with considerations such as development time, robustness, and evolvability. Any analytics was an after thought. Having spent a few years crunching data at various scales, I see the world differently. I see barriers all around in systems to analytics. Here are a few thoughts on the nature of barriers and how to address them.

Understanding the Nature of Data Analysis

In order to architect for data, we first need to understand the factors that can make data analysis complex:

  1. Dataset Uncertainty. The specific datasets required for individual analytic tasks are determined by the business need. The dataset cannot always be predicted ahead of time.
  2. Time Uncertainty. Amount of time available depend on business timelines.
  3. Unknowns. Any hidden assumptions or contexts make analytics error prone.
  4. Data Validity. Most analysis assumes data is internally consistent. Errors are embarrassing and will trigger conflict with the engineering team. Reconciliations are very expensive as well.
  5. Deep Changes. Some analysis involves conducting experiments. These can entail deep changes to system architecture.
  6. Data Updates. Analysis sometimes involves updates to data. The interfaces have to be scalable because they often tend to be detailed – touching every record e.g., customer segment for each customer.
  7. Untracked Data and Tooling. Analysts use small and large freestanding tools that are not integrated with the larger platforms. However these tools and their output have to be incorporated into the larger system.
  8. Inefficiency. Business questions are often repetitive and have predictable structures. They often look at the same data and use standard methods.

All Data Should Be (Eventually) Consumable

Any data being captured will be about the service being offered, the system delivering the service, or the consumption of the service. Optimizing and adapting a business will require understanding every aspect of the business and acting on it. It is inevitable that chasing business questions will lead to every data corner.

If some data is never consumed, then that is a signal too. The data and application must be examined for the relevance of the original business objective that led to the collection of the data in the first place. The analytical process too needs to be audited for soundness.

Architecture Should Cover Analytical Systems

Often, the analytics tools and processes are assessed to be “outside the system” being architected, and are therefore not accounted for. Architectural changes impact the dependent analytical tooling, process, and results. The output of the analytical process is often untracked. The execution of the analytical process, especially on larger datasets, is often efficient if it works within the computational framework provided by the application.

Architects should consider analytical frameworks to be a component of the system with two way information exchange with the rest of the system, and provide frameworks to compute, store, and index analytical artifacts. This will allow the analytical process to scale with data, people and questions, and reduce confusion.

Data Integrity and Quality Guarantees

The architecture should specify and guarantee integrity checks for data. Integrity issues, if discovered later, not only invalidate analysis, but also make the analysts (and worse, decision-makers) distrust the data provided by the system. These guarantees should be continuously checked.

Quality is a slightly different challenge. Any ambiguity or gaps in information collected such as timestamps and unstructured text, has an impact on the availability and value of analytics output. Architectural tradeoffs impacting data quality, scope, and availability should be coordinated with the analytics team.

Data Discoverability and Accessibility

Analysts are unlikely to read through design documents and code. The architecture should provide interfaces that allow users to discover what data is being stored in the system, and provide standardized interfaces to access them, short of going to the database. These interfaces should be self-describing, comprehensive, and evolving with the rest of the system. This also implies that the architecture should come with a data governance structure, as well as implementation, management, and auditing as built-in capabilities.

Data Usability

Data cannot be used without metadata that provides the context. Metadata includes the semantics (meaning) of the data but also lineage, assumptions, dependencies, accuracy, and other information. This metadata is useful in determining the appropriateness, value, and scope of analytical processes. For example, if a certain database column is being deprecated, then analysts can update their tools and processes to ignore that column.

This metadata should be explicitly managed through the entire lifetime of the system, and the architecture should provide or support a data catalog that organizes the metadata. As the volume, lifetime, and diversity of data increases, the catalog becomes more and more valuable in focusing attention on the right data. Additional services such as a search may be required to enable quick discovery of relevant metadata.

Data Lifecycle and Management

Application data keeps evolving with architectural and implementation changes made during the lifetime of the application. The storage and access mechanisms or nature may change due to decisions of what data to keep, throw away and archive. Bugs in applications may require the data to be modified or deprecated. Security policy changes may impact accessibility and coverage of data.

All these changes impact the scope, depth, defensibility, and reproducibility of analysis. The architecture should provide mechanisms like callback hooks and lineage tracking to enable analytical tooling discover impact and find ways of coping with the changes.

Support for Right Abstractions

Business questions and the data that address those questions are often predictable and repetitive. It is not uncommon to repeat analysis for different timeframes, products, or geographies. Duplication in datasets or structures of analysis can be discovered over time and the right abstractions, interfaces and data models, can be created to reduce or eliminate this duplication. The metadata discovery interface should include these abstractions as well, along with the context.

Bulk Data Interfaces

Often the analyses involve constructing detailed models over a large number of records. For example, customer profiles have one record for each of the possibly millions of customers. The system must be updated with these changes. The architecture should provide bulk interfaces to enable efficient updates, and should support interfaces for multiple levels of granularity.

Experiments

Experiment design is increasingly common where product variants are presented to end-customers in systematic ways to help analysts understand end-customer sensitivity towards various product attributes such as pricing and size. This creates multiple control paths through the entire system, and significantly complicates the architecture needed.

Architecting with an awareness of the scope, depth, and likelihood of the experiments will reduce the need for convoluted methods or patches to the architecture.

Summary

Analytics used to be an optional add-on. It is increasingly a core component as we make applications more intelligent and responsive. Thinking through the nature of analytics in the context of the business, both today and over the long-term, and the kinds of capabilities that will enable efficient and effective analytics will lead to better overall systems.

Adding data consumption to the goals of a system’s architecture will increase work in the short term in the form of new interfaces and mechanisms but the “data debt” has to be paid at one time or another. Putting in place frameworks and approaches early on will reduce the long term costs.

Technical architects often make assumptions about the use cases of data, knowledge and skill of the data user, and mechanisms to be provided. The data ecosystem is evolving rapidly, and it is best if the assumptions are explicitly identified and tested constantly to enable the systems to be in sync with the emerging needs.

(Hat tip to Premkumar for highlighting a gap)

Dr. Venkata Pingali is an academic turned entrepreneur, and co-founder of Scribble Data. Scribble aims to reduce friction in consuming data through automation.

Related Blogs

February 22, 2024

Exploring OpenAI’s SORA and Text-to-Video Models: A Complete Guide

In every epoch, some moments redefine the course of human history. The discovery of fire illuminated the dark. The invention of the wheel set humanity in motion. The creation of the printing press unfurled the banners of knowledge across the globe. Unironically, we may be standing at the threshold of another such transformative moment with […]

Read More
February 15, 2024

Building AI Assistants: A Comprehensive Guide

For years, a giant mystery confounded the world of medicine. How do proteins fold?  The answer, elusive, held the key to life itself. Then, a heroic AI agent – AlphaFold, emerged from DeepMind’s depths. It tackled the giant. And won. AlphaFold produces highly accurate protein structures The implications? Beyond staggering. AlphaFold is just the beginning. […]

Read More
February 8, 2024

How GenAI and Machine Learning are Transforming Actuarial Science

In the late 17th century, Edmond Halley sat by candlelight. He pored over numbers. Charts. Life tables. Halley, an astronomer by trade, ventured into uncharted waters. He sought to understand mortality, to predict life spans. His work laid the foundation for modern actuarial science. It was a time of discovery, of manual calculations, and limited […]

Read More