This is my pet peeve – technical architects are building systems and applications that make data analysis complicated, error prone, and inefficient. We need enablement of data consumption as a first class requirement of any system that is built. I explain here how we could architect differently.
Technical systems architects, including myself until recently, are used to building systems with considerations such as development time, robustness, and evolvability. Any analytics was an after thought. Having spent a few years crunching data at various scales, I see the world differently. I see barriers all around in systems to analytics. Here are a few thoughts on the nature of barriers and how to address them.
Understanding the Nature of Data Analysis
In order to architect for data, we first need to understand the factors that can make data analysis complex:
- Dataset Uncertainty. The specific datasets required for individual analytic tasks are determined by the business need. The dataset cannot always be predicted ahead of time.
- Time Uncertainty. Amount of time available depend on business timelines.
- Unknowns. Any hidden assumptions or contexts make analytics error prone.
- Data Validity. Most analysis assumes data is internally consistent. Errors are embarrassing and will trigger conflict with the engineering team. Reconciliations are very expensive as well.
- Deep Changes. Some analysis involves conducting experiments. These can entail deep changes to system architecture.
- Data Updates. Analysis sometimes involves updates to data. The interfaces have to be scalable because they often tend to be detailed – touching every record e.g., customer segment for each customer.
- Untracked Data and Tooling. Analysts use small and large freestanding tools that are not integrated with the larger platforms. However these tools and their output have to be incorporated into the larger system.
- Inefficiency. Business questions are often repetitive and have predictable structures. They often look at the same data and use standard methods.
All Data Should Be (Eventually) Consumable
Any data being captured will be about the service being offered, the system delivering the service, or the consumption of the service. Optimizing and adapting a business will require understanding every aspect of the business and acting on it. It is inevitable that chasing business questions will lead to every data corner.
If some data is never consumed, then that is a signal too. The data and application must be examined for the relevance of the original business objective that led to the collection of the data in the first place. The analytical process too needs to be audited for soundness.
Architecture Should Cover Analytical Systems
Often, the analytics tools and processes are assessed to be “outside the system” being architected, and are therefore not accounted for. Architectural changes impact the dependent analytical tooling, process, and results. The output of the analytical process is often untracked. The execution of the analytical process, especially on larger datasets, is often efficient if it works within the computational framework provided by the application.
Architects should consider analytical frameworks to be a component of the system with two way information exchange with the rest of the system, and provide frameworks to compute, store, and index analytical artifacts. This will allow the analytical process to scale with data, people and questions, and reduce confusion.
Data Integrity and Quality Guarantees
The architecture should specify and guarantee integrity checks for data. Integrity issues, if discovered later, not only invalidate analysis, but also make the analysts (and worse, decision-makers) distrust the data provided by the system. These guarantees should be continuously checked.
Quality is a slightly different challenge. Any ambiguity or gaps in information collected such as timestamps and unstructured text, has an impact on the availability and value of analytics output. Architectural tradeoffs impacting data quality, scope, and availability should be coordinated with the analytics team.
Data Discoverability and Accessibility
Analysts are unlikely to read through design documents and code. The architecture should provide interfaces that allow users to discover what data is being stored in the system, and provide standardized interfaces to access them, short of going to the database. These interfaces should be self-describing, comprehensive, and evolving with the rest of the system. This also implies that the architecture should come with a data governance structure, as well as implementation, management, and auditing as built-in capabilities.
Data cannot be used without metadata that provides the context. Metadata includes the semantics (meaning) of the data but also lineage, assumptions, dependencies, accuracy, and other information. This metadata is useful in determining the appropriateness, value, and scope of analytical processes. For example, if a certain database column is being deprecated, then analysts can update their tools and processes to ignore that column.
This metadata should be explicitly managed through the entire lifetime of the system, and the architecture should provide or support a data catalog that organizes the metadata. As the volume, lifetime, and diversity of data increases, the catalog becomes more and more valuable in focusing attention on the right data. Additional services such as a search may be required to enable quick discovery of relevant metadata.
Data Lifecycle and Management
Application data keeps evolving with architectural and implementation changes made during the lifetime of the application. The storage and access mechanisms or nature may change due to decisions of what data to keep, throw away and archive. Bugs in applications may require the data to be modified or deprecated. Security policy changes may impact accessibility and coverage of data.
All these changes impact the scope, depth, defensibility, and reproducibility of analysis. The architecture should provide mechanisms like callback hooks and lineage tracking to enable analytical tooling discover impact and find ways of coping with the changes.
Support for Right Abstractions
Business questions and the data that address those questions are often predictable and repetitive. It is not uncommon to repeat analysis for different timeframes, products, or geographies. Duplication in datasets or structures of analysis can be discovered over time and the right abstractions, interfaces and data models, can be created to reduce or eliminate this duplication. The metadata discovery interface should include these abstractions as well, along with the context.
Bulk Data Interfaces
Often the analyses involve constructing detailed models over a large number of records. For example, customer profiles have one record for each of the possibly millions of customers. The system must be updated with these changes. The architecture should provide bulk interfaces to enable efficient updates, and should support interfaces for multiple levels of granularity.
Experiment design is increasingly common where product variants are presented to end-customers in systematic ways to help analysts understand end-customer sensitivity towards various product attributes such as pricing and size. This creates multiple control paths through the entire system, and significantly complicates the architecture needed.
Architecting with an awareness of the scope, depth, and likelihood of the experiments will reduce the need for convoluted methods or patches to the architecture.
Analytics used to be an optional addon. It is increasingly a core component as we make applications more intelligent and responsive. Thinking through the nature of analytics in the context of the business, both today and over the long-term, and the kinds of capabilities that will enable efficient and effective analytics will lead to better overall systems.
Adding data consumption to the goals of a system’s architecture will increase work in the short term in the form of new interfaces and mechanisms but the “data debt” has to be paid at one time or another. Putting in place frameworks and approaches early on will reduce the long term costs.
Technical architects often make assumptions about the use cases of data, knowledge and skill of the data user, and mechanisms to be provided. The data ecosystem is evolving rapidly, and it is best if the assumptions are explicitly identified and tested constantly to enable the systems to be in sync with the emerging needs.
(Hat tip to Premkumar for highlighting a gap)
Dr. Venkata Pingali is an academic turned entrepreneur, and co-founder of Scribble Data. Scribble aims to reduce friction in consuming data through automation.