Cloud-Native Architecture — Data Mesh Architecture

8 min readAug 14, 2021

Introduction

The industry embraced the decoupling of applications into microservices with clear ownership, technology, and development. In a similar way, you need to consider decoupling Big Data management, this is the right time to do it. This article explains how you can adopt the decoupling technique in Big Data management.

Currently, the big data platforms available in the industry are Data Lake and Data Warehouse, these two hold big replicated data from various siloed domain databases either through ETL batch jobs or event streaming jobs. The data lake implementation of your organization or client’s organization is with unclear responsibilities and ownership of domains in a lake.

Why do you need to decouple big data management?

In modern-day business, disruption is happening never before, therefore we need to make sure that our technology should support the business. The Data Lake and Datawarehouse are good but have their limitations such as centralization of domains, domain ownership, etc. to overcome these challenges, the concept of Data Mesh provides a new way to address common problems. Zhamak Dehghani from Thought Works coined the Data Mesh and written a detailed paper on this.

What is Data Mesh?

The Data Mesh essentially refers to the concept of breaking down data lakes and siloes into smaller, more decentralized, it is like the shifting from monolithic legacy application towards a microservice architecture. In a nutshell, the Data Mesh is like a Microservice architecture in application development.

Decoupling

You are already familiar with microservices architecture and decoupling approach from legacy monolithic to microservices-based on domains by using the domain-driven approach (will explain in subsequent chapters), the domain-driven design approach is addressing the problems in an application domain and transactional data related to that domain but usually, we are not addressing the domains in a data and ownership of the data, etc. the Data Mesh is addressing data domain driven design.

In the Data Lake and Datawarehouse, you might have observed the ownership issues, there might be an owner who can manage and operationalize the big data platforms but not from the domains. The ownership is very important like in your organization, you might have seen each vertical tower for Finance, Health care, retail, etc. there is an in charge of that tower who owns the entire team and delivery for and related clients, in a similar approach, you need an owner for the domain.

Data Mesh Principles

The Data Mesh implementation is based on four principles as depicted in the below diagram.

Domain-oriented decentralized data ownership and architecture: This principle is about implementing the data domain-driven concept to decouple and decentralize the data and ownership.
Data as a Product: This principle is about addressing a concern around accessibility, usability, and harmonization of distributed datasets.
Self-service data infrastructure as a platform: This principle is about services and skills required to operate the data pipeline technologies and infrastructure in each domain.
Federated computational governance: This principle is about data governance and standardization for interoperability, enabled by a shared and harmonized self-service data infrastructure.

Data Mesh refers to the concept of decoupling data lakes and siloes into the smaller, decentralized domain-based model. The analytical scale can scale in the way the microservices and polyglot persistence have allowed transactional data to scale. Zhamak Dehghani explained all the four principles in a very detailed way, which you can find here (https://martinfowler.com/articles/data-monolith-to-mesh.html). I will cover briefly in a more structured way with an example of how you can implement Data Mesh in your project. I am using an example of an eCommerce application to explain the Data Mesh.

How do current projects define data architecture?

In the present scenario, the data architecture looks like the below Figure, centralized data lake architecture whose goals are to ingest data from all corners of the enterprises and cleanse, enrich and transform data to the data lake and serve the dataset in a data lake to diverse request.

The monolithic Data Lake platform contains and owns the data that belong to different domains e.g. Customers, Sales KPIs, Inventory, Payments, Order, etc. with the business change, this kind of implementation is no longer helpful to support required business growth, because of the diverse customer, more adoption of the cloud-native approach in an application landscape and minimum viable product (MVP) approach.

On the replication side, you are creating streaming from diverse sources to the Data Lake, usually any of the organization, you are not building all the replication at once, you might follow iteration model to build as the business grows. For this replication, you may use the ETL approach or streaming based on the events approach. The architecture you follow in both the approaches are ingestion, cleansing and transformation, and loading or subscribing to events, in this approach, if you want to add new domain replication, then you need to change the whole set of replications that leads to maintainability and testability problems.

The data ownership of today’s monolithic data lake platform is based on who builds the data lake with the skill of data engineering and specifically based on the tooling knowledge and not from the domain knowledge. In a nutshell, the ownership is based on technology and skills, not on the domain. The Data Mesh approach provides a solution to most of the problems you are facing in today’s monolithic big data approach.

Next Generation Cloud-Native Data Lake Implementation

This section explains the next-generation data lake implementation steps,

Step 1: Self-service data infrastructure as a platform

The principle of shifting the dataset ownership from the tool specific to the domain-specific. To support this approach, the data pipeline needs to move from the ingestion, cleansing, transformation, and subscribing approach to the domain-based approach.

For the domain-based approach, you need to split the replication pipeline based on domain, for example, Customer pipeline, Order pipeline, etc. In this split, the source database is required to own and take the responsibility of domain-based cleansing, deduplicating, enriching of their domain events. Each domain dataset must establish service level objectives for the quality of the data it provides.

For example, as shown in the above Figure, your customer domain providing customer demographic details and ‘add product to wish list’ domain that can include cleansing, and standardizing data pipeline in the customer domain pipeline that provides a stream of de-duped, near-real-time add product events. The aggregation of domains is responsible for the ‘New Data Domains’.

Customer Demographics + add the product to wish list = customer domain pipeline

To summarize, the responsibility of source and target is, the source side of ‘Domain Data Pipeline’ is to providing domain-related events, cleansing. The target side responsibility is subscribing data, that is in ‘New Data Domains’.

Step 2: Data as a Product

Based on the above step, the data ownership and data pipeline implementation are the responsibility of the business domain as shown in the below figure. This raises an important concern around the accessibility, usability, and harmonization of these new domain datasets.

Data As a Product with -as-a-service model

This is there where you can implement Data Domains-as-a-Service by creating domain capabilities as APIs to the rest of the consumers in an organization. As part of the as-a-service approach, you need to create a set of well-designed APIs and events with discoverable, well-documented, and test sandboxes.

Step3: Data Infrastructure as a Platform

The main concern of distributing the ownership of data to the domain is the duplicated effort and skills required to operate the data pipelines technology stack and infrastructure in each domain. Harvesting and extracting domain agnostic infrastructure capabilities into data infrastructure platform the need for duplicating the effort of creating a domain-related pipeline, storages, and domain-specific streaming engines, etc. the data infrastructure as a platform should be domain agnostic and configure the platform for the domain-specific.

To build the data infrastructure for the Data Mesh, you can use the existing available infrastructure to build, for example, you can use AWS S3, Google Cloud Storage, or Azure blog storage to store domain models and for the as-a- service, you can use standard API stacks and event stacks, for the data pipeline use event brokers and ETL tools and create a separate pipeline and code base for each domain-related replication.

Step 4: Domain oriented decentralized data ownership and architecture

To decouple and decentralize the monolithic data platform, we need to start thinking from a data domain angle, instead of just replicating data from heterogeneous sources to target data. In my eCommerce example, the customer domain owning and serving the dataset for access to any team for any purpose. The physical location of the customer domain can be anywhere like google cloud storage or AWS S3 or Azure Blog storage on respective cloud implementation, but the domain owner should be the same team who owns the overall customer domain in your enterprise.

The team that owns the customer domain is not only responsible for providing the business domains but also the truths of customer demographics and their likes and dislikes of the products. The customer usage pattern is required other transaction details that are related to other domains, in this case, you need to create a domain-specific data set that requires consumption.

Step 5: Data Governance

The Data Mesh platform should be designed with distributed data architecture, under the centralized governance and standardization for interoperability, enabled by a shared and self-service data infrastructure, once the data infrastructure is matured, then you can apply a centralized with decentralized governance concept to improve the innovation, independence, etc.

Conclusion

This article provides you an insight into current data lake architecture problems and how you build next-generation big data management by decoupling the existing data lake or data warehouse using the technique called data mesh. For detailed implementation, refer to Thought Works blogs.