BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The End of the Bronze Age: Rethinking the Medallion Architecture

The End of the Bronze Age: Rethinking the Medallion Architecture

Key Takeaways

  • Operational and analytical use cases are not able to access relevant, complete, and trustworthy data reliably. There needs to be a new approach to data processing.
  • While the multi-hop architecture has been around for decades and can bridge operational and analytical use cases, it’s inefficient, slow, expensive, and difficult to reuse.
  • The shift left approach takes the same data processing happening downstream and shifts it left (upstream) so more teams can access relevant, complete, and trustworthy data.
  • Data products are a key part of a shift left, forming the basis of data communications across the business.
  • Data contracts ensure healthy data products and provide a barrier between the internal and external data models providing data producer users with a stable yet evolvable API and well-defined boundaries into business domains. 

Operational and analytical use cases all face the same problem: they are unable to reliably access relevant, complete, and trustworthy data from across their organization. Instead, each use case typically requires cobbling together its own means for accessing data. ETL pipelines may provide a partial solution for data access for data analytics use cases, while a REST API may serve some ad hoc data access requests for operational use cases.

However, each independent solution requires its own implementation and maintenance, resulting in duplicate work, excessive costs, and similar yet slightly different data sets.

There is a better way to make data available to the people and systems who need it, regardless of whether they’re using it for operational, analytical, or something in between. It involves rethinking those archaic yet still commonly-used ETL patterns, the expensive and slow multi-hop data processing architectures, and the "everyone for themselves" mentality prevalent in data access responsibilities. It’s not only a shift in thinking but also a shift in where we do our data processing, who can use it, and how to implement it. In short, it’s a shift left. Take the very same work you’re already doing (or will be doing) downstream, and shift it left (upstream) so that everyone can benefit from it.

But what are we shifting left from?

Rethinking Data Lakes and Warehouses

A data lake is typically a multi-hop architecture, where data is processed and copied multiple times before eventually arriving at some level of quality and organization that can power a specific business use case. Data flows from left to right, beginning with some form of ETL from the source system into a data lake or data warehouse.

Multi-hop architectures have been around for decades as part of bridging the operational-analytical divide. However, they are inherently inefficient, slow, expensive, and difficult to reuse.

The medallion architecture is the most popular form of multi-hop architecture today. It is divided into three different medallion classifications or layers, according to the Olympic Medal standard: bronze, silver, and gold. Each of the three layers represents progressively higher quality, reliability, and guarantees - with bronze being the weakest and gold being the strongest.

  • Bronze layer: The bronze layer is the landing zone for raw imported data, often as a mirror of the source data model. Data practitioners then add structure, schemas, enrichment, and filtering to the raw data. The bronze layer is the primary data source for the higher-quality silver layer data sets.
  • Silver layer: The silver layer provides filtered, cleaned, structured, standardized, and (re)modeled data, suitable for analytics, reporting, and further advanced computations. These are the building blocks for calculating analytics, building reports, and populating dashboards, and are commonly organized around important business entities. For example, it may contain data sets representing all business customers, sales, receivables, inventory, and customer interactions, each well-formed, schematized, deduplicated, and verified as "trustworthy canonical data".
  • Gold layer: This layer delivers "business-level" and "application-aligned" data sets, purpose-built to provide data for specific applications, use cases, projects, reports, or data export. The gold layer data is predominantly de-normalized and optimized for reads, but you may still need to join against other data sets at query time. While the silver layer provides "building blocks", the gold layer provides the "building," built from the blocks and cemented together with additional logic.

The medallion architecture, a popular version of the multi-hop architecture

There are serious problems with the medallion architecture and, indeed, with all multi-hop architectures. Let’s take a look to examine why:

Flaw #1: The consumer is responsible for data access

The multi-hop medallion model is predicated on a pull system. A downstream data practitioner must first create an ETL job to pull data into the medallion architecture. Next, they have to remodel it, clean it, and put it into a usable form before any real work can begin. This is a reactive and untenable position, as the consumer bears all the responsibility of keeping the data available and running, without having ownership or even influence of the source model.

Furthermore, the ETL is coupled to the source data model, leading to a very tight and brittle coupling. Any changes made to the upstream system (such as the database schema) can cause the pipeline to break.

Flaw #2: The medallion architecture is expensive

Populating a bronze layer requires lots of data copying and processing power. Then, we immediately process and copy that data again to get it to a silver layer of quality. Each step incurs costs - loading data, network transfers, writing copies back to disk, and computing resources - and these costs quickly add up to a large bill. It’s inherently expensive to build and maintain, and results in significant wasted resources.

Costs can further balloon when coupled with consumer-centric responsibility for data access. A consumer who is unsure what data they can or can’t use is inclined to build their own pipeline from the source, increasing the total cost of both compute and human resources. The result may be a gold-layer project that fails to provide an adequate return on investment, simply due to the high costs incurred by this pattern.

Flaw #3: Restoring data quality is difficult

Suppose you’re successful at ETLing the data out of your system into your data lake. Now you need to denormalize it, restructure it, standardize it, remodel it, and make sense out of it - without making any mistakes.

A formerly popular TV show in the United States, entitled Crime Scene Investigators (CSI) gave their audiences an unrealistic sense of restoring structure and form to data. In one scene, one investigator tells another to "enhance" the image on the computer screen - the other investigator zooms in (significantly), and the blurry pixelated image suddenly reforms into a sharp rendition of a clue as seen reflecting off the suspect’s sunglasses In reality, we would just get a close-up of a handful of very blocky pixels that no amount of "enhancing" would fix.

This is a cautionary tale. The bronze layer creators are not experts on the source domain model, yet they are charged with recreating a precise, picture-perfect rendition of the data. Any of the work they perform in pursuit of this goal is challenging, and though they may come close to recreating what the source data model represents, it may not be precisely correct. This is where the next problem arises.

In a more practical scenario, consider three addresses entered by a human being: 123 Fake Street, 123 Fake st., and 123 FAKE STR. Which one is the correct address? Which one will the independent consumers of the data standardize it to? While each consumer of this unstandardized format will try their best to make an informed and rational processing decision, these relatively mundane data cleaning steps can cause havoc. One system may merge them into a single address, while another system may not, resulting in divergent results and conflicting reports.

There is no substitution for getting the correct high-fidelity data from the source, as tack-sharp and optimally constructed as possible. And if you do it correctly, you’ll be able to reuse it for all your use cases - operational, analytical, and everything in between.

Flaw #4: The bronze layer is fragile

The bronze layer is a dumping ground for whatever data was scraped out of the upstream system. It is overloaded with responsibilities and complex mappings between the source system data and the (allegedly) more trustworthy silver layer data. It’s a complex house of cards, relying on an orchestration of ETL and data cleaning jobs, while at the same time subject to breaking from any changes to its upstream input data.

The result is that the bronze layer is often broken in some way or another, requiring data reprocessing and post-mortems, resulting in inconsistent reports, dashboards, and analytics. And that’s if you’re lucky enough to notice that there’s a problem, and not be made aware of it by a very angry customer.

Trust is of utmost importance in business. All the data organizations I’ve worked in had some variation of a saying that boiled down to this: "Trust is easy to lose and hard to gain".

Bad data will make your customers lose trust in you. Show them one value on the dashboard but charge them another amount in the bill, and see how happy they are. Similar yet different datasets are common in a medallion architecture because all the work of standardizing data is put onto the consumer. It’s simply too easy to make mistakes, regardless of how diligent, well-meaning, or alert you and your team may be.

Flaw #5: No data reusability for operational workloads

Data pushed to an analytics endpoint remains largely (often only) in the analytical space, as do any cleanup, standardization, schematization, and transformation work. It is typically processed by periodic batch jobs, which are typically just too slow for operational use cases.

Instead, operational workloads develop their own techniques and tools to access the data they need, further enlarging the divide between the operational and analytical space.

Shift Left makes it easier, cheaper, and simpler to access data

Shifting left is a rethink of how we communicate data around an organization. At its core, you take the very same work you’re already doing (or that you plan to do) and shift it to the left so that everyone can benefit from it. Shifting left eliminates duplicate pipelines, reduces the risk and impact of bad data, and lets you leverage high-quality data sources for both operational and analytical use cases. Shifting left is iterative and modular - you can shift just one data set to the left, or you can shift many. It all depends on your specific needs.

Shifting left works because you’re simply doing work that you’re already doing or will be doing, just upstream from where it is now. For example, you’ve already:

  • Mapped internal domain models to external domains in the data lake/warehouse
  • Tried to restore fidelity to unstructured, low-fidelity data to the high-fidelity state of its source
  • Built an appropriate silver-layer data set to be used as a building block for more complicated work

Data products provide well-defined, high-quality, interoperable, and supported data that other teams, people, and systems can use as they need. Adopting data products enables easy access to data, without putting the onus on the consumer to figure out their own way to do it (flaw #1). Think of a data product as a first-class data citizen, published with the same guarantees and quality as any other product that you or your business creates.

All of the logic that previously converted raw bronze data to well-structured silver is shifted left out of your data lake, and put into the boundary of your data product. The cleanup, remodeling, and schema enforcement is performed as close to the source as possible so you get a clean and well-defined data model early in the process. In doing so, we eliminate the need to clean and standardize your data source (flaw #3), and make it cheaper to do so (flaw #2) by standardizing and cleaning once instead of multiple times in multiple downstream systems.

We’ll get further into the formal data product definition in a moment, but first, let’s turn our attention to how the data product customers (people, teams, and systems) can access the data.

Accessing a Data Product as a Stream or as a Table

A data product can provide its data in multiple ways - also known as modes. For example, providing the data through a stream (typically an Apache Kafka topic) and also as a table (typically some form of Parquet-based structure).

Analytical use cases have traditionally relied on periodic batch jobs and tabular data. However, modern use cases tend to require low latency streams to accomplish many analytical tasks, such as powering customer dashboards, predicting and shaping user engagement, and monitoring the real - time health of the business - just to name a few.

A data product with both stream and table modes

But streams can also drive operational use cases, which nicely resolves flaw #5 for us. Simply put, the same logic you use to standardize and structure the data for the table mode is perfectly fit for generating the stream mode. You get the ability to power both operational and analytical use cases for a singular investment into a stream/table data product.

Apache Iceberg is one of the leading ways to provide a materialized table from your event stream, so you can plug it into whatever analytical processing engine you require without making extra copies of the data. You save lots of time & effort (flaws #1 & #3) and money (flaw #2), while at the same time reducing misinterpretations, duplicate pipelines, and the proliferation of similar yet different data sets (flaw #4).

You can use Apache Kafka connectors to create an Iceberg table that’s representative of your event stream. Some service providers will also provide their own first-party option that reduces complexity even further, such as Confluent’s Tableflow feature that turns any Kafka topic directly into an Iceberg table.

You may be asking, "But how do you keep the stream and the table consistent?"

One of the easiest ways is to use a write-through approach - write the data to the stream first, then materialize the table off of the stream. You can use a connector to generate the table, or you can write some custom code to materialize the stream into a table representation.

A data product with the table mode materialized from the stream

Writing to the stream first enables all the event-driven use cases across your organization. The derived materialized table, as part of the data product, enables the table-based use cases, such as bulk-loading data, batch-based querying, and powering systems that have no need for streaming data.

There are still a few more things to cover before we wrap up. Let’s now take a deeper look at the data product itself.

Data Products and Data Contracts

Data products are primarily a formalization of the work you’re already doing or would need to do to make data usable across your organization. The rigor with which you create and manage data products will vary based on your company culture, business needs, and data problems.

A strict and formal approach is often necessary when you have many important data sets spread across multiple systems, domains, and teams. In short, the more people who rely on the data, the more rigorous you should build, manage, and monitor your data product.

Common practices for building a successful data product include:

  • Designate a data product owner, who manages customer requests and releases. This person could already be a product owner, it could be a senior developer, or a team lead. However, it is important that they are part of the team that owns the original source of data, and that they are familiar with the internal data domain model.
  • Consulting with customers (other teams, people, systems). The data product owner is responsible for ensuring that the data product can meet the data requirements.
  • Establish a release management cycle, to review, test, and integrate data product creation, updates, and deletion. Adding data product checks to an existing deployment pipeline is a great first step for ensuring data product quality.
  • Provide and populate a human-readable data catalog, to enable easy data discovery, including sample data, use-cases, and points of contact for questions and help (e.g. the data product owner).
  • Validation and verification to ensure that the data product adheres to the data contract established during its creation.

Data Contracts provide the formal agreement of the form and function of the data product and its API to all of its users, including the stream and table modes. Data contracts also provide a barrier between the internal and external data models, providing the data product users with a stable yet evolvable API and a well-defined boundary into other business domains.

Common aspects of a data contract include:

  • A specific schema technology, such as Avro, Protobuf, or JSONSchema to name the most popular event schemas. Though you also have choices for table formats, nowadays Parquet tends to be the clear favorite.
  • The actual schema of the data, including the names, types, and default values of the fields. For streams, it also includes the event key information, as the key-partition strategy used to partition the stream.
  • Schema evolution rules, which go hand-in-hand with the schema technology. For example, if you need to add a new field to a schema (akin to adding a column to a database table), the schema evolution rules will provide you with a framework for updating the data without causing unintentional breakages to downstream consumers (further alleviating flaw #4).
  • Social responsibilities, such as who to contact if you have questions with the data, and what level of support to expect should there be a problem.
  • Service-level agreements (SLAs) pertaining to support. If the data product fails in the middle of the night, a tier 1 SLA may see a person get out of bed to fix it, while a tier 4 SLA may be fixed within the next 3 business days.

Creating a data contract requires input from the prospective consumers, especially those who depended on the original unreliable data. What pained them? What did they want fixed? What guarantees do they need to avoid failures caused by unknown changes?

These prospective consumers, along with the data product owner and their supporting teammates, then hammer out the roles, responsibilities, data definitions, and change management framework for the data product.

How Far Do You Shift Left?

Perhaps one of the most commonly asked questions is how far to take a shift left approach. The answer will vary based on several factors, but will generally follow one of three major patterns.

The first is a full shift to the left. The data product is created within the application code of the source system, and the data is emitted to an event-stream natively, in well-formed events. The data product owner is a member of the application’s team, and the data contract is managed alongside all the application’s other APIs. The application team is fully willing and able to build and manage the data product.

Type 1 - Shift all the way to the left

However, not all systems allow for this level of integration. Legacy applications, for one, may restrict any further changes that are not critical to security. Socio-technical factors may also be another factor, with the source application team refusing to provide a deeper application integration. In this case, you’ll have to build the data product outside of the source application, and you’ll also need to resolve exactly what level of support the source application team is willing to give.

While socio-technical factors can be an impediment to shifting left, the reality is that the data cleaning and standardization still need to be done by someone, somewhere. One suboptimal but still workable approach is to appoint a team to manage the data product outside of the source application. This type 2 pattern is shown below, and leverages a change-data capture connector to take data from the database and transform it (for e.g. to standardize some fields), before writing it to the output Account topic.

Type 2 - Shift left, but just outside the source application

The downside of this level of integration is that changes to the database table may break the data product. However, it may be a more palatable pattern, since the application team no longer needs to deal with the complaints of all the teams and systems whose pipelines they’ve broken. Instead, they can simply deal with the data product owner, reducing their own stress and opening them up to further cooperation in the future.

These two patterns both assume that you can get all the data you need from a single database. But what about if you need data from multiple systems? This brings us to the Type 3 integration, where we use a fully external stream processor to bring together data from multiple streams into a single data product.

Type 3 - Building data products from multiple streams

This third type tends to show up as an organization matures with its shift left journey, and can have a flywheel effect. As data becomes available from more and more source systems, it becomes easier to access - increasing demand for data that is easy to mix and match as you see fit, without having to build and manage your own data pipelines and do all the work yourself.

In this example, we see that Flink SQL is used to create an Enriched Order topic, presumably by joining together Account information, Order information, and Product information, provided by three different source systems.

In Summary

Shifting left provides clean, reliable, and accessible data to all who need it in your organization. It is ultimately a reduction in complexity, overhead, and break-fix work and will free you and your colleagues to work on other, more valuable problems. Data products are the backbone of a shift left strategy, and form the basis of healthy data communication across your organization.

Shift left relies on a renegotiation of where work is done, and who is responsible for ensuring that it’s done properly. While it may seem like we’re adding work to the pile, we’re simply taking work that we would do in the bronze layer, such as cleaning, standardizing, and schematizing, and doing it upstream instead. This type of work is the easiest to shift to the left, because it puts the onus of clarifying what the data should mean on those who understand it best - the very teams, people, and systems who created it in the first place.

You may choose to do more advanced patterns with streams - such as complex time-based aggregations, business specific application logic, or other types of processing work. However, the crux of shift left is simply to establish easy access to high quality data products that would replace the work commonly done in the bronze layer, bringing the benefit to all instead of just a select few.

About the Author

Rate this Article

Adoption
Style

BT