A Big Data Reference Architecture

Martijn Imrich , Partner

LinkedIn profiel

So we arrived in the area of Big Data. And everyone is talking about Hadoop, Data Lakes and why we should have a Data warehouse anyway.

And there’s no shortage of technology. Every day new techniques arrive with fancy names as Spark, Splunk, Hunk, Hive, Pig, YARN and so on.

But how to design an effective Big Data Architecture?

A so called Reference Architecture has created order in many domains, ranging from applications to Business Intelligence. This article will suggest a Reference Architecture for Big Data

Let’s start with looking at a well-known reference Architecture for regular data, the Business Intelligence and Data warehousing Architecture. It has been argued that Data warehousing now has become obsolete. I rather follow the strategy to embrace and enhance what’s been proven useful.

In summary, a modern BI Architecture can be recognized by three main characteristics:

three main characteristics

This BI Architecture has matured over the last 25 years and now reached the phase of data warehouse automation. So why not use this Architecture and see how it can be extended to cater for Big Data as well?

A lot of “Big Data”  experiments turn out to be “Regular Data” projects using predictive modelling. These could be built on top of any modern BI Architecture. The big difference is the creation of a (huge) Analytical Base Table that could require more advanced data preparation than usual. Data Mining tools typically work from such a big flat table to discover the patterns in it:


This Architecture holds until the real “Big Data” comes in. Volume, Variety and Velocity are the reasons why HDFS, Hadoop and now Spark originated anyway. So let’s use these technologies at the “Staging” and “Register” areas of this Reference Architecture. The Data Lake becomes the “schema while reading” equivalent of the “schema while writing” Data Vault.

This way the Big Data Reference Architecture now looks like this:

Tabel 2

It does not only show the BI and Big Data Architecture in a complementary way, they also share value in the form of the BI Dimensions being presented to the Analytical environment. An interesting topic is how structures or patterns, when found in the lake, could be fed to the Data Vault. Personally I would be interested in any development here!

So technologies can be positioned in the areas of this Reference Architecture where most effective.
Also data modelling techniques can be positioned in the appropriate areas. Zooming into the core of this Architecture on data modelling reveals  interesting complementarity and similarity. The four area’s seem to be complementary and complete enough to hold any data in a form or shape needed in both BI and Big Data:

Tabel 3

But there’s a similarity in the Data Vault and Analytical Base Table being both Subject oriented structures. Who will tell if there’s a direct way to feed the Analytical Base Table directly from the also Subject oriented Data Vault?

There’s so more to tell about this Big Data Reference Architecture, but for the purpose of this post I’ve been as as briefly as possible. Let me know how it works for you in designing an effective Big Data Architecture!

Data Science Fundamentals | IIR


Verder lezen?

Leave your email address and receive the brochure of the training Data Science Fundamentals