An example flow

The flow

When building data warehouses, it is common to use multiple layers for incremental data processing. The structure of these layers can be determined according to your preferences.

To demonstrate what dewi can do for you, let's walk through an example data flow for a customer entity. However, please note that this is purely for illustrative purposes and may not always be the recommended approach ;)

Please refer to the image below for a visual representation of the flow we are about to build:

flow Show full screen image

Let's consider the following scenario: we have two source systems, a CRM system and an accounting system (Exact), both containing information about our fictional customer. Our goal is to create a unified customer dimension in the datamart for use with BI tools.

For the purpose of this discussion, we won't delve into the orchestration of these flows or the retrieval of data from the source systems. Instead, we will focus on configuring and building the necessary objects in the database.

In summary, we will set up an Operational Datastore to track historical changes in the source systems. Then, we will model, clean, and deduplicate the source data using a combination of datavault and businessvault. Finally, we will generate a datamart that can be consumed by BI tools such as Power BI.

Staging and Operational Datastore

We start with a Dutch CRM system as our source system, which contains a table called klant that holds all customer information. We need a destination for this data and also want to track changes made to that table.

The first step is to build the operational datastore tables in the data warehouse with the ODS type.

Here's an example YAML configuration:

type: ODS
entity: Customer
layer: OperationalDatastore
schema: crm_ods
name: klant
columns: 
- name: KlantID
  primary_key: True
- name: Code
- name: Naam
- name: Actief
  data_type: BIT
track_history: True
flows: 
  crm:
    upstream:
    - schema: crm_stg

Executing this configuration will create three tables in the data warehouse. Firstly, a staging or landing table in the staging layer, where we can load our data. Secondly, a klant table in the OperationalDatastore layer. Lastly, it creates a klant_HISTORY table. Additionally, it generates a few stored procedures to incrementally update or truncate/load the crm_ods.klant table.

We can follow the same approach for the other system that holds customer information.

Here's another example YAML configuration:

type: ODS
layer: OperationalDatastore
entity: Customer
schema: exact_ods
name: customer
columns: 
- name: CustomerID
  primary_key: True
- name: Code
- name: Name
- name: Active
  data_type: BIT
track_history: True
flows: 
  eol:
    upstream:
    - schema: eol_stg

By using this configuration, we can create the necessary tables and procedures in the data warehouse for the exact_ods.customer table, enabling change tracking and incremental loading.

Datavault

Next up is a really contrived example of an additional datavault implementation. We start with building an HUB table where we set the business_key to Code, as it is the code used universally within the business.

type: HUB
entity: Customer
business_keys:
- name: Code

Since we need to update the HUB from the CRM system, we'll add a flow to accomplish this from the ODS layer.

type: HUB
entity: Customer
business_keys:
- name: Code
flows:
  crm: 
    upstream:
    - schema: crm_ods
      table: klant

Additionally, there is information coming from the Exact system, so we'll add another flow for that.

type: HUB
entity: Customer
business_keys:
- name: Code
flows:
  crm: 
    upstream:
    - schema: crm_ods
      table: klant
  eol: 
    upstream:
    - schema: exact_ods
      table: customer

Next, we'll add some details to two satellite tables, one for each source system. Dewi is intelligent in this regard. When you add a SAT (satellite) for an entity, it will automatically attempt to find the corresponding HUB with the same entity. However, you can also configure this manually, see SAT attributes.

It's important to configure the same flow name as used for the HUB. This allows Dewi to merge these flows within the same stored procedures.

type: SAT
entity: Customer
name: Customer_CRM_SAT
additional_columns:
- name: KlantID
- name: Naam
- name: Actief
flows:
  crm:
    upstream:
    - schema: crm_ods
      table: klant

And the same goes for the exact table.

type: SAT
entity: Customer
name: Customer_EOL_SAT
additional_columns:
- name: CustomerID
- name: Name
- name: Active
flows:
  eol:
    upstream:
    - schema: exact_ods
      table: customer

Business vault

Now comes the part where we are going to implement some logic to merge both sources in the Business Vault. This use case is not yet supported very well, but we can mimick a normal Datavault flow with some CUST objects.

As we've seen before a SAT table needs a Staging table to load data from, this table does not have to be a physical table, this can be a VIEW and we can create a custom view with the CUST object, like this:

type: CUST
layer: DATAVAULT
schema: bv_stg
entity: Customer
name: customer_incremental
storage: virtual
columns:
- name: Code
- name: Name
- name: Active

In the DQL queries generated we can select from the upstream satellites and do some business logic like the following rule:

  • The CRM is leading, but when not available in the CRM take the name from Exact instead.

Now we can create the SAT. Because the view contains the 'full' dataset we only configure a full load in a flow we also will call 'full'.

type: SAT
entity: Customer
schema: bv
name: Customer_SAT
additional_columns:
- name: Name
- name: Active
flows:
  full: 
    load_types:
      full:
        enabled: True
    upstream:
    - schema: bv_stg
      table: customer_full

For the incremental load we can do something similar. Create an VIEW which contains only the records that need to be updated in the downstream table. So let's create a new flow and call it the incremental flow

type: CUST
layer: DATAVAULT
schema: bv_stg
entity: Customer
name: customer_incremental
storage: virtual
columns:
- name: Code
- name: Name
- name: Active`

And we need to add this flow to the SAT, so we modify our configuration a bit:

type: SAT
entity: Customer
schema: bv
name: Customer_SAT
additional_columns:
- name: Name
- name: Active
flows:
  full: 
    load_types:
      full:
        enabled: True
    upstream:
    - schema: bv_stg
      table: customer_full
  incremental: 
    load_types:
      incremental:
        enabled: True
    upstream:
    - schema: bv_stg
      table: customer_incremental

Datamart

Last in the line is to create a dimension table. For this example we'll create a virtual dimension table

type: DIM
entity: Customer
storage: virtual
columns:
- name: Customer HK
  primary_key: True
- name: Code
- name: Name
- name: Active

As datamart flows can contain some difficult business logic dewi will not try to guess how to build the table, but it'll provide you with a custom DQL file in which you can create the SELECT query of which you would like to create the view.

This brings us at the end of this example flow. For some more information you can check out all the different configuration properties there are on each of these configurations at Configuration schema's

For the full configuration of this flow download the link below:

Full flow implementation