Dewi configuration types

Staging

STG - Staging table

see STG properties for all the possible configurable properties

The STG type configures a staging table. A staging table is the most upstream table there is in a dataflow. So a staging table does not support defining dataflows.

The STG type defaults to the STAGING layer in the stg schema. The entity name is also considered the tablename when not defined differently

For a staging table it's important to configure all columns. Where a column needs to contain atleast a name and optionally a data_type (defaults to NVARCHAR(255)), whether or not the column is mandatory mandatory or if the column is part of the primary_key.

example:

type: STG
schema: tos_stg
entity: Vessel
name: Vessel
columns:
- name: IMONumber
  data_type: NVARCHAR(255)
  mandatory: true
  primary_key: true
- name: SourceKey
  data_type: NVARCHAR(255)
- name: VesselID
  data_type: NVARCHAR(255)
- name: VesselName
  data_type: NVARCHAR(255)
- name: CallSign
  data_type: NVARCHAR(255)
- name: IsActive
  data_type: BIT
- name: VesselLength
  data_type: DECIMAL(19,5)
- name: VesselWidth
  data_type: DECIMAL(19,5)

ODS Table

see ODS properties for all the possible configurable properties

The ODS type will make it easier to construct an Operational Datastore layer in the datawarehouse. When configuring the ODS type you will get a generated landing table for full, incremental, delete loads. The persistent staging table and when configured also a temporal table with history tracking. For every load type it'll generate load procedures which are optimized for the type of table. Non history tracking tables will perform better.

type: ODS
entity: activities
layer: STAGING # optional
schema: stg # optional
name: acitivities_CURRENT # optional
track_history: True # Current implementation will create a SQL Server system versioned temporal table for this
columns:
- name: ID
  primary_key: True # An ODS table does require a PK
- name: activity
- name: duration
  data_type: DECIMAL(19,8)  # this is allowed now :)
flow: # flows are optional. It'll default to a full and incremental load with a flow named default
  default:
    upstream:
    - schema: stg
      name: activity      # the ODS type is the first dewi type to create the staging table if it does not already exist in the config

Datavault

HUB - hub table

see HUB properties for all the possible configurable properties

A HUB table is defined by the business key it tracks and some default fields according to the datavault methodology.

The HUB type defaults to the DATAVAULT layer in the dv schema. The entity name is used to construct the table name. The default suffix of a HUB is _HUB

When a upstream table is found in the datamodel there is no need to define all columns or even the data type of the business keys.

The final configuration variable is whether to automatically track deletions from different sources. This is done with a Record Source Tracking Satellite and if track_source_deletions is set to true this table will be automatically added to the datamodel.

For the flows config the HUB type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for HUB tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.

example:

type: HUB
entity: Vessel
business_keys:
- name: IMONumber
track_source_deletions: true
flows:
  tos:
    upstream:
    - schema: tos_stg
  gos:
    upstream:
    - schema: gos_stg

SAT - Satellite table

see SAT properties for all the possible configurable properties

The SAT type defaults to the DATAVAULT layer in the dv schema. The entity name is used to construct the table name. The default suffix of a SAT is _SAT

A SAT table is always related to a single HUB table according to the datavault methodology. This defaults to the HUB table of the configured entity name. But it is possible to override this by configuring the hub key with either an entity or table name reference

hub: 
  schema: db_hub
  table: Vessel_HUB

Besides the relation to the hub a SAT table contains some additional columns. When a upstream table is found in the datamodel there is no need to define the data type of the additional columns.

On a additional column you can configure whether the column should be included in the hashdiff calculation and whether the column is mandatory

addtional_columns:
- name: X
  data_type: NVARCHAR(255)
  mandatory: false
  hash_diff_include: true

Finally a SAT always creates an additional view which is called the Current View. This view contains the current records of all HUB business keys.

For the flows config the SAT type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for SAT tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.

example:

type: SAT
entity: Vessel
additional_columns:
- name: SourceKey
- name: VesselID
- name: VesselName
- name: CallSign
- name: IsActive
- name: VesselLength
- name: VesselWidth
flows:
  tos:
    upstream:
    - schema: tos_stg
      table: Vessel

see LINK properties for all the possible configurable properties

A LINK table connects multiple HUB tables together.

The LINK type defaults to the DATAVAULT layer in the dv schema. The entity name is used to construct the table name. The default suffix of a LINK is _LINK

To create the relations with the hubs you need to configure which HUB tables to include in the link. You can reference them by their name, entity and schema. Additionally you can add whether the linked hub is the driving_key this defaults to False. When you want to link a single hub multiple times you need to also configure an alternative column_name. So when constructing a link for Employees and Managers which both are tracked in the "Person_HUB" then you can give the first a column_name of "Employee_HK" and the second "Manager_HK" which have a Foreign key relation with the "Person_HUB"."Person_HK"

hubs: 
- name: Vessel_HUB
  schema: dv
  entity: Vessel
  driving_key: False
  column_name: Vessel_HK

To make a LINK unique you sometimes need to add additional fields which aren't part of a HUB for this you can configure the key dependent_children which creates addtional columns (like in a staging table)

As with a HUB table a LINK table should also track whether the link still exists in each source that feeds the link. This is configured with the track_source_deletions flag and defaults to true. This creates a Record Tracking Satellite table for the link. When this is true it is not allowed to also add a LINK_SAT table to the link.

Like for the SAT table a Current View view will be automatically added to the datamodel.

For the flows config the LINK type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for LINK tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.

example

type: LINK
entity: VesselVisit
name: VesselVisit_LINK
hubs:
- name: VesselVisit_HUB
- name: Vessel_HUB
- name: VesselService_HUB
- name: VesselOperator_HUB
track_source_deletion: true
flows:
  tos:
    upstream:
    - schema: tos_stg
      table: VesselVisit

see LINK_SAT properties for all the possible configurable properties

A LINK_SAT is very comparable to a "normal" SAT table. So the configuration is mostly the same.

The LINK_SAT type defaults to the DATAVAULT layer in the dv schema. The entity name is used to construct the table name. The default suffix of a LINK_SAT is _LSAT

A LINK_SAT table is always related to a single LINK table according to the datavault methodology. This defaults to the LINK table of the configured entity name. But it is possible to override this by configuring the link key with either an entity or table name reference

link: 
  schema: db_hub
  table: Vessel_HUB

Besides the relation to the hub a LINK_SAT table contains some additional columns. When a upstream table is found in the datamodel there is no need to define the data type of the additional columns.

On a additional column you can configure whether the column should be included in the hashdiff calculation and whether the column is mandatory

addtional_columns:
- name: X
  data_type: NVARCHAR(255)
  mandatory: false
  hash_diff_include: true

Finally a LINK_SAT always creates an additional view which is called the Current View. This view contains the current records of all LINK records.

For the flows config the LINK_SAT type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for LINK_SAT tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.

Datamart

DIM - Dimension table

see DIM properties for all the possible configurable properties

A DIM table is like a STG or a CUST table a pretty defualt table object. the only requirement is that it MUST have a primary_key.

The DIM type defaults to the DATAMART layer in the dm schema. The entity name is used to construct the table name. The default suffix of a DIM is _DIM.

For a DIM table you can specify whether it is a virtual or physical object in the datawarehouse with the storage config variable. This controls whether to generate a VIEW statement or create stored procedures and a create table statement.

The columns need to be defined like in the STG table.

For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a DIM table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the DIM table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.

example:

type: DIM
entity: Vessel
name: Vessel DIM
storage: physical
columns:
- name: Vessel HK
  data_type: CHAR(64) 
  mandatory: true
  primary_key: true
- name: IMO Number
  data_type: NVARCHAR(255) 
  mandatory: true
- name: Vessel ID
  data_type: NVARCHAR(255)
- name: Vessel Name
  data_type: NVARCHAR(255) 
  mandatory: true
- name: Call Sign
  data_type: NVARCHAR(255)
- name: Vessel Length
  data_type: DECIMAL(19,5)
- name: Vessel Width
  data_type: DECIMAL(19,5)
flows:
  default:
    load_types:
      full:
        enabled: True
    query:
      type: sql

FACT - Fact table

see FACT properties for all the possible configurable properties

A FACT table is like a STG or a CUST table a pretty default table object.

The FACT type defaults to the DATAMART layer in the dm schema. The entity name is used to construct the table name. The default suffix of a FACT is _FACT.

A FACT should have some columns which reference some dimension tables. These can be configured with the dimensions config variable. A dimension is referenced by it's name and schema. Addtionally you are able to set a column_name so you're able to create multiple relations to a single dimension table.

For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a FACT table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the FACT table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.

example:

type: FACT
entity: VesselVisit
name: Vessel Visit FACT
storage: physical
dimensions:
- name: Vessel DIM
- name: Vessel Service DIM
- name: Vessel Operator DIM
- name: Quay DIM
- name: Date DIM
  column_name: Arrival Date
- name: Date DIM
  column_name: Departure Date
columns:
- name: Vessel Visit HK
  data_type: CHAR(64)
  mandatory: true
  primary_key: true
- name: Terminal ID
  data_type: NVARCHAR(255)
  mandatory: true
- name: Gross Berthing Hours
  data_type: DECIMAL(19,5) 
- name: Nett Berthing Hours
  data_type: DECIMAL(19,5) 

Custom

CUST - Custom table

see CUST properties for all the possible configurable properties

The CUST table is the "catchall" for functionality that is not yet exposed via one of the other types of tables.

A CUST table behaves like a STG table but with flows defined. You should not rely on the default values of the CUST table so you should fill in the entity, name, schema, layer, storage. Although not all are required it gives dewi more context when using these custom constructs.

For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a CUST table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the CUST table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.

In the example below. I implemented a "custom" staging table which is built by a static custom query to generate some data. Which in turn can be used as an upstream for a HUB or a SAT

example:

type: CUST
entity: Shift
schema: mdm_stg
name: Shift
storage: physical
layer: STAGING
columns:
- name: TerminalID
  data_type: NVARCHAR(255) 
- name: ShiftDate
  data_type: DATE 
- name: ShiftID
  data_type: NVARCHAR(255) 
- name: ShiftStartDTS
  data_type: DATETIME2 
- name: ShiftFinishDTS
  data_type: DATETIME2 
- name: ShiftTypeID
  data_type: NVARCHAR(255)
flows:
  mdm:
    load_types:
      full:
        enabled: true
        query:
          type: sql