Dewi configuration types¶
Staging¶
STG - Staging table¶
see STG properties for all the possible configurable properties
The STG type configures a staging table.
A staging table is the most upstream table there is in a dataflow. So a staging table does not support defining dataflows.
The STG type defaults to the STAGING layer in the stg schema.
The entity name is also considered the tablename when not defined differently
For a staging table it's important to configure all columns.
Where a column needs to contain atleast a name and optionally a data_type (defaults to NVARCHAR(255)), whether or not the column is mandatory mandatory or if the column is part of the primary_key.
example:
type: STG
schema: tos_stg
entity: Vessel
name: Vessel
columns:
- name: IMONumber
data_type: NVARCHAR(255)
mandatory: true
primary_key: true
- name: SourceKey
data_type: NVARCHAR(255)
- name: VesselID
data_type: NVARCHAR(255)
- name: VesselName
data_type: NVARCHAR(255)
- name: CallSign
data_type: NVARCHAR(255)
- name: IsActive
data_type: BIT
- name: VesselLength
data_type: DECIMAL(19,5)
- name: VesselWidth
data_type: DECIMAL(19,5)
ODS Table¶
see ODS properties for all the possible configurable properties
The ODS type will make it easier to construct an Operational Datastore layer in the datawarehouse. When configuring the ODS type you will get a generated landing table for full, incremental, delete loads. The persistent staging table and when configured also a temporal table with history tracking. For every load type it'll generate load procedures which are optimized for the type of table. Non history tracking tables will perform better.
type: ODS
entity: activities
layer: STAGING # optional
schema: stg # optional
name: acitivities_CURRENT # optional
track_history: True # Current implementation will create a SQL Server system versioned temporal table for this
columns:
- name: ID
primary_key: True # An ODS table does require a PK
- name: activity
- name: duration
data_type: DECIMAL(19,8) # this is allowed now :)
flow: # flows are optional. It'll default to a full and incremental load with a flow named default
default:
upstream:
- schema: stg
name: activity # the ODS type is the first dewi type to create the staging table if it does not already exist in the config
Datavault¶
HUB - hub table¶
see HUB properties for all the possible configurable properties
A HUB table is defined by the business key it tracks and some default fields according to the datavault methodology.
The HUB type defaults to the DATAVAULT layer in the dv schema.
The entity name is used to construct the table name. The default suffix of a HUB is _HUB
When a upstream table is found in the datamodel there is no need to define all columns or even the data type of the business keys.
The final configuration variable is whether to automatically track deletions from different sources. This is done with a Record Source Tracking Satellite and if track_source_deletions is set to true this table will be automatically added to the datamodel.
For the flows config the HUB type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for HUB tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.
example:
type: HUB
entity: Vessel
business_keys:
- name: IMONumber
track_source_deletions: true
flows:
tos:
upstream:
- schema: tos_stg
gos:
upstream:
- schema: gos_stg
SAT - Satellite table¶
see SAT properties for all the possible configurable properties
The SAT type defaults to the DATAVAULT layer in the dv schema.
The entity name is used to construct the table name. The default suffix of a SAT is _SAT
A SAT table is always related to a single HUB table according to the datavault methodology. This defaults to the HUB table of the configured entity name. But it is possible to override this by configuring the hub key with either an entity or table name reference
hub:
schema: db_hub
table: Vessel_HUB
Besides the relation to the hub a SAT table contains some additional columns. When a upstream table is found in the datamodel there is no need to define the data type of the additional columns.
On a additional column you can configure whether the column should be included in the hashdiff calculation and whether the column is mandatory
addtional_columns:
- name: X
data_type: NVARCHAR(255)
mandatory: false
hash_diff_include: true
Finally a SAT always creates an additional view which is called the Current View. This view contains the current records of all HUB business keys.
For the flows config the SAT type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for SAT tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.
example:
type: SAT
entity: Vessel
additional_columns:
- name: SourceKey
- name: VesselID
- name: VesselName
- name: CallSign
- name: IsActive
- name: VesselLength
- name: VesselWidth
flows:
tos:
upstream:
- schema: tos_stg
table: Vessel
LINK - Link table¶
see LINK properties for all the possible configurable properties
A LINK table connects multiple HUB tables together.
The LINK type defaults to the DATAVAULT layer in the dv schema.
The entity name is used to construct the table name. The default suffix of a LINK is _LINK
To create the relations with the hubs you need to configure which HUB tables to include in the link.
You can reference them by their name, entity and schema. Additionally you can add whether the linked hub is the driving_key this defaults to False. When you want to link a single hub multiple times you need to also configure an alternative column_name. So when constructing a link for Employees and Managers which both are tracked in the "Person_HUB" then you can give the first a column_name of "Employee_HK" and the second "Manager_HK" which have a Foreign key relation with the "Person_HUB"."Person_HK"
hubs:
- name: Vessel_HUB
schema: dv
entity: Vessel
driving_key: False
column_name: Vessel_HK
To make a LINK unique you sometimes need to add additional fields which aren't part of a HUB for this you can configure the key dependent_children which creates addtional columns (like in a staging table)
As with a HUB table a LINK table should also track whether the link still exists in each source that feeds the link. This is configured with the track_source_deletions flag and defaults to true.
This creates a Record Tracking Satellite table for the link. When this is true it is not allowed to also add a LINK_SAT table to the link.
Like for the SAT table a Current View view will be automatically added to the datamodel.
For the flows config the LINK type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for LINK tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.
example
type: LINK
entity: VesselVisit
name: VesselVisit_LINK
hubs:
- name: VesselVisit_HUB
- name: Vessel_HUB
- name: VesselService_HUB
- name: VesselOperator_HUB
track_source_deletion: true
flows:
tos:
upstream:
- schema: tos_stg
table: VesselVisit
LINK_SAT - Link satellite table¶
see LINK_SAT properties for all the possible configurable properties
A LINK_SAT is very comparable to a "normal" SAT table. So the configuration is mostly the same.
The LINK_SAT type defaults to the DATAVAULT layer in the dv schema.
The entity name is used to construct the table name. The default suffix of a LINK_SAT is _LSAT
A LINK_SAT table is always related to a single LINK table according to the datavault methodology. This defaults to the LINK table of the configured entity name. But it is possible to override this by configuring the link key with either an entity or table name reference
link:
schema: db_hub
table: Vessel_HUB
Besides the relation to the hub a LINK_SAT table contains some additional columns. When a upstream table is found in the datamodel there is no need to define the data type of the additional columns.
On a additional column you can configure whether the column should be included in the hashdiff calculation and whether the column is mandatory
addtional_columns:
- name: X
data_type: NVARCHAR(255)
mandatory: false
hash_diff_include: true
Finally a LINK_SAT always creates an additional view which is called the Current View. This view contains the current records of all LINK records.
For the flows config the LINK_SAT type supports configuring a single upstream definition with references to a specific schema and table the custom query functionality is not supported for LINK_SAT tables. Furthermore the load_types can be manually enabled and disabled. default only full and incremental flows are enabled in a default flow.
Datamart¶
DIM - Dimension table¶
see DIM properties for all the possible configurable properties
A DIM table is like a STG or a CUST table a pretty defualt table object. the only requirement is that it MUST have a primary_key.
The DIM type defaults to the DATAMART layer in the dm schema.
The entity name is used to construct the table name. The default suffix of a DIM is _DIM.
For a DIM table you can specify whether it is a virtual or physical object in the datawarehouse with the storage config variable. This controls whether to generate a VIEW statement or create stored procedures and a create table statement.
The columns need to be defined like in the STG table.
For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a DIM table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the DIM table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.
example:
type: DIM
entity: Vessel
name: Vessel DIM
storage: physical
columns:
- name: Vessel HK
data_type: CHAR(64)
mandatory: true
primary_key: true
- name: IMO Number
data_type: NVARCHAR(255)
mandatory: true
- name: Vessel ID
data_type: NVARCHAR(255)
- name: Vessel Name
data_type: NVARCHAR(255)
mandatory: true
- name: Call Sign
data_type: NVARCHAR(255)
- name: Vessel Length
data_type: DECIMAL(19,5)
- name: Vessel Width
data_type: DECIMAL(19,5)
flows:
default:
load_types:
full:
enabled: True
query:
type: sql
FACT - Fact table¶
see FACT properties for all the possible configurable properties
A FACT table is like a STG or a CUST table a pretty default table object.
The FACT type defaults to the DATAMART layer in the dm schema.
The entity name is used to construct the table name. The default suffix of a FACT is _FACT.
A FACT should have some columns which reference some dimension tables. These can be configured with the dimensions config variable. A dimension is referenced by it's name and schema. Addtionally you are able to set a column_name so you're able to create multiple relations to a single dimension table.
For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a FACT table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the FACT table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.
example:
type: FACT
entity: VesselVisit
name: Vessel Visit FACT
storage: physical
dimensions:
- name: Vessel DIM
- name: Vessel Service DIM
- name: Vessel Operator DIM
- name: Quay DIM
- name: Date DIM
column_name: Arrival Date
- name: Date DIM
column_name: Departure Date
columns:
- name: Vessel Visit HK
data_type: CHAR(64)
mandatory: true
primary_key: true
- name: Terminal ID
data_type: NVARCHAR(255)
mandatory: true
- name: Gross Berthing Hours
data_type: DECIMAL(19,5)
- name: Nett Berthing Hours
data_type: DECIMAL(19,5)
Custom¶
CUST - Custom table¶
see CUST properties for all the possible configurable properties
The CUST table is the "catchall" for functionality that is not yet exposed via one of the other types of tables.
A CUST table behaves like a STG table but with flows defined.
You should not rely on the default values of the CUST table so you should fill in the entity, name, schema, layer, storage. Although not all are required it gives dewi more context when using these custom constructs.
For the flows you need to be aware that when your storage is set to virtual dewi expects a flow with the name virtual. For a CUST table there is currently no support for automatically generated procedures or view statements. So you should always configure the query config variable. This will create a file in which you can specify a "SELECT" statement. This statement is then used as input for the CREATE VIEW or the stored procedures to update the CUST table. Be aware that when you update the query you should build the project again to also update the VIEW or STORED PROC statements.
In the example below. I implemented a "custom" staging table which is built by a static custom query to generate some data. Which in turn can be used as an upstream for a HUB or a SAT
example:
type: CUST
entity: Shift
schema: mdm_stg
name: Shift
storage: physical
layer: STAGING
columns:
- name: TerminalID
data_type: NVARCHAR(255)
- name: ShiftDate
data_type: DATE
- name: ShiftID
data_type: NVARCHAR(255)
- name: ShiftStartDTS
data_type: DATETIME2
- name: ShiftFinishDTS
data_type: DATETIME2
- name: ShiftTypeID
data_type: NVARCHAR(255)
flows:
mdm:
load_types:
full:
enabled: true
query:
type: sql