The conceptual model of dewi¶
A little headsup before we start: Dewi is supposed to manage 100% of the code within the datawarehouse to enable the ,more advanced features.
Process steps¶
The first step in using dewi is the configuration step. In which you define what tables you want in your datawarehouse how they relate to each other and what they need to look like.
The next step is to compile the configs in to a Datamodel this datamodel will contain all necessary information to generate the SQL (and other) code.
The datamodel is stored in the ./dewi folder as datamodel.yaml
This file is the input for the generate command. Of which the output will be all necessary SQL and supporting files to build and deploy your project.
Configuration¶
The first thing to understand is that dewi is driven fully by configuration files these configuration files are formatted using Yaml documents (Yaml specification). With yaml it is possible to have 2 or more documents in the same file. So lets show some examples:
imagine we have a file called test.yaml with the following content:
test_key: test value
This translates back to a single yaml document with a key test_key and the value of the key is test value.
When you add --- to a yaml document you create a new document within the same file
So the below translates to 2 different Yaml documents
test_key: test value 1
---
test_key: test value 2
For dewi we use this mechanism to define and configure all tables and stored procedures in a datawarehouse. But also some generic project configuration like default values, etc.
Project configuration¶
In the project configuration you can update things like the character to use for escaping identifiers in SQL queries " or [ and ] give some default table suffixes like all dimension tables should end with "_DIM" etc.
See project config properties for all the possible configurable properties
Table configuration¶
The dewi datamodel is based on 'table' level configuration. So for each table you would like in your Datawarehouse you need to create a Dewi Yaml document (with a few exceptions).
In a dewi document you need to specify atleast some default values.
The type of the table (See table types). The entity to which this table belongs (See entities). The schema of the table and the layer (See layers).
Then for each type of table there is table specific configuration.
A simple example of a "custom" table:
type: CUST
entity: Customer
layer: DATAMART
schema: dbo
name: CustomerTable
columns:
- name: CustomerID
data_type: NVARCHAR(255)
- name: CustomerName
data_type: NVARCHAR(255)
Flow configuration¶
Every Table or View in a datawarehouse gets it's data from querying other tables (except for staging/landing tables). To define the relation between these tables in a particular dataflow we use the terms upstream and downstream.
We define the base tables as upstream and all dependent tables as downstream tables.
So for example if you have a flow like this:
stg.Customer -> dv.Customer_HUB -> dm.Customer_DIM
\-> dv.Customer_SAT /
The Customer_DIM is downstream and has as upstream tables both dv.Customer_HUB and dv.Customer_SAT.
The dv.Customer_HUB has as upstream table the stg.Customer table.
Terms like parent, child, ancestor, predecessor are reserved for physical relations between tables in the database (Foreign Keys).
To document these flows every Dewi Yaml document can contain a config entry flows to define these dataflows.
The structure of the flows configuration is the same for every table type. But the defaults differ between the different table types.
The purpose of documenting these dataflows is to be able to generate stored procedures to update the downstream tables when new data arrives at the upstream table. But also to be able to visualize the datalineage of a specific column in a downstream table.
In the flow configuration we configure one or more flows to update the data within the current table.
To start configuring a flow you need to give a flow a name. This name should be descriptive for the actual dataflow, especially when you configure multiple data flows.
For example when you have a table that can be updated from both salesforce and exactonline you can define 2 flows like so:
flows:
salesforce:
exactonline:
Then within every flow you're able to configure what kind of load types your flow uses.
The upstream configuration for automatically generated procedures and/or query configuration if you have a query that cannot be automatically generated.
flows:
default:
load_types:
full:
enabled: True # full load enabled (No query config, so automatically generated procedures)
incremental:
enabled: True # incremental load enabled
query:
type: SQL # with a custom query (no path specified, so default is used) which is used in the stored procedures
partial: # Partial load is disabled, so no generated stored procedures
enabled: False
key: # Key load is disabled
enabled: False
upstream: # the upstream table for generating stored procedures defined as dbo.Customer
- schema: dbo
table: Customer
column_mapping:
CustomerID: CustomerNumber # possibility to remap columns of a upstream table to columns in the current table
CustomerName: Name
query: # query configuration, this is redundant in this config, but can be used to define the query for every load type
type: SQL
path: ./path/to/query
Most of the time the flows are really straight forward so you do not have to define them for every table. For every table type we have defined a default flow so you only have to configure these when you you're doing something different.
Project directory structure¶
When starting a new project the only thing you need is a directory (defaults to: ./dewi) where you will store your YAML files.
All commands will scan the entire directory recursively to build the datamodel.yaml.
When building the project dewi will create new directories like 'DataVault' and 'DataMart' in which it outputs the relevant SQL files. It will also create some default objects like a logging table and logging procedures.
You can manually create a CustomSQL directory (or set the "custom_sql_script_directory" in your project config) which recursively scans for SQL files to be added to the SQLPROJ file.
Definitions¶
Layer¶
A datawarehouse is usually split up in several layers, like a 'staging', 'integration', 'data vault', 'business vault','datamart'. When creating a new table you should define to which layer your table belongs.
This way you can for example create multiple schema's that form a single 'layer' within the datawarehouse.
Entity¶
An entity is a way to group certain tables together. For example a 'Customer' entity can contain the following datavault tables: 'Customer_HUB', 'Customer_SAT', 'Customer_Telephonenumber_SAT', 'Customer_Order_LINK' etc.
By grouping them we can create for example 'container' stored procedures to update all tables based on the entity to which they belong.
Schema¶
For SQL Server implementations this will literally be the schema of the table. In other systems it'll be the identifier or URI in the catalog.