Defining Data Model Quality Metrics for Data Vault 2.0 Model Evaluation
3.1. Defining Metrics for Data Vault 2.0 Data Model Quality Evaluation
CDTSHS: compares the data model generated to the source data model. All the data in the source dataset should end up either as a Hub table or a Satellite table. If the number of Hub tables is less than the number of tables in the original dataset, there should be more Satellite tables than Hub tables. Examples of these are a Dependent Child or a business concept that already has a Hub table in the Data Vault model.
RoT1: the ratio of the number of Satellite tables and Hub tables. The number of Satellite tables must be equal or greater than the number of Hub tables, since every Hub must have at least one Satellite, but it can have several Satellites.
RoT2: the ratio of the number of Hub tables and Link tables. The number of Hub tables must be greater than the number of Link tables.
RPK: the ratio of the number of tables and PKs. Every table (Hub, Link, Satellite) must have a primary key.
MaxD: the maximum depth of the model. The Data Vault 2.0 data model can be wide, but it should never be deeper than three levels; the third level citizen is the max depth.
To verify that there are no columns missing, we used table and column metrics as follows:
RPKH: the ratio of the number of PK columns on Hub tables. Each Hub table should have exactly one PK.
RPKL: the ratio of the number of PK columns on Link tables. Each Link table should have exactly one PK.
RPKS: the ratio of the number of PK columns on Satellite tables. Each Satellite table should have at least two columns for the PK, since Satellites hold the history data of a Hub or a Link. Most of the time the value is two but for multi-active satellites it is three; therefore, we define the value as being two or greater.
NoFKH: the number of FKs in Hub tables. Hub tables should not have any FKs.
RFKS: the ratio of FKs in Satellite tables. Each Satellite table should have exactly one FK.
RFKL: the ratio of FKs in Link tables. Each Link table should have at least two FKs.
RAH: the ratio of columns in Hub tables. Typically, the number of columns in a Hub table is four (PK + business key, Loaddate, Recordsource), but if the business key includes more than one column, then the number of columns in a Hub table is more than four.
RAL: the ratio of columns in Link tables. A Link table should include three columns (PK, Recordsource, Loaddate) and the FK columns (minimum two).
RAS: the ratio of columns in Satellite tables. A Satellite table should include two columns for the PK (the parent Hub PK + Loaddate), Recordsource, the Hashdiff column (optional), and all the actual data columns from the source dataset. The Satellite table might be useless if it did not include at least one actual data column.
RMPKA: the ratio of mandatory PK columns. PK columns should be defined as mandatory in all tables.
RMFKA: the ratio of mandatory FK columns. FK columns should be defined as mandatory in all tables.
RMAH: the ratio of mandatory columns in Hubs. Hub tables should only have mandatory columns.
RMAL: the ratio of mandatory columns in Link tables. Link tables should only have mandatory columns.
Satellite tables have optional columns if those columns are optional in the source database. Therefore, we cannot check the ratio of mandatory columns in Satellites, unless we compare them to those in the source tables. This can be carried out but the assumption is that ChatGPT follows the column definitions of the original DDLs. Based on the experiments so far, the following assumption holds.
Manually defined metrics:
TA, Data Vault 2.0 technical columns are correct:
In Hubs (loaddate, recordsource), TAB;
In Satellites (loaddate, recordsource, hashdiff), TAS;
In Links (loaddate, recordsource), TAL.
DTTA, Data types of Data Vault 2.0 technical columns are correct:
In Hubs, DTTAB;
In Satellites, DTTAS;
In Links, DTTAL.
The last two defined metrics are to be carried out manually. There are two reasons for this: to be able to do this in a reliable manner, we might need to follow this with naming conventions, and ChatGPT is not able to follow those without adding RAG to the process. The reply from ChatGPT is in line with this decision: “Please note that this is a simplified representation, and in a real-world scenario, you might need to consider additional aspects such as data types, constraints, indexes, and any other specific requirements of your target database platform”.
We can have metrics for indexes (at least PK and FK), but since these should be a generic set of metrics and different relational database management systems (RDBMs) support and require different kind of indexing, we abstain from using indexes as an element of the set of metrics.
We can define different weights for each metric to measure the amount of work needed for them, but this might make the model more complicated. This can be carried out as future research if needed, as well as using RAG to verify business keys, or to give instructions on the content of a PK column and the Hashdiff column, for example.
3.2. Empirically Evaluating the Metrics
We started our experiments of creating a new set of DDLs for the Data Vault 2.0 data model by prompting:
“The DDLs for the database are following:
<DDLs of the source database>
Please generate the DDLs for the target database following the Data Vault 2.0 methodology”.
Then, we prompted engineered ChatGPT to address the flaws noticed by adding in the end of the prompt:
“Also remember that all Primary keys should be hash columns of type binary. Satellites should have a hashdiff column”.
Model5 requires much more manual work than Model4. However, the inclusion of two additional metrics in the 20-point system falsely demonstrates their equivalence. Based on this experiment, these measures should not be used, or their weights should be reconsidered. As we have acknowledged the simplicity of programmatically adding missing technical columns or correcting their data types, there is a compelling rationale to exclude metrics 19 and 20 from the set of metrics.
3.3. Manual Human Expert Review of the Models
A quick review by a human expert was performed for all models. The maximum score for a model was 20, as determined by the anticipated effort required to rectify the model. A model receiving a score of 20 points suggests that the model is suitable for use without modifications, whereas any score below 20 indicates imperfections in the model. The human reviewer uses their own expertise and experience to determine metrics.
Disasters Expo USA, is proud to be supported by Inergency for their next upcoming edition on March 6th & 7th 2024!
The leading event mitigating the world’s most costly disasters is returning to the Miami Beach
And in case you missed it, here is our ultimate road trip playlist is the perfect mix of podcasts, and hidden gems that will keep you energized for the entire journey-