Defining Data Model Quality Metrics for Data Vault 2.0 Model Evaluation

By inergency On Feb 9, 2024

3.1. Defining Metrics for Data Vault 2.0 Data Model Quality Evaluation

In our previous research [4], it was noticed that the models generated by ChatGPT can be poor in quality, resulting in more work for the database designer than if the model was manually created. Instead of simply designing the data model, the database designer now needs to check the generated model to find errors and then to fix them. The primary objective of this research is to identify a comprehensive set of metrics for assessing the quality of the Data Vault 2.0 data model. For instance, the intention is to assist the database designer in determining the utility of a Data Vault 2.0 model generated by ChatGPT relative to a manually created counterpart.

The issues causing most of the refactoring work in a data model are typically missing tables, primary keys (PK), or foreign keys (FK). Also, missing columns are a problem, because, even though adding them might be easy, spotting that they are missing can be time consuming. Data Vault 2.0 includes a set of technical columns that each Hub, Link and Satellite should have. If those columns are missing, or the datatype is wrong, it is not a time-consuming task to fix them, since it is a straightforward task that can be efficiently programmed using Oracle SQL Developer Data Modeler [29].

Based on this understanding and the Data Vault 2.0 methodology [5,6,7,8], we defined 20 metrics, for evaluating the data model quality of a Data Vault 2.0 model. To calculate these Data Vault 2.0 data model quality metrics, we require the measures described in Table 3.

The 20 metrics, shown in Table 4, belong to three categories: schema metrics, table/column metrics, and manually evaluated metrics. Schema metrics are used to verify that there are no missing tables, PKs, or FKs, table/column metrics are used to identify missing columns. The schema metrics are the following:

CDTSHS: compares the data model generated to the source data model. All the data in the source dataset should end up either as a Hub table or a Satellite table. If the number of Hub tables is less than the number of tables in the original dataset, there should be more Satellite tables than Hub tables. Examples of these are a Dependent Child or a business concept that already has a Hub table in the Data Vault model.
RoT1: the ratio of the number of Satellite tables and Hub tables. The number of Satellite tables must be equal or greater than the number of Hub tables, since every Hub must have at least one Satellite, but it can have several Satellites.
RoT2: the ratio of the number of Hub tables and Link tables. The number of Hub tables must be greater than the number of Link tables.
RPK: the ratio of the number of tables and PKs. Every table (Hub, Link, Satellite) must have a primary key.
MaxD: the maximum depth of the model. The Data Vault 2.0 data model can be wide, but it should never be deeper than three levels; the third level citizen is the max depth.

To verify that there are no columns missing, we used table and column metrics as follows:

6.: RPKH: the ratio of the number of PK columns on Hub tables. Each Hub table should have exactly one PK.
7.: RPKL: the ratio of the number of PK columns on Link tables. Each Link table should have exactly one PK.
8.: RPKS: the ratio of the number of PK columns on Satellite tables. Each Satellite table should have at least two columns for the PK, since Satellites hold the history data of a Hub or a Link. Most of the time the value is two but for multi-active satellites it is three; therefore, we define the value as being two or greater.
9.: NoFKH: the number of FKs in Hub tables. Hub tables should not have any FKs.
10.: RFKS: the ratio of FKs in Satellite tables. Each Satellite table should have exactly one FK.
11.: RFKL: the ratio of FKs in Link tables. Each Link table should have at least two FKs.
12.: RAH: the ratio of columns in Hub tables. Typically, the number of columns in a Hub table is four (PK + business key, Loaddate, Recordsource), but if the business key includes more than one column, then the number of columns in a Hub table is more than four.
13.: RAL: the ratio of columns in Link tables. A Link table should include three columns (PK, Recordsource, Loaddate) and the FK columns (minimum two).
14.: RAS: the ratio of columns in Satellite tables. A Satellite table should include two columns for the PK (the parent Hub PK + Loaddate), Recordsource, the Hashdiff column (optional), and all the actual data columns from the source dataset. The Satellite table would be useless if it did not include at least one actual data column.
15.: RMPKA: the ratio of mandatory PK columns. PK columns should be defined as mandatory in all tables.
16.: RMFKA: the ratio of mandatory FK columns. FK columns should be defined as mandatory in all tables.
17.: RMAH: the ratio of mandatory columns in Hubs. Hub tables should only have mandatory columns.
18.: RMAL: the ratio of mandatory columns in Link tables. Link tables should only have mandatory columns.

Satellite tables have optional columns if those columns are optional in the source database. Therefore, we cannot check the ratio of mandatory columns in Satellites, unless we compare them to those in the source tables. This can be carried out but the assumption is that ChatGPT follows the column definitions of the original DDLs. Based on the experiments so far, the following assumption holds.

Manually defined metrics:

19.

TA, Data Vault 2.0 technical columns are correct:

a.: In Hubs (loaddate, recordsource), TAB;
b.: In Satellites (loaddate, recordsource, hashdiff), TAS;
c.: In Links (loaddate, recordsource), TAL.

20.

DTTA, Data types of Data Vault 2.0 technical columns are correct:

a.: In Hubs, DTTAB;
b.: In Satellites, DTTAS;
c.: In Links, DTTAL.

The last two defined metrics are to be carried out manually. There are two reasons for this: to be able to do this in a reliable manner, we would need to follow this with naming conventions, and ChatGPT is not able to follow those without adding RAG to the process. The reply from ChatGPT is in line with this decision: “Please note that this is a simplified representation, and in a real-world scenario, you might need to consider additional aspects such as data types, constraints, indexes, and any other specific requirements of your target database platform”.

We could have metrics for indexes (at least PK and FK), but since these should be a generic set of metrics and different relational database management systems (RDBMs) support and require different kind of indexing, we abstain from using indexes as an element of the set of metrics.

The criteria (equation) of these 20 metrics are shown in Table 4. Each metric from 1 to 18 is assigned a score of 1 if the specified criteria are satisfied and 0 points if they are not. Metrics 19 and 20 are established by a human reviewer and validated on a scale of 0, 0.25, 0.5, 0.75, or 1, depending on the degree to which they adhere to the Data Vault 2.0 methodology requirements. The maximum number of points for a model is 20. The metrics we have defined cover the most critical mistakes in the data model with several metrics giving them a higher weight. For example, missing tables are identified by metric 1 and, depending on the table type, by metric 2 or 3. Alternatively, missing PKs are identified using metrics 4 and 6, 7, or 8, depending on the table type.

We could define different weights for each metric to measure the amount of work needed for them, but this would make the model more complicated. This can be carried out as future research if needed, as well as using RAG to verify business keys, or to give instructions on the content of a PK column and the Hashdiff column, for example.

3.2. Empirically Evaluating the Metrics

The next step was to empirically test the metrics. We used the manually created example Data Vault 2.0 data model and two generated models from our previous research [4]. The generated models were from May 2023 and September 2023. A human expert reviewed the models and chose the May version with prompt engineering and the September version without it. Then, we generated a model in January 2024 with the original prompting [4] and another model using prompt engineering: additional instructions were incorporated into the prompt to address the errors identified in the initially generated DDL.

We imported the generated DDLs to Oracle SQL Developer Data Modeler one by one and investigated the results. We obtained the measures outlined in Table 3. The measures can be either visualized on the user interface of Oracle SQL Developer Data Modeler, or programmatically obtained from the data model using JavaScript. We used the visual approach. After collecting the necessary measures, we calculated the metrics, as explained in Table 4. Finally, we used the metrics to evaluate the models. We also conducted a review by a human expert to compare the evaluations by our metrics and the human reviewer.

Figure 1 illustrates the source database used [4]. The source database consists of four tables: Customers, Orders, Orderlines, and Products. This data model was chosen because it is simple, but includes the needed structures to test the main techniques used in Data Vault 2.0 modeling. It has Hubs, Links, and Satellites and it also includes a concept called Dependent Child.

Figure 2 displays an exemplar data model for Data Vault 2.0, formulated using the source data shown in Figure 1. This data model follows the best practices of Data Vault 2.0 [4]. The data model consists of nine tables: three Hub tables, two Link tables, and four Satellite tables.

We started our experiments of creating a new set of DDLs for the Data Vault 2.0 data model by prompting:

“The DDLs for the database are following:

Please generate the DDLs for the target database following the Data Vault 2.0 methodology”.

The data model generated by this prompt is shown in Figure 3. This data model consists of nine tables: three Hub tables, two Link tables, and four Satellite tables. The biggest problem with the model is that the PK of Satellite tables is wrongly defined. Also, the data type of PKs is wrong.

Then, we prompted engineered ChatGPT to address the flaws noticed by adding in the end of the prompt:

“Also remember that all Primary keys should be hash columns of type binary. Satellites should have a hashdiff column”.

The model generated from this prompt is shown in Figure 4. This data model consists of only eight tables: three Hub tables, one Link table, and four Satellite tables. Also, many FKs are missing.

For the evaluation, we used five models: the example model created manually [4], Version May 2023 with prompt engineering [4], Version September 2023 [4], Version January 2024, and Version January 2024 with prompt engineering. In the following evaluation, we refer to these models as shown in Table 5.

As shown in Table 6, we calculated all the necessary measures to obtain the metrics.

Then, we used these measures to calculate the metrics. The metrics for each model are shown in Table 7, where metrics with values of zero are shown in red. The red color indicates that the metric was not met.

Table 8 illustrates the models along with their respective scores, presented in both the 18-point system and 20-point system. The higher the number of points, the better the model quality.

Model5 requires much more manual work than Model4. However, the inclusion of two additional metrics in the 20-point system falsely demonstrates their equivalence. Based on this experiment, these measures should not be used, or their weights should be reconsidered. As we have acknowledged the simplicity of programmatically adding missing technical columns or correcting their data types, there is a compelling rationale to exclude metrics 19 and 20 from the set of metrics.

3.3. Manual Human Expert Review of the Models

A quick review by a human expert was performed for all models. The maximum score for a model was 20, as determined by the anticipated effort required to rectify the model. A model receiving a score of 20 points suggests that the model is suitable for use without modifications, whereas any score below 20 indicates imperfections in the model. The human reviewer uses their own expertise and experience to determine metrics.

Model1 is a data model that follows the methodology and could be used without modifications, earning a score of 20. In Model2, the Dependent Child concept was not implemented correctly; even the data attributes were lost in the model. Fixing this model would require a lot of work. The score for Model2 is 14. Model3 has similar issues as Model2, but it also has an extra FK, which causes confusion. Model3 is worse than Model2, and the score for Model3 is 13. In Model4, the PKs in the Satellite tables were incorrect. These issues can be easily addressed using Oracle SQL Developer Data Modeler; however, the errors in primary keys remain notably significant. The PKs were of the wrong data type in all tables. Fixing the PK would be easy, but that would cause changes to FKs too. With a tool, fixing would not take too long. The score for Model4 is 18. Model5 has wrongly defined PKs in Satellites, and a Link table is missing. Model5 achieves a score of 16. If we use the 18-point system, which does not include metrics for the technical columns, Model1 would achieve a score of 18, Model2 12, Model3 11, Model4 17, and Model5 15. The human reviewer would say that only Model1 and Model4 are useful models. All human expert reviews are shown in Table 9.