Improving Classification Performance in Credit Card Fraud Detection by Using New Data Augmentation

By inergency On Sep 19, 2023

[ad_1]

Figure 1.
The architecture of K-CGAN: (a) K-CGAN Discriminator Architecture; (b) K-CGAN Generator Architecture with Novelty Loss.

Figure 2.
ROC curve of K-CGAN.

Figure 3.
ROC curve of Original Dataset.

Figure 4.
ROC curve using SMOTE.

Figure 5.
ROC curve using ADASYN.

Figure 6.
ROC curve using B-SMOTE.

Figure 7.
ROC curve using Vanilla GAN.

Figure 8.
ROC curve using WS GAN.

Figure 9.
ROC curve using SDG GAN.

Figure 10.
ROC curve using NS GAN.

Figure 11.
ROC curve using LS GAN.

Figure 12.
ROC curve using K-CGAN.

Figure 13.
Correlation comparison of Original Data, SMOTE, ADASYN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 14.
Univariate V1 Feature Distribution comparison of Original Data, Smote, ADASYN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 15.
Univariate V5 Feature Distribution comparison of Original Data, SMOTE, ADAYSN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 16.
Univariate V15 Feature Distribution comparison of Original Data, SMOTE, ADAYSN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 17.
Bivariate V1 vs V3 Feature Distribution comparison of Original Data, SMOTE, ADAYSN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 18.
Bivariate V1 vs V4 Feature Distribution comparison of Original Data, SMOTE, ADAYSN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Figure 19.
Bivariate V1 vs. V5 Feature Distribution comparison of Original Data, SMOTE, ADASYN, B-SMOTE, Novelty K-CGAN, Vanilla GAN, WS GAN, SDG GAN, NS GAN, and LS GAN methods.

Table 1.
Credit card dataset (sourced from Kaggle.com).

Dataset	No. of Attributes	No. of Instances	No. of Fraud Instances	No. of Legal Instances
Kaggle	30	31	492	284,315

Table 2.
Credit card dataset features (highlighted in bold) (first samples of rows).

Time	V1	V2	V3	V4	V5	V6	V7	V8
0	−1.35981	−0.07278	2.536347	1.378155	−0.33832	0.462388	0.239599	0.098698
0	1.191857	0.266151	0.16648	0.448154	0.060018	−0.08236	−0.0788	0.085102
1	−1.35835	−1.34016	1.773209	0.37978	−0.5032	1.800499	0.791461	0.247676
1	−0.96627	−0.18523	1.792993	−0.86329	−0.01031	1.247203	0.237609	0.377436
2	−1.15823	0.877737	1.548718	0.403034	−0.40719	0.095921	0.592941	−0.27053
2	−0.42597	0.960523	1.141109	−0.16825	0.420987	−0.02973	0.476201	0.260314
4	1.229658	0.141004	0.045371	1.202613	0.191881	0.272708	−0.00516	0.081213
7	−0.64427	1.417964	1.07438	−0.4922	0.948934	0.428118	1.120631	−3.80786
7	−0.89429	0.286157	−0.11319	−0.27153	2.669599	3.721818	0.370145	0.851084
V9	V10	V11	V12	V13	V14	V15	V16	V17
0.363787	0.090794	−0.5516	−0.6178	−0.99139	−0.31117	1.468177	−0.4704	0.207971
−0.25543	−0.16697	1.612727	1.065235	0.489095	−0.14377	0.635558	0.463917	−0.1148
−1.51465	0.207643	0.624501	0.066084	0.717293	−0.16595	2.345865	−2.89008	1.109969
−1.38702	−0.05495	−0.22649	0.178228	0.507757	−0.28792	−0.63142	−1.05965	−0.68409
0.817739	0.753074	−0.82284	0.538196	1.345852	−1.11967	0.175121	−0.45145	−0.23703
−0.56867	−0.37141	1.341262	0.359894	−0.35809	−0.13713	0.517617	0.401726	−0.05813
0.46496	−0.09925	−1.41691	−0.15383	−0.75106	0.167372	0.050144	−0.44359	0.002821
0.615375	1.249376	−0.61947	0.291474	1.757964	−1.32387	0.686133	−0.07613	−1.22213
−0.39205	−0.41043	−0.70512	−0.11045	−0.28625	0.074355	−0.32878	−0.21008	−0.49977
V18	V19	V20	V21	V22	V23	V24	V25	V26
0.025791	0.403993	0.251412	−0.01831	0.277838	−0.11047	0.066928	0.128539	−0.18911
−0.18336	−0.14578	−0.06908	−0.22578	−0.63867	0.101288	−0.33985	0.16717	0.125895
−0.12136	−2.26186	0.52498	0.247998	0.771679	0.909412	−0.68928	−0.32764	−0.1391
1.965775	−1.23262	−0.20804	−0.1083	0.005274	−0.19032	−1.17558	0.647376	−0.22193
−0.03819	0.803487	0.408542	−0.00943	0.798278	−0.13746	0.141267	−0.20601	0.502292
0.068653	−0.03319	0.084968	−0.20825	−0.55982	−0.0264	−0.37143	−0.23279	0.105915
−0.61199	−0.04558	−0.21963	−0.16772	−0.27071	−0.1541	−0.78006	0.750137	−0.25724
−0.35822	0.324505	−0.15674	1.943465	−1.01545	0.057504	−0.64971	−0.41527	−0.05163
0.118765	0.570328	0.052736	−0.07343	−0.26809	−0.20423	1.011592	0.373205	−0.38416
						V27	V28	Amount
						0.133558	−0.02105	149.62
						−0.00898	0.014724	2.69
						−0.05535	−0.05975	378.66
						0.062723	0.061458	123.5
						0.219422	0.215153	69.99
						0.253844	0.08108	3.67
						0.034507	0.005168	4.99
						−1.20692	−1.08534	40.8
						0.011747	0.142404	93.2

Table 3.
Variance inflation factor (VIF).

Feature	VIF
Time	1.104214
V1	1.003973
V2	1.000397
V3	1.038927
V4	1.002805
V5	1.007125
V6	1.000983
V7	1.002670
V8	1.001018
V9	1.000367
V10	1.001049
V11	1.013779
V12	1.003927
V13	1.000932
V14	1.002786
V15	1.007373
V16	1.000528
V17	1.002051
V18	1.002158
V19	1.000196
V20	1.000669
V21	1.001252
V22	1.004694
V23	1.000729
V24	1.000058
V25	1.012106
V26	1.000409
V27	1.000941
V28	1.000440
Amount	11.650240

Table 4.
K-CGAN Generator Neural Network Hyperparameter Settings.

Parameter	Value
Learning Rate	0.0001
Hidden Layer Optimizer	Relu
Output Optimizer	Adam
Loss Function	Trained Discriminator Loss+ KL Divergence
Hidden Layers	2, −128, 64
Dropout	0.1
Random Noise Vector	100
Kernel Initializer	glorot_uniform
Kernel Regularizer	L2 method
Total Learning Parameters	36,837

Table 5.
K-CGAN Discriminator Neural Network Hyperparameter Settings.

Parameter	Value
Learning Rate	0.0001
Hidden Layer Optimizer	LeakyRelu
Output Optimizer	Adam
Loss Function	Binary Cross Entropy
Hidden Layers	2, −20, 10
Dropout	0.1
Kernel Regularizer	L2 method

Table 6.
GAN Generator Neural Network Hyperparameter Settings.

Parameter	Value
Learning Rate	0.0001
Hidden Layer Optimizer	Relu
Output Optimizer	RMSprop
Loss Function	Trained Discriminator Loss
Hidden Layers	64, 32
Dropout	0.5
Random Noise Vector	100

Table 7.
GAN Discriminator Neural Network Hyperparameter Settings.

Parameter	Value
Learning Rate	0.0001
Hidden Layer Optimizer	LeakyRelu
Output Optimizer	RMSprop
Loss Function	Binary Cross Entropy
Hidden Layers	128, 64, 32
Dropout	0.1

Table 8.
Precision values for classification methods for balanced dataset.

	Precision Value for Balanced Dataset
	K-CGAN	Original Dataset	SMOTE	ADASYN	B-SMOTE	Vanilla GAN	WS GAN	SDG GAN	NS GAN	LS GAN
XG-Boost	0.999762	0.924370	0.999467	0.999182	0.999816	0.997085	0.988636	0.986072	0.980831	0.982405
Random Forest	0.999776	0.931035	0.999762	0.999760	0.999958	0.994135	0.980170	0.986111	0.977564	0.982249
Nearest Neighbor	0.999608	0.864865	0.982366	0.973762	0.997603	0.960606	0.954416	0.966197	0.954545	0.961194
MLP	0.999692	0.881890	0.997690	0.997970	0.998082	0.982456	0.974504	0.957219	0.962145	0.959885
Logistic Regression	0.999566	0.890110	0.974443	0.909084	0.994725	0.965732	0.958457	0.970149	0.949495	0.968051

Table 9.
Recall values for classification methods for balanced dataset.

	Recall Value for Balanced Dataset
	K-CGAN	Original Dataset	SMOTE	ADASYN	B-SMOTE	Vanilla GAN	WS GAN	SDG GAN	NS GAN	LS GAN
XG-Boost	0.999706	0.827068	1.000000	0.999986	0.999703	0.955307	0.932976	0.917098	0.962382	0.941011
Random Forest	0.999706	0.812030	1.000000	1.000000	0.999661	0.946927	0.927614	0.919689	0.956113	0.932584
Nearest Neighbor	0.999706	0.721804	0.999804	1.000000	0.999746	0.885475	0.898123	0.888601	0.921630	0.904494
MLP	0.999594	0.842105	1.000000	0.999929	0.999746	0.938547	0.922252	0.927461	0.956113	0.941011
Logistic Regression	0.999608	0.609023	0.919681	0.860942	0.996383	0.865922	0.865952	0.841969	0.884013	0.851124

Table 10.
F1 Score values for classification methods.

	F1 Score Value for Balanced Dataset
	K-CGAN	Original Dataset	SMOTE	ADASYN	B-SMOTE	Vanilla GAN	WS GAN	SDG GAN	NS GAN	LS GAN
XG-Boost	0.999734	0.873016	0.999733	0.999584	0.999760	0.975749	0.960000	0.950336	0.971519	0.961263
Random Forest	0.999741	0.867470	0.999881	0.999880	0.999809	0.969957	0.953168	0.951743	0.966720	0.956772
Nearest Neighbor	0.999657	0.786885	0.991008	0.986707	0.998673	0.921512	0.925414	0.925776	0.937799	0.931983
MLP	0.999643	0.861538	0.998844	0.998949	0.998913	0.960000	0.947658	0.942105	0.959119	0.950355
Logistic Regression	0.999587	0.723214	0.946270	0.884358	0.995553	0.913108	0.909859	0.901526	0.915584	0.905830

Table 11.
Accuracy values for classification methods.

	Accuracy Value for Balanced Dataset
	K-CGAN	Original Dataset	SMOTE	ADASYN	B-SMOTE	Vanilla GAN	WS GAN	SDG GAN	NS GAN	LS GAN
XG-Boost	0.999733	0.999551	0.999733	0.999585	0.999761	0.999762	0.999594	0.999482	0.999748	0.999622
Random Forest	0.999740	0.999537	0.999880	0.999880	0.999810	0.999706	0.999524	0.999496	0.999706	0.999580
Nearest Neighbor	0.999655	0.999270	0.990905	0.986578	0.998678	0.999244	0.999244	0.999230	0.999454	0.999342
MLP	0.999641	0.999494	0.998839	0.998952	0.998917	0.999608	0.999468	0.999384	0.999636	0.999510
Logistic Regression	0.999585	0.999129	0.947643	0.887842	0.995568	0.999174	0.999104	0.999006	0.999272	0.999118

Table 12.
Comparison of classification models using the K-CGAN synthetic data only (sample of 30,000 valid and 30,000 fraud transactions generated by K-CGAN model).

Algorithm	Precision	Recall	F1 Score	Accuracy
XG-Boost	1.0	1.000000	1.000000	1.00000
Random Forest	1.0	0.982301	0.991071	0.99996
Nearest Neighbor	1.0	0.929204	0.963303	0.99984
MLP	1.0	1.000000	1.000000	1.00000
Logistic Regression	1.0	0.946903	0.972727	0.99988

[ad_2]