Credit Card Fraud detection using Binary Classification and ML.NET

It is not a hidden fact that we live in the digital era, which means everything is done online. This includes money management and the whole e-commerce area. So, the credit card fraud is often a problem that results in the loss of many important things. This problem can be prevented by using machine learning algorithms. Several methods exist in the literature to solve this problem, using Neural Networks, Logistic Regression, Naive Bayes, and Decision Trees.

In this report, I will focus on comparing them, but the focus will be on the binary classification methods. We will see the comparison on more than one data set. I tried to include real-world examples too to show the scalability.

In an era where everything happens fast, sometimes we forget to check our credit card security. So, for commodity we save the credit card on apps. This can lead to a credit card fraud. To solve a possible credit card fraud, we could use AI, but there are so many ways to do it. We could rely on Neural Networks, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, and Multilayer perception, but we will focus on the performance of binary classification methods, and we will compare it with other results.

Performance is key to any software since our world is going at top speed. How often do we meet this problem? This is a problem since in 2020 the pandemic gave everyone a hard time, but with the global lock-down the popularity of web shopping increased drastically, and this means that by the end of 2020 the credit card fraud could double, and in times like this, the last thing a person would want is to lose all his money.

The problem has piqued the interest of many incredible minds, and the solution to this problem would be to detect credit card fraud with decision trees and support vector machines as Y.Sahin and E.Duman would prove “as the size of the training data sets become larger, the accuracy performance of SVM based models reach the performance of the decision tree-based models.” But we are not talking about performance here. This was a problem way back. For example, in 1994 Gosh, Reilly created a neural network that would detect and classify an account as fraud with a higher success rate over rule-based procedures. Another idea came from Dhar and Buescher in 1996 that used historical data on cred card transaction to create a fraud score model and used a clustering approach on a radial basis function network. Other approaches used classic algorithms such as gradient boosting, decision trees, and logistic regression, all of which came with different results and will be compared in the following chapters.

Binary Classification Methods

As I said before there are many ways to solve this problem, but we will focus on the binary classification solutions since according to the paper Credit Card Fraud Detection the best results in terms of accuracy were binary classification methods. For example, random forests had an accuracy of 95.5%. The second place was a Decision Tree algorithm with 94.3%, and linear regression with 90%.

Dataset

The dataset that was used is a popular one that can be downloaded from Kaggle and is made with the data of European cardholders of the year 2013. This dataset contains approx. 284,807 transaction and only 492 were labeled as fraud. The dataset is transformed using principal component analysis. And the variables V1….V28 represent PCA features, and the rest are considered non-PCA like time, amount, and class. Since one crucial aspect of the experimental results is the distribution ratio of classes, the data will need some preprocessing.

Methodology

Not all features are useful, and if we keep them it may lead to overfitting so, we must carefully select the more important ones and remove the others to reduce the training time and improve the accuracy. To filter all the valuable features, Will Koehrsen’s tool was used, which led to reducing the number of valuable features by 95%. So only 27 features continued to the next phase. Because the data is highly imbalanced, a class distribution adjusting method is used. The most common ones are: oversampling the minority class, undersampling the majority class, or a hybrid between those two. A popular oversampling method that was used in both articles was SMOTE (Synthetic Minority Oversampling Technique) because it is highly effective when it comes to imbalanced datasets.

Binary Classification Methods used

Since the beginning of the article, our goal has been to see the performance on different binary classification methods. So in this article, they’ve built and trained the models, and we will compare the results and determine which one has better results in terms of precision and accuracy. Logistic regression describes a relationship between predictors that can be categorical, binary and continuous.

Depending on some predictors we determine if something will happen and decide the probability of belonging to each category of the given set of predictors. Naive Bayes is another supervised learning algorithm in which the attributes have no dependencies and are based on the Bayes theorem. In the experiment the Bernouli distribution was used for detecting fraudulent transactions. Decision trees are yet another supervised learning algorithm in which the structure is similar to real life tree, but there are three kinds of nodes: root node, intermediary node, and leaf node, the terminal node.

So, based on a set of factors, for a decision tree to make a correct classification, it will check a set of conditions at each level and navigate through the decision tree until it has the conclusion. A support vector machine is a supervised learning algorithm that trains on a set of data that is already classified into the correct categories and then tries to reconstruct the initial model, also it does all this by sorting the data. Random forests can be used for classification or regression, using a collection of decision trees for classification but outperforming them. The data set was split in an 80:20 ratio, 80% for training and 20% for testing.

Experimental Results

As I said before we will focus on the performance of binary classification methods and compare them over accuracy and precision. The total sum of the samples is 56962; out of that number, 98 are fraudulent transactions.

Linear Regression

Precision: 58.72%

Accuracy: 97.46%

Predicted as fraud: 1530

Actual fraud: 98

Predicted as not fraud: 55432

Actual not fraud: 56864

Naïve Bayes

Precision: 16.17%

Accuracy: 99.23%

Predicted as fraud: 501

Actual fraud: 98

Predicted as not fraud: 56461

Actual not fraud: 56864

Random Forest

Precision: 96.38%

Accuracy: 99.96%

Predicted as fraud: 83

Actual fraud: 98

Predicted as not fraud: 56879

Actual not fraud: 56864

Decision Tree

Precision: 98.14%

Accuracy: 97.08%

Support vector machine

Precision: 98.31%

Accuracy: 97.18%

As the paper suggests, the results prove that a classical approach can be as successful as the more popular choices like deep learning algorithms. And this idea is more detailed and supported by the articles. “ The findings of this study indicate promising results with SMOTE-based sampling techniques. The best recall score obtained was with SMOTE sampling strategy by DRF classifier at 0.81.”.

Conclusion

As we’ve seen the problem of cred fraud represents a real threat. Not to mention that this year we’ve also seen the introduction of applications that lets you pay with NFC which can be a huge problem for a person with the knowledge of credit card cloning. Several ways were proposed to combat this problem.

As we’ve seen with the experimental results, the classical algorithms are as successful as a deep learning method but only if we would pre-process the dataset with SMOTE strategy. The best-supervised learning algorithm in terms of precision was Support vector machine with a precision of 98.31% and in terms of accuracy random forest with an accuracy of 99.69%. And the previous remark “as the size of the training data sets become larger the accuracy performance of SVM-based models reach the performance of the decision tree-based models”.

The idea to use binary classification to solve this problem was also borrowed by Microsoft to develop a model that can be trained and consumed as an API in ML.NET. The algorithm they used was their innovative FastTree (which is a super-optimized boosted tree) and binary classification. I intend to analyze other papers that solved the exact problem using binary classification methods but to be sure, I will search for more up-to-date datasets that came from real banks across the world.

Cookie	Duration	Description
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
attributionCookie	session	No description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
cppro-ft	1 year	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	session	No description
cppro-ft-style	session	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	1 year	No description
i18n	10 years	No description available.
IE-jwt	62 years 6 months 9 days 9 hours	No description
IE-LANG_CODE	62 years 6 months 9 days 9 hours	No description
IE-set_country	62 years 6 months 9 days 9 hours	No description
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wmc	9 years 11 months 30 days 11 hours 59 minutes	No description

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	1 year	No description
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjSession_1776154	session	No description
_hjSessionUser_1776154	session	No description
_hjTLDTest	1 year	No description
_hjTLDTest	1 year	No description
_hjTLDTest	session	No description
_hjTLDTest	session	No description
_lfa_test_cookie_stored	past	No description

Cookie	Duration	Description
loglevel	never	No description available.
prism_90878714	1 month	No description
redirectFacebook	2 minutes	No description
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.