Elena Canorea
Communications Lead
It is not a hidden fact that we live in the digital era, which means everything is done online. This includes money management and the whole e-commerce area. So, the credit card fraud is often a problem that results in the loss of many important things. This problem can be prevented by using machine learning algorithms. Several methods exist in the literature to solve this problem, using Neural Networks, Logistic Regression, Naive Bayes, and Decision Trees.
In this report, I will focus on comparing them, but the focus will be on the binary classification methods. We will see the comparison on more than one data set. I tried to include real-world examples too to show the scalability.
In an era where everything happens fast, sometimes we forget to check our credit card security. So, for commodity we save the credit card on apps. This can lead to a credit card fraud. To solve a possible credit card fraud, we could use AI, but there are so many ways to do it. We could rely on Neural Networks, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, and Multilayer perception, but we will focus on the performance of binary classification methods, and we will compare it with other results.
Performance is key to any software since our world is going at top speed. How often do we meet this problem? This is a problem since in 2020 the pandemic gave everyone a hard time, but with the global lock-down the popularity of web shopping increased drastically, and this means that by the end of 2020 the credit card fraud could double, and in times like this, the last thing a person would want is to lose all his money.
The problem has piqued the interest of many incredible minds, and the solution to this problem would be to detect credit card fraud with decision trees and support vector machines as Y.Sahin and E.Duman would prove “as the size of the training data sets become larger, the accuracy performance of SVM based models reach the performance of the decision tree-based models.” But we are not talking about performance here. This was a problem way back. For example, in 1994 Gosh, Reilly created a neural network that would detect and classify an account as fraud with a higher success rate over rule-based procedures. Another idea came from Dhar and Buescher in 1996 that used historical data on cred card transaction to create a fraud score model and used a clustering approach on a radial basis function network. Other approaches used classic algorithms such as gradient boosting, decision trees, and logistic regression, all of which came with different results and will be compared in the following chapters.
As I said before there are many ways to solve this problem, but we will focus on the binary classification solutions since according to the paper Credit Card Fraud Detection the best results in terms of accuracy were binary classification methods. For example, random forests had an accuracy of 95.5%. The second place was a Decision Tree algorithm with 94.3%, and linear regression with 90%.
The dataset that was used is a popular one that can be downloaded from Kaggle and is made with the data of European cardholders of the year 2013. This dataset contains approx. 284,807 transaction and only 492 were labeled as fraud. The dataset is transformed using principal component analysis. And the variables V1….V28 represent PCA features, and the rest are considered non-PCA like time, amount, and class. Since one crucial aspect of the experimental results is the distribution ratio of classes, the data will need some preprocessing.
Not all features are useful, and if we keep them it may lead to overfitting so, we must carefully select the more important ones and remove the others to reduce the training time and improve the accuracy. To filter all the valuable features, Will Koehrsen’s tool was used, which led to reducing the number of valuable features by 95%. So only 27 features continued to the next phase. Because the data is highly imbalanced, a class distribution adjusting method is used. The most common ones are: oversampling the minority class, undersampling the majority class, or a hybrid between those two. A popular oversampling method that was used in both articles was SMOTE (Synthetic Minority Oversampling Technique) because it is highly effective when it comes to imbalanced datasets.
Since the beginning of the article, our goal has been to see the performance on different binary classification methods. So in this article, they’ve built and trained the models, and we will compare the results and determine which one has better results in terms of precision and accuracy. Logistic regression describes a relationship between predictors that can be categorical, binary and continuous.
Depending on some predictors we determine if something will happen and decide the probability of belonging to each category of the given set of predictors. Naive Bayes is another supervised learning algorithm in which the attributes have no dependencies and are based on the Bayes theorem. In the experiment the Bernouli distribution was used for detecting fraudulent transactions. Decision trees are yet another supervised learning algorithm in which the structure is similar to real life tree, but there are three kinds of nodes: root node, intermediary node, and leaf node, the terminal node.
So, based on a set of factors, for a decision tree to make a correct classification, it will check a set of conditions at each level and navigate through the decision tree until it has the conclusion. A support vector machine is a supervised learning algorithm that trains on a set of data that is already classified into the correct categories and then tries to reconstruct the initial model, also it does all this by sorting the data. Random forests can be used for classification or regression, using a collection of decision trees for classification but outperforming them. The data set was split in an 80:20 ratio, 80% for training and 20% for testing.
As I said before we will focus on the performance of binary classification methods and compare them over accuracy and precision. The total sum of the samples is 56962; out of that number, 98 are fraudulent transactions.
Precision: 58.72%
Accuracy: 97.46%
Predicted as fraud: 1530
Actual fraud: 98
Predicted as not fraud: 55432
Actual not fraud: 56864
Precision: 16.17%
Accuracy: 99.23%
Predicted as fraud: 501
Actual fraud: 98
Predicted as not fraud: 56461
Actual not fraud: 56864
Precision: 96.38%
Accuracy: 99.96%
Predicted as fraud: 83
Actual fraud: 98
Predicted as not fraud: 56879
Actual not fraud: 56864
Precision: 98.14%
Accuracy: 97.08%
Precision: 98.31%
Accuracy: 97.18%
As the paper suggests, the results prove that a classical approach can be as successful as the more popular choices like deep learning algorithms. And this idea is more detailed and supported by the articles. “ The findings of this study indicate promising results with SMOTE-based sampling techniques. The best recall score obtained was with SMOTE sampling strategy by DRF classifier at 0.81.”.
As we’ve seen the problem of cred fraud represents a real threat. Not to mention that this year we’ve also seen the introduction of applications that lets you pay with NFC which can be a huge problem for a person with the knowledge of credit card cloning. Several ways were proposed to combat this problem.
As we’ve seen with the experimental results, the classical algorithms are as successful as a deep learning method but only if we would pre-process the dataset with SMOTE strategy. The best-supervised learning algorithm in terms of precision was Support vector machine with a precision of 98.31% and in terms of accuracy random forest with an accuracy of 99.69%. And the previous remark “as the size of the training data sets become larger the accuracy performance of SVM-based models reach the performance of the decision tree-based models”.
The idea to use binary classification to solve this problem was also borrowed by Microsoft to develop a model that can be trained and consumed as an API in ML.NET. The algorithm they used was their innovative FastTree (which is a super-optimized boosted tree) and binary classification. I intend to analyze other papers that solved the exact problem using binary classification methods but to be sure, I will search for more up-to-date datasets that came from real banks across the world.
Elena Canorea
Communications Lead
Cookie | Duration | Description |
---|---|---|
__cfduid | 1 year | The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information. |
__cfduid | 29 days 23 hours 59 minutes | The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information. |
__cfduid | 1 year | The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information. |
__cfduid | 29 days 23 hours 59 minutes | The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information. |
_ga | 1 year | This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors. |
_ga | 1 year | This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors. |
_ga | 1 year | This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors. |
_ga | 1 year | This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors. |
_gat_UA-326213-2 | 1 year | No description |
_gat_UA-326213-2 | 1 year | No description |
_gat_UA-326213-2 | 1 year | No description |
_gat_UA-326213-2 | 1 year | No description |
_gid | 1 year | This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form. |
_gid | 1 year | This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form. |
_gid | 1 year | This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form. |
_gid | 1 year | This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form. |
attributionCookie | session | No description |
cookielawinfo-checkbox-analytics | 1 year | Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category . |
cookielawinfo-checkbox-necessary | 1 year | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-necessary | 1 year | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-non-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary". |
cookielawinfo-checkbox-non-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary". |
cookielawinfo-checkbox-non-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary". |
cookielawinfo-checkbox-non-necessary | 1 year | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary". |
cookielawinfo-checkbox-performance | 1 year | Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance". |
cppro-ft | 1 year | No description |
cppro-ft | 7 years 1 months 12 days 23 hours 59 minutes | No description |
cppro-ft | 7 years 1 months 12 days 23 hours 59 minutes | No description |
cppro-ft | 1 year | No description |
cppro-ft-style | 1 year | No description |
cppro-ft-style | 1 year | No description |
cppro-ft-style | session | No description |
cppro-ft-style | session | No description |
cppro-ft-style-temp | 23 hours 59 minutes | No description |
cppro-ft-style-temp | 23 hours 59 minutes | No description |
cppro-ft-style-temp | 23 hours 59 minutes | No description |
cppro-ft-style-temp | 1 year | No description |
i18n | 10 years | No description available. |
IE-jwt | 62 years 6 months 9 days 9 hours | No description |
IE-LANG_CODE | 62 years 6 months 9 days 9 hours | No description |
IE-set_country | 62 years 6 months 9 days 9 hours | No description |
JSESSIONID | session | The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application. |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
viewed_cookie_policy | 1 year | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
viewed_cookie_policy | 1 year | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
VISITOR_INFO1_LIVE | 5 months 27 days | A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface. |
wmc | 9 years 11 months 30 days 11 hours 59 minutes | No description |
Cookie | Duration | Description |
---|---|---|
__cf_bm | 30 minutes | This cookie, set by Cloudflare, is used to support Cloudflare Bot Management. |
sp_landing | 1 day | The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content. |
sp_t | 1 year | The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content. |
Cookie | Duration | Description |
---|---|---|
_hjAbsoluteSessionInProgress | 1 year | No description |
_hjAbsoluteSessionInProgress | 1 year | No description |
_hjAbsoluteSessionInProgress | 1 year | No description |
_hjAbsoluteSessionInProgress | 1 year | No description |
_hjFirstSeen | 29 minutes | No description |
_hjFirstSeen | 29 minutes | No description |
_hjFirstSeen | 29 minutes | No description |
_hjFirstSeen | 1 year | No description |
_hjid | 11 months 29 days 23 hours 59 minutes | This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. |
_hjid | 11 months 29 days 23 hours 59 minutes | This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. |
_hjid | 1 year | This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. |
_hjid | 1 year | This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. |
_hjIncludedInPageviewSample | 1 year | No description |
_hjIncludedInPageviewSample | 1 year | No description |
_hjIncludedInPageviewSample | 1 year | No description |
_hjIncludedInPageviewSample | 1 year | No description |
_hjSession_1776154 | session | No description |
_hjSessionUser_1776154 | session | No description |
_hjTLDTest | 1 year | No description |
_hjTLDTest | 1 year | No description |
_hjTLDTest | session | No description |
_hjTLDTest | session | No description |
_lfa_test_cookie_stored | past | No description |
Cookie | Duration | Description |
---|---|---|
loglevel | never | No description available. |
prism_90878714 | 1 month | No description |
redirectFacebook | 2 minutes | No description |
YSC | session | YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages. |
yt-remote-connected-devices | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt-remote-device-id | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt.innertube::nextId | never | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |
yt.innertube::requests | never | This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen. |