The importance of Machine Learning and Data Science cannot be overstated. If you are interested in studying past trends and training machines to learn with time how to define scenarios, identify and label events, or predict a value in the present or future, data science is of the essence. It is essential to study the underlying data and model it by selecting an appropriate algorithm to approach any such use case. The various control parameters of the algorithm need to be tweaked to fit the data set. As a result, the developed application improves and becomes more efficient in solving the problem.

In this blog, we have attempted to illustrate the modeling of a data set using a machine learning paradigm classification, with Credit Card Fraud Detection being the base. Classification is a machine learning paradigm that involves deriving a function that will separate data into categories, or classes, characterized by a training set of data containing observations (instances) whose category membership is known. This function is then used in identifying in which of the categories a new observation belongs.

How do you spot 492 fake credit card transactions out of 284K+? Start by reading this! Click To Tweet

**Problem Statement:**

The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.

**Data Set Analysis:**

This problem has been picked from Kaggle.

**Observations**

- The data set is highly skewed, consisting of 492 frauds in a total of 284,807 observations. This resulted in only 0.172% fraud cases. This skewed set is justified by the low number of fraudulent transactions.
- The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
- The ‘Time’ and ‘Amount’ features are not transformed data.
- There is no missing value in the dataset.

**Inferences drawn:**

- Owing to such imbalance in data, an algorithm that does not do any feature analysis and predicts all the transactions as non-frauds will also achieve an accuracy of 99.828%. Therefore, accuracy is not a correct measure of efficiency in our case. We need some other standard of correctness while classifying transactions as fraud or non-fraud.
- The ‘Time’ feature does not indicate the actual time of the transaction and is more of a list of the data in chronological order. So we assume that the ‘Time’ feature has little or no significance in classifying a fraud transaction. Therefore, we eliminate this column from further analysis.

**Theory:**

Credit Card Fraud Detection is a typical example of classification. In this process, we have focused more on analyzing the feature modeling and possible business use cases of the algorithm’s output than on the algorithm itself. We used the implementation of Binomial Logistic Regression Algorithm in the ‘ROCR’ package on the PCA transformed Credit Card Fraud data.

**Some Definitions:**

The following are essential definitions – in the current problem’s context – needed to understand the approaches mentioned later:

- True Positive: The fraud cases that the model predicted as ‘fraud.’
- False Positive: The non-fraud cases that the model predicted as ‘fraud.’
- True Negative: The non-fraud cases that the model predicted as ‘non-fraud.’
- False Negative: The fraud cases that the model predicted as ‘non-fraud.’
- Threshold Cutoff Probability: Probability at which the true positive ratio and true negatives ratio are both highest. It can be noted that this probability is minimal, which is reasonable as the probability of frauds is low.
- Accuracy: The measure of correct predictions made by the model – that is, the ratio of fraud transactions classified as fraud and non-fraud classified as non-fraud to the total transactions in the test data.
- Sensitivity: Sensitivity, or True Positive Rate, or Recall, is the ratio of correctly identified fraud cases to total fraud cases.
- Specificity: Specificity, or True Negative Rate, is the ratio of correctly identified non-fraud cases to total non-fraud cases.
- Precision: Precision is the ratio of correctly predicted fraud cases to total predicted fraud cases.

**Incorrect Measures of Efficiency of a Data Model:**

Let’s look at the various measures of efficiency that fail at analyzing the correctness of the underlying data model.

- Total/Net Accuracy: One approach to gauge the compute model’s correctness is to use Accuracy as the deciding parameter. But, as stated earlier, in a highly skewed data set like this, we know that even if we predict all values as non-fraudulent, we’ll have only 492 wrong predictions out of 284,807 in total. So, the accuracy is excellent, but it still doesn’t solve our problem as we want to identify as many fraud cases as possible. So, we can’t use accuracy as a deciding factor here.
- Confusion Matrix: Merely tabulating the confusion matrix will not provide a clear understanding of the performance of the data. This is because the total number of fraud cases is much less, and variation in the confusion matrix will be so small that it will be equivalent to a justified error in a balanced dataset (probably even less!). So, this measure is also ruled out.

This first part of three-part blog series provides an insight into the analysis of the data and the pitfalls in handling a skewed data set. In the subsequent posts, we shall try to fit a model to this data set, analyze the results and look into the various measures of efficiency that can be resorted to as the metric for defining the utility (correctness) of the modeling. So, stay tuned for our next blog for continued work on Credit Card Fraud Detection.

Where can I find the blog articles in continuation to this topic?