The importance of Machine Learning and Data Science cannot be overstated. If you are interested in studying past trends and training machines to learn with time how to define scenarios, identify and label events, or predict a value in the present or future, data science is of the essence. It is essential to study the underlying data and model it by selecting an appropriate algorithm to approach any such use case. The various control parameters of the algorithm need to be tweaked to fit the data set. As a result, the developed application improves and becomes more efficient in solving the problem.
In this blog, we have attempted to illustrate the modeling of a data set using a machine learning paradigm classification, with Credit Card Fraud Detection being the base. Classification is a machine learning paradigm that involves deriving a function that will separate data into categories, or classes, characterized by a training set of data containing observations (instances) whose category membership is known. This function is then used in identifying in which of the categories a new observation belongs.
The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.
Data Set Analysis:
This problem has been picked from Kaggle.
Credit Card Fraud Detection is a typical example of classification. In this process, we have focused more on analyzing the feature modeling and possible business use cases of the algorithm’s output than on the algorithm itself. We used the implementation of Binomial Logistic Regression Algorithm in the ‘ROCR’ package on the PCA transformed Credit Card Fraud data.
The following are essential definitions – in the current problem’s context – needed to understand the approaches mentioned later:
Incorrect Measures of Efficiency of a Data Model:
Let’s look at the various measures of efficiency that fail at analyzing the correctness of the underlying data model.
This first part of three-part blog series provides an insight into the analysis of the data and the pitfalls in handling a skewed data set. In the subsequent posts, we shall try to fit a model to this data set, analyze the results and look into the various measures of efficiency that can be resorted to as the metric for defining the utility (correctness) of the modeling. So, stay tuned for our next blog for continued work on Credit Card Fraud Detection.