Credit Card Fraud Detection is the most popular dataset on Kaggle. It’s appeal stems from the fact that transaction fraud detection is a practical application that many businesses care about. It’s pretty cool to stop crime with machine learning.

The dataset is relatively easy to work with given that it’s structured, doesn’t have missing values, and is under 1GB in size. Our goal is to build a binary classifier that identifies which transactions are fraudulent and which are genuine. One challenge is dealing with the highly unbalanced distribution of labels: only 492 of the 284,807 transactions in the dataset are fraudulent (0.173%). Less fraud is good for a credit card company, but makes life a little more difficult for machine learning engineers.

While it would be cool to just build an accurate model, it would be more useful to build a production application that can automatically scale to handle more data, update when new data becomes available, and serve real-time predictions. This usually requires a lot of DevOps work, but we can do it with minimal effort using Cortex, an open source machine learning infrastructure platform. Cortex converts declarative configuration into scalable machine learning pipelines. In this guide, we’ll see how to use Cortex to build and deploy a fraud detection API using Kaggle’s dataset.

Setting up Data Ingestion

Typically, we’d configure Cortex to ingest data from a production data warehouse, but for this example we’ll use a public S3 bucket named cortex-examples. Machine learning applications should not tamper with the data warehouse, so Cortex ingests the data and manages it independently.

The environment configuration below tells Cortex to ingest the CSV data with the defined schema. There’s no need to write custom data wrangling scripts or manage Spark workloads for data processing.

We need to convert the CSV file into columns that we can transform and then use to build our model. For production applications, it’s good to perform type checking and ensure that we don’t have any missing values. The raw columns configuration tells Cortex to validate that class is always an integer, and time, v1-v28, and amount are all floats. Without these checks, we could have data quality issues that are hard to debug and degrade model performance.

- kind: environment
  name: dev
  data:
    type: csv
    path: s3a://cortex-examples/fraud.csv
    csv_config:
      header: true
    schema: [time, v1, v2, v3, ..., amount, class]
    
- kind: raw_column
  name: time
  type: FLOAT_COLUMN
  required: true
 
- kind: raw_column
  name: v1
  type: FLOAT_COLUMN
  required: true

- kind: raw_column
  name: v2
  type: FLOAT_COLUMN
  required: true

# v3 - v28 omitted for brevity

- kind: raw_column
  name: amount
  type: FLOAT_COLUMN
  required: true

- kind: raw_column
  name: class
  type: INT_COLUMN
  required: true

Defining Data Transformations

When the data is ingested and validated, we’ll need to prepare it for training. time , v1-v28 , and amount are numeric columns with very different ranges, so we should normalize them to prevent some features from being treated as more or less important based on the magnitude of their values.

Normalization requires computing two aggregate values for each data column, namely mean and standard deviation, as well as transforming each value in the column (by subtracting the mean and dividing by the standard deviation). Cortex has these aggregation and transformation functions built-in. The YAML below shows how to configure a Cortex pipeline to compute all the aggregates and transformed columns. Note that the target column classdoesn’t need any modification because its values are already 0 or 1.

- kind: aggregate
  name: time_mean
  aggregator: cortex.mean
  inputs:
    columns:
      col: time

- kind: aggregate
  name: time_stddev
  aggregator: cortex.stddev
  inputs:
    columns:
      col: time

- kind: transformed_column
  name: time_normalized
  transformer: cortex.normalize
  inputs:
    columns:
      num: time
    args:
      mean: time_mean
      stddev: time_stddev

# v1 - v28 omitted for brevity

- kind: aggregate
  name: amount_mean
  aggregator: cortex.mean
  inputs:
    columns:
      col: amount

- kind: aggregate
  name: amount_stddev
  aggregator: cortex.stddev
  inputs:
    columns:
      col: amount

- kind: transformed_column
  name: amount_normalized
  transformer: cortex.normalize
  inputs:
    columns:
      num: amount
    args:
      mean: amount_mean
      stddev: amount_stddev

Processing the Data

Now that the data preparation steps are defined, cortex deploy will launch and orchestrate all required workloads for processing the data at scale. We can deploy our application at any time and Cortex will create the desired state based on the configuration. Subsequent deployments will use cached resources when possible before launching additional workloads. Cortex streams output to our terminal in real time, which we can use to sanity check that our code is working correctly:

Ingesting fraud data from s3a://cortex-examples/fraud.csv
284807 rows ingested

...

v1_mean:    -2.237831565309384e-10
v1_stddev:  1.958695804149988

Transforming v1 to v1_normalized
v1:             -1.36    1.19    -1.36
v1_normalized:  -0.69    0.61    -0.69

...

amount_mean:    88.34961924204623
amount_stddev:  250.12010901734928

Transforming amount to amount_normalized
amount:             149.62     2.69    378.66
amount_normalized:    0.24    -0.34      1.16

Configuring Model Training

We’ll use TensorFlow’s pre-made DNNClassifier to keep our example simple, although Cortex supports any TensorFlow code that implements the tf.estimator API.

import tensorflow as tf


def create_estimator(run_config, model_config):
    feature_columns = [
        tf.feature_column.numeric_column(feature_column["name"])
        for feature_column in model_config["feature_columns"]
    ]

    return tf.estimator.DNNClassifier(
        feature_columns=feature_columns,
        hidden_units=model_config["hparams"]["hidden_units"],
        n_classes=2,
        config=run_config,
    )

We configure Cortex to make the normalized columns available to the training workload, automatically split the dataset into 80% for training and 20% for evaluation, and train for 5000 steps.

- kind: model
  name: dnn
  path: dnn.py
  type: classification
  target_column: class
  feature_columns:
    [time_normalized, v1_normalized, v2_normalized, v3_normalized, v4_normalized, v5_normalized, v6_normalized, v7_normalized, v8_normalized, v9_normalized, v10_normalized, v11_normalized, v12_normalized, v13_normalized, v14_normalized, v15_normalized, v16_normalized, v17_normalized, v18_normalized, v19_normalized, v20_normalized, v21_normalized, v22_normalized, v23_normalized, v24_normalized, v25_normalized, v26_normalized, v27_normalized, v28_normalized, amount_normalized]
  hparams:
    hidden_units: [100, 100, 100]
  data_partition_ratio:
    training: 0.8
    evaluation: 0.2
  training:
    num_steps: 5000

Training the Model

Now, cortex deploy will launch and orchestrate the workloads required for training the model at scale:

...

loss = 0.02041925, step = 4501 (0.703 sec)
loss = 0.012982414, step = 4601 (0.612 sec)
loss = 0.004186087, step = 4701 (0.762 sec)
loss = 0.0035106726, step = 4801 (0.755 sec)
loss = 0.0006830689, step = 4901 (0.640 sec)

accuracy = 0.9965
accuracy_baseline = 0.995
auc = 0.92329776
auc_precision_recall = 0.60935247
precision = 0.60714287
recall = 0.85

99.7% accuracy looks good, but precision and recall are too low. This is because the dataset is highly unbalanced. Almost every training sample is a genuine transaction, so the model learns to classify every sample as genuine. Therefore, this model isn’t actually useful for fraud detection.

Adding a Weight Column

We can address this problem in several ways: upsampling, downsampling, or weighting. Upsampling means duplicating samples from the rare class until the ratio of the classes is closer to 1, but this could get expensive in terms of compute resources and storage if the dataset is large. Alternatively, downsampling eliminates samples from the more common class until the ratio is closer to 1, but this could shrink our training dataset significantly. Weighting scales a training sample’s impact on the loss function based on its class. In this application, it will tell the model to take fraudulent transactions a lot more seriously.

We’ll opt to use weighting and create a new data column containing weights for each sample. The weights for the fraud class will be the ratio of genuine transactions to the full dataset, and the weights for the genuine class will be the ratio of fraudulent transactions to the full dataset. So if 99% of transactions are genuine, a fraudulent sample should have a weight 99 times greater than a genuine sample.

We can implement this using PySpark. Cortex already has a built-in class_distribution aggregation function, as well as support for custom PySpark code which we’ll use to create the weight column:

def transform_spark(data, columns, args, transformed_column_name):
    import pyspark.sql.functions as F

    distribution = args["class_distribution"]

    return data.withColumn(
        transformed_column_name,
        F.when(data[columns["col"]] == 0, distribution[1]).otherwise(distribution[0]),
    )

Below is the configuration for the class distribution aggregate, the custom PySpark transformer, and the transformed column. Cortex will automatically execute the aggregation and transformation workloads based on this configuration:

- kind: aggregate
  name: class_distribution
  aggregator: cortex.class_distribution_int
  inputs:
    columns:
      col: class

- kind: transformer
  name: weight
  path: weight.py
  inputs:
    columns:
      col: INT_COLUMN
    args:
      class_distribution: {INT: FLOAT}
  output_type: FLOAT_COLUMN

- kind: transformed_column
  name: weight_column
  transformer: weight
  inputs:
    columns:
      col: class
    args:
      class_distribution: class_distribution

Updating the Model

We can use the weight column in our model by making a small modification to the TensorFlow estimator implementation (line 14):

import tensorflow as tf


def create_estimator(run_config, model_config):
    feature_columns = [
        tf.feature_column.numeric_column(feature_column["name"])
        for feature_column in model_config["feature_columns"]
    ]

    return tf.estimator.DNNClassifier(
        feature_columns=feature_columns,
        hidden_units=model_config["hparams"]["hidden_units"],
        n_classes=2,
        weight_column="weight_column",
        config=run_config,
    )

And a small update to the model configuration (line 8):

- kind: model
  name: dnn
  path: dnn.py
  type: classification
  target_column: class
  feature_columns:
    [time_normalized, v1_normalized, v2_normalized, v3_normalized, v4_normalized, v5_normalized, v6_normalized, v7_normalized, v8_normalized, v9_normalized, v10_normalized, v11_normalized, v12_normalized, v13_normalized, v14_normalized, v15_normalized, v16_normalized, v17_normalized, v18_normalized, v19_normalized, v20_normalized, v21_normalized, v22_normalized, v23_normalized, v24_normalized, v25_normalized, v26_normalized, v27_normalized, v28_normalized, amount_normalized]
  training_columns: [weight_column]
  hparams:
    hidden_units: [100, 100, 100]
  data_partition_ratio:
    training: 0.8
    evaluation: 0.2
  training:
    num_steps: 5000

We add weight_column to the training_columns list in the model configuration because this is not a feature column, and will only be used in training, not inference.

Retraining the Model

Once we’ve made these modifications, cortex deploy will create the weight column and retrain the model. The feature data won’t have to be ingested or normalized again because Cortex caches as much as possible.

...

loss = 0.01819101, step = 4501 (0.491 sec)
loss = 0.010051587, step = 4601 (0.461 sec)
loss = 0.008202191, step = 4701 (0.491 sec)
loss = 0.0047955187, step = 4801 (0.716 sec)
loss = 0.002248696, step = 4901 (0.464 sec)

accuracy = 0.9960093
accuracy_baseline = 0.74384576
auc = 0.9978455
auc_precision_recall = 0.99920136
precision = 0.9946642
recall = 1.0

These metrics look a lot better! For comparison, precision was 0.61
before adding the weight column. Now our model is much more useful for detecting fraudulent transactions in production.

Configuring Prediction Serving

We can make the model available as a live web service that can serve real-time predictions using the configuration below:

- kind: api
  name: fraud
  model_name: dnn
  compute:
    replicas: 3

After deploying again, we can test the API with the following prediction request:

$ curl -k 
       -X POST
       -H "Content-Type: application/json" \
       -d '{ "samples": [ { "amount": 10, "time": 123, "v1": 1.0, ...} ] }'
       https://abc.amazonaws.com/fraud

{"classification_predictions": [{"class_ids":["1"]}]}

Cortex will automatically apply the same data transformations that were used during training to the prediction request.

Running this Yourself

Cortex is open source and free to download. The full fraud detection example code can be found here.