RFM Segmentation and Analysis with Python


python data scientist data analysis

A good understanding of the customer is key to a successful business. Not all customers are created equally. Some (like me), might want the cheapest/free offer, others perhaps willing to pay more to get a premium service, and some might still doubt taking our full offer. Some might be happy with the product and thus keep using it, while others may be on the verge of churning due to various reasons. Different customer needs different actions and treatments, and segmentation is one way to explore it.

Customer segmentation involves grouping customers based on shared characteristics. We could do segmentation based on various aspects, such as demographic information (such as age, and gender), geographic (location), or even psychographics (application usage behavior). In this post, I’ll talk about RFM Segmentation, a segmentation based on Recency, Frequency, and Monetary (normally of transactions) from a user. This segmentation is quite powerful because RFM segmentation is directly linked to customer purchasing behavior, offering more actionable insights towards our customer.

Data Loading

First, let’s talk about the data. For this experiment, I’ll be using a sample online retail dataset that I got from Kaggle. This is transactional data per user across several countries. Here, I put the data into ../data folder. Since we are interested in only valid data, after loading the dataset, I’ll only remove all rows that contain NULL values and we will not doing any further data cleaning.

import pandas as pd
df = pd.read_csv('../data/data.csv', encoding = "ISO-8859-1", parse_dates=['InvoiceDate'])
df = df[~df.isna()].reset_index(drop=True)
print(df.shape)
(541909, 8)

In the end, we have around 542k transactions.

df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

From this data, the next step that we need to do is to construct Recency, Frequency, and Monetary features from each user.

Recency

In order to get recency per user, we first need to find out when the last transaction date from the data. We will use that date (the latest date) as the base date to find how recently the customer made a transaction.

MAX_DATE = df['InvoiceDate'].max().date()
MAX_DATE
datetime.date(2011, 12, 9)

After we have the base date, then we could aggregate per user, find the date of their last transaction, and subtract our base date from the date of their last transaction to get their recency. Here, the lower the recency is the better.

s_recency = df.groupby('CustomerID')['InvoiceDate'].max().apply(lambda x: (MAX_DATE - x.date()).days)
s_recency.head()
CustomerID
12346.0    325
12347.0      2
12348.0     75
12349.0     18
12350.0    310
Name: InvoiceDate, dtype: int64

Frequency

For frequency, all we need to do is count each customer’s total invoice ID. In these metrics, the higher the value, the more valuable the customers are (since they make frequent purchases).

s_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique()
s_frequency.head()
CustomerID
12346.0    2
12347.0    7
12348.0    4
12349.0    1
12350.0    1
Name: InvoiceNo, dtype: int64

Monetary

Lastly, for monetary, we first need to find the total amount per transaction by multiplying the quantity by the price per unit. Then from that, we could aggregate per user and sum the total amount per transaction.

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
s_monetary = df.groupby('CustomerID')['TotalPrice'].sum()

To make everything easier, let’s combine the three series into a single dataframe.

df_rfm = pd.concat([s_recency, s_frequency, s_monetary], axis=1).reset_index()
# replace the column name
df_rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
df_rfm.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

There are two ways to do segmentation using RFM.

  1. Either we create segmentation separately based on the Recency, Frequency, and Monetary values of each user and then combine them, or
  2. We could use the raw Recency, Frequency, and Monetary data and feed it into a clustering algorithm to find clusters automatically.

First, let’s explore the first option. In this option, we make separate segmentation first on each R, F, and M, and then combine them together to get the final segment.

Manual RFM

To make segmentation for each R, F, and M manually, the simplest technique that we could use is by using quartile. By dividing the data using it’s quartile, we will get 4 group/score from each R, F, and M.

Quartile Segmentation

df_rfm['Monetary_Score'] = pd.qcut(
    df_rfm['Monetary'],
    q=4,
    labels=[1, 2, 3, 4]
)
df_rfm[['Monetary', 'Monetary_Score']].head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

df_rfm['Frequency_Score'] = pd.qcut(
    df_rfm['Frequency'],
    q=3,
    labels=[1, 2, 3]
)
df_rfm[['Frequency', 'Frequency_Score']].head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

For frequency, we are using 3 bins only, since there are so many single transactions in the data that when we want to make a cut using 4 bins, the minimum and the first quartile have the same value (i.e. 1). This means more than 25% of our transactions are only one, therefore we have duplicate edges.

On Recency, we will score it in reverse, since for Recency, the lower the number is the better. We want to asociate group 4 with the lowest recency, highest monetary and higest frequency.

df_rfm['Recency_Score'] = pd.qcut(
    df_rfm['Recency'], 
    q=4, 
    labels=[4, 3, 2, 1]
)
df_rfm[['Recency', 'Recency_Score']].head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

df_rfm.head()[['CustomerID', 'Monetary_Score', 'Frequency_Score', 'Recency_Score']]
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

RFM Segments Classification

After we assign a group/score to our customer based on each of the R, F, and M attributes, the final step is to make segments based on this. There are many ways to make segments for RFM, and it’s more of an art than a science. Normally it depends on business needs and many other factors. Here, we will define 7 segments:

  1. Big Fish, our most valuable customer. Recent, frequent, and high-spending individuals. They contribute significantly to revenue and are crucial for sustained business success.
  2. Loyal Customer, customers that keep coming back frequently and have made recent purchases, even though the spending value is less than Big Fish.
  3. Regular Customer, customers with a balance across recency, frequency, and monetary aspects. They are consistent buyers who contribute steadily to the business.
  4. Infrequent Customer, customers who make sporadic purchases, showing lower engagement and frequency.
  5. Lost Customer, has not engaged recently, has made few purchases, and contributes minimally to revenue.
  6. Can’t Lose, customers who used to have frequent and high-value purchases but not recently.
  7. About to Churn, customers who might be considering leaving, with the last purchase a long time ago and it’s not a high-value purchase.

Based on the business definition above, then we could create a function that will return the category of a user based on its R, F, and M scores.

def make_category(R,F,M):
    if R == 4 and F == 3 and M >= 3: return 'Big Fish'
    elif R >= 3 and F >= 2: return 'Regular Customer'
    elif R >= 3 and F <= 1: return 'Seasonal/New Customer'
    elif R == 2 and F >= 2 and M >= 3: return "Can't Lose"
    elif R == 2: return 'About to Churn'
    elif R == 1: return 'Lost Customer'
    else: return '??'

df_rfm['category'] = df_rfm.apply(
    lambda x: make_category(
        x['Recency_Score'], 
        x['Frequency_Score'], 
        x['Monetary_Score']
    ),
    axis=1
)
df_rfm.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

We could aggregate the category to find out the overall distribution of our customers.

import seaborn as sns
df_rfm_cat = df_rfm.groupby('category').size().to_frame('total').reset_index()
sns.barplot(data=df_rfm_cat, x='total', y='category')
<Axes: xlabel='total', ylabel='category'>

png

Unsupervised Learning RFM

Apart from manually quartile to segment the data, we could use another technique such as clustering to automatically find clusters from the data. Then, we could try to derive the characteristics of each segment based on the clustered data. In this simple experiment, we will use the good ol’ day K-Means to find the segment.

Data Scaling

Since the data has a different range (i.e. price in hundred thousand, but frequency in 1-100), it’s a good idea to scale the data so that when doing distance calculation, one variable doesn’t hugely impact the calculation. We could use Z-score normalization to transform the data to have a mean of 0 and a standard deviation of 1, so all the variables has the same range.

from sklearn.preprocessing import StandardScaler
scaler = {} # process one by one
for key in ('Monetary', 'Recency', 'Frequency'):
    scaler[key] = StandardScaler()
    df_rfm[f'{key}_Scaled'] = scaler[key].fit_transform(df_rfm[[key]])
df_rfm[['Monetary', 'Recency', 'Frequency', 'Monetary_Scaled', 'Frequency_Scaled', 'Recency_Scaled']].head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Find the best K

After the data is ready to be fed into our model, the next part is to find the best parameter. Since we are using vanilla KMeans, the most important parameter we need to choose is the number of clusters we want ($k$) since this algorithm can’t detect it automatically. One way we could find the best cluster is by a method called elbow method that are available in the package from yellowbrick

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans(random_state=5, n_init='auto')
visualizer = KElbowVisualizer(model, k=(2, 20))
X = df_rfm[['Monetary_Scaled', 'Frequency_Scaled', 'Recency_Scaled']]
visualizer.fit(X)
             estimator=KMeans(n_clusters=19, n_init=&#x27;auto&#x27;, random_state=5),
             k=(2, 20))</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">KElbowVisualizer</label><div class="sk-toggleable__content"><pre>KElbowVisualizer(ax=&lt;Axes: &gt;,
             estimator=KMeans(n_clusters=19, n_init=&#x27;auto&#x27;, random_state=5),
             k=(2, 20))</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: KMeans</label><div class="sk-toggleable__content"><pre>KMeans(n_clusters=19, n_init=&#x27;auto&#x27;, random_state=5)</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">KMeans</label><div class="sk-toggleable__content"><pre>KMeans(n_clusters=19, n_init=&#x27;auto&#x27;, random_state=5)</pre></div></div></div></div></div></div></div></div></div></div>

png

Based on the visualization above, it seems the most optimum K (given by the black dashed line) is around 8. So let’s settle with that.

Modeling

model = KMeans(8, random_state=5, n_init='auto')
df_rfm['cluster'] = model.fit_predict(X)
cluster_size = df_rfm.groupby('cluster').size()
cluster_size
cluster
0    2284
1     767
2       3
3     584
4     101
5       3
6     612
7      18
dtype: int64

It seems from out cluster, we have several cluster that only consists of few data. Since this probably doesn’t have any information, we will drop any clusters with members below 20.

df_rfm_filtered = df_rfm[df_rfm['cluster'].isin(cluster_size[cluster_size >= 20].index)]

Interpreting Result

In order to interpret the result, we can plot each cluster based on it’s R, F, and M value. From there, we could figure out what kind of customer that cluster representing.

df_rfm_filtered[['Monetary_Scaled', 'Recency_Scaled', 'Frequency_Scaled', 'cluster']]
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

# we need to convert our data into Tidy dataset first to be able to plot multiple boxplot
tidy_df = pd.melt(df_rfm_filtered, id_vars=['CustomerID', 'cluster'], value_vars=['Monetary_Scaled', 'Recency_Scaled', 'Frequency_Scaled'])
import matplotlib.pyplot as plt
ax = sns.boxplot(x="cluster", y="value", hue="variable", data=tidy_df, palette="Set1")
ax.grid(False)
ax.set(xlabel='Cluster ID', ylabel='Scaled Value')
ax.set(title='R, F, M Value Distribution Per Cluster')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1),  title='Variable')

png

Based on the graph above, we can identify the pattern among the clusters. For example:

  • We could clearly see cluster #4 is our most important customer, with high Monetary and Frequency, and also low Recency (meaning they made a purchase recently).
  • On the contrary, cluster #6 is our churn customer, identified by having a high score in Recency value, which means the last transaction they made was from a long time ago.
  • Cluster #3 is our regular customer, even though the purchasing power isn’t as strong as cluster #4, it still made frequent purchases recently.
  • Cluster #0 is probably our new customer, with recent transactions but with small and infrequent transactions.
  • Lastly, our cluster #1 could be a customer who is about to churn, with a quite high recency score and low frequency and recency (albeit not as low as cluster #6).

Follow Up

So, after we create the segmentation, what could we do next? How could this information help our business? RFM segmentation could help us to give a specific treatment to the right group.

For example, an “about-to-churn” customer might need regular notification about the benefit of our product, or if this customer is new but already about to churn, then we could gather feedback from them, what part of our product needs to improve, etc. For “regular customers”, the ideal path is most likely to increase purchasing power or the total transaction, either by giving the best recommendation or by giving incentives if they purchase in bulk.

This ‘special’ treatment based on the group’s needs is what makes the business thrive. Focus on what our customer needs and want, knowing that each customer (or in this case, each segment) is different.