Cohort Analysis With Python


python data analysis

Ilustration generated with Leonardo AI

Cohort analysis is one of the simple and effective tool to better understand how customer interact in business. We can look whether the customer are coming back, or whether their value are improved over time. A simple tools that should be on every data analytics toolbox. But first, what is a cohort?

What is cohort?

The technical definition of cohort analysis is behavioral analytics that breaks data into relevant groups and observes how their behavior changes over time. For example, a company might group their user based on their first registration, and observe how their interaction with the platform changes day, week, or month.

Usually, cohorts are displayed in a table format, with each row representing a different user group and a column representing time. For example, consider the cohort table of user purchase/transaction below.

Cohort example

Based on the number of rows, we have 13 groups representing users who first purchased at different months. The first groups (i.e. the first row) are the users that joined (have their first transaction) in December 2010. The second group is from January 2011, and so on.

Moving on to the column, the first column represents the percentage of users that still have transactions within the next month after they first purchase. So based on the data above, 38% of users that had their first transaction in December 2010 also had transactions in January 2011.

By looking at this, we can see the pattern for our users’ behavior, for example, whether they tend to come back to our platform a few months after they register, or whether there are a special occasion that makes certain month has higher recurring user compared to other month.

Creating Cohort

We know how valuable a cohort is, the question is how do we make one? For this demo, we will use data from https://archive.ics.uci.edu/dataset/352/online+retail.

This data consists of transactions for an online retailer that includes price and customer ID. The cohort that we will create will be a monthly transaction cohort.

For a cohort, what we need generally falls into three steps:

  1. Get the cohort group for each customer. Since we are building a monthly transaction cohort, the cohort group will be the month of the first transaction from each customer. So if a customer makes a transaction on Nov 2019, Oct 2019, and Dec 2019, their base month will be Nov 2019.
  2. The step of the cohort, which will be the difference between the transaction month and each customer base date. So if a base month for a customer is Nov 2019, then his transaction in Oct 2019 will be categorized as Step 1, the transaction in Nov 2019 will be Step 2.
  3. Pivot the data, based on the base month as the row and the step as the column.

Using Python

Using Python, after we load the data and remove duplicates, we first create an additional column for storing month-only information.

import pandas as pd
df = pd.read_csv(...)
...
df['Month'] = df['InvoiceDate'].dt.to_period('M')

Then we will use groupby along with transform to get the minimum month of each customer. Unlike other normal aggregations, the function will result in a column with the same shape as the inputted DataFrame, with the same group will have the same value.

After having this information, to calculate the step, we could simply subtract the month.

df['FirstMonth'] = df.groupby('CustomerID')['Month'].transform('min')
df['DiffMonth'] = (df['Month']- df['FirstMonth']).apply(lambda x: x.n)

Next, we will aggregate per base group, to count the total unique user and convert the transaction number into percentage.

# aggregate per base and step
df_agg = df.groupby(['FirstMonth', 'Month', 'DiffMonth']).agg(
    total_customers=('CustomerID', 'nunique')
).reset_index()

# get percentage
df_agg['perc_customer'] = df_agg['total_customers'] / \
    df_agg.sort_values(['FirstMonth', 'Month'])\
        .groupby(['FirstMonth'])['total_customers']\
        .transform('first')

Finally, we can pivot this table, and then plot it using our visualization tools of choice.

df_pivot = df_agg.pivot_table(
 index='FirstMonth', 
 columns='DiffMonth', 
 values='perc_customer'
)

import matplotlib.pyplot as plt
import seaborn as sns

f, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(
    df_pivot, 
    ax=ax, 
    annot=True, 
    fmt='.0%', 
    cmap='GnBu'
)

Using SQL

The same step could be also applied when we are using SQL. What might be different is the way we get the information for each step.

First, To get the base month of each user, we could use the partition function first_value. In SQLite, we could use:

select first_value(InvoiceDate) over (partition by CustomerID order by InvoiceDate) as FirstInvoiceDate 
from TABLE

Second, For the step or month difference, since each SQL flavor might not have the right function, we could use simple arithmetic to convert the date into the total month (12*year+month) using strftime, and make the calculation based on that.

select
 strftime('%Y', InvoiceDate) * 12 + strftime('%m', InvoiceDate)) 
      - (strftime('%Y', FirstInvoiceDate) * 12 + strftime('%m', FirstInvoiceDate))

Finally, for the pivot and plotting function, it depends on the BI tools we use. When working with Superset, we could create a chart called Pivot Table to display our cohort. Different BI might have different ways to do Pivot. Some even have their own cohort chart, so make sure to check the documentation first.

Cohort in Superset

Analysis

So, given the cohort as above, what insight we could have? The most obvious information that we can look from it is the retention rate from each cohort group. Based on the data, only around 20–30% of customers make a repeat transaction after their first month. There are also some quite significant difference on several cohort group, with March to August having on average, a lower retention rate compare to overall median.

Another possible analysis that we can do is to make cohort based on customer average transaction value instead of the total transaction. This could be as a proxy to a customer lifetime value (CLTV) metrics, and shows whether the customer value for each cohort is going up on each month or not.

We could also look for the effect of our product change using this cohort. If for example, we change our website to be faster in March, and we could see that the next month transaction is significantly higher that the previous one (and assuming no other possible causation exists, such as monthly trend), then we can assume this changed is caused by our change.

Conclusion

As a tools, cohort analysis is an easy but powerful way to get better understanding about our customer and business. This technique can provide insights into user behavior and outcomes that can aid your business growth.