top of page

Analyzing Customer Bases with R

  • Writer: Andy Chen
    Andy Chen
  • Jan 17, 2020
  • 4 min read

Updated: Mar 2, 2020

"Who" are my customers? The answer to this question is long coveted by marketers and retailers around the world, for it is the foundation to all marketing strategies. With the assistance of the RFM framework and the Pareto/NBD model, we can obtain a rough idea of our customer profile and further segment customers into cohorts. Moreover, we can develop personalized and dynamic marketing plans using these models, making sure we create a unique experience for our customers. Hoping to refine my customer analysis skills, I wanted to practice these methods and note down my thought process behind the analyses.


The data I used for the analyses is a Kaggle dataset with all the transactions that occurred between 01/12/2010 and 09/12/2011 for a UK-based online retail, containing over 540K rows of transaction data. The data can be accessed here.


RFM Model

RFM model stands for recency, frequency, and monetary, intuitively measuring different aspects of each customer's behaviors. One thing that's great about RFM is its simplicity. All you need is a transaction log labeled with the customers' ID, a few lines of code, and voilà, there's your RFM model. Naturally, creating a RFM model is fairly easy using R, there's even a package created specifically for this usage. However, I prefer building the model manually in order to precisely manipulate my variables of choice. This can be easily implemented with the powerful dplyr package.


The first step of our analysis is to reformat the dataframe into a customer-indexed table. For the three RFM measures, I wanted to know when each customer last purchased, how often they purchased, and the total amount they paid, respectively. Next, I gave each customer their RFM rankings to carry out further comparisons.

I also created a simple function to calculate RFM scores based on the RFM rankings, which could be useful depending on our needs. For instance, we could create several breakpoints for each facet and use them to filter customers, extracting the customers that meet our score requirements.


Last but not least, is the segmentation of our customers. This can be performed in many ways, and in this case, I applied k-means clustering to make things easier for me, splitting the customers into five groups based on their RFM scores. By labeling each customer with the cluster results, we can further conduct summary statistics for each cluster, which could be utilized to design related strategies and campaigns for different customer segments.

RFM is very powerful in identifying whether a customer is active, sleeping, or churning, etc., and picking out the most valuable customers. We can also add additional variables to provide a more complete view of our customers. Nevertheless, RFM still has many limitations, such as its inability to consider the "length" of each customer's lifetime. If we want to obtain more information from our data, we'll need assistance from another model.


BG/NBD Model

In the Marketing Analytics course this past semester, I was introduced to the powerful Pareto/NBD model. First established in 1987, the Pareto/NBD model was developed to describe customers' repeat-buying behaviors and deal with some of the toughest questions in marketing. Which customers can be considered to have churned? How many transactions can be expected next month? How many customers will be active a year from now? You get the idea. However, despite the respect it has earned among marketers, it has given me several headaches regarding parameter estimation due to its computational challenges. But fear not, an alternative model, the beta-geometric/NBD (BG/NBD), comes to the rescue.


Explaining the models in statistical terms is much more complicated, but basically, both models follow these assumptions:

1. Customers are "alive" for a period of time, and then become permanently inactive at some point in their "lifetime".

2. While alive, a customer randomly purchases around his or her mean transaction rate. The total number of transactions follows a Poisson distribution, which is equivalent to assuming that the time between transactions is distributed exponentially.

3. Both the transaction rates and dropout rates vary across customers, with the heterogeneity of transaction rates following a gamma distribution.

The only difference between Pareto/NBD and BG/NBD lies in how/when customers become inactive. While Pareto/NBD assumes that dropouts can occur at any point in time, BG/NBD supposes that dropouts occur immediately after a purchase, i.e., after any transaction, a customer has a certain probability of becoming inactive, therefore that "probability" after each transaction follows a geometric distribution, the heterogeneity of the probability, on the other hand, follows a beta distribution. This makes the calculations much easier, giving us more time to come up with strategies rather than waiting for it to finish compiling.

I'm going to focus on the implementations of the model in this article, but you can find the detailed derivations of the Pareto/NBD here, and the BG/NBD here. Without further ado, let's dig right into the model building using the Buy Till You Die (BTYD) and BTYDplus packages in R.


To start off, we first need to convert our transaction log into a customer-by-sufficient (CBS) dataframe, which is the required data format for estimating model parameters. The "elog2cbs" function really comes in handy for this purpose. Then, we can estimate the parameters for the model using the "bgnbd.EstimateParameters" function. Normally, we can split the data into two parts—the calibration period for parameter estimation, and the holdout period for model validation. This is be easily done by specifying the "T.cal" parameter, but I'm just going to use the entire dataset and the default arguments here.

After obtaining our parameters, our model can be used for many purposes, including calculating each customer's survival rate at the end of the calibration period, as well as their expected number of transactions over a given period of time. The timeframe of interest can be adjusted by tweaking the "T.star" parameter. Pretty neat, huh? For the finishing touch, just merge the results with the original dataframe, and enjoy analyzing your customers!

Conclusion

The code I demonstrated above merely provides a outline for both of the methods, but hopefully this article would be helpful if I ever need a quick refresher in the future. Both RFM and the BG/NBD model are great techniques to help us analyze our customers, and to top that off, they are relatively simple to execute in R. Nonetheless, at the end of the day, what matters most is how we utilize the results of our models to create business value. Only by identifying the actual "goals" of the analysis, can we truly power our strategies with data, and eventually win over the hearts of our customers.

Recent Posts

See All

Comments


© 2020 by ANDY CHEN

bottom of page