Importing the necessary packages

The data

The data set contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

There are 541909 rows and 8 coulmns in our data

Data Pre-processing

Since the "Quantity" column signifies quantity of orders, it is safe to assume that the quantity can never be negative. We shall now look into different values of the "Quantity" column

We may notice above that the "Quantity" column 10624 rows that have values that are less than 0. We shall now drop these rows and retain only the records for which the "Quantity" is greater than 0.

Similarly we can notice that the "UnitPrice" column has values that are less than 0 for 2 records. Since, any price cannot be less than 0, we shall drop these records.

We notice that the two columns "Description" and "CustomerID" have NULL values.

We notice that just by dropping the CustomerID column to retain only the not NULL values, the NULL values in the description column has also been taken care of.

There are 5192 duplicate rows in the data set and now we shall remove them.

After data-preprocessing, we have 392732 rows and 8 columnns in our dataset

Feature Engineering

We have the columns for "Quantity" and "UnitPrice". Now we shall create a new column for Sales which is just the product of "Quantity" and "UnitPrice"

We notice that the InvoiceDate is of the 'object' datatype. We shall convert it into a datetime object so that it is easier to perform further manipulations

The InvoiceDate is now a datatime datatype. Now we shall create columns for date, year, month, week of the year, day of the week, hour of the day.

After the data manipulation and feature engineering, our data now has 392732 rows and 17 coulumns

After Feature Engineering, we have 392732 rows and 17 columnns in our dataset

Exploratory Data Analysis (EDA)

Count of transactions for each country

From the above chart, we can see that most of the cusomers (349227 which is almost 88% of the customers) are from United Kingdom.

And since the description of the dataset does mention that the data "is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." So, it makes sense that most of the customers are from the UK.

Count of transactions, quantity of orders, and sales on each day of the year

Insights

From the above time series chart, we can see that there is an increasing trend in the count of orders over the year.
And there is also a seasonal trend in both the quantity of orders placed and the corresponding sales. However, we do notice that there are two peaks - one at the beginning of the year and one at the end. But we do not have enough evidence from the data to determine whether these are genuine data points or outliers. The description of the data says that "The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." We can assume that some wholesale dealer bought bulk of the items on those particular days.

Similarly, we notice a peak at the end of the year, and since we do not know for sure whether it is an outlier, we shall have it in our analysis for now. If we know more about the data and the domain knowledge, we can make a more educated guess as to whether or not to include the datapoint in our analysis.

Quantity of orders and sales on each month of the year

Insights

Since the data is from 01/12/2010 and 09/12/2011, it seems logical that the number of orders are way more in 2011 as compared the number of orders in the year 2010. From the above two charts, we notice that the count of orders and the sales follow a similar pattern except for the months September and December where relatively less number of orders have produced comparitively more sales. We may infer that high priced good got sold more during these months as compared to other months.

Sales on each week of the year

Insights

We notice that the sales were at its peak during the 49th month of the year. However, as we noticed in the trend charts for the sales above, we found a peak in the last month which could have been on the 49th week.

Sales on each day of the week

Count of orders on each hour of the day

Insights

The sales seem to be peaking on Thursdays followed by high sales on Tuesday. Mondays, Wednesdays, and Fridays seem to having similar sales amounts. Sundays register the lowest sales. There are no sales on Saturday. So we may assume that the retail store does not function on Saturdays. And even with respect to everyday, the count of orders seem to peak around the afternoon and taper at the start and end of the day. We notice that the customer activity in at its highest with respect to placing orders during the day time from around 10 Am to 4 PM.

Count of orders on each session of the day

Most number of orders are placed during the mornings and afternoons.

Top 20 orders based on the description

The item named "White Hanging Heart T-Light Holder" seems to be the highest sold item in the retail store.

Customer segmentation

Methodology

RFM Analysis

In order to perform customer segmentation, we would be using RFM analysis. RFM analysis is a customer segmentation technique that uses the past purchase behavior of the customers to segment them into groups based on their activity.

RFM helps in dividing customers into various clusters to identify customers who are more likely to respond to promotions, marketing campaigns and also for future personalization services.

The RFM stands for

RECENCY (R): Days since last purchase
FREQUENCY (F): Total number of purchases
MONETARY VALUE (M): Total transaction value that a customer spent

The basic idea is to segment customers based on when their last purchase was, how often they’ve purchased in the past, and how much they’ve spent overall.

We will create those 3 customer attributes for each customer.

Recency

Recency denotes how recently a customer has made a purchase. We noticed earlier that the data ranges from 01/12/2010 and 09/12/2011. To calculate the value of Recency, I considered an arbitrary date of 2011-12-31 that is after the date range of the dataset. And from this date I calculated the Recency of an order. I took the most recent order date for each customer and subtracted it from this date. For example, if the result is 120 days, it means this customer’s last purchase was 120 days ago.

Frequency

For calculating the Frequency, we shall take the count of unique purchases made by each customer.

When we are taking the count of the frequency of purchases made by a customer, we need to understand that each customer has a unique customer ID and each order has a unique InvoiceNo. However, each customer could have bought multiple products in one single order.

For example, if we take a random customer with customer ID "12347.0", we may notice that a customer buys multiple products in a single order.

If we notice above, for the customer "12347.0", there have been 182 products that were bought which are spread across just 7 unique InvoiceNo. This shows that the frequency of purchase for the particular customer "12347.0" is 7 (afer removing duplicates) and not 182 (without removing duplicates)

Hence, it is imperative that we remove the duplicates for the "CustomerID" and "InvoiceNo" pair.

Monetary value

For the Monetary Value, we shall calculate the total amount spent by each customer.

For ease of use, we shall convert the Monetary_value to int

Insights

From the above charts,we notice that a the Recency values are Right Skewed denoting that most of the customers visited the online retail website recently which is a good sign for the company. Whereas the Frequency and the Monetary Values are also right skewed which denotes that the number of times that customers visit the website is less and the money they spend on the website is also low.

However, the correlation chart shows that there is a strong correlation between the Frequency and the Monetary Values which is a good sign for the business because when the customers who buy the products frequencly also spend more money it results in the increase in the sales revenue of the company.

Data Discretization for the RFM values

In this section, we shall now discretize the values of Recency, Frequency and Monetary Value to form customer segment groups. Based on this information, each customer will be assigned a score of 1 to 3 for the recency, frequency and monetary values of the purchase data and therefore we should have 3X3X3=27 groups in total. The higher the score, the better is the RFM value.

We shall now consolidate the scores of R, F, M into a singe value by concatinating the columns.

Business Recommendations

The following are the types of customers based on our RFM analysis and the corresponding business recommendations to improve the business.

Champions (5.78% of the customers):

RFM Score: 333

Who They Are: Highly engaged customers who have bought the most recent, the most often, and generated the most revenue. This segment are potential high-valued customers. These customers likely generate a disproportionately high percentage of overall revenues and thus focusing on keeping them happy should be a top priority.
Marketing Strategies: Communications with this group should make them feel valued and appreciated. We need to reward these customers. They can become early adopters for new products or new feature releases of the app and will help promote the brand. Suggest them to share your products with their friends or family using "Referral Program". It will help to increase conversion rates. Focus on loyalty programs for them. These customers have proven to have a higher willingness to pay, so instead of using discount pricing to generate incremental sales, we should focus on value added offers through product recommendations based on previous purchases. Further analyzing their individual preferences and affinities will provide additional opportunities for even more personalized messaging.

Potential Loyalists (20.48%of the customers):

RFM Score: 322, 323, 332, 223, 233

Who They Are: The customers in these RFM groups are our recent customers with average frequency and who spent a good amount.
Marketing Strategies: This group has the potential to become Champions. Thse are highly promising customers and need to taken care of by offering annual or quarterly membership programs to them with additional benefits or loyalty programs or recommend related products to upsell them may help them become our Champions. Loyalty programs are effective for these repeat visitors. Advocacy programs and reviews are also common strategies. Lastly, consider rewarding these customers with Free Shipping or other like benefits. By doing so, they will shop more frequently and for more amount.

Rookies - Newest Customers (13.20% of the customers):

RFM Score: 211, 311

Who They Are: Customers who visited the website recently often, but do not frequent it or spend a lot.
Marketing Strategies: These are the customers who take a look at articles in the products without making any purchases. These customers visit relatively often but only to look at the products and not buy them. Even if they do end up buying, they are only interested in the low cost items. Hence, providing them discounts at a lower cost to encourage their buying is the way to attract these customers. Having clear strategies in place for first time buyers such as triggered welcome emails will pay dividends.

At Risk Customers (0.36% of the customers):

RFM Score: 133, 123

Who They Are: Great past customers who haven't bought in awhile. They are your customers who purchased often and spent big amounts, but haven’t purchased recently.
Marketing Strategies: These are the customers who are in the risk of churning and we must bring them back with relevant promotions. Customers leave for a variety of reasons but one of the major advantages of these customers is that they have frequented the website many times in the past, so we have a good record of the product they are interested in. Hence, We need to send them personalized reactivation campaigns to reconnect, and offer renewals discounts and suggest helpful products to encourage more recent visits to the website. Suggest "Referral Program" and "Annual Membership Program" to prevent these customers from churning as they were frequent and high spenders in the past.

Average Customers (20.99% of the customers):

RFM Score: 212

Who They Are:These customers have purchased fairly recently and their spending is average, but the purchase frequency is relatively low.
Marketing Strategies: Considering the huge customer base, this group can have large potential values and we need to incentivize them to spend more. For example, we can reward them with special discount offers, free shipping or other benefits.

New Big Spenders (5.83% of the customers):

RFM Score: 313, 312

Who They Are:These customers are the ones who have a high overall RFM score based on most recent visits and high spending but are not frequent shoppers.
Marketing Strategies: We need to start building relationships with these customers by providing onboarding support and special offers to increase the frequency of their visits. This segment consists of potentially high-valued customers because they spend more and also have visited the site recently. Since they have purchased very recently with high spending amount, as long as we increase their frequency, they will become the best customers with high loyalty. They need to be cultivated by the brand to become long-term loyalists. We can welcome them with personalized email with a coupon code to encourage repeat purchases. We can also enroll these customers into a loyalty or membership program that rewards order frequency.

Can’t Lose Them (6.75% of the customers):

RFM Score: 222

Who They Are:These are customers who used to visit and purchase quite often, and spend a decent amount of money.
Marketing Strategies: The main aim with respect to these customers is to keep them coming back to the website for more purchases. We may run surveys to find out what could be improved about their experience and avoid losing them to a competitor by providing timely offers on relevant products and free-shipping benefits.

Churned Customers (12% of the customers):

RFM Score: 111

Who They Are:They probably bought once or very few times and they bought for very less amount.
Marketing Strategies: Give lost customers an incentive to come back. We can send this campaign directly to inactive customers with tagline phrases like "Haven’t seen you in a while", "Let’s catch up", "Come back and receive [promotion details]" which would encourage them to come back. Another way is to address common complaints through a social media campaign and create visibility of a productive change in process.

Overall Insights

Based on the above percentages of customer segments, we can say that the business has a huge potential for growth because the Average Customers and Potential Loyalists make up almost 40% of the customers. With the right marketing campaigns as discussed above, these customers can be encouraged to make more visits and purchases on the website.

K-means Clustering

Removing skewness in the data

One of the constraints of using K-Means algorithm is that there should be no skewness in the data. Hence I considered the following transformations to remove the skewness,

  1. Log Transformation
  2. Sqrt Transformation
  3. Box-Cox Transformation
  4. Cube-Root Transformation

I found that for "Recency" and "Frequency", the Box-CoX transformation seemed to work well in elimimating skewness. For the "Monetary Value", the cube root transformation, worked better

Based on the comparison above, we can see that the Skewness is much less in the rfm_skew_adj data which is transformed as compared to the rfm_df_copy data which is the raw values for "Recency", "Frequency", and "Monetary Value"

Standard scalar

Customer Segmentation using K-Means

Insights

Based on the above graph, we can see that the SSE(Error Sum of Squares) is reducing as we incraase the number of clusters. It is up to us to choose the number of clusters. We shall build two models now - one with 3 clusters and another with 8 clusters.

Building a K-means clustering model with 3 clusters

Building a K-means clustering model with 8 clusters

Insights

With the help of domain knowledge, we would be able to interpret the clusters better to know which customers we need to target for specific use cases and promotional deals.

Conclusion

We analyzed the online retail dataset to perform customer segmentation. We segmented the customers into 8 major categories as Champions, Potential Loyalists, Rookies - Newest Customers, At Risk Customers, Average Customers, New Big Spenders, Can’t Lose Them, Churned Customers and discussed the potential marketing strategies for each of the customer segments. We also implemented K-Means clustering ML algorithm that could be used on top of the RFM scores to perform customer segmentation.

In conclusion, by coupling our analysis with the insights from some domain knowledge experts, we would be able to make the best out of our data.

References: I referred various articles on customer segmentation from websites such as Medium, Analytics Vidhya, Stack Overflow and used the FB Prophet documentation for the time series forecasting.

I enjoyed working on this case study. Thank you very much for the opportunity!

Kishor Kumar Sridhar

kishorkumarsridhar@gmail.com