Importing the necessary packages

The data

The data set contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

There are 541909 rows and 8 coulmns in our data

Data Pre-processing

Since the "Quantity" column signifies quantity of orders, it is safe to assume that the quantity can never be negative. We shall now look into different values of the "Quantity" column

We may notice above that the "Quantity" column 10624 rows that have values that are less than 0. We shall now drop these rows and retain only the records for which the "Quantity" is greater than 0.

Similarly we can notice that the "UnitPrice" column has values that are less than 0 for 2 records. Since, any price cannot be less than 0, we shall drop these records.

We notice that the two columns "Description" and "CustomerID" have NULL values.

We notice that just by dropping the CustomerID column to retain only the not NULL values, the NULL values in the description column has also been taken care of.

There are 5192 duplicate rows in the data set and now we shall remove them.

After data-preprocessing, we have 392732 rows and 8 columnns in our dataset

Feature Engineering

We have the columns for "Quantity" and "UnitPrice". Now we shall create a new column for Sales which is just the product of "Quantity" and "UnitPrice"

We notice that the InvoiceDate is of the 'object' datatype. We shall convert it into a datetime object so that it is easier to perform further manipulations

The InvoiceDate is now a datatime datatype. Now we shall create columns for date, year, month, week of the year, day of the week, hour of the day.

After the data manipulation and feature engineering, our data now has 392732 rows and 17 coulumns

After Feature Engineering, we have 392732 rows and 17 columnns in our dataset

Exploratory Data Analysis (EDA)

Count of transactions for each country

From the above chart, we can see that most of the cusomers (349227 which is almost 88% of the customers) are from United Kingdom.

And since the description of the dataset does mention that the data "is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." So, it makes sense that most of the customers are from the UK.

Count of transactions, quantity of orders, and sales on each day of the year

Insights

From the above time series chart, we can see that there is an increasing trend in the count of orders over the year.
And there is also a seasonal trend in both the quantity of orders placed and the corresponding sales. However, we do notice that there are two peaks - one at the beginning of the year and one at the end. But we do not have enough evidence from the data to determine whether these are genuine data points or outliers. The description of the data says that "The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." We can assume that some wholesale dealer bought bulk of the items on those particular days.

Similarly, we notice a peak at the end of the year, and since we do not know for sure whether it is an outlier, we shall have it in our analysis for now. If we know more about the data and the domain knowledge, we can make a more educated guess as to whether or not to include the datapoint in our analysis.

Quantity of orders and sales on each month of the year

Insights

Since the data is from 01/12/2010 and 09/12/2011, it seems logical that the number of orders are way more in 2011 as compared the number of orders in the year 2010. From the above two charts, we notice that the count of orders and the sales follow a similar pattern except for the months September and December where relatively less number of orders have produced comparitively more sales. We may infer that high priced good got sold more during these months as compared to other months.

Sales on each week of the year

Insights

We notice that the sales were at its peak during the 49th month of the year. However, as we noticed in the trend charts for the sales above, we found a peak in the last month which could have been on the 49th week.

Sales on each day of the week

Count of orders on each hour of the day

Insights

The sales seem to be peaking on Thursdays followed by high sales on Tuesday. Mondays, Wednesdays, and Fridays seem to having similar sales amounts. Sundays register the lowest sales. There are no sales on Saturday. So we may assume that the retail store does not function on Saturdays. And even with respect to everyday, the count of orders seem to peak around the afternoon and taper at the start and end of the day. We notice that the customer activity in at its highest with respect to placing orders during the day time from around 10 Am to 4 PM.

Count of orders on each session of the day

Most number of orders are placed during the mornings and afternoons.

Top 20 orders based on the description

The item named "White Hanging Heart T-Light Holder" seems to be the highest sold item in the retail store.

Sales forecasting using Fb Prophet time series modeling

FB Prophet is a procedure for forecasting time series data based on an Generalized Additive Model (GAM) where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.
The systematic component of an additive model is the arithmetic sum of the individual effects of the predictors. Prophet combines seasonality, trend, and holidays.

y(t) = g(t) +s(t) +h(t) + et
Where
g(t) = trend component (the trend function which models non-periodic changes in the value of thetime series)
s(t) = seasonal component (represents periodic changes (e.g., weekly and yearly seasonality))
h(t) = holiday component (represents the effects of holidays)
et = remainder component (represents any idiosyncratic changes or errors)

For running the Fb Prophet model, it is mandatory to have the ‘date’ column renamed as ‘ds’ and ‘sales’ column renamed as ‘y’.

The time series ranges from 2010-12-01 until 2011-12-09, it has 305 observations, a mean of 29138.38 and a standard deviation of 17834.94.

We can notice that there is one huge spike on the last day of the dataset corrresponding to the date '2011-12-09' with the value 184329.66. Now let us see how the time series chart looks without that value.

Assumption

We notice that if we just remove that one data point the time series chart looks much better. Although, we do not know much abuot the description of the dataset, for now let us assume that this one point is an outlier and remove it from our data.

The time series ranges from 2010-12-01 until 2011-12-09, it has 303 observations, a mean of 28627.89 and a standard deviation of 15472.15

We can use statsmodels to perform a decomposition of this time series. The decomposition of time series is a statistical task that deconstructs a time series into several components, each representing one of the underlying categories of patterns. With statsmodels we will be able to see the trend, seasonal, and residual components of our data.

Decomposing the time series

Inference

Trend Component:

We find that the trend is increasing and there appears to be an overall upwards trend in the data.

Seasonal Component:

We can see that there is definite seasonality in the data, causing the sales to fluctuate in a periodic manner

Stationary and Seasonality Effect

Above we can see the data is somewhat nonstationary as there is an slightly increasing trend at the end and also it is definitely seasonal because of the constant fluctuations at time intervals. By stationarity, we mean that there should be constant mean, variance, and covariance in the data if we divide the data into segments with respect to time and seasonality means the same type of trend the data is following if segregated based on time intervals. Therefore, in general while working with a timeseries data like this, we need to make it stationary and nonseasonal. However, Fbprophet has the capability of handling this type of data especially the seasonal component and therefore we need not worry regarding the preprocessing part. The FbProphet library has the capability of handling stationarity within the data and also seasonality related components.

There are two types of seasonalities in any data.
One is additive, which can be considered as the result of adding numbers. This type of data tends to show a linear trend.
Another is multiplicative, which can be considered as the result of the compounding effect with percentage growth. This type of data tends to show an exponential trend.
Since our data tends to show a linear trend, here we use FB Prophet model that fits additive seasonalities, meaning the effect of the seasonality is added to the trend to get the forecast is additive.

Splitting train and test data

Since our dataset has values for 303 days, we shall use approximately the first 80% of the data (244 data points) as training data and the last 20% of data (60 data points) as test data.

Building the FB Prophet model

Base Model

In the first plot above, the black dots represent actual measurements blue line displays Prophet’s forecast light blue window indicates uncertainty intervals. We also notice that the yearly seasonality is being overfitted with a lot of fluctuation points.

Evaluation metrics

Mean Absolute Error (MAE) : The Mean Absolute Error is the average of the absolute difference between the actual and predicted values
Mean Absolute Percentage Error (MAPE) : The Mean Absolute Percentge Error is the average of absolute percentage errors
Mean Squared Error (MSE) : Mean Squared Error represents the average of the squared difference between the original and predicted values
Root Mean Squared Error (RMSE) : Root Mean Squared Error is the square root of Mean Squared error. It measures the standard deviation of residuals

We shall use MAPE to evaluate our model. The MAPE for our base model is 44%. This indicates that over all the points predicted, we are out with an average of 44% from the true value.

Hyperparameter optimization

growth: This parameter can be set to either "linear" or "logistic". By looking at the trend line, we notice a trend that keeps on growing with no real saturation insight. Hence, we shall set this parameter to "linear". If we plot the trend and notice a curve that is showing promise of saturation, then we would set this parameter to “logistic”

seasonality_mode: This parameter can be set to either "additive" or "multiplicative". This parameter indicates how your seasonality components should be integrated with the predictions. The default value here is additive with multiplicative being the other option. We use "additive" since our seasonality is almost “constant” over the entire period. Prophet default value for this is parameter is "additive"

n_changepoints: Changepoints are the points in your data where there are sudden and abrupt changes in the trend. This parameter determines how many change points we would like to have.

changepoint_prior_scale: The changepoint_prior_scale indicates how flexible the changepoints are allowed to be. As in, how much can the changepoints fit to the data. Large values will allow many changepoints and small values will allow fewer change points. I found 0.095 to result in the least MAPE.

holidays_prior_scale: This parameter determines how much of an effect holidays should have on your predictions. The default value is 10. However, when I increased the value to 20, it resulted in a lower MAPE. This makes sense because the data we are dealing with is a retail data and holidays might have a noticeable effect on the sales.

add_seasonality: There are a lot of options in Prophet to control seasonality. Yearly, weekly, and daily seasonality and their granularity. I set the yearly_seasonality, weekly_seasonality and daily_seasonality all to false and then add in my own seasonalities using the "add_seasonality" option. By doing this we get more power and control over seasonality. For example, for the "monthly" seasonality, the period is set to 30.5 which is how we tell the model that what happened at a certain point is likely to happen again in 30.5 days. Similarly we shall set the periods for other seasonalities as well. The advantage of this method is that we can introduce new seasonalities based on domain knowledge such as quarterly seasonalities that are not avialable by default in the model. For example, by making the "period" parameter to the value of 365/4 you can add in quarterly seasonality to your model. The other parameter you can tweak using this technique is the number of Fourier components (fourier_order). With this, we can change how accurately it should start representing the curve or how many more curves can be present in the curve. I played around with values ranging from 10 to 25 and retained the ones that gave the least MAPE. The prior_scale parameter allows our seasonalities to be more flexible. I got the least MAPE with the prior_scale value of 0.00001 for the yearly seasonality.

add_country_holidays: FB Prophet gives us an option to include the country holidays. Since we are dealing with the UK-based and registered non-store online retail store, I chose to include the UK holidays

I incorporated the hyperparameters in the model below.

Insights

We notice that the trend line is increasing over the year.
The weekly peaks occur during Thrusdays and Tuesdays
The sales seem to peak during the later part of the year around November and dip duing the mid part of the year around May

Hyperparameter Optimization Results

After all the hyperparameter tuning, we notice that,
The Mean Absolute Error (MAE) is reduced to 10610.0 from being 19395.0 in the base model.
The Mean Absolute Percentage Error (MAPE) is reduced to 26% from being 44% in the base model.
The Mean Squared Error (MSE) is reduced to 193477641.0 from being 526025768.0 in the base model.
The Root Mean Squared Error (RMSE) is reduced to 13910.0 from being 22935.0 in the base model.

Interpretation of the model results for non-technical business stakeholders

The Mean Absolute Percentage Error (MAPE) is reduced to 26% from being 44% in the base model. This indicates that over all the predicted values for the sales in the retail store, we are out with an average of 26% from the true value.
The Root Mean Squared Error (RMSE) is reduced to 13910.0 from being 22935.0 in the base model. This means that in general, the model’s predicted sales for the retail store is generally about 13910.0 Sterlings off.
With the help of specific domain knowledge, we may be able to improve these scores.

Busuiness Recommendation

The changepoint effect

Recommendation

In the above chart, the change points (red dotted lined) denote the points where the trend changes (either increases or decreases). Based on the dates (shown above) on which the trend changes, we can expect an increase or a decrease in sales in the future years around the same dates. This would help us in running appropriate marketing campaigns and planning for the inventory and logistics activities. With the help of some domain knoweldge expert we could fine tune the change points to reflect the specific marketing strategy roll-outs and promotion deals.

The Holiday Effect

Recommendation

The above holidays which are specific to the United Kingdom holidays that were added to the model. However, with the help of domain knowledge experts, if the company wants to include other holidays such as "Thanksgiving", "Superbowl", etc we can provide them along with the other holidays to improve our forecast.

ARIMA Implementation

Insights

We find that the MAPE of SARIMAX is 35% which means that our model is out with an average of 35% from the true value.

Model Comparison

FB Prophet model: MAPE = 26%

SARIMAX model: MAPE = 35%

The lower the MAPE score, the better the model performance is. Hence, our FB Prophet model worked better than the SARIMAX model in this case.

Conclusion

We used the sales data to create an FB Prophet timeseries model and SARIMAX model to forecast future sales and explore the hyperparameters we can tune to get the most out of the historic sales data. We found that our FB Prophet model works better than the SARIMAX model

In conclusion, by coupling our analysis with the insights from some domain knowledge experts, we would be able to make the best out of our data.

References: I referred various articles on customer segmentation from websites such as Medium, Analytics Vidhya, Stack Overflow and used the FB Prophet documentation for the time series forecasting.

I enjoyed working on this case study. Thank you very much for the opportunity!

Kishor Kumar Sridhar

kishorkumarsridhar@gmail.com