Market Basket Analysis using Apriori algorithm in python


With the booming industry of e-commerce, online shopping has now become the norm for everyone. It is easily accessible, secure and convenient from a customer’s perspective. As one of several strategies to increase customer retention and sales, retailers may cross-sell other products to the customer, perhaps offering the products as a "frequently bought together " in most of these sites. 

Ever wonder how are the retailers gathering this information? Market Basket analysis has benefitted retailers to mine customer purchasing patterns and identify combinations of products frequently bought together. MBA of these transaction sets using association rule mining allowing for superior if-then analysis. Retailers are able to cluster items, strategize product placement, plan catalog designs that are more suitable to customer needs. MBA enables retailers to make customer-aware online search recommendations and tailor marketing campaigns / sales promotions.

Association Rule

Association Rules is a key data mining technique in Market Basket Analysis that is often used to analyze customer buying habits and patterns. Association Rule Mining is used when you want to find an association between different objects in the data set or find frequent patterns in the data. Association rules do not extract an individual's preference, rather find relationships between sets of elements of every distinct transaction. It enables to discover purchasing patterns and requires minimal feature engineering or data cleansing. It differs from recommendation systems that use collaborative filtering to identify individual preferences etc. Collaborative filtering makes automatic predictions about the interests of a user by collecting preferences from many users. It uses like-minded responses to predict behavior for an active user. The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with tastes similar to themselves.

Apriori algorithm is used to mine large data sets for frequent item sets and their association rules. These associate rules can be used effectively for marketing and sales, survey and research.

Association rules are used to determine relations amongst multiple variables in the data set. Some measures used to filter rules of interest and significance include –

Support – It signifies the popularity of an item. It is defined as the proportion of the transactions for an item out of the total number of transactions.


Confidence – It predicts the likelihood of co-occurrence of the second item in an item set, when item 1 is selected in a transaction.


Lift – It identifies the chances of co-occurrence of the second item in an item set, when item 1 is selected in a transaction while also considering the popularity of the 2nd item.



Step-by-Step: Apriori Algorithm in Python - Market Basket Analysis

Problem Statement

Perform Exploratory Data Analysis over very popular groceries dataset and apply apriori algorithm to find the association using Python.

Dataset
Below is the transaction data from our groceries.csv file.


Step 1: Import the libraries:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules

Step 2: Loading the dataset in pandas dataframe:
# Input
data_file = "groceries.csv"
file_delimiter = ','

# The max number of columns a line in the file could have
largest_column_count = 0
with open(data_file, 'r') as temp_f:
    # get No of columns in each line
    largest_column_count = [ len(l.split(file_delimiter)) for l in temp_f.readlines() ]

#Assigning a column name
column_names = [i for i in range(0, max(largest_column_count))]

#loading data from csv into pandas dataframe
df = pd.read_csv(data_file, header=None, delimiter=file_delimiter, names=column_names)

Step 3: Analyzing the data we have loaded from the csv:
# Analyzing the data we have loaded from the csv
df.head(10)
Step 4 : Let's do some Exploratory Data Analysis(EDA):
First we will un-pivot the data for our convenience in doing some analysis. 
# un-pivot the above dataset to do EDA
df_unpivoted = pd.melt(df.reset_index(), id_vars='index', var_name="item_num", value_name='item_name')
df_unpivoted.rename(columns={'index': 'transaction_no'}, inplace=True)
# delete all the records in the unpivoted dataset where the item_name is NaN
df_unpivoted.dropna(subset = ["item_name"], inplace=True)
Taking a glance at our unpivoted data frame
#get all the records for 1st transaction
df_unpivoted.loc[df_unpivoted['transaction_no']==1]
Getting the count of number of transactions the items occured in and also the respective percentage and the cumulative percent.
# find the top 20 "sold items" that occur in the dataset
df_percent = pd.concat([df_unpivoted["item_name"].value_counts(), df_unpivoted["item_name"].value_counts(normalize=True)], keys=['counts', 'percent'], axis=1)
#find the cumulative percent based on the number of transactions
# an item appeared
df_percent['cumulative_percent']=df_percent['percent'].cumsum()
df_percent.head(20)
This shows that the top five items (i.e. whole milk, other vegetables, rolls/buns, soda and yogurt) are responsible for 21.4% of the entire sales and only the top 20 items are responsible for over 50% of the sales! This is important for us, as we don’t want to find association rules for items which are bought very infrequently. With this information we can limit the items we want to explore for creating our association rules. This also helps us in keeping our possible item set number to a manageable figure.

Some graphical representation of the above data:
plt.figure(figsize=(15,9))
plt.title("Item Vs No Of transactions", fontsize=20)
ax = sns.barplot(x=df_percent.head(20).index, y="counts", data=df_percent.head(20));
ax.set_xlabel('Items', fontsize=15)
ax.set_ylabel('No of transactions item occured', fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=12);
plt.figure(figsize=(15,9))
plt.title("Cumulative Percent", fontsize=20)
ax = sns.barplot(x=df_percent.head(20).index, y="cumulative_percent", data=df_percent.head(20));
ax.set_xlabel('Items', fontsize=15)
ax.set_ylabel('Cumulative transaction percent', fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=12);
  
Step 5: Pruning the dataset:
As the EDA shows that we have to find association of our top 20 items, let's do some pruning. This is one of the important step of data cleaning which reduces and transforms our dataset for apriori learning.
Below code creates a function which will perform pruning based on percentage of total sales and the length of transaction (eg. length_trans=2 indicates that we are interested in transactions with at least two items, total_sales_perc = 50 means all items which contributes to 50% of total sales.)
# Create a function prune_dataset, which will help us reduce the size of our dataset
# based on our requirements. The function should perform pruning based on length of 
# transaction i.e. minimum number of item in a particular transaction, and percentage of
# total sales.
def prune_dataset(input_df, length_trans, total_sales_perc):
  df_unpivoted = pd.melt(input_df.reset_index(), id_vars='index', var_name="item_num", value_name='item_name')
  df_unpivoted.rename(columns={'index': 'transaction_no'}, inplace=True)
  df_percent = pd.concat([df_unpivoted["item_name"].value_counts(), df_unpivoted["item_name"].value_counts(normalize=True)], keys=['counts', 'percent'], axis=1)
  df_percent['cumulative_percent']=df_percent['percent'].cumsum()
  col_arr = df_percent.loc[df_percent['cumulative_percent'].round(2)<=total_sales_perc].index.values
  df_pruned = (df_unpivoted.groupby(['transaction_no','item_name'])['item_name']
      .count().unstack().reset_index().fillna(0)
      .set_index('transaction_no'))
  df_pruned = df_pruned.astype('int64')
  df_pruned = df_pruned[df_pruned.sum(axis=1) >= length_trans]
  df_pruned = df_pruned.loc[:,col_arr]
  return df_pruned

Step 6: Building as Apriori model:
We are building our model based on the rule of minimum support of 40%.
df_apriori = prune_dataset(df, 2, 50)
frq_items = apriori(df_apriori, min_support = 0.04, use_colnames = True) 
# Collecting the inferred rules in a dataframe 
rules = association_rules(frq_items, metric ="lift", min_threshold = 1) 
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False]) 

Step 7: Getting the metrics of the Apriori Model:
rules.loc[:,['consequents', 'antecedents', 'support', 'confidence', 'lift']].head(5)
Step 8: Decoding the Apriori algorithm:
Let's take a look at the top rule (based on confidence)

{whole milk} => {whipped/sour cream} 

Support = 0.041298 : This means that whole milk and whipped/sour cream show up together in 4.1% of transactions. 

Confidence = 0.466863 : 46.6% of the time the customer tends to buy whipped/sour cream along with whole milk. 

Lift = 1.498178. whipped/sour cream are 1.5 times more likely to be purchased along with whole milk.

Finally, we can tune the Apriori algorithm by generating different association rules based on support and confidence threshold.

Apriori is a better way of finding the association rule. Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete feature. It can also be used on the OTT platforms to recommend series and movies to viewers. It can be used for ad targeting in online advertising.

I, truly appreciate Rocky Jagtiani for his sincere efforts and guidance in Market Basket Analysis.


Comments

Post a Comment