Market Basket Analysis using Apriori algorithm in python
With the booming industry of e-commerce, online shopping has now become the norm for everyone. It is easily accessible, secure and convenient from a customer’s perspective. As one of several strategies to increase customer retention and sales, retailers may cross-sell other products to the customer, perhaps offering the products as a "frequently bought together " in most of these sites.
Ever wonder how are the retailers gathering this information?
Market Basket analysis has benefitted retailers to mine customer purchasing
patterns and identify combinations of products frequently bought together. MBA
of these transaction sets using association rule mining allowing for superior
if-then analysis. Retailers are able to cluster items, strategize product
placement, plan catalog designs that are more suitable to customer needs. MBA
enables retailers to make customer-aware online search recommendations and
tailor marketing campaigns / sales promotions.
Association Rule
Association Rules is a key data mining technique in Market Basket Analysis that is often used to analyze customer buying habits and patterns. Association Rule Mining is used when you want to find an association between different objects in the data set or find frequent patterns in the data. Association rules do not extract an individual's preference, rather find relationships between sets of elements of every distinct transaction. It enables to discover purchasing patterns and requires minimal feature engineering or data cleansing. It differs from recommendation systems that use collaborative filtering to identify individual preferences etc. Collaborative filtering makes automatic predictions about the interests of a user by collecting preferences from many users. It uses like-minded responses to predict behavior for an active user. The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with tastes similar to themselves.
Apriori algorithm is used to mine large data sets for frequent item sets and their association rules. These associate rules can be used effectively for marketing and sales, survey and research.
Association rules are used to determine relations amongst multiple variables in the data set. Some measures used to filter rules of interest and significance include –
Support – It signifies the popularity of an item. It is defined as the proportion of the transactions for an item out of the total number of transactions.
Confidence – It predicts the likelihood of co-occurrence of the second item in an item set, when item 1 is selected in a transaction.
Lift – It identifies the chances of co-occurrence of the second item in an item set, when item 1 is selected in a transaction while also considering the popularity of the 2nd item.
Step-by-Step: Apriori Algorithm in Python - Market Basket Analysis
Problem Statement
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules
# Input
data_file = "groceries.csv"
file_delimiter = ','
# The max number of columns a line in the file could have
largest_column_count = 0
with open(data_file, 'r') as temp_f:
# get No of columns in each line
largest_column_count = [ len(l.split(file_delimiter)) for l in temp_f.readlines() ]
#Assigning a column name
column_names = [i for i in range(0, max(largest_column_count))]
#loading data from csv into pandas dataframe
df = pd.read_csv(data_file, header=None, delimiter=file_delimiter, names=column_names)
# Analyzing the data we have loaded from the csv
df.head(10)
# un-pivot the above dataset to do EDA
df_unpivoted = pd.melt(df.reset_index(), id_vars='index', var_name="item_num", value_name='item_name')
df_unpivoted.rename(columns={'index': 'transaction_no'}, inplace=True)
# delete all the records in the unpivoted dataset where the item_name is NaN
df_unpivoted.dropna(subset = ["item_name"], inplace=True)
Taking a glance at our unpivoted data frame
#get all the records for 1st transaction
df_unpivoted.loc[df_unpivoted['transaction_no']==1]
# find the top 20 "sold items" that occur in the dataset
df_percent = pd.concat([df_unpivoted["item_name"].value_counts(), df_unpivoted["item_name"].value_counts(normalize=True)], keys=['counts', 'percent'], axis=1)
#find the cumulative percent based on the number of transactions
# an item appeared
df_percent['cumulative_percent']=df_percent['percent'].cumsum()
df_percent.head(20)
plt.figure(figsize=(15,9))
plt.title("Item Vs No Of transactions", fontsize=20)
ax = sns.barplot(x=df_percent.head(20).index, y="counts", data=df_percent.head(20));
ax.set_xlabel('Items', fontsize=15)
ax.set_ylabel('No of transactions item occured', fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=12);
plt.figure(figsize=(15,9))
plt.title("Cumulative Percent", fontsize=20)
ax = sns.barplot(x=df_percent.head(20).index, y="cumulative_percent", data=df_percent.head(20));
ax.set_xlabel('Items', fontsize=15)
ax.set_ylabel('Cumulative transaction percent', fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=12);
# Create a function prune_dataset, which will help us reduce the size of our dataset
# based on our requirements. The function should perform pruning based on length of
# transaction i.e. minimum number of item in a particular transaction, and percentage of
# total sales.
def prune_dataset(input_df, length_trans, total_sales_perc):
df_unpivoted = pd.melt(input_df.reset_index(), id_vars='index', var_name="item_num", value_name='item_name')
df_unpivoted.rename(columns={'index': 'transaction_no'}, inplace=True)
df_percent = pd.concat([df_unpivoted["item_name"].value_counts(), df_unpivoted["item_name"].value_counts(normalize=True)], keys=['counts', 'percent'], axis=1)
df_percent['cumulative_percent']=df_percent['percent'].cumsum()
col_arr = df_percent.loc[df_percent['cumulative_percent'].round(2)<=total_sales_perc].index.values
df_pruned = (df_unpivoted.groupby(['transaction_no','item_name'])['item_name']
.count().unstack().reset_index().fillna(0)
.set_index('transaction_no'))
df_pruned = df_pruned.astype('int64')
df_pruned = df_pruned[df_pruned.sum(axis=1) >= length_trans]
df_pruned = df_pruned.loc[:,col_arr]
return df_pruned
df_apriori = prune_dataset(df, 2, 50)
frq_items = apriori(df_apriori, min_support = 0.04, use_colnames = True)
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules.loc[:,['consequents', 'antecedents', 'support', 'confidence', 'lift']].head(5)
Amazing work! Really helpful
ReplyDeletegreat
ReplyDelete