Dec 1, 2019
How to run a Market Basket Analysis with TM1py and PAW
You’re walking through your local grocer. You rummage through your basket and manage to find your shopping list.
- Milk. Check.
- Eggs. Check.
- Bread. Check.
- Ben and Jerry’s Choc Chip Cookie Dough…right. To the ice cream aisle.
You make your way around the store to pick up the last of your supplies. Finally, you get to the check out.
“That’ll be $102.35 thank you.”
Ever wonder how a quick visit to the local grocer end up in a $100+ bill? You quickly scan your basket, there’s a whole heap of stuff that definitely wasn’t on the list. Eh, you figure you’ll use it anyway.
Knowing those extra products that you’re likely to buy is hyper-valuable to retailers. It allows you to better estimate the average basket size of a customer. Average basket size is critical…why? Well there’s three key things that determine sales revenue for most retailers:
- Number of people that visit a store
- The percentage of people that converted (i.e. bought something)
- How much they spent on average (aka average basket size).
What is market basket analysis?
Chances are, if you’re a retailer you have some pretty useful data on historical purchases. Lucky for you, you can use this to predict what customers are likely to buy and get better estimates of basket size.
This type of forecasting is called market basket analysis. It allows you to get deep into customer buying patterns.
How does it work?
Say you had a whole heap of customer transaction data which told you what customers bought with each shopping trip. You’re able to use that data to create relationships (read: associations) between what a customer bought and what they’re likely to buy based on their historical spending patterns.
What’s even better is that you can leverage your existing IBM TM1 application to run this analysis. TM1py makes it easy to apply free advanced open-source machine learning models to your IBM Planning Analytics application. Read on below to follow along.
Objectives
We want to use the IBM Planning Analytics Workspace capabilities to build a heatmap to highlights links between items:
To do that we will need to write some Python code and then use TM1 to visualize the data:
- Load data from a csv file using Pandas
- Transform the data with some Python code
- Run the Market Basket Analysis with MLXTEND
- Send data to TM1 with TM1py
- Build a heatmap with IBM Planning Analytics Workspace
To begin with, spin up a new Jupyter workbook, alternatively you can download the worked example below:
Loading data
The large majority of this post is based off the post at Practical Business Python on the same topic. To keep things simple and straight forward you’ll be able to use the same data set that this post used to conduct your first pass of market basket analysis.
The data set used is the Online Retail Data Set hosted over at the UCI Machine Learning Repository. It’s based on transactions from a retailer based in the UK over 2010-2011.
Loading data from a CSV in Python is very simple, by using the Pandas library, we can load the data into a Pandas data frame with just two lines of code:
If you are not familiar with Pandas data frame, you should first read the article below:
By having a closer look at the data, we can see that some value are negative, this means that we will need to work on our data set before running the algorithms:
Transforming data
The next step after loading the data is to clean the data and then to prepare the data for our algorithms.
Cleanup and standardise your data
Nobody likes a messy workspace. Likewise you’ll need to tidy up the data before it’s ready to analyse. Note: cleaning up is generally a thankless task but without it you’ll basically throwing junk in and getting junk out.
So, first strip any whitespace from the invoice descriptions:
df['Description'] = df['Description'].str.strip()
Then drop any rows that are missing the invoice numbers as well as converting the data types to strings. Each row in this data set represents a segment of a transaction i.e. it only shows the quantity for a particular product purchased in one visit. For this reason the invoice number if super important because it allows you to make associations between products i.e. if they have the same invoice number…they were purchased in the same transaction.
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)df['InvoiceNo'] = df['InvoiceNo'].astype('str')
In the data set there are a number of transactions that are prefixed with C which have negative values. Without knowing too much more about the dataset we could probably assume that these are returns. Either way, they won’t really impact the analysis as we’re looking for positive associations i.e. things that were bought together.
So, given that…let’s strip out all of the lines which have an Invoice number which contains a C. This should cut down the list of purchases with negative purchase quantities.
df = df[~df['InvoiceNo'].str.contains('C')]
Just to be sure, let’s remove any lines which have negative quantities purchased as well by running the following line.
df = df[~df['Quantity'] < 0 ]
Let’s display the first 5 lines of the new data set:
Create baskets and basket sets
Now that the boring stuff is done, it’s time to actually start analyzing your data. You’re trying to analyse common groups of product customers buy together. So it make sense that you group single product purchases, ideally into baskets of purchases. Given we have the invoice number attached to each single product purchase, you can do this pretty easily by group them together.
For this example, we are going to reduce the dataset to focus only on items purchased in France. The following code filters the data set to single out only France, then groups the dataframe by invoice and quantity.
basket = (df[df['Country'] == France].groupby(['InvoiceNo', 'Description'])['Quantity'] .sum().unstack().reset_index().fillna(0) .set_index('InvoiceNo'))
Running this should give you a new dataframe that represents matrix of basket purchases that looks like so..
This is useful for analysing the number of each product purchased per basket but no so much for analysing the basket relationship itself. For this reason, we want to strip out the volume and simply create a flag to say whether or not ANY quantity of that particular product was purchased as part of that basket.
The process to convert this dataframe to one which just shows the occurrence of a purchase is called one-hot-encoding.
Given the data frame is already setup in the format that’s needed we can run an encoder over it. The encoder below will run through each and every datapoint and if the value is greater than or equal to 1 it will set the value to one.
# Create encoder functiondef encode_units(x): if x <= 0: return 0 if x >= 1: return 1basket_sets = basket.applymap(encode_units)basket_sets.drop('POSTAGE', inplace=True, axis=1)
This means that in the figure above, the 24 packs of 10 Color Spaceboy Pens purchased in invoice 537468 will now be represented as a 1 to show that that particular product was purchased as part of that invoice (aka basket).
Running Market Basket Analysis
Find frequent item sets
Awesome, so we’ve found what products are bought in particular baskets. This is great but…what we’re really interested in is what products are frequently bought together. We can do this using the Apriori algorithm.
Without getting too nerdy, the Apriori algorithm scans through the basket sets that have been created and finds the most commonly occurring groups of items (or purchases). It starts out by selecting individual products then extends out and finds groups of products that occur frequently within the data set.
The Apriori algorithm can be imported as part of the mlxtend library. To import it, run the code below:
# Import apriorifrom mlxtend.frequent_patterns import apriori
You can then find the frequent itemsets by running the algorithm over the basket sets that were created after one-hot-encoding the basket dataframe.
# Determine frequent itemsfrequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)
The min_support parameter passed into the function limits the number of results returned based on support. What’s support you ask? It’s the number of times the frequent itemset shows up as a proportion of the total number of transactions. It’s an indicator of how common that itemset is within a particular list of transactions.
Create Association Rules
The final step in this analysis is to use Association Rule Mining to find relationships between the frequent item sets.
There are two key metrics that are important to analyse when performing Association Rule Mining. The first is confidence which determines how frequent a rule holds true. E.g. Say we have a frequent itemset in which 7/10 transactions where people purchase milk they also purchase bread. This would lead to a confidence metric of .7. The higher the value the more likely the rule is expected to hold true.
The second one is lift. Lift is a measure of independence, i.e. how relevant is the rule. This is important as there may be some products which are frequently bought independently of other products. If this is the case then rules with a low lift metric are not as relevant as the relationship between the products is not dependent. Lift is calculated as the support for the rule holding true divided by the product of support for the antecedent and the support for the consequent. A lift metric greater than one means that two components (the antecedent and the consequent) are dependent on each other. A figure less than one means that the items are substitutes for one another.
The good thing about the mlxtend package is that all of these metrics can be calculated with a few quick function calls.
First, import the association rules library…
# Import association rulesfrom mlxtend.frequent_patterns import association_rules
Then create association rules over the previously created basket sets dataframe…
# Generate rulesrules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
The result is a dataframe that shows key metrics for each of the rules found.
Sending data back into TM1
Now that our data is ready, we need to send this data into TM1 using the TM1py package.
The best way to do this is to create a new cube where antecedant (if you bought these products) and consequent elements are created dynamically in the dimensions.
This allows us to better visualise the relationship between the rules when opening up visualisations in PAW.
The first step to get to this stage is to import the TM1py package.
#import tm1 service modulefrom TM1py.Services import TM1Service#import tm1 utils modulefrom TM1py.Utils import Utils#MDX viewfrom TM1py.Objects import MDXView
And finally, dynamically build the dimensions and load the cubes, the script below will do the following:
- Get all elements from the dimension Product Antecedants
- Get all elements from the dimension Product Consequents
- Loop through the rules data set and prepare the new elements and the cell set to be sent to TM1
- Update the dimensions in TM1
- Send data to TM1
# Update to use new config setupwith TM1Service(address='localhost', port='8009', user='admin', password='apple', ssl='true') as tm1: #----- # 1. Get 'Product Antecedants' information from TM1 # Get Antecedents Dim antecedent_dimension = tm1.dimensions.get('Product Antecedants') # Get Default Hierarchy antecedent_hierarchy = antecedent_dimension.hierarchies[0] # Parent element antecedent_parent = 'All Antecedants' # Add parent (if not already exists) if antecedent_parent not in antecedent_hierarchy.elements: antecedent_hierarchy.add_element(element_name=antecedent_parent, element_type='Consolidated') #----- # 2. Get 'Product Consequents' information from TM1 # Get Consequents Dim consequent_dimension = tm1.dimensions.get('Product Consequents') # Get Default Hierarchy consequent_hierarchy = consequent_dimension.hierarchies[0] # Parent element consequent_parent = 'All Consequents' # Add parent (if not already exists) if consequent_parent not in consequent_hierarchy.elements: consequent_hierarchy.add_element(element_name=consequent_parent, element_type='Consolidated') #----- # 3. Loop through the rules data set and prepare elements and cellset cellset = {} for key, row in rules.iterrows(): antesecents_leaves = list(row.antecedents) consequents_leaves = list(row.consequents) antesecent_element_name = ' - '.join(antesecents_leaves) consequent_element_name = ' - '.join(consequents_leaves) # Add elements only if they don't already exist if antesecent_element_name not in antecedent_hierarchy.elements: antecedent_hierarchy.add_element(element_name=antesecent_element_name, element_type='Numeric') if (antecedent_parent, antesecent_element_name) not in antecedent_hierarchy.edges: antecedent_hierarchy.add_edge(antecedent_parent, antesecent_element_name, 1) if consequent_element_name not in consequent_hierarchy.elements: consequent_hierarchy.add_element(element_name=consequent_element_name, element_type='Numeric') if (consequent_parent, antesecent_element_name) not in consequent_hierarchy.edges: consequent_hierarchy.add_edge(consequent_parent, antesecent_element_name, 1) # Build cellset cellset[(consequent_element_name, antesecent_element_name, 'Lift')] = row['lift'] cellset[(consequent_element_name, antesecent_element_name, 'Confidence')] = row['confidence'] cellset[(consequent_element_name, antesecent_element_name, 'Support')] = row['support'] #----- # 4. write Hierarchy back to TM1 tm1.dimensions.update(antecedent_dimension) tm1.dimensions.update(consequent_dimension) #----- # 5. write data to TM1 tm1.cubes.cells.write_values('Basket Matrix', cellset)
Building the heatmap with PAW
Once you’re in IBM Planning Analytics Workspace, create a new view and rearrange the dimensions so that the Consequents are on the row dimension, Antecedants are on the column dimension and the metric you’re interested in is on the title dimension.
Once that’s done, you can change the view from Exploration to Heat.
To visualise the metric a little bit better…
And…that’s it!