Data Analysis

Objective

The purpose is to provide analysis based on dataset¹ provided by going through all the necessary step and methods in this alongside explanation. My project collects data of sales records. I have wrangled and analysed data from last couple years of each country individually and visualized the data in form of excel dashboard for easy understanding of lending trend.

Import && Read Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os
import pandas as pd
import matplotlib.pyplot as plt

#find file path directory
from dirname, _, filenames in os.walk("./source/data"):
    for filename in filenames:
        print(os.path.join(dirname, filename))


all_sales_df = pd.DataFrame()

for filename in os.listdir(dir_path):
    df = pd.read_csv(os.path.join(dirname, filename))
    all_sales_df = pd.concat([all_sales_df, df], ignore_index=True)

all_sales_df = all_sales_df.rename(mapper = str.strip, axis ='columns')
all_sales_df = all_sales_df.rename(columns= {'Order ID': 'Order_id',
                                              'Quantity Ordered': 'Quantity', 
                                              'Price Each': 'Price', 
                                              'Order Date': 'Date', 
                                              'Purchase Address': 'Address'})

#lowercase column name
column_name = list(all_sales_df.columns)
column_name = [x.lower().strip() for x in column_name] 
all_sales_df.columns = column_name

Data cleaning

First check for rows contain null and duplicates. Then check for non-numeric values in order_id, quantity, and price since the 3 column cannot be non-numeric. Then, correct format for each column, Int, Timestamp and String. When there is a few consideration about dropping duplicated rows. Check duplicated rows and their duplicates.

If there is null and duplicated rows then it need to be dropped/removed. For non-numeric in mentioned columns, apply to_numeric to remove the rows dataframe. Then check the result by applying the previous methods for a new defined dataframe(clean_sales_df).

identify data, null and non-numeric digits

1
2
3
4
5
6
7
8
9
#check for all rows contain null value
all_sales_df.isnull().sum()
#check duplicated rows
all_sales_df.duplicated().sum()
#check for non-numeric in order_id, quantity, and price
for i in ["order_id", "quantity", "price"]:
    all_sales_df[i].loc[pd.to_numeric(all_sales_df[i], errors='coerce').isnull()].unique()
#check order id in duplicated rows that are not null and is numeric
all_sales_df[all_sales_df.duplicated(keep=False) & all_sales_df['order_id'].notnull() & all_sales_df['order_id'].str.isnumeric()]['order_id'].head()

remove null rows

1
2
3
4
5
#drop null rows
clean_sales_df = all_sales_df.dropna(how='all')

for i in ["order_id", "quantity", "price"]:
    clean_sales_df = clean_sales_df[pd.to_numeric(clean_sales_df[i], errors='coerce').notnull()]

chagne data type

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#change data type for column quantity, price, date, and address
clean_sales_df['quantity'] = clean_sales_df['quantity'].astype(int)
clean_sales_df['price'] = clean_sales_df['price'].astype(float)
clean_sales_df['date'] = pd.to_datetime(clean_sales_df['date'], format='%m/%d/%y %H:%M')
clean_sales_df['address'] = clean_sales_df['address'].astype(str)

#add city and state column
clean_sales_df['city'] = clean_sales_df['address'].apply(lambda x: x.split(',')[1].strip())
clean_sales_df['state'] = clean_sales_df['address'].apply(lambda x: x.split(',')[2].split(' ')[1].strip())

#add total sales column
clean_sales_df['total_sales'] = clean_sales_df['quantity'] * clean_sales_df['price']

** store cleaned dataset to new location **

Questions

Task 1

Q: What was the best Year for sales? How much was earned that Year? A: Create Annual Sales summary and find max/min values in Annual Sales

In this section clean_sales_df will be referred to find annual sales as annual_sales_df. Plot the annual_sales_df and find max/min values in annual_sales_df then print the result as conclusion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#find annual sales summary and convert to dataframe
annual_sales_df = pd.DataFrame(clean_sales_df.groupby(clean_sales_df['date'].dt.year)['total_sales'].sum())
annual_sales_df.reset_index(inplace=True)

#generate annual sales visual
with plt.style.context('ggplot'):
    plt.figure(figsize=(4, 4))
    sns.barplot(x='date', y='total_sales', data=annual_sales_df)
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Annual Sales", fontsize=20)
    plt.xlabel("Year")
    plt.ylabel("Sales")
    
    #add data callouts
    for x, y in enumerate(annual_sales_df['total_sales']):
        label = f"${y:,.2f}"
        plt.annotate(label, (x,y), textcoords='offset points', xytext=(0,1), ha='center', fontsize=12)
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#find highest year and value in annual sales
highest_year = annual_sales_df['date'].iloc[annual_sales_df['total_sales'].idxmax()]
highest_value = annual_sales_df['total_sales'].max()

#find lowest year and value in annual sales
lowest_year = annual_sales_df['date'].iloc[annual_sales_df['total_sales'].idxmin()]
lowest_value = annual_sales_df['total_sales'].min()

#show result as visual
with plt.style.context('ggplot'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5,
             f"Highest Year: {highest_year}, Total Sales: ${highest_value:,.2f}\n"
             f"Lowest Year: {lowest_year}, Total Sales: ${lowest_value:,.2f}",
             fontsize=14, ha='center', va='center')
    plt.title("Best & Worst Year", fontsize=20)
    plt.show()

1
2
3
4
5
6
#print result as conclusion
print(f"CONCLUSION\n"
      f"Total sales generated by company: ${highest_value+lowest_value:,.2f}\n"
      f"\n"
      f"With the highest sales is in {highest_year}, generating: ${highest_value:,.2f}\n"
      f"With the lowest sales is in {lowest_year}, generating: ${lowest_value:,.2f}")

Task 2

Q: What was the best month for sales? How much was earned that month? A: Create Monthly Sales summary and find largest/smallest values in Monthly Sales

In this section clean_sales_df will be referred to find monthly sales as monthly_sales_df. Plot the monthly_sales_df and find largest/smallest values in monthly_sales_df then print the result as conclusion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#find monthly sales summary and convert to dataframe
monthly_sales_df = pd.DataFrame(clean_sales_df.groupby([clean_sales_df['date'].dt.year, 
                                                        clean_sales_df['date'].dt.month])['total_sales'].sum())
monthly_sales_df.index = pd.to_datetime(monthly_sales_df.index.map('{0[0]}-{0[1]}'.format))
monthly_sales_df.index.name = 'date'
monthly_sales_df = monthly_sales_df.reset_index()

#find monthly sales average
average_monthly_sales = monthly_sales_df['total_sales'].mean()

#generate monthly sales visual
with plt.style.context('ggplot'):
    plt.figure(figsize=(16, 4))
    sns.lineplot(x='date', y='total_sales', data=monthly_sales_df)
    plt.gca().xaxis.set_major_formatter(DateFormatter('%y-%m'))
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Monthly Sales", fontsize=20)    
    plt.xlabel("Date")
    plt.ylabel("Sales")
    plt.xticks(monthly_sales_df['date'])

    #add axh line
    plt.axhline(y=average_monthly_sales, color='blue', linestyle='--', linewidth=1)
    plt.text(monthly_sales_df['date'].iloc[-12], average_monthly_sales+5,
             f"Avg: ${average_monthly_sales:,.2f}",
             color='blue', fontsize=10, ha='center', va='bottom')
    
    #add data callouts
    for x, y in zip(monthly_sales_df['date'], monthly_sales_df['total_sales']):
        label = f"${y:,.2f}"
        plt.annotate(label, (x,y), textcoords='offset points', xytext=(2,4), ha='center')
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#find highest month and value in monthly sales
highest_months = monthly_sales_df.nlargest(3, 'total_sales')
highest_months['date'] = pd.to_datetime(highest_months['date']).dt.strftime('%B %Y')
highest_month_names = highest_months['date'].tolist()
highest_month_values = highest_months['total_sales'].tolist()

#find lowest month and value in monthly sales
lowest_months = monthly_sales_df.nsmallest(3, 'total_sales')
lowest_months['date'] = pd.to_datetime(lowest_months['date']).dt.strftime('%B %Y')
lowest_month_names = lowest_months['date'].tolist()
lowest_month_values = lowest_months['total_sales'].tolist()

#show result as visual
with plt.style.context('ggplot'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5, 
             f"Highest Month: {highest_month_names[0]}, Total Sales: ${highest_month_values[0]:,.2f}\n"
             f"Lowest Month: {lowest_month_names[0]}, Total Sales: ${lowest_month_values[0]:,.2f}",
             fontsize=14, ha='center', va='center')
    plt.title("Best & Worst Month", fontsize=20)
    plt.show()

1
2
3
4
5
#print result as conclusion
print(f"CONCLUSION\n"
      f"Top 3 best month:\n{highest_months.to_string(index=False)}\n"
      f"\n"
      f"Bottom 3 worst month:\n{lowest_months.to_string(index=False)}")

Task 3

Q: What City had the highest number of sales? A: Create Sales by State/City and find largest/smallest values in Sales by State/City

This section separated by 2 for Sales by State and Sales by City from clean_sales_df. Purpose is to find both trend on state level and city level. Plot and find largest/smallest values in both dataframe then print the result as conclusion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#find sales by state and convert to dataframe
sales_by_state_df = pd.DataFrame(clean_sales_df.groupby('state')['total_sales'].sum())
sales_by_state_df = sales_by_state_df.sort_values(by='total_sales', ascending=False).reset_index()

#generate sales by state visual
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(16, 4))
    sns.barplot(x='state', y='total_sales', data=sales_by_state_df ,palette='husl')
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Sales by State", fontsize=20)
    plt.xlabel("State")
    plt.ylabel("Sales")
    
    #add data callouts
    for x, y in enumerate(sales_by_state_df['total_sales']):
        label = f"${y:,.2f}"
        plt.annotate(label, (x,y), textcoords="offset points", xytext=(0,2), ha='center', fontsize=10)
    plt.show()

1
2
3
4
5
6
7
8
9
#create dictionary for state colors for use in sales by city visual
state_colors = {}
n_states = len(sales_by_state_df['state'])
palette = sns.color_palette('husl', n_colors=n_states)  #use same palette as previous state visual
for i, state in enumerate(sales_by_state_df['state']):
    state_colors[state] = palette[i]

#check the state colors
print(state_colors)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#find highest month and value in sales by state
highest_state = sales_by_state_df.nlargest(3, 'total_sales')
highest_state_names = highest_state['state'].tolist()
highest_state_values = highest_state['total_sales'].tolist()

#find lowest month and value in sales by state
lowest_state = sales_by_state_df.nsmallest(3, 'total_sales')
lowest_state_names = lowest_state['state'].tolist()
lowest_state_values = lowest_state['total_sales'].to_list()

#show result as visual
with plt.style.context('fivethirtyeight'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5, 
             f"Highest State: {highest_state_names[0]}, Total Sales: ${highest_state_values[0]:,.2f}\n"
             f"Lowest State: {lowest_state_names[0]}, Total Sales: ${lowest_state_values[0]:,.2f}",
             fontsize=14, ha='center', va='center')
    plt.title("Best & Worst State", fontsize=20)
    plt.show()

1
2
3
4
5
6
7
#find sales by city and convert to dataframe
sales_by_city_df = pd.DataFrame(clean_sales_df.groupby(['state', 'city'])['total_sales'].sum())
sales_by_city_df = sales_by_city_df.sort_values(by='total_sales', ascending=False).reset_index()

#modify city named "Portland" before generating visual (to avoid overwrite)
sales_by_city_df.loc[(sales_by_city_df['state'] == 'ME') & (sales_by_city_df['city'] == 'Portland'), 'city'] = 'Portland (Maine)'
sales_by_city_df.loc[(sales_by_city_df['state'] == 'OR') & (sales_by_city_df['city'] == 'Portland'), 'city'] = 'Portland (Oregon)'

1
2
3
4
5
#print result as conclusion
print(f"CONCLUSION\n"
      f"Top 3 best state sales:\n{highest_state.to_string(index=False)}\n"
      f"\n"
      f"Bottom 3 worst state sales:\n{lowest_state.to_string(index=False)}")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#generate visual for sales by city
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(16, 4))
    sns.barplot(x='total_sales', y='city', hue='state', data=sales_by_city_df, dodge=False, palette=state_colors) #previous state_colors dict
    plt.title("Sales by City", fontsize=20)
    plt.xlabel("Sales")
    plt.ylabel("City")
    plt.legend(title="State", bbox_to_anchor=(1.05, 0.5), loc='center left', borderaxespad=0.)
    
    #add data callouts
    for y, x in enumerate(sales_by_city_df['total_sales']):
        label = f"${x:,.2f}"
        plt.annotate(label, (x, y), textcoords='offset points', xytext=(0, 0), ha='left', va='center', fontsize=10)
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#find highest month and value in sales by city
highest_city = sales_by_city_df.nlargest(3, 'total_sales')
highest_city_names = highest_city['city'].tolist()
highest_city_values = highest_city['total_sales'].tolist()

#find lowest month and value in sales by city
lowest_city = sales_by_city_df.nsmallest(3, 'total_sales')
lowest_city_names = lowest_city['city'].tolist()
lowest_city_values = lowest_city['total_sales'].tolist()

#show result as visual
with plt.style.context('fivethirtyeight'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5, 
             f"Highest City: {highest_city_names[0]}, Total Sales: ${highest_city_values[0]:,.2f}\n"
             f"Lowest City: {lowest_city_names[0]}, Total Sales: ${lowest_city_values[0]:,.2f}",
             fontsize=14, ha='center', va='center')
    plt.title("Highest & Lowest City Sales", fontsize=20)
    plt.show()

1
2
3
4
5
#print result as conclusion
print(f"CONCLUSION\n"
      f"Top 3 best city sales:\n{highest_city.to_string(index=False)}\n"
      f"\n"
      f"Bottom 3 worst city sales:\n{lowest_city.to_string(index=False)}")

Task 4

Q: What time should we display adverstisement to maximize likelihood of customer’s buying product? A: Find Monthly/Daily/Hourly Order Trend

This section will find the data related to orders by simply find count/unique number of order_id by month/day/hour. The data that will be referred here is clean_sales_2019_df(which also apply for the rest of the analysis).

Despite order id cannot be duplicate in the dataset there is multiple rows with the same order id. The reason why it’s not dropped is because each may contains different product. *not sure how to approach/may also need to change data cleaning section

But using .count() it will return the duplicated order id as well. If we want to count the order id as unique number then we can call .nunique() which will be used in following sections.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#find monthly order and convert to dataframe
monthly_order_df = pd.DataFrame(clean_sales_2019_df.groupby([clean_sales_2019_df['date'].dt.year, 
                                                             clean_sales_2019_df['date'].dt.month])['order_id'].nunique())
monthly_order_df.index = pd.to_datetime(monthly_order_df.index.map('{0[0]}-{0[1]}'.format))
monthly_order_df.index.name = 'date'
monthly_order_df = monthly_order_df.rename(columns={'order_id': 'total_orders'}).reset_index()

#find average monthly order
average_monthly_order = monthly_order_df['total_orders'].mean()

#generate monthly order visual
with plt.style.context('bmh'):
    plt.figure(figsize=(16, 4))
    sns.lineplot(x='date', y='total_orders', data=monthly_order_df)
    plt.gca().xaxis.set_major_formatter(DateFormatter('%Y-%m'))
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Monthly Order Trend", fontsize=20)    
    plt.xlabel("Date")
    plt.ylabel("Order")
    plt.xticks(monthly_order_df['date'])

    #add axh line
    plt.axhline(y=average_monthly_order, color='red', linestyle='--', linewidth=1)
    plt.text(monthly_order_df['date'].iloc[-12], average_monthly_order+5, 
             f"Avg: {average_monthly_order:,.2f}", 
             color='red', fontsize=10, ha='center', va='bottom')

    #add data callouts
    for x, y in zip(monthly_order_df['date'], monthly_order_df['total_orders']):
        label = f"{(y):,}"
        plt.annotate(label, (x,y), textcoords='offset points', xytext=(2,4), ha='center')
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#find 3 highest value in monthly order
highest_monthly_orders = monthly_order_df.nlargest(3, 'total_orders')
highest_monthly_orders['date'] = pd.to_datetime(highest_monthly_orders['date']).dt.strftime('%B')

#find 3 lowest value in monthly order
lowest_monthly_orders = monthly_order_df.nsmallest(3, 'total_orders')
lowest_monthly_orders['date'] = pd.to_datetime(lowest_monthly_orders['date']).dt.strftime('%B')

#print result as conclusion
print(f"CONCLUSION\n"
      f"Months with highest total orders:\n{highest_monthly_orders.to_string(index=False)}\n"
      f"\n"
      f"Months with lowest total orders:\n{lowest_monthly_orders.to_string(index=False)}\n"
      f"\n"
      f"*use as references to maximize/prioritize advertisement")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#find daily order and convert to dataframe
daily_order_df = pd.DataFrame(clean_sales_2019_df.groupby(clean_sales_2019_df['date'].dt.dayofweek)['order_id'].nunique())
daily_order_df.columns = ['total_orders']
daily_order_df.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_order_df.index.name = 'day_of_week'
daily_order_df = daily_order_df.reset_index()

#find average daily order
average_daily_order = daily_order_df['total_orders'].mean()

#generate daily order visual
with plt.style.context('bmh'):
    plt.figure(figsize=(16, 4))
    sns.lineplot(x='day_of_week', y='total_orders', data=daily_order_df)
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Daily Order Trend", fontsize=20)    
    plt.xlabel("Day of Week")
    plt.ylabel("Order")
    
    #add axh line
    plt.axhline(y=average_daily_order, color='red', linestyle='--', linewidth=1)
    plt.text(daily_order_df['day_of_week'].iloc[-7], average_daily_order-5, 
             f"Avg: {average_daily_order:,.2f}", 
             color='red', fontsize=10, ha='center', va='top')
    
    #add data callouts
    for x, y in enumerate(daily_order_df['total_orders']):
        label = f"{(y):,}"
        plt.annotate(label, (x,y), textcoords='offset points', xytext=(2,4), ha='center')
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#find 3 highest value in daily orders
highest_daily_orders = daily_order_df.nlargest(3, 'total_orders')

#find 3 lowest value in daily orders
lowest_daily_orders = daily_order_df.nsmallest(3, 'total_orders')

#print result as conclusion
print(f"CONCLUSION\n"
      f"Days with highest total orders:\n{highest_daily_orders.to_string(index=False)}\n"
      f"\n"
      f"Days with lowest total orders:\n{lowest_daily_orders.to_string(index=False)}\n"
      f"\n"
      f"*use as references to maximize/prioritize advertisement")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#find hourly order and convert to dataframe
hourly_order_df = pd.DataFrame(clean_sales_2019_df.groupby(clean_sales_2019_df['date'].dt.hour)['order_id'].nunique())
hourly_order_df.columns = ['total_orders']
hourly_order_df.index = pd.to_datetime(hourly_order_df.index, format='%H').strftime('%H:%M')
hourly_order_df.index.name = 'hour'
hourly_order_df = hourly_order_df.reset_index()

#find average hourly order
average_hourly_order = hourly_order_df['total_orders'].mean()

#generate hourly order visual
with plt.style.context('bmh'):
    plt.figure(figsize=(16, 4))
    sns.lineplot(x='hour', y='total_orders', data=hourly_order_df)
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Hourly Order Trend", fontsize=20)    
    plt.xlabel("Hour")
    plt.ylabel("Order")
    
    #add axh line
    plt.axhline(y=average_hourly_order, color='red', linestyle='--', linewidth=2)
    plt.text(0.5, average_hourly_order+100, f"Avg: {average_hourly_order:,.2f}", color='red', fontsize=10, ha='center', va='bottom')

    #add data callouts
    for x, y in enumerate(hourly_order_df['total_orders']):
        label = f"{y:,}"
        plt.annotate(label, (x,y), textcoords='offset points', xytext=(2,4), ha='center')
    plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#find 6 highest value in hourly orders
highest_hourly_orders = hourly_order_df.nlargest(6, 'total_orders')

#find 6 lowest value in hourly orders
lowest_hourly_orders = hourly_order_df.nsmallest(6, 'total_orders')

#print result as conclusion
print(f"CONCLUSION\n"
      f"Hours with highest total orders:\n{highest_hourly_orders.to_string(index=False)}\n"
      f"\n"
      f"Hours with lowest total orders:\n{lowest_hourly_orders.to_string(index=False)}\n"
      f"\n"
      f"*use as references to maximize/prioritize advertisement")

Task 5

Q: What product sold the most? Why do you think it sold the most? A: Find Products Popularity and Product Price List Comparision

Find data related to Product Popularity by calculate sum of quantity by each product. This return the total product being purchased. It also find price list by product which is used to create filled lines for comparision.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#create dataframe for product popularity
product_popularity_df = pd.DataFrame(clean_sales_2019_df.groupby('product')['quantity'].sum())
product_popularity_df.index.name = 'product'
product_popularity_df = product_popularity_df.sort_values(by='quantity', ascending=False)
product_popularity_df = product_popularity_df.reset_index()

#create dataframe for product price
product_price_df = pd.DataFrame(clean_sales_2019_df.groupby('product')['price'].first())
product_price_df = product_price_df.reset_index()

#merge the two dataframes on the product column
merged_popularity_price_df = pd.merge(product_popularity_df, product_price_df, on='product')

#generate product sales visual
with plt.style.context('classic'):
    fig, ax1 = plt.subplots(figsize=(16, 4))
    sns.barplot(x='product', y='quantity', data=merged_popularity_price_df, palette='husl', ax=ax1)
    plt.ticklabel_format(style='plain', axis='y')
    plt.title("Product Popularity", fontsize=20)
    plt.xlabel("Product")
    plt.ylabel("Quantity")
    plt.xticks(rotation=80)

    #add data callouts
    for x, y in enumerate(merged_popularity_price_df['quantity']):
        label = f"{y:,}"
        plt.annotate(label, (x,y), textcoords="offset points", xytext=(0,4), ha='center', fontsize=12)
    
    #plot product prices as filled line plot
    ax2 = ax1.twinx()
    ax2.plot(merged_popularity_price_df['product'], merged_popularity_price_df['price'], color='red')
    ax2.set_ylabel("Price")
    ax2.fill_between(merged_popularity_price_df['product'], merged_popularity_price_df['price'], alpha=0.2)
    plt.show()

1
2
3
4
5
6
7
8
9
#create dictionary for product colors for use in product related visual
product_colors = {}
n_products = len(merged_popularity_price_df['product'])
palette = sns.color_palette('husl', n_colors=n_products)  #use same palette as previous product visual
for i, product in enumerate(merged_popularity_price_df['product']):
    product_colors[product] = palette[i]

#check the state colors
print(product_colors)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#find product sold most
product_sold_most = merged_popularity_price_df.nlargest(3, 'quantity')
product_sold_most_names = product_sold_most['product'].tolist()
product_sold_most_quantities = product_sold_most['quantity'].tolist()

#find product sold least
product_sold_least = merged_popularity_price_df.nsmallest(3, 'quantity')
product_sold_least_names = product_sold_least['product'].tolist()
product_sold_least_quantities = product_sold_least['quantity'].tolist()

#show result as visual
with plt.style.context('classic'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5, 
             f"Most sold: {product_sold_most_names[0]}, Quantity: {product_sold_most_quantities[0]:,}\n"
             f"Least sold: {product_sold_least_names[0]}, Quantity: {product_sold_least_quantities[0]:,}",
             fontsize=14, ha='center', va='center')
    plt.title("Most & Least Sold Product", fontsize=20)
    plt.show()

1
2
3
4
5
#print result as conclusion
print(f"CONCLUSION\n"
      f"Top 3 most sold products:\n{product_sold_most.to_string(index=False)}\n"
      f"\n"
      f"Bottom 3 least sold products:\n{product_sold_least.to_string(index=False)}")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#find price for each item list
product_list_df = pd.DataFrame(clean_sales_2019_df.groupby('product')['price'].first())
product_list_df = product_list_df.sort_values(by='price', ascending=False)
product_list_df = product_list_df.reset_index()

#find product price average
average_product_price = product_list_df['price'].mean()

#generate product list visual
with plt.style.context('classic'):
    plt.figure(figsize=(16, 6))
    sns.barplot(x='price', y='product', data=product_list_df, palette=product_colors)
    plt.title("Product Price List", fontsize=20)
    plt.xlabel("Price")
    plt.ylabel("Product")
    
    #add axv line
    plt.axvline(x=average_product_price, color='red', linestyle='--', linewidth=1)
    plt.text(average_product_price+5, len(product_list_df)/2, 
             f"Avg: ${average_product_price:.2f}", 
             color='red', ha='left', va='center')
    plt.show()
    

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#find expensive product
most_expensive_product = product_list_df.nlargest(3, 'price')

#find cheap product
most_cheap_product = product_list_df.nsmallest(3, 'price')

#print result as conclusion
print(f"CONCLUSION\n"
      f"Most expensive products:\n{most_expensive_product.to_string(index=False)}\n"
      f"\n"
      f"Most cheap products:\n{most_cheap_product.to_string(index=False)}")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#find average quantity per order id
average_item_per_order = clean_sales_2019_df.groupby('order_id')['quantity'].mean().mean()

#show result as visual
with plt.style.context('classic'):
    plt.subplots(figsize=(8, 1))
    plt.axis('off')
    plt.text(0.5, 0.5, 
             f"{average_item_per_order:,.2f}",
             fontsize=40, ha='center', va='center')
    plt.title("Average Item Each Order", fontsize=20)
    plt.show()

Task 6

Q: How much probability for next people will order certain products? A: Find Product Orders % of Total Sales

The following section is to find % product orders/sold compare to the total sales of all products. the result will give % of product which may define the likelihood of the product will be purchased next. This part is also the only section which used Squarify to plot the result as Treemap.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#find total product sold
total_product_sold = product_popularity_df['quantity'].sum()

#find sizes and labels value in product popularity
sizes = product_popularity_df['quantity']
labels = product_popularity_df['product']

#calculate percentage for each product
percentages = [f'{100*sizes[i]/sizes.sum():.2f}%\n{labels[i]}' for i in range(len(sizes))]

#generate treemap visual for product sold percentage of total
with plt.style.context('seaborn-dark-palette'):
    plt.figure(figsize=(16, 8))
    squarify.plot(sizes=sizes, label=percentages, alpha=0.7,
                  color=[product_colors[label] for label in labels]) #get color from previous product color dict
    plt.axis('off')
    plt.title("Product % of Total", fontsize=20)
    plt.show()

1
2
3
4
5
6
7
8
#print the result as conclusion
print(f"CONCLUSION\n"
      f"List of probability:")

#find percentages for product and print as conclusion
for p in percentages:
    prob, label = p.split('\n')
    print(f"{label} = {prob}")

Task 7

Q: What products are most often sold together? A: Find Product Associate Rules using mlxtend library

First need to find the product sold by each order id and store them into series(product_by_order). Then transform the data into one-hot encoded matrix by applying TransactionEncoder().fit_transform(). The result will be stored as dataframe(encoded_df). Encoded data then used to generate frequent item sets using apriori algorithm the minimum support are defined as following. Lastly sort the order by value in ‘confidence’.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#group product by order id
product_by_order = clean_sales_2019_df.groupby('order_id')['product'].apply(list)

#convert data to one-hot encoded matrix
te = TransactionEncoder()
onehot = te.fit_transform(product_by_order)
encoded_df = pd.DataFrame(onehot, columns=te.columns_)

#identify frequent itemsets
frequent_itemsets = apriori(encoded_df, min_support=0.000015, use_colnames=True)

#generate association rules
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)
rules = rules.sort_values('confidence', ascending=False)

#print rules data head
rules.head()

Although the rules can be presented as tables. In this case the rules is generated as string following more descriptive explanation. There may be other method to visualize the rules waiting to be explored. But for now the following will do.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#create summary for each rule
#create list to store items
summaries = []

#iterrate throught available items in rules
for i, row in rules.iterrows():
    antecedents = ', '.join(list(row['antecedents']))
    consequents = list(row['consequents'])[0]
    support = row['support']
    confidence = row['confidence']
    lift = row['lift']
    leverage = row['leverage']
    conviction = row['conviction']
    
    #define ANSI escape codes for modifying the text's appearance
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    RED = "\033[91m"
    END = '\033[0m'

    #create summary in string
    summary = (
        f"Rule {i+1}:\n"
        f"Customers who bought {BLUE}{antecedents}{END} are more likely to buy {RED}{consequents}{END}.\n"
        f"This combination of products was purchased in {GREEN}{support*100:.3f}%{END} of all transactions.\n"
        f"When customers buy {BLUE}{antecedents}{END}, the chance of them also buying {RED}{consequents}{END} is {GREEN}{confidence*100:.2f}%{END}.\n"
        f"{RED}{consequents}{END} appears {GREEN}{lift:.2f}x{END} more likely when {BLUE}{antecedents}{END} appears compared to {RED}{consequents}{END} alone in general.\n"
        f"The rule's significance, as measured by leverage, is {GREEN}{leverage:.5f}{END}.\n"
        f"Dependence between {BLUE}{antecedents}{END} and {RED}{consequents}{END} is {GREEN}{conviction:.2f}{END}."
    )

    #append the summary
    summaries.append(summary)

#join all summaries into a single string
summary_string = "\n\n".join(summaries)

#print summary
print(f"CONCLUSION\n\n{summary_string}")

Summarize

It appears that the company generated a total of USD 34,465,537.94 in sales, with the majority of sales occurring in 2019, totaling USD 34,456,867.65. In contrast, 2020 only accounted for USD 8,670.29 in sales, which may be due to incomplete or missing data. The highest selling months were December, October, and April 2019, with sales of USD 4,608,295.70, USD 3,734,777.86 and USD 3,389,217.98 respectively, while the lowest sales occurred in January, September 2019, and January 2020, which accounted for only USD 3,915,878.85 of the total sales. The company generated the highest sales in California (CA), with San Francisco and Los Angeles accounting for USD 8,254,743.55 and USD 5,448,304.28 respectively. The lowest sales occurred in Portland, Maine, with only USD 449,321.38 in sales.

To maximize the company’s advertising efforts, it is recommended to advertise in December, October, and April, as the company generated the most orders during these months (24,004, 19,436, and 17,528, respectively). Tuesdays were found to be the most profitable day for advertising, generating 26,063 orders, while Fridays had the lowest number of orders. The best times for advertising were between 11:00-13:00 and 18:00-20:00, when the company typically had the highest sales counts.

In terms of product performance, some products performed better or worse than their price indicated. The finding result is as following:

The Bose SoundSport Headphones had a lower price, but sold less than the Apple Airpods Headphones. The 34in Ultrawide Monitor, 27in 4K Gaming Monitor, Flatscreen TV, 20in Monitor, Vareebadd Phone, LG Washing Machine, and LG Dryer all sold fewer units than the iPhone despite having lower prices. In contrast, the ThinkPad Laptop and Macbook Pro Laptop performed better than expected when compared to the LG Dryer, LG Washing Machine, Vareebadd Phone, and 20in Monitor despite having higher price. The 20in Monitor was the worst-performing product, selling only 4,123 units, compared to the 27in FHD Monitor, which sold 7,538 units despite having lower price. Finally, the probability of customers ordering specific products compared to all other products was calculated as following:

The USB-C Charging Cable had the highest probability, with 11.46%. Followed by the Wired Headphones at 9.83%. The iPhone had a probability of 3.28%. While the Google Phone had a probability of 2.65%. Additionally, when customers bought Apple Airpods Headphones, Wired Headphones, and Lightning Charging Cable, the chance of them also buying an iPhone was 100%, but this accounted for only 0.002% of total transactions. Similarly, when customers bought Wired Headphones, USB-C Charging Cable, and Bose SoundSport Headphones, the chance of them also buying a Google Phone was 60%, but this accounted for only 0.002% of total transactions.

Dataset ↩︎

Objective#

Import && Read Data#

Data cleaning#

identify data, null and non-numeric digits#

remove null rows#

chagne data type#

Questions#

Task 1#

Task 2#

Task 3#

Task 4#

Task 5#

Task 6#

Task 7#

Summarize#