Data Analysis
Objective The purpose is to provide analysis based on dataset1 provided by going through all the necessary step and methods in this alongside explanation. My project collects data of sales records. I have wrangled and analysed data from last couple years of each country individually and visualized the data in form of excel dashboard for easy understanding of lending trend. Import && Read Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import os import pandas as pd import matplotlib.pyplot as plt #find file path directory from dirname, _, filenames in os.walk("./source/data"): for filename in filenames: print(os.path.join(dirname, filename)) all_sales_df = pd.DataFrame() for filename in os.listdir(dir_path): df = pd.read_csv(os.path.join(dirname, filename)) all_sales_df = pd.concat([all_sales_df, df], ignore_index=True) all_sales_df = all_sales_df.rename(mapper = str.strip, axis ='columns') all_sales_df = all_sales_df.rename(columns= {'Order ID': 'Order_id', 'Quantity Ordered': 'Quantity', 'Price Each': 'Price', 'Order Date': 'Date', 'Purchase Address': 'Address'}) #lowercase column name column_name = list(all_sales_df.columns) column_name = [x.lower().strip() for x in column_name] all_sales_df.columns = column_name Data cleaning First check for rows contain null and duplicates. Then check for non-numeric values in order_id, quantity, and price since the 3 column cannot be non-numeric. Then, correct format for each column, Int, Timestamp and String. When there is a few consideration about dropping duplicated rows. Check duplicated rows and their duplicates. ...