DMV Lab 7
DMV Lab 7
*************************************************************************************
Part II
Assignment 7
*************************************************************************************
Title: Data Loading, Storage and File Formats
Problem Statement: Analyzing Sales Data from Multiple File Formats
Dataset: Sales data in multiple file formats (e.g., CSV, Excel, JSON)
Description: The goal is to load and analyze sales data from different file formats,
including CSV, Excel, and JSON, and perform data cleaning, transformation, and analysis on
the dataset.
Tasks to Perform:
Obtain sales data files in various formats, such as CSV, Excel, and JSON.
1. Load the sales data from each file format into the appropriate data structures or
dataframes.
2. Explore the structure and content of the loaded data, identifying any
inconsistencies, missing values, or data quality issues.
3. Perform data cleaning operations, such as handling missing values, removing
duplicates, or correcting inconsistencies.
4. Convert the data into a unified format, such as a common dataframe or data
structure, to enable seamless analysis.
5. Perform data transformation tasks, such as merging multiple datasets, splitting
columns, or deriving new variables.
6. Analyze the sales data by performing descriptive statistics, aggregating data by
specific variables, or calculating metrics such as total sales, average order value, or
product category distribution.
7. Create visualizations, such as bar plots, pie charts, or box plots, to represent the
sales data and gain insights into sales trends, customer behavior, or product
performance.
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Loading, Preprocessing, Data Formatting, Data Normalization and
Data Cleaning.
Theory:
Reading data and making it accessible (often called data loading) is a necessary first step
for using most of the tools. The term parsing is also sometimes used to describe loading
text data and interpreting it as tables and different data types. To focus on data input and
output using pandas, there are numerous tools in other libraries to help with reading and
writing data in various formats. Pandas features a number of functions for reading tabular
data as a DataFrame object. pandas.read_csv is one of the most frequently used to load csv
file.
Some of data loading functions in pandas summarizes as;
Function Description
read_csv Load delimited data from a file, URL, or file-like object; use comma as
default delimiter
read_fwf Read data in fixed-width column format (i.e., no delimiters)
read_clipboard Variation of read_csv that reads data from the clipboard; useful for
converting tables from web pages
read_excel Read tabular data from an Excel XLS or XLSX file
read_hdf Read HDF5 files written by pandas
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation) string
representation, file, URL, or file-like object
read_feather Read the Feather binary file format
read_orc Read the Apache ORC binary file format
read_parquet Read the Apache Parquet binary file format
read_pickle Read an object stored by pandas using the Python pickle format
read_sas Read a SAS dataset stored in one of the SAS system's custom storage
formats
read_spss Read a data file created by SPSS
read_sql Read the results of a SQL query (using SQLAlchemy)
read_sql_table Read a whole SQL table (using SQLAlchemy); equivalent to using a
query that selects everything in that table using read_sql
read_stata Read a dataset from Stata file format
read_xml Read a table of data from an XML file
Indexing:
Can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, arguments you provide, or not at all.
Type inference and data conversion:
Includes the user-defined value conversions and custom list of missing value markers
ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID
ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID
10/10/2
Shipp 200
4 10159 49 100.00 14 5205.27 003 4 10
ed 3
0:00
5 rows × 25 columns
In [7]:
df2=pd.read_json('sales.json')
In [8]:
df2.head()
Out[8]:
ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID
10/10/2
Shipp 200
4 10159 49 100.00 14 5205.27 003 4 10
ed 3
0:00
5 rows × 25 columns
In [9]:
df3=pd.read_excel('sales.xlsx')
Out[9]:
df2.head()
ORDER
QUANTIT
ORDERN PRICEE LINEN SALE ORDER MONT YEAR
YORDER STATUS QTR_ID ...
UMBER ACH UMBE S DATE H_ID _ID
ED
R
10/10/2003
4 10159 49 100.00 14 5205.27 Shipped 4 10 2003
0:00
5 rows × 25 columns
2. Explore the structure and content of the loaded data, identifying any
inconsistencies, missing values, or data quality issues.
In [9]:
df1.shape
Out[9]:
(2823, 25)
In [10]:
df2.shape
Out[10]:
(2823, 25)
In [11]:
df3.shape
Out[11]:
(2823, 25)
In [12]:
df1.info
Out[12]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
In [13]:
df1.isna().sum()
Out[13]:
ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
STATUS 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
POSTALCODE 76
COUNTRY 0
dtype: int64
● isnull() is the function that is used to check missing values or null values in
pandas python.
● isna() function is also used to get the count of missing values of column and
row wise count of missing values
In [14]:
df1.isna()
ORDE
ORDE QUANTI
PRICE RLINE SALE ORDER STAT QTR MONT YEA
RNUM TYORDE ...
EACH NUMB S DATE US _ID H_ID R_ID
BER RED
ER
0 False False False False False False False False False False
1 False False False False False False False False False False
2 False False False False False False False False False False
3 False False False False False False False False False False
4 False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ...
2818 False False False False False False False False False False
2819 False False False False False False False False False False
2820 False False False False False False False False False False
2821 False False False False False False False False False False
2822 False False False False False False False False False False
The na_values option accepts a sequence of strings to add to the default list of strings
recognized as missing
In [15]:
df1. describe()
Out[15]:
ORDERL
ORDERN QUANTITY PRICE QTR_I MONTH_ YEAR_
INENUM SALES MSRP
UMBER ORDERED EACH D ID ID
BER
4. Convert the data into a unified format, such as a common dataframe or data
structure, to enable seamless analysis.
Working with datetime in Pandas can sometimes be tricky if your datetime column is a
string with messy, mixed formats. You will need to do some pre-processing and convert the
mixed formats to the standard datetime type in order to perform any subsequent data
analysis.
Let’s take a look at the following example. In this dataframe, we have a column named
‘orderdate’ which appears to be a datetime column but is actually a string column.
Moreover, the dates are all arranged in different formats.
For example, for ID 0, the date string is arranged in the DD.MM.YYYY format in which the
day is written first. For ID 1, the date string is displayed in the DD-MM-YY format in which
the day is also written first. For ID 2, the date string is arranged in the format
MM/DD/YYYY with the month being written first which is common in the U.S.
So to clean this messy date string column that has mixed formats, easily and efficiently?
Let’s first try Pandas’ to_datetime() method which can parse any valid date strings to
datetime easily. If we simply do the following without any additional arguments, we get the
results shown below:
df['joining_date'] = pd.to_datetime(df['joining_date'])
We can see that with the pd.to_datetime() method, we are able to parse the date strings
that have mixed formats to datetime with a standard format (default is YYYY-MM-DD).
Conclusion:
We are successfully load and analyze sales data from different file formats, including CSV,
Excel, and JSON, and perform data cleaning, transformation, and analysis on the dataset.
Dated Sign
Performance Innovation Completion Total of Subject
Teacher
3 1 1 5