0% found this document useful (0 votes)
17 views9 pages

DMV Lab 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

DMV Lab 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Department of AI & DS Engineering Computer Laboratory I

*************************************************************************************
Part II
Assignment 7
*************************************************************************************
Title: Data Loading, Storage and File Formats
Problem Statement: Analyzing Sales Data from Multiple File Formats
Dataset: Sales data in multiple file formats (e.g., CSV, Excel, JSON)
Description: The goal is to load and analyze sales data from different file formats,
including CSV, Excel, and JSON, and perform data cleaning, transformation, and analysis on
the dataset.
Tasks to Perform:
Obtain sales data files in various formats, such as CSV, Excel, and JSON.
1. Load the sales data from each file format into the appropriate data structures or
dataframes.
2. Explore the structure and content of the loaded data, identifying any
inconsistencies, missing values, or data quality issues.
3. Perform data cleaning operations, such as handling missing values, removing
duplicates, or correcting inconsistencies.
4. Convert the data into a unified format, such as a common dataframe or data
structure, to enable seamless analysis.
5. Perform data transformation tasks, such as merging multiple datasets, splitting
columns, or deriving new variables.
6. Analyze the sales data by performing descriptive statistics, aggregating data by
specific variables, or calculating metrics such as total sales, average order value, or
product category distribution.
7. Create visualizations, such as bar plots, pie charts, or box plots, to represent the
sales data and gain insights into sales trends, customer behavior, or product
performance.

Objective of the Assignment:


Students should be able to load multiple file formats and perform data cleaning,
transformation, and analysis on the dataset.

Prerequisite:
1. Basic of Python Programming
2. Concept of Data Loading, Preprocessing, Data Formatting, Data Normalization and
Data Cleaning.

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

Theory:
Reading data and making it accessible (often called data loading) is a necessary first step
for using most of the tools. The term parsing is also sometimes used to describe loading
text data and interpreting it as tables and different data types. To focus on data input and
output using pandas, there are numerous tools in other libraries to help with reading and
writing data in various formats. Pandas features a number of functions for reading tabular
data as a DataFrame object. pandas.read_csv is one of the most frequently used to load csv
file.
Some of data loading functions in pandas summarizes as;
Function Description
read_csv Load delimited data from a file, URL, or file-like object; use comma as
default delimiter
read_fwf Read data in fixed-width column format (i.e., no delimiters)
read_clipboard Variation of read_csv that reads data from the clipboard; useful for
converting tables from web pages
read_excel Read tabular data from an Excel XLS or XLSX file
read_hdf Read HDF5 files written by pandas
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation) string
representation, file, URL, or file-like object
read_feather Read the Feather binary file format
read_orc Read the Apache ORC binary file format
read_parquet Read the Apache Parquet binary file format
read_pickle Read an object stored by pandas using the Python pickle format
read_sas Read a SAS dataset stored in one of the SAS system's custom storage
formats
read_spss Read a data file created by SPSS
read_sql Read the results of a SQL query (using SQLAlchemy)
read_sql_table Read a whole SQL table (using SQLAlchemy); equivalent to using a
query that selects everything in that table using read_sql
read_stata Read a dataset from Stata file format
read_xml Read a table of data from an XML file

Indexing:
Can treat one or more columns as the returned DataFrame, and whether to get column
names from the file, arguments you provide, or not at all.
Type inference and data conversion:
Includes the user-defined value conversions and custom list of missing value markers

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

Date and time parsing:


Includes a combining capability, including combining date and time information spread
over multiple columns into a single column in the result
Iterating:
Support for iterating over chunks of very large files.
Unclean data issues:
Includes skipping rows or a footer, comments, or other minor things like numeric data with
thousands separated by commas
Because of how messy data in the real world can be, some of the data loading
functions (especially pandas.read_csv) have accumulated a long list of optional arguments
over time.
Some of these functions perform type inference, because the column data types are
not part of the data format. That means you don’t necessarily have to specify which
columns are numeric, integer, Boolean, or string. Other data formats, like HDF5, ORC, and
Parquet, have the data type information embedded in the format.
Let’s start to load sales data from different file format such as csv, exl,json etc
1. Load the sales data from each file format into the appropriate data structures or
dataframes.
Since this is comma-delimited, we can then use pandas.read_csv to read it into a
DataFrame:
In [5]:
df1=pd.read_csv('sales.csv', encoding='ISO-8859-1')
In [6]:
df1.head()
Out[6]:

ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID

2/24/20 Shipp 200


0 10107 30 95.70 2 2871.00 1 2
03 0:00 ed 3

5/7/200 Shipp 200


1 10121 34 81.35 5 2765.90 2 5
3 0:00 ed 3

7/1/200 Shipp 200


2 10134 41 94.74 2 3884.34 3 7
3 0:00 ed 3

8/25/20 Shipp 200


3 10145 45 83.26 6 3746.70 3 8
03 0:00 ed 3

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID

10/10/2
Shipp 200
4 10159 49 100.00 14 5205.27 003 4 10
ed 3
0:00

5 rows × 25 columns
In [7]:
df2=pd.read_json('sales.json')
In [8]:
df2.head()

Out[8]:

ORDERNU QUANTITYO PRICEE ORDERLINEN SAL ORDER STATU QTR MONT YEAR
...
MBER RDERED ACH UMBER ES DATE S _ID H_ID _ID

2/24/20 Shipp 200


0 10107 30 95.70 2 2871.00 1 2
03 0:00 ed 3

5/7/200 Shipp 200


1 10121 34 81.35 5 2765.90 2 5
3 0:00 ed 3

7/1/200 Shipp 200


2 10134 41 94.74 2 3884.34 3 7
3 0:00 ed 3

8/25/20 Shipp 200


3 10145 45 83.26 6 3746.70 3 8
03 0:00 ed 3

10/10/2
Shipp 200
4 10159 49 100.00 14 5205.27 003 4 10
ed 3
0:00

5 rows × 25 columns
In [9]:
df3=pd.read_excel('sales.xlsx')

Out[9]:

df2.head()

Matoshri College of Engineering & Research Centre, Nashik


Department
ent of AI & DS Engineering Computer Laboratory I

ORDER
QUANTIT
ORDERN PRICEE LINEN SALE ORDER MONT YEAR
YORDER STATUS QTR_ID ...
UMBER ACH UMBE S DATE H_ID _ID
ED
R

0 10107 30 95.70 2 2871.00 2/24/2003 0:00 Shipped 1 2 2003

1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5 2003

2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7 2003

3 10145 45 83.26 6 3746.70 8/25/2003 0:00 Shipped 3 8 2003

10/10/2003
4 10159 49 100.00 14 5205.27 Shipped 4 10 2003
0:00

5 rows × 25 columns

2. Explore the structure and content of the loaded data, identifying any
inconsistencies, missing values, or data quality issues.

In [9]:

df1.shape

Out[9]:
(2823, 25)

In [10]:
df2.shape

Out[10]:
(2823, 25)

In [11]:
df3.shape

Out[11]:
(2823, 25)
In [12]:
df1.info

Out[12]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

Data columns (total 25 columns):


# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB

In [13]:
df1.isna().sum()
Out[13]:
ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
STATUS 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
POSTALCODE 76
COUNTRY 0
dtype: int64

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

3. Perform data cleaning operations, such as handling missing values, removing


duplicates, or correcting inconsistencies.

Checking of Missing Values in Dataset:

● isnull() is the function that is used to check missing values or null values in
pandas python.
● isna() function is also used to get the count of missing values of column and
row wise count of missing values

In [14]:
df1.isna()

ORDE
ORDE QUANTI
PRICE RLINE SALE ORDER STAT QTR MONT YEA
RNUM TYORDE ...
EACH NUMB S DATE US _ID H_ID R_ID
BER RED
ER

0 False False False False False False False False False False

1 False False False False False False False False False False

2 False False False False False False False False False False

3 False False False False False False False False False False

4 False False False False False False False False False False

... ... ... ... ... ... ... ... ... ... ...

2818 False False False False False False False False False False

2819 False False False False False False False False False False

2820 False False False False False False False False False False

2821 False False False False False False False False False False

2822 False False False False False False False False False False

2823 rows × 25 columns

The na_values option accepts a sequence of strings to add to the default list of strings
recognized as missing

In [15]:
df1. describe()

Out[15]:

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

ORDERL
ORDERN QUANTITY PRICE QTR_I MONTH_ YEAR_
INENUM SALES MSRP
UMBER ORDERED EACH D ID ID
BER

2823.000 2823.00000 2823.0 2823.000 2823.00 2823.0 2823.000 2823.0 2823.0


count
000 0 00000 000 0000 00000 000 0000 00000

10258.72 83.658 3553.88 2.7176 2003.8 100.71


mean 35.092809 6.466171 7.092455
5115 544 9072 76 1509 5551

92.08547 20.174 1841.86 1.2038 0.6996 40.187


std 9.741443 4.225841 3.656633
8 277 5106 78 7 912

10100.00 26.880 482.130 1.0000 2003.0 33.000


min 6.000000 1.000000 1.000000
0000 000 000 00 0000 000

10180.00 68.860 2203.43 2.0000 2003.0 68.000


25% 27.000000 3.000000 4.000000
0000 000 0000 00 0000 000

10262.00 95.700 3184.80 3.0000 2004.0 99.000


50% 35.000000 6.000000 8.000000
0000 000 0000 00 0000 000

10333.50 100.00 4508.00 4.0000 11.00000 2004.0 124.00


75% 43.000000 9.000000
0000 0000 0000 00 0 0000 0000

10425.00 100.00 18.00000 14082.8 4.0000 12.00000 2005.0 214.00


max 97.000000
0000 0000 0 00000 00 0 0000 0000

4. Convert the data into a unified format, such as a common dataframe or data
structure, to enable seamless analysis.

Working with datetime in Pandas can sometimes be tricky if your datetime column is a
string with messy, mixed formats. You will need to do some pre-processing and convert the
mixed formats to the standard datetime type in order to perform any subsequent data
analysis.

Let’s take a look at the following example. In this dataframe, we have a column named
‘orderdate’ which appears to be a datetime column but is actually a string column.
Moreover, the dates are all arranged in different formats.

For example, for ID 0, the date string is arranged in the DD.MM.YYYY format in which the
day is written first. For ID 1, the date string is displayed in the DD-MM-YY format in which
the day is also written first. For ID 2, the date string is arranged in the format
MM/DD/YYYY with the month being written first which is common in the U.S.

So to clean this messy date string column that has mixed formats, easily and efficiently?

Let’s first try Pandas’ to_datetime() method which can parse any valid date strings to
datetime easily. If we simply do the following without any additional arguments, we get the
results shown below:
df['joining_date'] = pd.to_datetime(df['joining_date'])

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Laboratory I

We can see that with the pd.to_datetime() method, we are able to parse the date strings
that have mixed formats to datetime with a standard format (default is YYYY-MM-DD).

5. Perform data transformation tasks, such as merging multiple datasets, splitting


columns, or deriving new variables.
Sort and filter
Often the ordering of data matters, particularly when manually inspecting datasets. In
addition, it can be useful to filter the dataset and only use particular rows or particular
columns of data. To do this, we will use the following functions from pandas: sort_values,
query, and filter.
#Sort by name
sorted = df.sort_values(by=['name'])
display(sorted)
New variables
Another common example of creating a new variable is when a dataset contains birthdates.
A more useful feature might be “age” rather than using the raw birthdate. The code below
first creates a function to calculate age and then creates a new column to contain this data.
Splitting and combining
Often when working with text columns, data needs to be split and/or combined in various
ways. The code below demonstrates how the name column can be split into first and last
names. The next code block shows how to recombine the first and last names using the
‘last, first’ convention.
#Splitting
splitnames = df.copy()
split = splitnames['name'].str.split(' ', expand = True)
splitnames['first'] = split[0]
splitnames['last'] = split[1]
display(splitnames)

Conclusion:
We are successfully load and analyze sales data from different file formats, including CSV,
Excel, and JSON, and perform data cleaning, transformation, and analysis on the dataset.

Dated Sign
Performance Innovation Completion Total of Subject
Teacher
3 1 1 5

Matoshri College of Engineering & Research Centre, Nashik

You might also like