Lab 6

NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab
LABNo:06
Name: M Ahmed Mustafa
Student ID: F-20603039
Lab06: Data Cleaning and Preprocessing
Objective:
Tointroduceessentialdatacleaningandpreprocessingtechniquesinthecontextofasupermarketsalesdataset.
Students will learn to handle missing values, detect and manage duplicates, standardize data formats, and
perform basic transformations, preparing them for more advanced data analysis and machine learning tasks.
Tools/SoftwareRequirements:
 Python3.x
 JupyterNotebookoranyotherPythonIDE
 PandasandNumPylibrariesfordatamanipulation
 SamplesupermarketsalesdatasetinCSVformat
OverviewofCommonDataCleaningTasks:
 HandlingMissingValues:Techniquestodealwithmissingdata,suchasimputationorremoval.
 DetectingandManagingDuplicates:Identifyingandhandlingduplicateentriesinthedataset.
 DataTransformation:Standardizingandnormalizingdata.
 DataTypeConversion:Ensuringeachcolumninthedatasetisofthecorrectdatatype.
 FeatureEngineering:Creatingnewfeaturesfromexistingdatatobettercapturetheunderlying patterns.
Implementation
ImportNecessaryLibraries:
importpandasaspd import
numpy as np
importmatplotlib.pyplotasplt
LoadandInspecttheDataset:
 Loadingthedatasetusingpandas.
file_path='C:\\Users\\us\\Desktop\\superstore_final_dataset.csv'#
Replace with your file path
dataset=pd.read_csv(file_path,encoding='ISO-8859-1')
 Basicinspectionusing.head(),.info(),and.describe().
print(dataset.head())print(dataset.info()
) print(dataset.describe())
1. pd.read_csv():Readsacomma-separatedvalues(CSV)fileintoaDataFrame.
2. head():Returnsthefirst5rowsoftheDataFrame,allowingyouto quicklycheck theformatofyour data
and the first few entries.
3. info():Provides aconcise summaryofyourDataFrame,includingthenumberofnon-nullvalues in each
column and the data type.
4. describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and
shapeofadataset’sdistribution,excludingNaNvalues.It'sparticularlyusefulforunderstandingthe
numerical fields in your dataset.
HandlingMissing Values:
 Identifyingmissingvaluesusing.isnull().
Youcanusethe .isnull()methodcombinedwith .sum() tocountthenumberofmissingvalues ineach
column.
missing_values=dataset.isnull().sum()prin
t(missing_values)
dataset['column_name']
=dataset['column_name'].fillna(dataset['column_name'].mean())
 Strategiestohandlethem,likefillingwithmean/medianordropping.
Fornumericalcolumns,youcanfillmissingvalueswiththemeanormedian.Thechoicebetween mean and
median depends on the data distribution. Median is more robust to outliers.
=dataset['column_name'].fillna(dataset['column_name'].mean())
=dataset['column_name'].fillna(dataset['column_name'].median())
Ifacolumnhastoomanymissingvalues,orifthemissingvaluesareinrows thatarenotcrucialforyour analysis, you

might choose to drop them.
dataset=dataset.dropna() #droppingaspecificrow
dataset=dataset.drop(columns=['column_name'])#droppingcoloumn
DetectingandRemovingDuplicates:
 Identifyingduplicatesinproductnamesorotherrelevantfields.
Youcanusetheduplicated()methodinPandastofindduplicaterows.Forinstance,ifyouwanttofind duplicates based
on the Product_Name field:
duplicates=dataset[dataset.duplicated(subset=['Product_Name'],keep=Fa
lse)]
print(duplicates)
Beforeremovingduplicates,considerthefollowing:
1. ValidDuplicates:Inasupermarketdataset,duplicatesmightbevalid.Forexample,thesame
product could be sold multiple times, and each instance would be a valid entry.
2. DifferentiatingFactors:Checkifthereareothercolumnsthatdifferentiatetheduplicates,like
transaction IDs, dates, or store locations. This would indicate that the duplicates are valid.
3. DataEntryErrors: Sometimes, duplicates area resultof data entryerrors. Forinstance, if two
entrieswiththesameproductnamehavesignificantlydifferentpriceswithoutaclearreason, it
might be an error.
RemoveAllDuplicates:
dataset=dataset.drop_duplicates(subset=['Product_Name'],keep=False)
VerifyingRemovalofDuplicates
print(dataset.duplicated(subset=['Product_Name']).sum())
DataTypeConversion andStandardization:
 Convertingdatatypes,e.g.,convertingdatestringstodatetimeobjects.
Data type conversion is essential when the data types of the columns are not suitable for the
operationsyouintendtoperform.Acommonscenarioisconvertingdatestringstodatetimeobjects, which
facilitates time-based analysis.
dataset['Date_Column']=pd.to_datetime(dataset['Date_Column'])
 OtherCommonConversions:
Convertingacolumntonumerictype:
dataset['Numeric_Column']=pd.to_numeric(dataset['Numeric_Column'],
errors='coerce')
Convertingacolumntoastringtype:
dataset['String_Column']=dataset['String_Column'].astype(str)
 Standardizingunits(ifapplicable),likeconvertingallweightstokilograms.
Ifyourdatasetcontainsmeasurements(likeweights,lengths,etc.),ensuringthatalldatapointsarein a
consistent unit is crucial for accurate analysis.
StandardizingWeights(e.g.,convertingallweightstokilograms):
 Assumeyouhaveaweightcolumnindifferentunits(likegrams)andyouwanttostandardizeitto
kilograms.
dataset['Weight_kg']=dataset['Weight_g']/1000.0
OtherStandardizations:
 Ifyouhavemeasurementsinvariousunits(likemilesandkilometers),you'dsimilarlyconvertthemto a single
standard unit.
 Currencyconversionsmightalsoberelevantifdealingwithinternationaldata.
Note:
 DataInspection:Alwaysinspectyourdatafirsttounderstandtheexistingformatsandunits.
 ErrorsHandling:Whenconvertingdatatypes,especiallytonumerictypes,beawareofnon-numeric values
that can cause errors. Using errors='coerce' in pd.to_numeric() can help by setting invalid parsing to
NaN.
 ContextualDecisions:Thechoiceofstandardization(likewhichunittostandardizeto)shouldbe
guided by the specific requirements of your analysis or the conventions in your field.
FeatureEngineering:
Featureengineeringisacrucialstepindatapreprocessing,whereyoucreatenewfeaturesfromexistingdata to
enhance the performance of machine learning models or to gain deeper insights during analysis. In the
context of a supermarket dataset, there are several ways to engineer features. Let’s go through some
common methods:
 Ifyouhaveawiderangeofproducts,itmightbeusefultocategorizethemintobroadergroups.This can
simplify analysis and might reveal trends that are not apparent at the individual product level.
Assumeyouhaveacolumnnamed'Product_Type'whichcontainstypesofproducts.Youcan categorize
these into broader categories.
defcategorize_product(product_type):
ifproduct_typein['Beverage','Snack']:return
'Food'
elifproduct_typein['Shampoo','Soap']:return
'Personal Care'
#Addmoreconditionsasneededelse:
return 'Other'dataset['Product_Category']
=
dataset['Product_Type'].apply(categorize_product)
 Ifyourdatasetcontainsdate ortimeinformation,youcanextractfeatureslikethedayoftheweek, month,

or time of day, which can be relevant for analysis.
ExtractDayoftheWeek:
dataset['Day_of_Week']=dataset['Date_Column'].dt.day_name()
ExtractMonth:
dataset['Month']=dataset['Date_Column'].dt.month
OtherPotentialFeatures
 CustomerSegmentationFeatures:Ifyouhavecustomerdata,youmightcreatefeaturesthatsegment
customers into groups based on their buying behavior.
 SeasonalFeatures:Forproductswhosesalesmightbeaffectedbyseasons(likeicecream,heaters), you
could create a feature indicating the season.
LabTask
Implementdatapre-processingand cleaningin python, onthegiven dataset.
Code:
importpandasaspd
importnumpyasnp
importmatplotlib.pyplotasplt
file_path='C:\\Users\\Student\\Desktop\\superstore_final_dataset.csv'
dataset = pd.read_csv(file_path, encoding='ISO-8859-1')
print(dataset.head())
#Checkif'Row_ID'isunique
is_unique=dataset['Row_ID'].is_unique
print(is_unique)
# Reset index and update 'Row_ID'

dataset=dataset.reset_index(drop=True)
dataset['Row_ID'] = dataset.index + 1
# Check for missing values

missing_values=dataset.isnull().sum()
print(missing_values)
#Handlemissing'Order_ID'
forindex,rowindataset.iterrows():
ifpd.isnull(row['Order_ID']):
customer_name=row['Customer_Name']
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Order_ID'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Order_ID']=match['Order_ID'].values[0]
#Handlemissing'Order_Date'
ifpd.isnull(row['Order_Date']):
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Order_Date'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Order_Date']=pd.to_datetime(match['Order_Date'].values[0])
#Handlemissing'Ship_Date'
ifpd.isnull(row['Ship_Date']):
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Ship_Date'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Ship_Date']=pd.to_datetime(match['Ship_Date'].values[0])
#Convert'Order_Date'and'Ship_Date'to datetime
dataset['Order_Date']=pd.to_datetime(dataset['Order_Date'],errors='coerce') dataset['Ship_Date']
= pd.to_datetime(dataset['Ship_Date'], errors='coerce')
# Check for missing values again

#missing_values_after_handling=dataset.isnull().sum()
#print(missing_values_after_handling)
#Printtheupdateddataset
print(dataset.head())
output_file_path='C:\\Users\\Student\\Desktop\\cleaned\\superstore_final_dataset_cleaned.csv'#Pathforthenewfile dataset.to_csv(output_file_path,
index=False, encoding='ISO-8859-1')
DatasetBefore:
Conclusion:
CONCLUSION
In conclusion, this lab successfully achieved its objective of familiarizing students with essential data cleaning
and preprocessing techniques using a supermarket sales dataset. Through practical exercises in Python,
employingtoolssuchasJupyterNotebookandlibrarieslikePandasandNumPy,studentsgainedproficiencyin handling
common data challenges.
The importance of addressing missing values, managing duplicates, standardizing data formats, and
performingbasictransformationswasunderscore throughoutthelab.Thesefoundationalskillsarecrucial
prerequisitesformoreadvanced data analysis andmachine learning tasks, providing students with a solid
groundwork for future exploration in the realm of data science.
The utilization of a real-world supermarket sales dataset in CSV format enhanced the relevance of the
exercise,allowingstudentstoapplylearnedtechniquestoauthenticdatascenarios.Asweprogressintomore
complex analyses and modeling, the acquired skills in data cleaning and preprocessing will serve as
indispensable tools for ensuring the quality and reliability of data, setting the stage for informed decision-
making and advanced analytics in diverse applications.

Lab 6

Uploaded by

Copyright:

Available Formats

Lab 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 6

Uploaded by

Copyright:

Available Formats

NationalUniversityofTechnology(NUTECH)

Name: M Ahmed Mustafa

Student ID: F-20603039

Lab06: Data Cleaning and Preprocessing

Ifacolumnhastoomanymissingvalues,orifthemissingvaluesareinrows thatarenotcrucialforyour analysis, you

 Ifyourdatasetcontainsdate ortimeinformation,youcanextractfeatureslikethedayoftheweek, month,

# Reset index and update 'Row_ID'

# Check for missing values

# Check for missing values again

You might also like