Lab 6

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

NationalUniversityofTechnology(NUTECH)

ElectricalEngineeringDepartment
EE4407-MachineLearningLab

LABNo:06

Name: M Ahmed Mustafa

Student ID: F-20603039

Lab06: Data Cleaning and Preprocessing

Objective:

Tointroduceessentialdatacleaningandpreprocessingtechniquesinthecontextofasupermarketsalesdataset.
Students will learn to handle missing values, detect and manage duplicates, standardize data formats, and
perform basic transformations, preparing them for more advanced data analysis and machine learning tasks.

Tools/SoftwareRequirements:

 Python3.x
 JupyterNotebookoranyotherPythonIDE
 PandasandNumPylibrariesfordatamanipulation
 SamplesupermarketsalesdatasetinCSVformat

OverviewofCommonDataCleaningTasks:
 HandlingMissingValues:Techniquestodealwithmissingdata,suchasimputationorremoval.
 DetectingandManagingDuplicates:Identifyingandhandlingduplicateentriesinthedataset.
 DataTransformation:Standardizingandnormalizingdata.
 DataTypeConversion:Ensuringeachcolumninthedatasetisofthecorrectdatatype.
 FeatureEngineering:Creatingnewfeaturesfromexistingdatatobettercapturetheunderlying patterns.

Implementation
ImportNecessaryLibraries:

importpandasaspd import
numpy as np
importmatplotlib.pyplotasplt

LoadandInspecttheDataset:
 Loadingthedatasetusingpandas.
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

file_path='C:\\Users\\us\\Desktop\\superstore_final_dataset.csv'#
Replace with your file path

dataset=pd.read_csv(file_path,encoding='ISO-8859-1')

 Basicinspectionusing.head(),.info(),and.describe().

print(dataset.head())print(dataset.info()
) print(dataset.describe())

1. pd.read_csv():Readsacomma-separatedvalues(CSV)fileintoaDataFrame.
2. head():Returnsthefirst5rowsoftheDataFrame,allowingyouto quicklycheck theformatofyour data
and the first few entries.
3. info():Provides aconcise summaryofyourDataFrame,includingthenumberofnon-nullvalues in each
column and the data type.
4. describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and
shapeofadataset’sdistribution,excludingNaNvalues.It'sparticularlyusefulforunderstandingthe
numerical fields in your dataset.

HandlingMissing Values:
 Identifyingmissingvaluesusing.isnull().
Youcanusethe .isnull()methodcombinedwith .sum() tocountthenumberofmissingvalues ineach
column.

missing_values=dataset.isnull().sum()prin
t(missing_values)

dataset['column_name']
=dataset['column_name'].fillna(dataset['column_name'].mean())
 Strategiestohandlethem,likefillingwithmean/medianordropping.
Fornumericalcolumns,youcanfillmissingvalueswiththemeanormedian.Thechoicebetween mean and
median depends on the data distribution. Median is more robust to outliers.

dataset['column_name']
=dataset['column_name'].fillna(dataset['column_name'].mean())

dataset['column_name']
=dataset['column_name'].fillna(dataset['column_name'].median())

Ifacolumnhastoomanymissingvalues,orifthemissingvaluesareinrows thatarenotcrucialforyour analysis, you


might choose to drop them.

dataset=dataset.dropna() #droppingaspecificrow
dataset=dataset.drop(columns=['column_name'])#droppingcoloumn
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

DetectingandRemovingDuplicates:
 Identifyingduplicatesinproductnamesorotherrelevantfields.
Youcanusetheduplicated()methodinPandastofindduplicaterows.Forinstance,ifyouwanttofind duplicates based
on the Product_Name field:

duplicates=dataset[dataset.duplicated(subset=['Product_Name'],keep=Fa
lse)]
print(duplicates)

Beforeremovingduplicates,considerthefollowing:
1. ValidDuplicates:Inasupermarketdataset,duplicatesmightbevalid.Forexample,thesame
product could be sold multiple times, and each instance would be a valid entry.
2. DifferentiatingFactors:Checkifthereareothercolumnsthatdifferentiatetheduplicates,like
transaction IDs, dates, or store locations. This would indicate that the duplicates are valid.
3. DataEntryErrors: Sometimes, duplicates area resultof data entryerrors. Forinstance, if two
entrieswiththesameproductnamehavesignificantlydifferentpriceswithoutaclearreason, it
might be an error.

RemoveAllDuplicates:
dataset=dataset.drop_duplicates(subset=['Product_Name'],keep=False)

VerifyingRemovalofDuplicates
print(dataset.duplicated(subset=['Product_Name']).sum())

DataTypeConversion andStandardization:
 Convertingdatatypes,e.g.,convertingdatestringstodatetimeobjects.
Data type conversion is essential when the data types of the columns are not suitable for the
operationsyouintendtoperform.Acommonscenarioisconvertingdatestringstodatetimeobjects, which
facilitates time-based analysis.

dataset['Date_Column']=pd.to_datetime(dataset['Date_Column'])
 OtherCommonConversions:
Convertingacolumntonumerictype:

dataset['Numeric_Column']=pd.to_numeric(dataset['Numeric_Column'],
errors='coerce')

Convertingacolumntoastringtype:

dataset['String_Column']=dataset['String_Column'].astype(str)

 Standardizingunits(ifapplicable),likeconvertingallweightstokilograms.
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

Ifyourdatasetcontainsmeasurements(likeweights,lengths,etc.),ensuringthatalldatapointsarein a
consistent unit is crucial for accurate analysis.

StandardizingWeights(e.g.,convertingallweightstokilograms):
 Assumeyouhaveaweightcolumnindifferentunits(likegrams)andyouwanttostandardizeitto
kilograms.

dataset['Weight_kg']=dataset['Weight_g']/1000.0

OtherStandardizations:

 Ifyouhavemeasurementsinvariousunits(likemilesandkilometers),you'dsimilarlyconvertthemto a single
standard unit.
 Currencyconversionsmightalsoberelevantifdealingwithinternationaldata.

Note:
 DataInspection:Alwaysinspectyourdatafirsttounderstandtheexistingformatsandunits.
 ErrorsHandling:Whenconvertingdatatypes,especiallytonumerictypes,beawareofnon-numeric values
that can cause errors. Using errors='coerce' in pd.to_numeric() can help by setting invalid parsing to
NaN.
 ContextualDecisions:Thechoiceofstandardization(likewhichunittostandardizeto)shouldbe
guided by the specific requirements of your analysis or the conventions in your field.

FeatureEngineering:
Featureengineeringisacrucialstepindatapreprocessing,whereyoucreatenewfeaturesfromexistingdata to
enhance the performance of machine learning models or to gain deeper insights during analysis. In the
context of a supermarket dataset, there are several ways to engineer features. Let’s go through some
common methods:

 Ifyouhaveawiderangeofproducts,itmightbeusefultocategorizethemintobroadergroups.This can
simplify analysis and might reveal trends that are not apparent at the individual product level.

Assumeyouhaveacolumnnamed'Product_Type'whichcontainstypesofproducts.Youcan categorize
these into broader categories.

defcategorize_product(product_type):
ifproduct_typein['Beverage','Snack']:return
'Food'
elifproduct_typein['Shampoo','Soap']:return
'Personal Care'
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

#Addmoreconditionsasneededelse:
return 'Other'dataset['Product_Category']
=
dataset['Product_Type'].apply(categorize_product)

 Ifyourdatasetcontainsdate ortimeinformation,youcanextractfeatureslikethedayoftheweek, month,


or time of day, which can be relevant for analysis.

ExtractDayoftheWeek:

dataset['Day_of_Week']=dataset['Date_Column'].dt.day_name()

ExtractMonth:

dataset['Month']=dataset['Date_Column'].dt.month

OtherPotentialFeatures

 CustomerSegmentationFeatures:Ifyouhavecustomerdata,youmightcreatefeaturesthatsegment
customers into groups based on their buying behavior.
 SeasonalFeatures:Forproductswhosesalesmightbeaffectedbyseasons(likeicecream,heaters), you
could create a feature indicating the season.

LabTask
Implementdatapre-processingand cleaningin python, onthegiven dataset.

Code:
importpandasaspd
importnumpyasnp
importmatplotlib.pyplotasplt

file_path='C:\\Users\\Student\\Desktop\\superstore_final_dataset.csv'
dataset = pd.read_csv(file_path, encoding='ISO-8859-1')
print(dataset.head())

#Checkif'Row_ID'isunique
is_unique=dataset['Row_ID'].is_unique
print(is_unique)

# Reset index and update 'Row_ID'


dataset=dataset.reset_index(drop=True)
dataset['Row_ID'] = dataset.index + 1
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

# Check for missing values


missing_values=dataset.isnull().sum()
print(missing_values)

#Handlemissing'Order_ID'
forindex,rowindataset.iterrows():
ifpd.isnull(row['Order_ID']):
customer_name=row['Customer_Name']
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Order_ID'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Order_ID']=match['Order_ID'].values[0]

#Handlemissing'Order_Date'
forindex,rowindataset.iterrows():
ifpd.isnull(row['Order_Date']):
customer_name=row['Customer_Name']
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Order_Date'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Order_Date']=pd.to_datetime(match['Order_Date'].values[0])

#Handlemissing'Ship_Date'
forindex,rowindataset.iterrows():
ifpd.isnull(row['Ship_Date']):
customer_name=row['Customer_Name']
match=dataset[(dataset['Customer_Name']==customer_name)&dataset['Ship_Date'].notnull()].head(1) if
not match.empty:
dataset.at[index,'Ship_Date']=pd.to_datetime(match['Ship_Date'].values[0])

#Convert'Order_Date'and'Ship_Date'to datetime
dataset['Order_Date']=pd.to_datetime(dataset['Order_Date'],errors='coerce') dataset['Ship_Date']
= pd.to_datetime(dataset['Ship_Date'], errors='coerce')

# Check for missing values again


#missing_values_after_handling=dataset.isnull().sum()
#print(missing_values_after_handling)

#Printtheupdateddataset
print(dataset.head())
output_file_path='C:\\Users\\Student\\Desktop\\cleaned\\superstore_final_dataset_cleaned.csv'#Pathforthenewfile dataset.to_csv(output_file_path,
index=False, encoding='ISO-8859-1')
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

DatasetBefore:
NationalUniversityofTechnology(NUTECH)
ElectricalEngineeringDepartment
EE4407-MachineLearningLab

Conclusion:
CONCLUSION

In conclusion, this lab successfully achieved its objective of familiarizing students with essential data cleaning
and preprocessing techniques using a supermarket sales dataset. Through practical exercises in Python,
employingtoolssuchasJupyterNotebookandlibrarieslikePandasandNumPy,studentsgainedproficiencyin handling
common data challenges.

The importance of addressing missing values, managing duplicates, standardizing data formats, and
performingbasictransformationswasunderscore throughoutthelab.Thesefoundationalskillsarecrucial
prerequisitesformoreadvanced data analysis andmachine learning tasks, providing students with a solid
groundwork for future exploration in the realm of data science.

The utilization of a real-world supermarket sales dataset in CSV format enhanced the relevance of the
exercise,allowingstudentstoapplylearnedtechniquestoauthenticdatascenarios.Asweprogressintomore
complex analyses and modeling, the acquired skills in data cleaning and preprocessing will serve as
indispensable tools for ensuring the quality and reliability of data, setting the stage for informed decision-
making and advanced analytics in diverse applications.

You might also like