Data Wrangling

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Data Wrangling

Due to the rapid expansion of the amount of data and data sources available today, storing and
organizing large quantities of data for analysis is becoming increasingly necessary.

Data wrangling is the process of removing errors and combining complex data sets to make
them more accessible and easier to analyze.
Data wrangling can be defined as the process of cleaning, organizing, and transforming
raw data into the desired format for analysts to use for prompt decision-making.

Data wrangling enables businesses to tackle more complex data in less time, produce more
accurate results, and make better decisions.

More and more organizations are increasingly relying on data wrangling tools to make
data ready for downstream analytics.
Importance of Data Wrangling

The primary importance of using data wrangling tools can be described as:

•Making raw data usable. Accurately wrangled data guarantees that quality data is entered into the
downstream analysis.
•Getting all data from various sources into a centralized location so it can be used.
•Piecing together raw data according to the required format and understanding the business context of
data
•Automated data integration tools are used as data wrangling techniques that clean and convert source
data into a standard format that can be used repeatedly according to end requirements. Businesses use
this standardized data to perform crucial, cross-data set analytics.
•Cleansing the data from the noise or flawed, missing elements
•Data wrangling acts as a preparation stage for the data mining process, which involves gathering data
and making sense of it.
•Helping business users make concrete, timely decisions
Data wrangling software typically performs six iterative steps of Discovering, Structuring,
Cleaning, Enriching, Validating, and Publishing data before it is ready for analytics.
1. Data discovery
The first step helps you make sense of the data you're working with. You'll also
need to keep the primary goal of the data analysis during this step. For example, if
your organization wants to gain customer behavior insight, you might sort
customer data according to location, promotional codes, and purchases.

2. Data structuring
Once you've finished the first step, you might find raw data that could be more
organized, complete, or mis formatted for your purposes. That's where data
structuring comes into play. This is the process in which you transform that raw
data into a form appropriate for the analytical model you want to use to interpret
the data.
3. Data cleaning
During the data cleaning step, you remove data errors that might distort or damage the value of your analysis.
This includes tasks like standardising inputs, deleting empty cells, removing outliers, and deleting blank rows.
Ultimately, the goal is to ensure the data is as error-free as possible.

4. Enriching data
Once you've transformed your data into a more usable state, you must determine if you have all the data you
need for the project. If you don't, you can enrich it by adding values from other data sets. And if you do so, you
might have to repeat steps one through three for that new data.

5. Validating data
When you work on data validation, you verify that your data is consistent and of sufficient quality. During this
step, you might find some issues you need to address or that the data is ready to be analysed. This step is
typically completed using automated processes and requires some programming skills.

6. Publishing data
After validating your data, you're ready to publish it. In this step, you'll put it into whatever format you prefer
for sharing with other organisation members for analysis purposes. Use written reports or digital files,
depending on the nature of the data and the organisation's overarching goals.
Benefits of Data Wrangling

•Data wrangling helps to improve data usability as it converts data into a compatible
format for the end system.
•It helps to quickly build data flows within an intuitive user interface and easily
schedule and automate the data-flow process.
•Integrates various types of information and their sources (like databases, web
services, files, etc.)
•Help users to process very large volumes of data easily and easily share data-flow
techniques.
Data Wrangling Tools

Some examples of basic data munging tools are:

•Spreadsheets / Excel Power Query - It is the most basic manual data wrangling tool

•OpenRefine - An automated data cleaning tool that requires programming skills

•Tabula – It is a tool suited for all data types

•Google DataPrep – It is a data service that explores, cleans, and prepares data

•Data wrangler – It is a data cleaning and transforming tool


Data Wrangling Examples
Data wrangling techniques are used for various use cases. The most commonly used examples of
data wrangling are for:
•Merging several data sources into one data set for analysis
•Identifying gaps or empty cells in data and either filling or removing them
•Deleting irrelevant or unnecessary data
•Identifying severe outliers in data and either explaining the inconsistencies or deleting them to
facilitate analysis
Businesses also use data wrangling tools to

•Detect corporate fraud


•Support data security
•Ensure accurate and recurring data modeling results
•Ensure business compliance with industry standards
•Perform Customer Behavior Analysis
•Reduce time spent on preparing data for analysis
•Promptly recognize the business value of your data
•Find out data trends
Data Wrangling Skills

• To be able to perform series of data transformations like merging, ordering, aggregating

• To use data science programming languages like R, Python, Julia, SQL on specified
data sets

• To make logical judgments based on underlying business context


Data Wrangling in Python
Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of
Python is used for Data Wrangling.
Data wrangling in Python deals with the below functionalities:
1.Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column, or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where
new data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Data exploration in Python

Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a tabular
format.

# Import pandas package


import pandas as pd

# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
'Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}

# Convert into DataFrame


df = pd.DataFrame(data)

# Display data
df
Dealing with missing values in Python
As we can see from the previous output, there are NaN values present in the MARKS column which is
a missing value in the dataframe that is going to be taken care of in data wrangling by replacing them
with the column mean.

# Compute average
c = avg = 0
for ele in df['Marks']:
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c

# Replace missing values


df = df.replace(to_replace="NaN",
value=avg)

# Display data
df
Data Replacing in Data Wrangling

In the GENDER column, we can replace the Gender column data by categorizing them into
different numbers.

# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,

'F': 1, }).astype(float)

# Display data
df
Filtering data in Data Wrangling

Suppose there is a requirement for the details regarding name, gender, and marks of the top-
scoring students. Here we need to remove some using the pandas slicing method in data
wrangling from unwanted data.
# Filter top scoring students
df = df[df['Marks'] >= 75].copy()

# Remove age column from filtered DataFrame


df.drop('Age', axis=1, inplace=True)

# Display data
df

Hence, we have finally obtained an efficient dataset that can be further used for
various purposes.
Data Wrangling Using Merge Operation

Merge operation is used to merge two raw data into the desired format.

Syntax: pd.merge( data_frame1,data_frame2, on=”field “)

Here the field is the name of the column which is similar in both data-frame.

For example: Suppose that a Teacher has two types of Data, the first type of Data
consists of Details of Students and the Second type of Data consists of Pending Fees
Status which is taken from the Account Office. So The Teacher will use the merge
operation here in order to merge the data and provide it meaning. So that teacher will
analyze it easily and it also reduces the time and effort of the Teacher from Manual
Merging.
Creating First Dataframe to Perform Merge Operation using Data Wrangling:

# import module
import pandas as pd

# creating DataFrame for Student Details


details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106,
107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly',
"Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})

# printing details
print(details)
Creating Second Dataframe to Perform Merge operation using Data Wrangling:

# Import module
import pandas as pd

# Creating Dataframe for Fees_Status


fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250',
'NIL']})

# Printing fees_status
print(fees_status)
Data Wrangling Using Merge Operation:

# Import module
import pandas as pd

# Creating Dataframe
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})

# Creating Dataframe
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250', 'NIL']})

# Merging Dataframe
print(pd.merge(details, fees_status, on='ID'))
Data Wrangling Using Grouping Method

The grouping method in Data wrangling is used to provide results in terms of various groups
taken out from Large Data.
This method of pandas is used to group the outset of data from the large data set.

Example: There is a Car Selling company and this company have different Brands of various Car
Manufacturing Company like Maruti, Toyota, Mahindra, Ford, etc., and have data on where different
cars are sold in different years. So the Company wants to wrangle only that data where cars are sold
during the year 2010. For this problem, we use another data Wrangling technique which is a pandas
groupby() method.
Creating dataframe to use Grouping methods[Car selling datasets]:

# Import module
import pandas as pd

# Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti', 'Maruti','Maruti',
'Hyundai', 'Hyundai','Toyota', 'Mahindra', 'Mahindra','Ford',
'Toyota', 'Ford'],
'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013,
2010, 2010, 2011],
'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]}

# Creating Dataframe of car_selling_data


df = pd.DataFrame(car_selling_data)

# printing Dataframe
print(df)
Creating Dataframe to use Grouping methods[DATA OF THE YEAR 2010]:

# Import module
import pandas as pd

# Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti', 'Maruti’,
'Maruti', 'Hyundai', 'Hyundai','Toyota', 'Mahindra', 'Mahindra',
'Ford', 'Toyota', 'Ford'],
'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013,
2010, 2010, 2011],
'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]}

# Creating Dataframe for Provided Data


df = pd.DataFrame(car_selling_data)

# Group the data when year = 2010


grouped = df.groupby('Year')
print(grouped.get_group(2010))
Data Wrangling by Removing Duplication

Pandas duplicates() method helps us to remove duplicate values from Large Data. An important
part of Data Wrangling is removing Duplicate values from the large data set.

Syntax: DataFrame.duplicated(subset=None, keep=’first’)


Here subset is the column value where we want to remove the Duplicate value.
In keeping, we have 3 options :

•if keep =’first’ then the first value is marked as the original rest of all values if occur
will be removed as it is considered duplicate.

•if keep=’last’ then the last value is marked as the original rest the above same values
will be removed as it is considered duplicate values.

•if keep =’false’ all the values which occur more than once will be removed as all are
considered duplicate values.
For example, A University will organize the event. In order to participate Students
have to fill in their details in the online form so that they will contact them. It may
be possible that a student will fill out the form multiple times. It may cause
difficulty for the event organizer if a single student will fill in multiple entries. The
Data that the organizers will get can be Easily Wrangles by removing duplicate
values.
Creating a Student Dataset who want to participate in the event:

# Import module
import pandas as pd

# Initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop’, 'Rahul',
'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul', 'Praveen',
'Amit'],
'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23],
'Email': ['xxxx@gmail.com', 'xxxxxx@gmail.com',
'xxxxxx@gmail.com', 'xx@gmail.com',
'xxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxxx@gmail.com',
'xxxxxxxxxx@gmail.com', 'xxxxxxxxxx@gmail.com']}
# Creating Dataframe of Data
df = pd.DataFrame(student_data)
# Printing Dataframe
print(df)
Removing Duplicate data from the Dataset using Data wrangling:

# import module
import pandas as pd

# initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop',
'Rahul', 'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul',
'Praveen', 'Amit'],
'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23],
'Email': ['xxxx@gmail.com', 'xxxxxx@gmail.com',
'xxxxxx@gmail.com', 'xx@gmail.com',
'xxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxx@gmail.com',
'xxxxx@gmail.com', 'xxxxxx@gmail.com',
'xxxxxxxxxx@gmail.com', 'xxxxxxxxxx@gmail.com']}
# creating dataframe
df = pd.DataFrame(student_data)
# Here df.duplicated() list duplicate Entries in ROllno.
# So that ~(NOT) is placed in order to get non duplicate values.
non_duplicate = df[~df.duplicated('Roll_no')]
# printing non-duplicate values
print(non_duplicate)
Creating New Datasets Using the Concatenation of Two Datasets In Data Wrangling.

We can join two dataframe in several ways. For our example in Concanating Two datasets, we use
pd.concat() function.
Creating Two Dataframe For Concatenation.

# importing pandas module


import pandas as pd

# Define a dictionary containing employee data


data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad',
'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
# Define a dictionary containing employee data
data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age':[22, 32, 12, 52],
'Address':['Allahabad', 'Kannuaj',
'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary':[1000, 2000, 3000, 4000]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
We will join these two dataframe along axis 0.

res = pd.concat([df, df1])

We can see that data1 does not have a salary column so all four rows of new dataframe res are
Nan values.

You might also like