Lesson 5 Data Wrangling in Data Science.

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

ZETECH UNIVERSITY

DATA SCIENCE PROGRAMMIG WITH PYTHON

Lesson 5: Data Wrangling in data science.

Data Wrangling - generally refers to transforming raw data into a useable form
for your analyses of interest, including loading, aggregating, merging, grouping,
concatenating and formatting.
Data is not useful until it can be analyzed and presented as insight that drives
better decision making.
data cannot be effectively analyzed until it is well structured, clean, and
converted into a suitable format. Simply put, that is why good data wrangling is
important.

Data wrangling with Python


Python is generally considered to be a data scientist’s best friend. According to a
2019 survey, 87% of data scientists said they regularly used Python, far more
than the next most used languages, SQL (44%) and R (31%).

This process ensures that data is prepared for automation and additional
analysis.

what programming language do you use on a regular basis?

1|P ag e
Here are the goals of data wrangling:

The important needs of data wrangling include,

data scientists spend 75% - 80% of their time wrangling the data, which is not a surprise

at all.

1. The quality of the data is ensured.


2. Supports timely decision-making and fastens data insights.
3. Noisy, flawed, and missing data are cleaned.
4. It makes sense to the resultant dataset, as it gathers data that acts as a
preparation stage for the data mining process.
5. Helps to make concrete and take a decision by cleaning and structuring raw data
into the required format.
6. Raw data are pieced together to the required format.
7. To create a transparent and efficient system for data management, the best
solution is to have all data in a centralized location so it can be used in improving
compliance.
8. Wrangling the data helps make decisions promptly and helps the wrangler clean,
enrich, and transform the data into a perfect picture.

Data wrangling Key Competencies:

1. Outlier/Anomaly Detection - Apply Outlier Detection techniques.


2. Missing values in data - Cleaning data by finding and replacing missing values using
data science libraries.
3. Duplicate values in data - Cleaning data by finding and removing duplicate values
using data science libraries.
4. Categorical data to numeric data - Transforming categorical data to numerical data
using data science libraries.
5. Group data based on values - In a single dataset, grouping data using data science
libraries.
6. Concatenate data along an axis - Concatenating data using Python data science
libraries.
7. Merge multiple sets of data into a single dataset - Joining multiple sets of data using
data science libraries.

2|P ag e
1. Data exploration - here we assign the data, and then we visualize the data in a tabular
format.

2. Dealing with missing values, as we can see from the previous output, there
are NaN values present in the MARKS column which are going to be taken care of by
replacing them with the column mean.

3|P ag e
3. Reshaping data, in the GENDER column, we can reshape the data by categorizing them
into different numbers.

4|P ag e
Explanation
Panda. map () function from series is used to substitute each value in series with
another value.

4. Filtering data, suppose there is a requirement for the details regarding name, gender,
marks of the top-scoring students. Here we need to remove some unwanted data.

5|P ag e
Explanation
What does Axis 1 in pandas mean?
A data frame object has two axes. “axis 0” and “axis 1”
 “axis 0” represents rows and
 “axis 1“represents columns.

Hence, we have finally obtained an efficient dataset which can be further used
for various purposes. Hence, we have finally obtained an efficient dataset which
can be further used for various purposes.

Now that we know the basics of data wrangling. Below we will discuss various
operations using which we can perform data wrangling:
a) Merge operation
b) Grouping Method

Wrangling Data Using Merge Operation


Merge operation is used to merge raw data and into the desired format.

Syntax:
pd.merge( data_frame1,data_frame2, on="field ")
For example: Suppose that a Teacher has two types of Data, first type of Data consists
of Details of Students and Second type of Data Consist of Pending Fees Status which is
taken from Account Office. So The Teacher will use merge operation here in order to

6|P ag e
merge the data and provide it meaning. So that teacher will analyze it easily and it
also reduces time and effort of Teacher from Manual Merging.

SECOND TYPE OF DATA

7|P ag e
8|P ag e
WRANGLING DATA USING MERGE OPERATION:

Wrangling Data using Grouping Method


The grouping method in Data analysis is used to provide results in terms of
various groups taken out from Large Data. This method of pandas is used to
group the outset of data from the large data set.

Example: There is a Car Selling company and this company have different
Brands of various Car Manufacturing Company like Maruti, Toyota, Mahindra,
Ford, etc. and have data where different cars are sold in different years. So the
Company wants to wrangle only that data where cars are sold during the year

9|P ag e
2010. For this problem, we use another Wrangling technique that is groupby()
method.

DATA OF THE YEAR 2010:

10 | P a g e
11 | P a g e

You might also like