0% found this document useful (0 votes)
26 views6 pages

python 2.1.3 (2)

The document provides an overview of various data manipulation techniques in Pandas, including merging and joining datasets, aggregation and grouping, creating pivot tables, vectorized string operations, and working with time series data. It also introduces high-performance functions like eval() and query() for efficient data evaluation and filtering. Examples are included to illustrate the usage of each technique.

Uploaded by

hritikp266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

python 2.1.3 (2)

The document provides an overview of various data manipulation techniques in Pandas, including merging and joining datasets, aggregation and grouping, creating pivot tables, vectorized string operations, and working with time series data. It also introduces high-performance functions like eval() and query() for efficient data evaluation and filtering. Examples are included to illustrate the usage of each technique.

Uploaded by

hritikp266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

3.

Combining Datasets: Merge and Join, Aggregation and Grouping, Pivot Tables, Vectorized String
Operations, Working with Time Series. High-Performance Pandas: eval()and query()

1. Combining Datasets: Merge and Join

Pandas provides several methods to combine or merge multiple datasets. The most common
methods are merge() and join(), both of which are used to combine data from multiple
DataFrames based on a common column.

Merge:

The merge() function is used to combine two DataFrames based on common columns or
indices. It is similar to SQL joins (e.g., inner, left, right, and outer joins).

Example:

import pandas as pd

# Creating two DataFrames


df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})

# Merging the DataFrames on 'ID'


merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Output:

ID Name Age
0 1 Alice 25
1 2 Bob 30

Explanation:

 The merge() function combines df1 and df2 on the common column ID.
 The how='inner' argument specifies that we want an inner join, meaning only the
rows with common IDs will be kept.

Join:

The join() function combines DataFrames by their index or columns. It is more commonly
used when the two DataFrames have a shared index.
Example:

df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}, index=[1, 2, 3])

df2 = pd.DataFrame({
'Country': ['USA', 'Canada', 'UK']
}, index=[1, 2, 3])

# Using join to combine the DataFrames based on index


joined_df = df1.join(df2)
print(joined_df)

Output:

Name Age Country


1 Alice 25 USA
2 Bob 30 Canada
3 Charlie 35 UK

Explanation:

 The join() function combines df1 and df2 using their index.
 The resulting DataFrame contains the columns from both df1 and df2.

2. Aggregation and Grouping

Aggregation and grouping allow you to perform calculations (such as sum, mean, or count)
on subsets of your data.

GroupBy:

The groupby() function splits the data into groups based on some criteria, applies a function
to each group, and then combines the results.

Example:

df = pd.DataFrame({
'Team': ['A', 'B', 'A', 'B', 'A', 'B'],
'Points': [10, 20, 15, 25, 30, 35]
})

# Grouping by 'Team' and calculating the sum of 'Points'


grouped = df.groupby('Team').sum()
print(grouped)

Output:

Points
Team
A 55
B 80

Explanation:

 The groupby() function groups the data by the column Team.


 The sum() function is applied to each group to calculate the total Points for each
team.

3. Pivot Tables

Pivot tables allow you to reshape data and perform aggregations. They are similar to Excel
pivot tables.

Example:

df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 77]
})

# Creating a pivot table to find the average temperature by date and city
pivot_table = df.pivot_table(values='Temperature', index='Date',
columns='City', aggfunc='mean')
print(pivot_table)

Output:

City Los Angeles New York


Date
2021-01-01 75.0 32.0
2021-01-02 77.0 30.0

Explanation:

 The pivot_table() function reshapes the data.


 It aggregates Temperature by Date and City and calculates the mean temperature for
each group.

4. Vectorized String Operations

Pandas provides powerful vectorized operations for string manipulation. These operations are
applied directly to entire columns or Series, making them very efficient.

Example:

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['New York', 'Los Angeles', 'Chicago']
})

# Convert all names to uppercase using vectorized string operations


df['Name'] = df['Name'].str.upper()
print(df)

Output:

Name City
0 ALICE New York
1 BOB Los Angeles
2 CHARLIE Chicago

Explanation:

 The str.upper() method is applied to the entire Name column, converting all the
names to uppercase.

5. Working with Time Series

Pandas provides extensive functionality for working with time series data, including
generating ranges of dates, resampling data, and performing date/time operations.

Example:

# Creating a DateTime index


dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame({
'Date': dates,
'Temperature': [32, 35, 31, 30, 29, 28]
})

# Setting 'Date' as the index


df.set_index('Date', inplace=True)

# Resampling the data to get the average temperature per month


monthly_avg = df.resample('M').mean()
print(monthly_avg)

Output:

Temperature
Date
2021-01-31 31.666667

Explanation:

 The date_range() function generates a range of dates starting from '2021-01-01'.


 We resample the data by month using .resample('M') and calculate the mean of the
Temperature column.
6. High-Performance Pandas: eval() and query()

Pandas provides two high-performance functions, eval() and query(), for efficiently
evaluating expressions and filtering data.

eval():

The eval() function allows you to evaluate an expression as a string and operate on large
datasets efficiently.

Example:

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})

# Using eval() to perform arithmetic operations on columns


df['C'] = pd.eval('df.A + df.B')
print(df)

Output:

A B C
0 1 10 11
1 2 20 22
2 3 30 33
3 4 40 44

Explanation:

 The eval() function evaluates the expression 'df.A + df.B' and computes the
result in a new column C.

query():

The query() function allows you to filter rows of a DataFrame based on a condition
expressed as a string.

Example:

# Using query() to filter rows where A is greater than 2


filtered_df = df.query('A > 2')
print(filtered_df)

Output:

A B C
2 3 30 33
3 4 40 44

Explanation:

 The query() function allows you to filter the DataFrame based on a condition. Here,
we selected rows where the value in column A is greater than 2.

Questions:

1. What is the difference between the merge() and join() functions in Pandas? Provide an
example of when to use each of them.
2. What is a pivot table in Pandas? Explain how to create a pivot table and describe the role
of the aggfunc parameter.
3. Define vectorized string operations in Pandas. How are they different from using
Python’s regular string methods? Provide an example of using a vectorized string operation
on a column in a DataFrame.
4. Explain how to handle missing data in Pandas. What are the common techniques for
dealing with NaN values in a DataFrame?
5. What is the purpose of the eval() function in Pandas? How does it improve performance
compared to traditional methods for column operations?
6. Explain the query() function in Pandas. How is it used to filter data based on a specific
condition or expression?

You might also like