python 2.1.3 (2)
python 2.1.3 (2)
Combining Datasets: Merge and Join, Aggregation and Grouping, Pivot Tables, Vectorized String
Operations, Working with Time Series. High-Performance Pandas: eval()and query()
Pandas provides several methods to combine or merge multiple datasets. The most common
methods are merge() and join(), both of which are used to combine data from multiple
DataFrames based on a common column.
Merge:
The merge() function is used to combine two DataFrames based on common columns or
indices. It is similar to SQL joins (e.g., inner, left, right, and outer joins).
Example:
import pandas as pd
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
Explanation:
The merge() function combines df1 and df2 on the common column ID.
The how='inner' argument specifies that we want an inner join, meaning only the
rows with common IDs will be kept.
Join:
The join() function combines DataFrames by their index or columns. It is more commonly
used when the two DataFrames have a shared index.
Example:
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}, index=[1, 2, 3])
df2 = pd.DataFrame({
'Country': ['USA', 'Canada', 'UK']
}, index=[1, 2, 3])
Output:
Explanation:
The join() function combines df1 and df2 using their index.
The resulting DataFrame contains the columns from both df1 and df2.
Aggregation and grouping allow you to perform calculations (such as sum, mean, or count)
on subsets of your data.
GroupBy:
The groupby() function splits the data into groups based on some criteria, applies a function
to each group, and then combines the results.
Example:
df = pd.DataFrame({
'Team': ['A', 'B', 'A', 'B', 'A', 'B'],
'Points': [10, 20, 15, 25, 30, 35]
})
Output:
Points
Team
A 55
B 80
Explanation:
3. Pivot Tables
Pivot tables allow you to reshape data and perform aggregations. They are similar to Excel
pivot tables.
Example:
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 77]
})
# Creating a pivot table to find the average temperature by date and city
pivot_table = df.pivot_table(values='Temperature', index='Date',
columns='City', aggfunc='mean')
print(pivot_table)
Output:
Explanation:
Pandas provides powerful vectorized operations for string manipulation. These operations are
applied directly to entire columns or Series, making them very efficient.
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['New York', 'Los Angeles', 'Chicago']
})
Output:
Name City
0 ALICE New York
1 BOB Los Angeles
2 CHARLIE Chicago
Explanation:
The str.upper() method is applied to the entire Name column, converting all the
names to uppercase.
Pandas provides extensive functionality for working with time series data, including
generating ranges of dates, resampling data, and performing date/time operations.
Example:
Output:
Temperature
Date
2021-01-31 31.666667
Explanation:
Pandas provides two high-performance functions, eval() and query(), for efficiently
evaluating expressions and filtering data.
eval():
The eval() function allows you to evaluate an expression as a string and operate on large
datasets efficiently.
Example:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
Output:
A B C
0 1 10 11
1 2 20 22
2 3 30 33
3 4 40 44
Explanation:
The eval() function evaluates the expression 'df.A + df.B' and computes the
result in a new column C.
query():
The query() function allows you to filter rows of a DataFrame based on a condition
expressed as a string.
Example:
Output:
A B C
2 3 30 33
3 4 40 44
Explanation:
The query() function allows you to filter the DataFrame based on a condition. Here,
we selected rows where the value in column A is greater than 2.
Questions:
1. What is the difference between the merge() and join() functions in Pandas? Provide an
example of when to use each of them.
2. What is a pivot table in Pandas? Explain how to create a pivot table and describe the role
of the aggfunc parameter.
3. Define vectorized string operations in Pandas. How are they different from using
Python’s regular string methods? Provide an example of using a vectorized string operation
on a column in a DataFrame.
4. Explain how to handle missing data in Pandas. What are the common techniques for
dealing with NaN values in a DataFrame?
5. What is the purpose of the eval() function in Pandas? How does it improve performance
compared to traditional methods for column operations?
6. Explain the query() function in Pandas. How is it used to filter data based on a specific
condition or expression?