0% found this document useful (0 votes)

13 views29 pages

Basic Data Processing With Pandas

The document provides an overview of the Pandas library in Python, focusing on data processing techniques such as creating and manipulating Pandas Series and DataFrames. It covers key functionalities including data cleaning, indexing, handling missing data, and performing operations on data. Additionally, it explains how to construct Series from various data types and how to query and analyze data effectively.

Uploaded by

RANJIT Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views29 pages

Basic Data Processing With Pandas

Uploaded by

RANJIT Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Basic Data Processing with Pandas

Pandas is a Python library used for working with data sets.

It has functions for analysing, cleaning, exploring, and manipulating data.

Why Use Pandas?

Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.

class pandas.Series(data=None, index=None, dtype=None, name=None,

copy=None, fastpath=_NoDefault.no_default): One-dimensional ndarray
with axis labels (including time series).

Constructing Series from a dictionary

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d)
>>> ser
a 1
b 2
c 3
dtype: int64

>>> d = {'a': 1, 'b': 2, 'c': 3}

>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN

Constructing Series from a list

>>> d = [10 20 30]
>>> ser = pd. Series(data=d)
>>> ser
0 10
1 20
2 30

Constructing Series from a scaler

>>> ser = pd. Series (50, index=['a', 'b', 'c'])
>>> ser
a 50
b 50
c 50

Indexing and Selecting Data

 Positional Indexing: Accessing data by position using integer-based indexing (like lists).
 Label-based Indexing: Accessing data by index labels.
 Boolean Indexing: Selecting data based on a condition

s = pd. Series ([4, 7, -5, 3], index= ['a', 'b', 'c', 'd'])
# Positional indexing
print(s.iloc[0]) # Output: 4

# Label-based indexing
print(s['b']) # Output: 7

# Boolean indexing
print(s[s > 0]) # Output: a 4
# b 7
# d 3
Vectorised Operations
Pandas Series supports vectorised operations, which means operations are applied to each
element of the Series without the need for an explicit loop.
s = pd. Series ([1, 2, 3, 4])
# Arithmetic operations
print (s + 5) # Output: Series with each element incremented by 5
# Element-wise operations
print (s * 2) # Output: Series with each element multiplied by 2

Alignment of Data
When performing operations between two Series, Pandas automatically aligns the data based
on the index. If an index is missing, the result will have NaN for that index.
s1 = pd. Series ([1, 2, 3], index= ['a', 'b', 'c'])
s2 = pd. Series ([4, 5, 6], index= ['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
a NaN
b 6.0
c 8.0
d NaN
dtype: float64

Handling Missing Data

Pandas Series has built-in methods to handle missing data (NaN values).

 .isnull(): Returns a Boolean Series indicating if each value is NaN.

 .notnull(): Returns the inverse of. isnull().
 .fillna(): Fills missing data with a specified value.
 .dropna(): Removes missing data from the Series.

s = pd.Series([1, None, 3, None, 5])

# Detect missing values

print(s.isnull()) # Output: Boolean Series
0 False
1 True
2 False
3 True
4 False
print(s.notnull()) # Output: Boolean Series
0 True
1 False
2 True
3 False
4 True

# Fill missing values

s_filled = s.fillna(0)
print(s_filled) # Output: Series with `None` replaced by 0
0 1.0
1 0.0
2 3.0
3 0.0
4 5.0

# Drop missing values

s_dropped = s.dropna()
print(s_dropped) # Output: Series with missing values removed
0 1.0
2 3.0
4 5.0

Series Methods

Series objects come with a variety of methods for data manipulation and analysis:

 .sum(): Returns the sum of all elements.

 .mean(): Returns the mean of the elements.
 .sort_values(): Sorts the Series by its values.
 .rank(): Ranks the values in the Series.
 .apply(): Applies a function to each element.

s = pd.Series([5, 3, 8, 2])
print(s.sum()) # Output: 18
print(s.mean()) # Output: 4.5
print(s.sort_values()) # Output: Sorted Series
3 2
1 3
0 5
2 8
print(s.rank()) # Output: Series with ranks
0 3.0
1 2.0
2 4.0
3 1.0
print(s.apply(lambda x: x**2)) # Output: Series with squared values
0 25
1 9
2 64
3 4

Series and Time Series Data:

Pandas Series is well-suited for handling time series data. You can use date ranges as the
index to create a time series.

pandas.date_range(start=None, end=None, periods=None)

pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')

Specify start, end, and periods; the frequency is generated automatically

(linearly spaced).
pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)

dates = pd.date_range('20230101', periods=6)

s = pd.Series([1, 3, 5, 7, 9, 11], index=dates)

print(s)

2023-01-01 1

2023-01-02 3
2023-01-03 5

2023-01-04 7

2023-01-05 9

2023-01-06 11

Freq: D, dtype: int64

Querying a Series
Querying a Series in Pandas involves selecting and filtering data based on various conditions.
Accessing Elements by Index: You can access elements in a Series using the index, either
by position (integer-based) or by label.
import pandas as pd
s = pd. Series ([10, 20, 30, 40], index= ['a', 'b', 'c', 'd'])
# Access the first element
print(s[0]) # Output: 10
# Access the third element
print(s[2]) # Output: 30

# Access element with index label 'b'

print(s['b']) # Output: 20
# Access multiple elements by label
print(s[['b', 'd']]) # Output: Series with elements 20 and 40
b 20
d 40
Slicing: You can slice a Series to get a subset of the data. Slicing can be done by position or
by label.
# Slice the first three elements

print(s[:3]) # Output: a 10

b 20

c 30

# Slice using labels

print(s['a':'c']) # Output: a 10
b 20
c 30

Boolean Indexing: Boolean indexing allows you to filter a Series based on a condition.
# Query elements greater than 20

print(s[s > 20]) # Output: c 30

d 40

You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).

# Query elements greater than 10 and less than 40

print (s[(s > 10) & (s < 40)]) # Output: b 20

c 30

Using isin() Method: The isin() method is used to filter data based on whether the elements
are in a list of values.

# Query elements that are either 10 or 30

print (s[s.isin([10, 30])]) # Output: a 10

c 30

Querying with .loc and .iloc:.loc[] is used for label-based indexing,

while .iloc[] is used for position-based indexing.
# Using .loc for label-based indexing

print(s.loc['b':'d']) # Output: b 20

c 30

d 40

# Using .iloc for position-based indexing

print(s.iloc[1:3]) # Output: b 20

c 30

Querying with Conditional Functions: You can also use functions like .where() and
.query() for conditional selection.

.where(): Returns elements that satisfy a condition, otherwise returns NaN.

# Keep elements greater than 20, replace others with NaN

print(s.where(s > 20)) # Output: a NaN

b NaN

c 30.0

d 40.0

.query(): Though primarily used for DataFrames, it can be used with Series in certain contexts
when converting to DataFrame temporarily.

df = s.to_frame(name='value')

result = df.query('value > 20')

print(result)

Handling Missing Data: While querying, you might encounter missing data (NaN).

.isnull() and .notnull(): Check for NaN values.

s = pd.Series([10, None, 30, None, 50])

# Query non-null values

print(s[s.notnull()]) # Output: 0 10.0

2 30.0

3 40.0

.dropna(): Removes NaN values from the Series.

# Drop missing data

print(s.dropna()) # Output: 0 10.0

2 30.0

3 40.0

Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=N

one): Two-dimensional, size-mutable, potentially heterogeneous tabular data.

We can create a Pandas DataFrame in the following ways:

 Using Python Dictionary
 Using Python List
 From a File
 Creating an Empty DataFrame

import pandas as pd
# create a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# create a two-dimensional list

data = [['John', 25, 'New York'],

['Alice', 30, 'London'],

['Bob', 35, 'Paris']]

# create a DataFrame from the list

df = pd.DataFrame(data, columns= ['Name', 'Age', 'City'])

print(df) // output: Name Age City

0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# load data from a CSV file

df = pd.read_csv('data.csv')
print(df)

# create an empty DataFrame

df = pd.DataFrame()

print(df) // output: Empty DataFrame

Columns: []

Index: []

Basic DataFrame Operations

Viewing Data:

DataFrame.head(n=5): Return the first n rows.

For negative values of n, this function returns all rows except the last |n| rows
If n is larger than the number of rows, this function returns all rows.

Parameters:

df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',

... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey

>>> df.head(3)
animal
0 alligator
1 bee
2 falcon

>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot

DataFrame.tail(n=5): Return the last n rows.

For negative values of n, this function returns all rows except the first n rows.
If n is larger than the number of rows, this function returns all rows.
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.tail(3)
animal
6 shark
7 whale
8 zebra

>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

DataFrame.info(): Print a concise summary of a DataFrame. This method prints

information about a DataFrame including the index dtype and columns, non-null values and
memory usage.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Accessing Columns and Rows

Single column: df['column_name'] or df.column_name
Multiple columns: df[['column1', 'column2']]
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

print (df[['Name', 'Qualification']])

Column addition: In Order to add a column in Pandas DataFrame, we can declare a new
list as a column and add to existing Dataframe.
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# Declare a list that is to be converted into a column

address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']

# Using 'Address' as the column name

# and equating it to the list
df['Address'] = address

# Observe the result

print(df)
Column Deletion: In Order to delete a column in Pandas DataFrame, we can use the
drop () method.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),

... columns= ['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

>>> df.drop(columns=['B', 'C'])

A D
0 0 3
1 4 7
2 8 11

Drop a row by index

>>> df.drop([0, 1])

A B C D
2 8 9 10 11

Selecting Rows

By label: df.loc[label]: Label-based data selector. The end index is included during
slicing.

By index: df.iloc[index]: Index-based data selector. The end index is excluded during
slicing.

What is loc Method?

The loc[ ] is a label-based method used for selecting data as well as updating it. This is done
by passing the name (label) of the row or column that we wish to select.

Syntax: loc[row_label, column_label]

#Creating a Sample DataFrame

df = pd.DataFrame({
'id': [ 101, 102, 103, 104, 105, 106, 107],
'age': [ 20, 22, 23, 21, 22, 21, 25],
'group': [ 'A', 'B', 'C', 'C', 'B', 'A', 'A'],
'city': [ 'Tier1', 'Tier2', 'Tier2', 'Tier3', 'Tier1', 'Tier2', 'Tier1'],
'gender': [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'degree': [ 'econ', 'econ', 'maths', 'finance', 'history', 'science', 'marketing']
})

Selecting a row using loc[ ]

#Selecting a row with label
df.loc[102]

Slicing using loc[ ]

#Slicing using loc[]
df.loc[101:103]

Filtering rows using loc[ ]

#Selecting all rows with a given condition

df.loc[df.age >= 22]
#Selecting rows with multiple conditions
df.loc[(df.age >= 22) & (df.city == 'Tier1')]

Filtering columns using loc[ ]

#Selecting columns with a given condition

df.loc[(df.gender == 'M'), ['group', 'degree']]

Updating columns using loc[ ]

#Updating a column with a given condition

df.loc[(df.gender == 'M'), ['group']] = 'A'
df
#Updating multiple columns with a given condition
df.loc[(df.gender == 'F'), ['group', 'city']] = ['B','Tier2']
df

What is iloc Method?

The iloc[ ] is an index-based method used for data selection. In this case, we pass the
positions of the row or column that we wish to select (0-based integer index).

Syntax: iloc[row_position, column_position]

Selecting a row using iloc

#Selecting rows with index

df.iloc[[2,4]] // 2 and 4 refer to the index number, and hence the second and the fourth row
would be displayed

Selecting rows and columns using iloc

#Selecting rows with particular index and particular columns

df.iloc[[0,4],[1,3]] // [0,4] refers to index numbers 0 and 4 for rows and [1,3] refers to
index numbers for columns.

Slicing using iloc

#Selecting range of rows

data.iloc[1:5] // index number 5, that is the endpoint, is not included
We can also select a range of rows and columns:
#Selecting range of rows and columns
df.iloc[1:3,2:4]

Joining and Merging Dataframes

Pandas provides several functions for this purpose, primarily merge( ) and concat()

pandas.merge(left, right, how='inner', on=None, left_on=None, righ

t_on=None, left_index=False, right_index=False, sort=False, suffixe
s=('_x', '_y'), copy=None, indicator=False, validate=None): Merge
DataFrame or named Series objects with a database-style join. A named Series
object is treated as a DataFrame with a single named column.

how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Type of merge to be performed.

 left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
 right: use only keys from right frame, similar to a SQL right outer
join; preserve key order.
 outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
 inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
 cross: creates the cartesian product from both frames, preserves
the order of the left keys.

on: label or list: Column(s) to join on. If not specified, the merge will be on common
columns.
 Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then
this defaults to the intersection of the columns in both DataFrames.
 left_on: Column(s) from the left DataFrame to use as keys.
 right_on: Column(s) from the right DataFrame to use as keys.
 suffixes: Suffix to apply to overlapping column names in the left and right side.

import pandas as pd key value1

df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

import pandas as pd key value1

df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2, on=’key’) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='left') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0

result = pd.merge(df1, df2, how='right') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8

result = pd.merge(df1, df2, how='outer') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0

result = pd.merge(df1, df2, how='inner') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='cross') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key_x value1 key_y value2
0 A 1 B 5
1 A 1 D 6
2 A 1 E 7
3 A 1 F 8
4 B 2 B 5
5 B 2 D 6
6 B 2 E 7
7 B 2 F 8
8 C 3 B 5
9 C 3 D 6
10 C 3 E 7
11 C 3 F 8
12 D 4 B 5
13 D 4 D 6
14 D 4 E 7
15 D 4 F 8

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],

... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
>>> df1.merge(df2, left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
pandas.concat
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=
None, verify_integrity=False, sort=False, copy=None): Concatenate pandas objects along a
particular axis.

axis{0/’index’, 1/’columns’}, default 0

Axis to concatenate along (0 for rows, 1 for columns)
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_index:bool, default False
If True, do not use the index values along the concatenation axis.
The resulting axis will be labeled 0, …, n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the index values on the other
axes are still respected in the join.
import pandas as pd letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 0 c 3
columns=['letter', 'number']) 1 d 4
print(pd.concat([df1, df2]))

import pandas as pd letter number letter number

df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1 c 3
columns=['letter', 'number']) 1 b 2 d 4
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
print(pd.concat([df1, df2], axis=1))

import pandas as pd letter number

df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 2 c 3
columns=['letter', 'number']) 3 d 4
print(pd.concat([df1, df2], ignore_index
=True))

Pivot Table
Pivot tables in Pandas are powerful tools for summarizing data, allowing you to aggregate and
reshape your data easily. The pivot_table() function in Pandas is similar to Excel's pivot tables.

pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',

fill_value=None, margins=False, margins_name='All')
 data: The DataFrame to pivot.
 values: Column(s) to aggregate.
 index: Column(s) to group by on the rows.
 columns: Column(s) to group by on the columns.
 aggfunc: Function to aggregate values (e.g., mean, sum, count, etc.). Default is mean.
 fill_value: Value to replace missing values.
 margins: Add all rows/columns (like Excel's pivot table "Grand Totals").
 margins_name: Name of the margins row/column.

import pandas as pd Region Product Sales Quantity

0 East A 100 10
data = { 1 East B 150 15
'Region': ['East', 'East', 'West', 'West', 2 West A 200 20
'East', 'West'], 3 West B 250 25
'Product': ['A', 'B', 'A', 'B', 'A', 'B'], 4 East A 300 30
'Sales': [100, 150, 200, 250, 300, 400], 5 West B 400 40
'Quantity': [10, 15, 20, 25, 30, 40] Product A B
} Region
East 400 150
df = pd.DataFrame(data) West 200 650
print(df)
pivot = pd.pivot_table(df, values='Sales',
index='Region', columns='Product',
aggfunc='sum')
print(pivot)

pivot = pd.pivot_table(df, values='Sales', sum len

index='Region', columns='Product', Product A B A B
aggfunc=[sum, len]) Region
print(pivot) East 400 150 2 1
West 200 650 1 2

pivot = pd.pivot_table(df, values='Sales', Sales

index=['Region', 'Product'], aggfunc='sum') Region Product
print(pivot) East A 400
B 150
West A 200
B 650

pivot = pd.pivot_table(df, values='Sales', Product A B All

index='Region', columns='Product', Region
aggfunc='sum', margins=True) East 400 150 550
print(pivot) West 200 650 850
All 600 800 1400

pivot = pd.pivot_table(df, values=['Sales', Quantity Sales

'Quantity'], index='Region', columns='Product', Product A B A B
aggfunc='sum') Region
print(pivot) East 40 15 400 150
West 20 65 200 650

Aggregation and Grouping

DataFrame.agg(func=None, axis=0): Aggregate using one or more operations over the
specified axis.

Func: Function to use for aggregating the data. list of functions and/or
function names, e.g. [‘sum’, 'mean']

axis{0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’: apply function to each column. If 1 or ‘columns’:
apply function to each row.
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>>df.agg(['sum', 'min'])
A B C
sum 12.0 15.0 18.0
min 1.0 2.0 3.0

Different aggregations per column.

>>>df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})

A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0

Aggregate over the columns.

>>>df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN

DataFrame.groupby: The groupby() method is used to group data based on one or more
columns, and then you can apply aggregation functions such as mean(), sum(), count(), etc.

Syntax:
df.groupby('column_name').aggregate_function()
Example:
import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Employee': ['John', 'Anna', 'Mike', 'Sara', 'Paul', 'Kate'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000]
}

df = pd.DataFrame(data)

# Group by Department and calculate the mean salary for each department
grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

output:
Department
Finance 70000.0
HR 55000.0
IT 75000.0
Name: Salary, dtype: float64

Common Aggregation Functions:

 .sum(): Sum of values.

 .mean(): Mean of values.
 .count(): Count of non-null values.
 .min(), .max(): Minimum and maximum values.
 .std(), .var(): Standard deviation and variance.

Multiple Aggregations:

You can also apply multiple aggregation functions to the groups:

# Applying multiple aggregation functions

grouped_df = df.groupby('Department')['Salary'].agg(['mean', 'min', 'max'])

print(grouped_df)

output:

mean min max

Department
Finance 70000 65000 75000
HR 55000 50000 60000
IT 75000 70000 80000

Grouping by Multiple Columns:

You can group by multiple columns as well:

# Group by Department and Employee

grouped_df = df.groupby(['Department', 'Employee'])['Salary'].sum()

print(grouped_df)

output:

Department Employee

Finance Kate 65000

Paul 75000

HR Anna 60000

John 50000

IT Mike 70000

Sara 80000
Name: Salary, dtype: int64

Summary tables in pandas

In pandas, you can create summary tables (or summary statistics tables) using several built-in
methods. These tables provide key statistics such as mean, count, sum, etc., for each group or
the entire dataset.

1. Basic Summary Statistics Table: .describe()

The .describe() method provides a summary of statistics like count, mean, standard
deviation, min, and max for numeric columns.

import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000],
'Years_Experience': [5, 7, 3, 10, 8, 6]
}
df = pd.DataFrame(data)
# Summary statistics
summary_table = df.describe()
print(summary_table)

output:

Salary Years_ Experience

count 6.000000 6.000000

mean 66666.666667 6.500000

std 11690.292990 2.338090

min 50000.000000 3.000000

25% 60000.000000 5.250000

50% 67500.000000 6.500000

75% 75000.000000 7.750000

max 80000.000000 10.000000

2. Summary Table by Group: .groupby() + Aggregation

You can use the groupby() function combined with aggregation methods to generate
summary tables based on categorical columns.

Example: Summary by Department

# Group by Department and calculate summary statistics

grouped_summary = df.groupby('Department').agg({
'Salary': ['mean', 'sum', 'min', 'max'],
'Years_Experience': ['mean', 'sum', 'min', 'max']
})

print(grouped_summary)

output:
Salary Years_Experience
mean sum min max mean sum min max
Department
Finance 70000.0 140000 65000 75000 7.0 14 6 8
HR 55000.0 110000 50000 60000 6.0 12 5 7
IT 75000.0 150000 70000 80000 6.5 13 3 10
3. Crosstab Summary: pd.crosstab()

pd.crosstab() creates a summary table (cross-tabulation) of counts or aggregations

between two or more categorical columns.

Example: Cross-tabulation of Departments and Years of Experience

# Crosstab of Department and Years_Experience

crosstab_summary = pd.crosstab(df['Department'], df['Years_Experience'])

print(crosstab_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance 0 0 1 0 1 0

HR 0 1 0 1 0 0

IT 1 0 0 0 0 1

In this case, it shows the number of employees in each department with various years of experience.

4. Pivot Tables: pd.pivot_table()

Pivot tables provide more flexibility by allowing you to compute aggregated values across
multiple dimensions.

Example: Pivot Table of Salary by Department and Years of Experience

# Pivot table for Salary with Department and Years_Experience

pivot_summary = pd.pivot_table(df, values='Salary', index='Department',

columns='Years_Experience', aggfunc='mean')

print(pivot_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance NaN NaN 65000.0 NaN 75000.0 NaN

HR NaN 50000.0 NaN 60000.0 NaN NaN

IT 70000.0 NaN NaN NaN NaN 80000.0

5. Custom Summary Table Using .apply()

If you need a highly customized summary, you can use .apply() to create a summary table
with user-defined functions.

Example: Custom Summary for Salary

# Custom function for range (max - min)

def salary_range(x):

return x.max() - x.min()

# Apply custom aggregation to Salary

custom_summary = df.groupby('Department').Salary.apply(salary_range)

print(custom_summary)

output:

Department

Finance 10000

HR 10000

IT 10000
Name: Salary, dtype: int64

The Neptune, Pool Layout Programme: Proo F
No ratings yet
The Neptune, Pool Layout Programme: Proo F
1 page
Ip Work
No ratings yet
Ip Work
6 pages
CH 1 Python Pandas-I
No ratings yet
CH 1 Python Pandas-I
13 pages
Ip Notes
No ratings yet
Ip Notes
20 pages
Pandas
No ratings yet
Pandas
49 pages
Unit-1 Python Pandas
No ratings yet
Unit-1 Python Pandas
56 pages
Pandas - Series - Short - Notes
100% (1)
Pandas - Series - Short - Notes
7 pages
PandasOutput (1)
No ratings yet
PandasOutput (1)
16 pages
Introduction To Pandas & Data Structures
No ratings yet
Introduction To Pandas & Data Structures
11 pages
1 IP 12 NOTES PythonPandas 2022 PDF
100% (3)
1 IP 12 NOTES PythonPandas 2022 PDF
66 pages
Final Formatted After Iloc Loc
No ratings yet
Final Formatted After Iloc Loc
34 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
9 pages
Chapter 2 Data Handling Using Pandas - I (Series)
No ratings yet
Chapter 2 Data Handling Using Pandas - I (Series)
13 pages
MLL Ip Xii
No ratings yet
MLL Ip Xii
22 pages
XII IP CH 1 Python Pandas - I Series
No ratings yet
XII IP CH 1 Python Pandas - I Series
45 pages
Study Material IP 2022
No ratings yet
Study Material IP 2022
55 pages
12ip 22 23
No ratings yet
12ip 22 23
188 pages
Pandas
No ratings yet
Pandas
57 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Exp 25 - 26
No ratings yet
Exp 25 - 26
17 pages
Pandas Notes
No ratings yet
Pandas Notes
19 pages
DV Final QB
No ratings yet
DV Final QB
60 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Working With Pandas Notes
No ratings yet
Working With Pandas Notes
27 pages
XII-IP-QuickRevision 2 in 1
No ratings yet
XII-IP-QuickRevision 2 in 1
13 pages
Exp8 SBLC
No ratings yet
Exp8 SBLC
9 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Unit II Notes Revision
No ratings yet
Unit II Notes Revision
20 pages
Class 12 IP Ch-1, 2 3
No ratings yet
Class 12 IP Ch-1, 2 3
28 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
No ratings yet
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
15 pages
XII IP QuickRevision
No ratings yet
XII IP QuickRevision
26 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
25 pages
Ip Notes
No ratings yet
Ip Notes
72 pages
Python Data Processing
No ratings yet
Python Data Processing
36 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
Class12 Pandas Notes
No ratings yet
Class12 Pandas Notes
23 pages
Python Pandas Series
No ratings yet
Python Pandas Series
7 pages
1 Data Handlinng Using Pandas-I
No ratings yet
1 Data Handlinng Using Pandas-I
46 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
12 IP Questions
No ratings yet
12 IP Questions
181 pages
Reading Material For Data Handling Using Pandas-I
No ratings yet
Reading Material For Data Handling Using Pandas-I
51 pages
Python Pandas Series
No ratings yet
Python Pandas Series
37 pages
Python Pandas-Series-neww
100% (1)
Python Pandas-Series-neww
80 pages
Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Class Notes Class: XII Date: 17-04-2021 Subject: Informatics Practices Topic: Chapter-1
No ratings yet
Class Notes Class: XII Date: 17-04-2021 Subject: Informatics Practices Topic: Chapter-1
5 pages
Ip 102
No ratings yet
Ip 102
36 pages
ATA Andling - 25 MARKS: D H Pandas
No ratings yet
ATA Andling - 25 MARKS: D H Pandas
102 pages
Unit 1 Pandas - Series and DataFrame
No ratings yet
Unit 1 Pandas - Series and DataFrame
19 pages
Unit 2
No ratings yet
Unit 2
81 pages
Python Pandas
No ratings yet
Python Pandas
22 pages
Python Pandas 1
No ratings yet
Python Pandas 1
86 pages
Ncert Pandas
No ratings yet
Ncert Pandas
36 pages
Grade-XII-IP_Ch-1_Series Notes (1)
No ratings yet
Grade-XII-IP_Ch-1_Series Notes (1)
28 pages
Leip 102
No ratings yet
Leip 102
36 pages
Data Handlinng Using Pandas-I
No ratings yet
Data Handlinng Using Pandas-I
46 pages
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
14 pages
Transaction
No ratings yet
Transaction
3 pages
Index and Hashng
No ratings yet
Index and Hashng
2 pages
CN After Mid-2
No ratings yet
CN After Mid-2
13 pages
Movie Recommendation System Using Machine Learning Techniques
No ratings yet
Movie Recommendation System Using Machine Learning Techniques
21 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Scriabin Etude Op.42 No.5
No ratings yet
Scriabin Etude Op.42 No.5
4 pages
Pronouns Class 6
No ratings yet
Pronouns Class 6
12 pages
ASSIGNMENT 2 (25%) : Diploma Programmes Introduction To Information Technology (CSC40704/ CSC40104)
No ratings yet
ASSIGNMENT 2 (25%) : Diploma Programmes Introduction To Information Technology (CSC40704/ CSC40104)
4 pages
Islamic Values 2 Lesson Plan Q2-W5-Day1
No ratings yet
Islamic Values 2 Lesson Plan Q2-W5-Day1
4 pages
Citizen Portal
No ratings yet
Citizen Portal
2 pages
The Philippine Cultural Values and Entrepreneurship
100% (2)
The Philippine Cultural Values and Entrepreneurship
16 pages
Thinking Through Drawing
No ratings yet
Thinking Through Drawing
35 pages
HEF4521B: 1. General Description
No ratings yet
HEF4521B: 1. General Description
17 pages
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
No ratings yet
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
35 pages
Hyster 330Y
100% (1)
Hyster 330Y
20 pages
How To Troubleshoot Identity Awareness Issues - Checkpoint
No ratings yet
How To Troubleshoot Identity Awareness Issues - Checkpoint
17 pages
Grade Ten (10) Work: Unit 1: The Birth and Infancy of John The Baptist and Jesus Christ
100% (2)
Grade Ten (10) Work: Unit 1: The Birth and Infancy of John The Baptist and Jesus Christ
88 pages
w2 - For Students - w2 - Preparation For Chap 5
No ratings yet
w2 - For Students - w2 - Preparation For Chap 5
3 pages
I'm Yours
No ratings yet
I'm Yours
2 pages
Web Guide
No ratings yet
Web Guide
60 pages
Online Registration Manual
No ratings yet
Online Registration Manual
14 pages
Medical Technology Laws and Bioethics
No ratings yet
Medical Technology Laws and Bioethics
12 pages
Company Profile: Board of Directors
No ratings yet
Company Profile: Board of Directors
6 pages
Service Manual: Finisher
No ratings yet
Service Manual: Finisher
235 pages
Py Bom 13729140000069375
No ratings yet
Py Bom 13729140000069375
2 pages
The Thick of It - 3x02
No ratings yet
The Thick of It - 3x02
51 pages
HSSC Cet: HKRHZ Ijh (Kk&2025
No ratings yet
HSSC Cet: HKRHZ Ijh (Kk&2025
128 pages
PHP Developer Course Material
No ratings yet
PHP Developer Course Material
10 pages
A Review of Literature On Emotional Intelligence: Doa Naqvi
No ratings yet
A Review of Literature On Emotional Intelligence: Doa Naqvi
14 pages
Updates in Taxation 18 April 2024 MCLE
No ratings yet
Updates in Taxation 18 April 2024 MCLE
61 pages
146 - Module 4 - FinTech Regulation and RegTech - FinTech, RegTech and The Reconceptualisation of Financial Regulation
No ratings yet
146 - Module 4 - FinTech Regulation and RegTech - FinTech, RegTech and The Reconceptualisation of Financial Regulation
51 pages
Forensic Toxicology
90% (10)
Forensic Toxicology
165 pages
Reyes Vernalyn D. Practicum 2
No ratings yet
Reyes Vernalyn D. Practicum 2
63 pages
Application of Coir Geotextile For Road Construction Some Issues
No ratings yet
Application of Coir Geotextile For Road Construction Some Issues
5 pages