Basic Data Processing with Pandas
Pandas is a Python library used for working with data sets.
It has functions for analysing, cleaning, exploring, and manipulating data.
Why Use Pandas?
Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
class pandas.Series(data=None, index=None, dtype=None, name=None,
copy=None, fastpath=_NoDefault.no_default): One-dimensional ndarray
with axis labels (including time series).
Constructing Series from a dictionary
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d)
>>> ser
a 1
b 2
c 3
dtype: int64
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN
Constructing Series from a list
>>> d = [10 20 30]
>>> ser = pd. Series(data=d)
>>> ser
0 10
1 20
2 30
Constructing Series from a scaler
>>> ser = pd. Series (50, index=['a', 'b', 'c'])
>>> ser
a 50
b 50
c 50
Indexing and Selecting Data
Positional Indexing: Accessing data by position using integer-based indexing (like lists).
Label-based Indexing: Accessing data by index labels.
Boolean Indexing: Selecting data based on a condition
s = pd. Series ([4, 7, -5, 3], index= ['a', 'b', 'c', 'd'])
# Positional indexing
print(s.iloc[0]) # Output: 4
# Label-based indexing
print(s['b']) # Output: 7
# Boolean indexing
print(s[s > 0]) # Output: a 4
# b 7
# d 3
Vectorised Operations
Pandas Series supports vectorised operations, which means operations are applied to each
element of the Series without the need for an explicit loop.
s = pd. Series ([1, 2, 3, 4])
# Arithmetic operations
print (s + 5) # Output: Series with each element incremented by 5
# Element-wise operations
print (s * 2) # Output: Series with each element multiplied by 2
Alignment of Data
When performing operations between two Series, Pandas automatically aligns the data based
on the index. If an index is missing, the result will have NaN for that index.
s1 = pd. Series ([1, 2, 3], index= ['a', 'b', 'c'])
s2 = pd. Series ([4, 5, 6], index= ['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
Handling Missing Data
Pandas Series has built-in methods to handle missing data (NaN values).
.isnull(): Returns a Boolean Series indicating if each value is NaN.
.notnull(): Returns the inverse of. isnull().
.fillna(): Fills missing data with a specified value.
.dropna(): Removes missing data from the Series.
s = pd.Series([1, None, 3, None, 5])
# Detect missing values
print(s.isnull()) # Output: Boolean Series
0 False
1 True
2 False
3 True
4 False
print(s.notnull()) # Output: Boolean Series
0 True
1 False
2 True
3 False
4 True
# Fill missing values
s_filled = s.fillna(0)
print(s_filled) # Output: Series with `None` replaced by 0
0 1.0
1 0.0
2 3.0
3 0.0
4 5.0
# Drop missing values
s_dropped = s.dropna()
print(s_dropped) # Output: Series with missing values removed
0 1.0
2 3.0
4 5.0
Series Methods
Series objects come with a variety of methods for data manipulation and analysis:
.sum(): Returns the sum of all elements.
.mean(): Returns the mean of the elements.
.sort_values(): Sorts the Series by its values.
.rank(): Ranks the values in the Series.
.apply(): Applies a function to each element.
s = pd.Series([5, 3, 8, 2])
print(s.sum()) # Output: 18
print(s.mean()) # Output: 4.5
print(s.sort_values()) # Output: Sorted Series
3 2
1 3
0 5
2 8
print(s.rank()) # Output: Series with ranks
0 3.0
1 2.0
2 4.0
3 1.0
print(s.apply(lambda x: x**2)) # Output: Series with squared values
0 25
1 9
2 64
3 4
Series and Time Series Data:
Pandas Series is well-suited for handling time series data. You can use date ranges as the
index to create a time series.
pandas.date_range(start=None, end=None, periods=None)
pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')
Specify start, end, and periods; the frequency is generated automatically
(linearly spaced).
pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)
dates = pd.date_range('20230101', periods=6)
s = pd.Series([1, 3, 5, 7, 9, 11], index=dates)
print(s)
2023-01-01 1
2023-01-02 3
2023-01-03 5
2023-01-04 7
2023-01-05 9
2023-01-06 11
Freq: D, dtype: int64
Querying a Series
Querying a Series in Pandas involves selecting and filtering data based on various conditions.
Accessing Elements by Index: You can access elements in a Series using the index, either
by position (integer-based) or by label.
import pandas as pd
s = pd. Series ([10, 20, 30, 40], index= ['a', 'b', 'c', 'd'])
# Access the first element
print(s[0]) # Output: 10
# Access the third element
print(s[2]) # Output: 30
# Access element with index label 'b'
print(s['b']) # Output: 20
# Access multiple elements by label
print(s[['b', 'd']]) # Output: Series with elements 20 and 40
b 20
d 40
Slicing: You can slice a Series to get a subset of the data. Slicing can be done by position or
by label.
# Slice the first three elements
print(s[:3]) # Output: a 10
b 20
c 30
# Slice using labels
print(s['a':'c']) # Output: a 10
b 20
c 30
Boolean Indexing: Boolean indexing allows you to filter a Series based on a condition.
# Query elements greater than 20
print(s[s > 20]) # Output: c 30
d 40
You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).
# Query elements greater than 10 and less than 40
print (s[(s > 10) & (s < 40)]) # Output: b 20
c 30
Using isin() Method: The isin() method is used to filter data based on whether the elements
are in a list of values.
# Query elements that are either 10 or 30
print (s[s.isin([10, 30])]) # Output: a 10
c 30
Querying with .loc and .iloc:.loc[] is used for label-based indexing,
while .iloc[] is used for position-based indexing.
# Using .loc for label-based indexing
print(s.loc['b':'d']) # Output: b 20
c 30
d 40
# Using .iloc for position-based indexing
print(s.iloc[1:3]) # Output: b 20
c 30
Querying with Conditional Functions: You can also use functions like .where() and
.query() for conditional selection.
.where(): Returns elements that satisfy a condition, otherwise returns NaN.
# Keep elements greater than 20, replace others with NaN
print(s.where(s > 20)) # Output: a NaN
b NaN
c 30.0
d 40.0
.query(): Though primarily used for DataFrames, it can be used with Series in certain contexts
when converting to DataFrame temporarily.
df = s.to_frame(name='value')
result = df.query('value > 20')
print(result)
Handling Missing Data: While querying, you might encounter missing data (NaN).
.isnull() and .notnull(): Check for NaN values.
s = pd.Series([10, None, 30, None, 50])
# Query non-null values
print(s[s.notnull()]) # Output: 0 10.0
2 30.0
3 40.0
.dropna(): Removes NaN values from the Series.
# Drop missing data
print(s.dropna()) # Output: 0 10.0
2 30.0
3 40.0
Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=N
one): Two-dimensional, size-mutable, potentially heterogeneous tabular data.
We can create a Pandas DataFrame in the following ways:
Using Python Dictionary
Using Python List
From a File
Creating an Empty DataFrame
import pandas as pd
# create a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris
# create a two-dimensional list
data = [['John', 25, 'New York'],
['Alice', 30, 'London'],
['Bob', 35, 'Paris']]
# create a DataFrame from the list
df = pd.DataFrame(data, columns= ['Name', 'Age', 'City'])
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris
# load data from a CSV file
df = pd.read_csv('data.csv')
print(df)
# create an empty DataFrame
df = pd.DataFrame()
print(df) // output: Empty DataFrame
Columns: []
Index: []
Basic DataFrame Operations
Viewing Data:
DataFrame.head(n=5): Return the first n rows.
For negative values of n, this function returns all rows except the last |n| rows
If n is larger than the number of rows, this function returns all rows.
Parameters:
df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
DataFrame.tail(n=5): Return the last n rows.
For negative values of n, this function returns all rows except the first n rows.
If n is larger than the number of rows, this function returns all rows.
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
7 whale
8 zebra
>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
DataFrame.info(): Print a concise summary of a DataFrame. This method prints
information about a DataFrame including the index dtype and columns, non-null values and
memory usage.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes
Accessing Columns and Rows
Single column: df['column_name'] or df.column_name
Multiple columns: df[['column1', 'column2']]
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print (df[['Name', 'Qualification']])
Column addition: In Order to add a column in Pandas DataFrame, we can declare a new
list as a column and add to existing Dataframe.
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
# Observe the result
print(df)
Column Deletion: In Order to delete a column in Pandas DataFrame, we can use the
drop () method.
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
... columns= ['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Drop a row by index
>>> df.drop([0, 1])
A B C D
2 8 9 10 11
Selecting Rows
By label: df.loc[label]: Label-based data selector. The end index is included during
slicing.
By index: df.iloc[index]: Index-based data selector. The end index is excluded during
slicing.
What is loc Method?
The loc[ ] is a label-based method used for selecting data as well as updating it. This is done
by passing the name (label) of the row or column that we wish to select.
Syntax: loc[row_label, column_label]
#Creating a Sample DataFrame
df = pd.DataFrame({
'id': [ 101, 102, 103, 104, 105, 106, 107],
'age': [ 20, 22, 23, 21, 22, 21, 25],
'group': [ 'A', 'B', 'C', 'C', 'B', 'A', 'A'],
'city': [ 'Tier1', 'Tier2', 'Tier2', 'Tier3', 'Tier1', 'Tier2', 'Tier1'],
'gender': [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'degree': [ 'econ', 'econ', 'maths', 'finance', 'history', 'science', 'marketing']
})
Selecting a row using loc[ ]
#Selecting a row with label
df.loc[102]
Slicing using loc[ ]
#Slicing using loc[]
df.loc[101:103]
Filtering rows using loc[ ]
#Selecting all rows with a given condition
df.loc[df.age >= 22]
#Selecting rows with multiple conditions
df.loc[(df.age >= 22) & (df.city == 'Tier1')]
Filtering columns using loc[ ]
#Selecting columns with a given condition
df.loc[(df.gender == 'M'), ['group', 'degree']]
Updating columns using loc[ ]
#Updating a column with a given condition
df.loc[(df.gender == 'M'), ['group']] = 'A'
df
#Updating multiple columns with a given condition
df.loc[(df.gender == 'F'), ['group', 'city']] = ['B','Tier2']
df
What is iloc Method?
The iloc[ ] is an index-based method used for data selection. In this case, we pass the
positions of the row or column that we wish to select (0-based integer index).
Syntax: iloc[row_position, column_position]
Selecting a row using iloc
#Selecting rows with index
df.iloc[[2,4]] // 2 and 4 refer to the index number, and hence the second and the fourth row
would be displayed
Selecting rows and columns using iloc
#Selecting rows with particular index and particular columns
df.iloc[[0,4],[1,3]] // [0,4] refers to index numbers 0 and 4 for rows and [1,3] refers to
index numbers for columns.
Slicing using iloc
#Selecting range of rows
data.iloc[1:5] // index number 5, that is the endpoint, is not included
We can also select a range of rows and columns:
#Selecting range of rows and columns
df.iloc[1:3,2:4]
Joining and Merging Dataframes
Pandas provides several functions for this purpose, primarily merge( ) and concat()
pandas.merge(left, right, how='inner', on=None, left_on=None, righ
t_on=None, left_index=False, right_index=False, sort=False, suffixe
s=('_x', '_y'), copy=None, indicator=False, validate=None): Merge
DataFrame or named Series objects with a database-style join. A named Series
object is treated as a DataFrame with a single named column.
how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
right: use only keys from right frame, similar to a SQL right outer
join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves
the order of the left keys.
on: label or list: Column(s) to join on. If not specified, the merge will be on common
columns.
Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then
this defaults to the intersection of the columns in both DataFrames.
left_on: Column(s) from the left DataFrame to use as keys.
right_on: Column(s) from the right DataFrame to use as keys.
suffixes: Suffix to apply to overlapping column names in the left and right side.
import pandas as pd key value1
df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6
import pandas as pd key value1
df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2, on=’key’) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6
result = pd.merge(df1, df2, how='left') key value1
print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0
result = pd.merge(df1, df2, how='right') key value1
print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8
result = pd.merge(df1, df2, how='outer') key value1
print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0
result = pd.merge(df1, df2, how='inner') key value1
print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2 5
1 D 4 6
result = pd.merge(df1, df2, how='cross') key value1
print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key_x value1 key_y value2
0 A 1 B 5
1 A 1 D 6
2 A 1 E 7
3 A 1 F 8
4 B 2 B 5
5 B 2 D 6
6 B 2 E 7
7 B 2 F 8
8 C 3 B 5
9 C 3 D 6
10 C 3 E 7
11 C 3 F 8
12 D 4 B 5
13 D 4 D 6
14 D 4 E 7
15 D 4 F 8
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
>>> df1.merge(df2, left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
pandas.concat
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=
None, verify_integrity=False, sort=False, copy=None): Concatenate pandas objects along a
particular axis.
axis{0/’index’, 1/’columns’}, default 0
Axis to concatenate along (0 for rows, 1 for columns)
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_index:bool, default False
If True, do not use the index values along the concatenation axis.
The resulting axis will be labeled 0, …, n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the index values on the other
axes are still respected in the join.
import pandas as pd letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 0 c 3
columns=['letter', 'number']) 1 d 4
print(pd.concat([df1, df2]))
import pandas as pd letter number letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1 c 3
columns=['letter', 'number']) 1 b 2 d 4
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
print(pd.concat([df1, df2], axis=1))
import pandas as pd letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 2 c 3
columns=['letter', 'number']) 3 d 4
print(pd.concat([df1, df2], ignore_index
=True))
Pivot Table
Pivot tables in Pandas are powerful tools for summarizing data, allowing you to aggregate and
reshape your data easily. The pivot_table() function in Pandas is similar to Excel's pivot tables.
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
fill_value=None, margins=False, margins_name='All')
data: The DataFrame to pivot.
values: Column(s) to aggregate.
index: Column(s) to group by on the rows.
columns: Column(s) to group by on the columns.
aggfunc: Function to aggregate values (e.g., mean, sum, count, etc.). Default is mean.
fill_value: Value to replace missing values.
margins: Add all rows/columns (like Excel's pivot table "Grand Totals").
margins_name: Name of the margins row/column.
import pandas as pd Region Product Sales Quantity
0 East A 100 10
data = { 1 East B 150 15
'Region': ['East', 'East', 'West', 'West', 2 West A 200 20
'East', 'West'], 3 West B 250 25
'Product': ['A', 'B', 'A', 'B', 'A', 'B'], 4 East A 300 30
'Sales': [100, 150, 200, 250, 300, 400], 5 West B 400 40
'Quantity': [10, 15, 20, 25, 30, 40] Product A B
} Region
East 400 150
df = pd.DataFrame(data) West 200 650
print(df)
pivot = pd.pivot_table(df, values='Sales',
index='Region', columns='Product',
aggfunc='sum')
print(pivot)
pivot = pd.pivot_table(df, values='Sales', sum len
index='Region', columns='Product', Product A B A B
aggfunc=[sum, len]) Region
print(pivot) East 400 150 2 1
West 200 650 1 2
pivot = pd.pivot_table(df, values='Sales', Sales
index=['Region', 'Product'], aggfunc='sum') Region Product
print(pivot) East A 400
B 150
West A 200
B 650
pivot = pd.pivot_table(df, values='Sales', Product A B All
index='Region', columns='Product', Region
aggfunc='sum', margins=True) East 400 150 550
print(pivot) West 200 650 850
All 600 800 1400
pivot = pd.pivot_table(df, values=['Sales', Quantity Sales
'Quantity'], index='Region', columns='Product', Product A B A B
aggfunc='sum') Region
print(pivot) East 40 15 400 150
West 20 65 200 650
Aggregation and Grouping
DataFrame.agg(func=None, axis=0): Aggregate using one or more operations over the
specified axis.
Func: Function to use for aggregating the data. list of functions and/or
function names, e.g. [‘sum’, 'mean']
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’:
apply function to each row.
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>>df.agg(['sum', 'min'])
A B C
sum 12.0 15.0 18.0
min 1.0 2.0 3.0
Different aggregations per column.
>>>df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0
Aggregate over the columns.
>>>df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN
DataFrame.groupby: The groupby() method is used to group data based on one or more
columns, and then you can apply aggregation functions such as mean(), sum(), count(), etc.
Syntax:
df.groupby('column_name').aggregate_function()
Example:
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Employee': ['John', 'Anna', 'Mike', 'Sara', 'Paul', 'Kate'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000]
}
df = pd.DataFrame(data)
# Group by Department and calculate the mean salary for each department
grouped_df = df.groupby('Department')['Salary'].mean()
print(grouped_df)
output:
Department
Finance 70000.0
HR 55000.0
IT 75000.0
Name: Salary, dtype: float64
Common Aggregation Functions:
.sum(): Sum of values.
.mean(): Mean of values.
.count(): Count of non-null values.
.min(), .max(): Minimum and maximum values.
.std(), .var(): Standard deviation and variance.
Multiple Aggregations:
You can also apply multiple aggregation functions to the groups:
# Applying multiple aggregation functions
grouped_df = df.groupby('Department')['Salary'].agg(['mean', 'min', 'max'])
print(grouped_df)
output:
mean min max
Department
Finance 70000 65000 75000
HR 55000 50000 60000
IT 75000 70000 80000
Grouping by Multiple Columns:
You can group by multiple columns as well:
# Group by Department and Employee
grouped_df = df.groupby(['Department', 'Employee'])['Salary'].sum()
print(grouped_df)
output:
Department Employee
Finance Kate 65000
Paul 75000
HR Anna 60000
John 50000
IT Mike 70000
Sara 80000
Name: Salary, dtype: int64
Summary tables in pandas
In pandas, you can create summary tables (or summary statistics tables) using several built-in
methods. These tables provide key statistics such as mean, count, sum, etc., for each group or
the entire dataset.
1. Basic Summary Statistics Table: .describe()
The .describe() method provides a summary of statistics like count, mean, standard
deviation, min, and max for numeric columns.
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000],
'Years_Experience': [5, 7, 3, 10, 8, 6]
}
df = pd.DataFrame(data)
# Summary statistics
summary_table = df.describe()
print(summary_table)
output:
Salary Years_ Experience
count 6.000000 6.000000
mean 66666.666667 6.500000
std 11690.292990 2.338090
min 50000.000000 3.000000
25% 60000.000000 5.250000
50% 67500.000000 6.500000
75% 75000.000000 7.750000
max 80000.000000 10.000000
2. Summary Table by Group: .groupby() + Aggregation
You can use the groupby() function combined with aggregation methods to generate
summary tables based on categorical columns.
Example: Summary by Department
# Group by Department and calculate summary statistics
grouped_summary = df.groupby('Department').agg({
'Salary': ['mean', 'sum', 'min', 'max'],
'Years_Experience': ['mean', 'sum', 'min', 'max']
})
print(grouped_summary)
output:
Salary Years_Experience
mean sum min max mean sum min max
Department
Finance 70000.0 140000 65000 75000 7.0 14 6 8
HR 55000.0 110000 50000 60000 6.0 12 5 7
IT 75000.0 150000 70000 80000 6.5 13 3 10
3. Crosstab Summary: pd.crosstab()
pd.crosstab() creates a summary table (cross-tabulation) of counts or aggregations
between two or more categorical columns.
Example: Cross-tabulation of Departments and Years of Experience
# Crosstab of Department and Years_Experience
crosstab_summary = pd.crosstab(df['Department'], df['Years_Experience'])
print(crosstab_summary)
output:
Years_Experience 3 5 6 7 8 10
Department
Finance 0 0 1 0 1 0
HR 0 1 0 1 0 0
IT 1 0 0 0 0 1
In this case, it shows the number of employees in each department with various years of experience.
4. Pivot Tables: pd.pivot_table()
Pivot tables provide more flexibility by allowing you to compute aggregated values across
multiple dimensions.
Example: Pivot Table of Salary by Department and Years of Experience
# Pivot table for Salary with Department and Years_Experience
pivot_summary = pd.pivot_table(df, values='Salary', index='Department',
columns='Years_Experience', aggfunc='mean')
print(pivot_summary)
output:
Years_Experience 3 5 6 7 8 10
Department
Finance NaN NaN 65000.0 NaN 75000.0 NaN
HR NaN 50000.0 NaN 60000.0 NaN NaN
IT 70000.0 NaN NaN NaN NaN 80000.0
5. Custom Summary Table Using .apply()
If you need a highly customized summary, you can use .apply() to create a summary table
with user-defined functions.
Example: Custom Summary for Salary
# Custom function for range (max - min)
def salary_range(x):
return x.max() - x.min()
# Apply custom aggregation to Salary
custom_summary = df.groupby('Department').Salary.apply(salary_range)
print(custom_summary)
output:
Department
Finance 10000
HR 10000
IT 10000
Name: Salary, dtype: int64