Unit - III
Pandas
Introducing Pandas Objects
Pandas objects is enhanced versions of NumPy structured arrays in which the rows and
columns are identified with labels rather than simple integer indices.
Three fundamental Pandas data structures: the Series, DataFrame, and Index.
import numpy as np
import pandas as pd
The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data.
It can be created from a list or array as follows:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
data.values
array([ 0.25, 0.5 , 0.75, 1. ])
data.index
RangeIndex(start=0, stop=4, step=1)
data[1]
0.5
data[1:3]
1 0.50
2 0.75
dtype: float64
Series as generalized NumPy array
We can use strings as an index:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
0.5
We can even use non-contiguous or non-sequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
data[5]
0.5
Series as specialized dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
population['California']
38332521
population['California':'Illinois']
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
Constructing Series objects
>>> pd.Series(data, index=index)
Where index is an optional argument and data can be one of many entities.
pd.Series([2, 4, 6])
0 2
1 4
2 6
dtype: int64
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
pd.Series({2:'a', 1:'b', 3:'c'})
1 b
2 a
3 c
dtype: object
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
3 c
2 a
dtype: object
The Pandas DataFrame Object
Series is an analog of a one-dimensional array with flexible indices,
DataFrame is an analog of a two-dimensional array with both flexible row indices and
flexible column names.
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
states = pd.DataFrame({'population': population,
'area': area})
states
area population
California42396738332521
Florida 17031219552860
Illinois 14999512882135
New York14129719651127
Texas 69566226448193
states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
states.columns
Index(['area', 'population'], dtype='object')
DataFrame as specialized dictionary
states['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways.
From a single Series object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be
constructed from a single Series:
pd.DataFrame(population, columns=['population'])
population
California38332521
Florida 19552860
Illinois 12882135
New York19651127
Texas 26448193
From a list of dicts
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
a b
0 0 0
1 1 2
2 2 4
From a dictionary of Series objects
pd.DataFrame({'population': population,
'area': area})
area population
California42396738332521
Florida 17031219552860
Illinois 14999512882135
New York14129719651127
Texas 69566226448193
From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718
From a NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
A B
0 0 0.0
1 0 0.0
2 0 0.0
The Pandas Index Object
Both the Series and DataFrame objects contain an explicit index that lets you reference
and modify data.
Index object is an interesting structure in itself, and it can be thought of either as
an immutable array or as an ordered set.
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
Index as immutable array
ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices are immutable–
that is, they cannot be modified via the normal means:
ind[1] = 0
Index as ordered set
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')
Data Selection in Series
Indexers: loc, iloc, and ix
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1 a
3 b
5 c
dtype: object
# explicit index when indexing
data[1]
'a'
# implicit index when slicing
data[1:3]
3 b
5 c
dtype: object
First, the loc attribute allows indexing and slicing that always references the explicit
index:
data.loc[1]
'a'
data.loc[1:3]
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-
style index:
data.iloc[1]
'b'
data.iloc[1:3]
3 b
5 c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.
The purpose of the ix indexer will become more apparent in the context
of DataFrame objects
Operating on Data in Pandas
One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and logarithmic
functions, etc.).
Pandas inherit much of this functionality from NumPy, and the ufuncs.
Pandas include a couple useful twists, however:
i. For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
ii. For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Handling Missing Data
The difference between data found in many tutorials and data in the real world is that
real-world data is rarely clean and homogeneous.
In particular, many interesting datasets will have some amount of data missing.
Different data sources may indicate missing data in different ways.
Here we will see some general considerations for missing data, and how Pandas chooses
to represent it.
Demonstrate some built-in Pandas tools for handling missing data in Python.
We refer missing data in general as null, NaN, or NA values.
There are a number of schemes that have been developed to indicate the presence of
missing data in a table or DataFrame.
Two strategies: using a mask that globally indicates missing values, or choosing
a sentinel value (indicating a missing integer value with -9999 or NaN or None) that
indicates a missing entry.
Operating on Null Values
Pandas treats None and NaN as essentially interchangeable for indicating missing or null
values.
To facilitate this convention, there are several useful methods for detecting, removing,
and replacing null values in Pandas data structures. They are:
isnull(): Generate a boolean mask indicating missing values
notnull(): Opposite of isnull()
dropna(): Return a filtered version of the data
fillna(): Return a copy of the data with missing values filled or imputed
Hierarchical Indexing
Pandas provide objects that handle three-dimensional and four-dimensional data.
Common pattern in practice is to make use of hierarchical indexing (also known
as multi-indexing) to incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar
one-dimensional Series and two-dimensional DataFrame objects.
Pandas MultiIndex
import pandas as pd
import numpy as np
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
dtype: int64
index = pd.MultiIndex.from_tuples(index)
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
The MultiIndex contains multiple levels of indexing–in this case, the state names and the
years, as well as multiple labels for each data point which encode these levels.
If we re-index our series with this MultiIndex, we see the hierarchical representation of
the data:
pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
pop[:, 2010]
California 37253956
New York 19378102
Texas 25145561
dtype: int64
MultiIndex as extra dimension
The unstack() method will quickly convert a multiply indexed Series into a
conventionally indexed DataFrame:
pop_df = pop.unstack()
pop_df
2000 2010
California3387164837253956
New York 1897645719378102
Texas 2085182025145561
The stack() method provides the opposite operation:
pop_df.stack()
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
total under18
California2000338716489267089
2010372539569284094
New York 2000189764574687374
2010193781024318033
Texas 2000208518205906301
2010251455616879014
Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series or DataFrame is to
simply pass a list of two or more index arrays to the constructor.
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
data1 data2
a 10.5542330.356072
20.9252440.219474
b10.4417590.610054
20.1714950.886688
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
Explicit MultiIndex constructors
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. This can be
accomplished by passing the names argument to any of the
above MultiIndex constructors.
pop.index.names = ['state', 'year']
pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
Indexing and Slicing a MultiIndex
pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
pop.loc['California':'New York']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
Combining Datasets:
Concat and Append
import pandas as pd
import numpy as np
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
Simple Concatenation with pd.concat
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
Concatenation with joins
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
The append() method
display('df1', 'df2', 'df1.append(df2)')
Merge and Join
Categories of Joins
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-
one, and many-to-many joins.
All three types of joins are accessed via an identical call to the pd.merge() interface; the
type of join performed depends on the form of the input data.