lOMoARcPSD|50930822
UNIT II EDA using python
Exploratory Data Analysis (Anna University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
EXPLORATORY DATA ANALYSIS
UNIT II EDA USING PYTHON
Data Manipulation using Pandas – Pandas Objects – Data Indexing
and Selection – Operating on Data – Handling Missing Data –
Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables –
Vectorized String Operations.
2.1 Data manipulation with Pandas
Pandas are a newer package built on top of NumPy, and provide an
efficient implementation of a DataFrame. DataFrames are essentially
multidimensional arrays with attached row and column labels, and
often with heterogeneous types and/or missing data. As well as
offering a convenient storage interface for labeled data, Pandas
implements a number of powerful data operations familiar to users of
both database frameworks and spreadsheet programs.
As we saw, NumPy's ndarray data structure provides essential features
for the type of clean, well-organized data typically seen in numerical
computing tasks. While it serves this purpose very well, its limitations
become clear when we need more flexibility (e.g., attaching labels to data,
working with missing data, etc.) and when attempting operations that do
not map well to element-wise broadcasting (e.g., groupings, pivots, etc.),
each of which is an important piece of analyzing the less structured data
available in many forms in the world around us. Pandas, and in
particular its Series and DataFrame objects, builds on the NumPy array
structure and provides efficient access to these sorts of "data munging"
tasks that occupy much of a data scientist's time.
In this chapter, we will focus on the mechanics of using Series,
DataFrame, and related structures effectively. We will use examples
drawn from real datasets where appropriate, but these examples are not
necessarily the focus.
Installing and Using Pandas
Installation of Pandas on your system requires NumPy to be installed, and
if building the library from source, requires the appropriate tools to
compile the C and Python sources on which Pandas is built. Once Pandas
is installed, you can import it and check the version:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
import pandas
pandas.__version__
Out[1]:
'0.18.1'
Just as we generally import NumPy under the alias np, we will import
Pandas under the alias pd:
In [2]:
import pandas as pd
Reminder about Built-In Documentation
For example, to display all the contents of the pandas namespace, you
can type
In [3]: pd.<TAB>
And to display Pandas's built-in documentation, you can use this:
In [4]: pd?
Data indexing and selection
Data indexing includes (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]),
masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and
combinations thereof (e.g., arr[:, [1, 5]]). Here we'll look at similar means
of accessing and modifying values in
Pandas Series and DataFrame objects. If you have used the NumPy
patterns, the corresponding patterns in Pandas will feel very familiar,
though there are a few quirks to be aware of.
Data Selection in Series
A Series object acts in many ways like a one-dimensional NumPy array,
and in many ways like a standard Python dictionary. If we keep these two
overlapping analogies in mind, it will help us to understand the patterns
of data indexing and selection in these arrays.
Series as dictionary Like a dictionary, the Series object provides a
mapping from a collection of keys to a collection of values:
EXAMPLE 1
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
PROGRAM OUTPUT
import pandas as pd a 0.25
data = pd.Series([0.25, 0.5, b 0.50
0.75, 1. 0], index=['a', 'b', 'c', c 0.75
'd']) d 1.00
data dtype: float64
0.5
data['b']
True
'a' in data
data.keys() Index(['a', 'b', 'c', 'd'],
dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75),
('d', 1 .0)]
data['e'] = 1.25
data a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
We can also use dictionary-like Python expressions and methods to
examine the keys/indices and values: Series objects can even be modified
with a dictionary-like syntax. Just as you can extend a dictionary by
assigning to a new key, you can extend a Series by assigning to a new
index value:
This easy mutability of the objects is a convenient feature: under the
hood, Pandas is making decisions about memory layout and data copying
that might need to take place; the user generally does not need to worry
about these issues.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Series as one-dimensional array
A Series builds on this dictionary-like interface and provides array-style
item selection via the same basic mechanisms as NumPy arrays – that is,
slices, masking, and fancy indexing. Examples of these are as follows:
EXAMPLE 2
PROGRAM OUTPUT
data['a':'c'] a 0.25
b 0.50
c 0.75
dtype: float64
data[0:2] a 0.25
data[(data > 0.3) & (data < 0.8)] b 0.50
dtype: float64
b 0.50
# fancy indexing c 0.75
data[['a', 'e']] dtype: float64
a 0.25
e 1.25
dtype: float64
Among these, slicing may be the source of the most confusion. Notice that
when slicing with an explicit index (i.e., data['a':'c']), the final index is
included in the slice, while when slicing with an implicit index (i.e.,
data[0:2]), the final index is excluded from the slice.
Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For
example, if your Series has an explicit integer index, an indexing
operation such as data[1] will use the explicit indices, while a slicing
operation like data[1:3] will use the implicit Python-style index.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
EXAMPLE 3
PROGRAM OUTPUT
data = pd.Series(['a', 'b', 'c'], 1 a
index=[ 1, 3, 5]) 3 b
data 5 c
dtype: object
# explicit index when 'a'
indexing data[1]
3 b
# implicit index when slicing 5 c
data[1:3] dtype: object
Because of this potential confusion in the case of integer indexes, Pandas
provides some special indexer attributes that explicitly expose certain
indexing schemes. These are not functional methods, but attributes that
expose a particular slicing interface to the data in the Series.
First, the loc attribute allows indexing and slicing that always references
the explicit index:
EXAMPLE 4
PROGRAM OUTPUT
data.loc[1] 'a'
data.loc[1:3] 1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the
implicit Python-style index:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
EXAMPLE 4
PROGRAM OUTPUT
data.iloc[1] 'b'
data.iloc[1:3] 3 b
5 c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series
objects is equivalent to standard []-based indexing. The purpose of the ix
indexer will become more apparent in the context of DataFrame objects,
which we will discuss in a moment.
One guiding principle of Python code is that "explicit is better than
implicit." The explicit nature of loc and iloc make them very useful in
maintaining clean and readable code; especially in the case of integer
indexes, I recommend using these both to make code easier to read and
understand, and to prevent subtle bugs due to the mixed indexing/slicing
convention.
Data Selection in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or
structured array, and in other ways like a dictionary of Series structures
sharing the same index. These analogies can be helpful to keep in mind
as we explore data selection within this structure.
DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of
related Series objects. Let's return to our example of areas and
populations of states:
EXAMPLE 4
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
PROGRAM OUTPUT
area = pd.Series({'California':
42396 7, 'Texas': 695662,
'New York': 141297, 'Flo rida':
170312,
'Illinois': 149995}) pop =
pd.Series({'California': 383325
21, 'Texas': 26448193,
'New York': 19651127, 'F
lorida': 19552860,
'Illinois': 12882135}) data =
pd.DataFrame({'area':area, 'p
op':pop})
data
The individual Series that make up the columns of the DataFrame can be
accessed via dictionary-style indexing of the column name:
EXAMPLE 4
PROGRAM OUTPUT
data['area']
Equivalently, we can use attribute-style access with column names that
are strings:
EXAMPLE 4
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
PROGRAM OUTPUT
data.area
data.area is data['area'] True
data.pop is data['pop']
False
This attribute-style column access actually accesses the exact same
object as the dictionary-style access: Though this is a useful shorthand,
keep in mind that it does not work for all cases! For example, if the
column names are not strings, or if the column names conflict with
methods of the DataFrame, this attribute-style access is not possible. For
example, the DataFrame has a pop() method, so data.pop will point to
this rather than the "pop" column:
In particular, you should avoid the temptation to try column assignment
via attribute (i.e., use data['pop'] = z rather than data.pop = z).
Like with the Series objects discussed earlier, this dictionary-style syntax
can also be used to modify the object, in this case adding a new column:
EXAMPLE 4
PROGRAM OUTPUT
data['density'] = data['pop'] /
data['a rea']
data
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
This shows a preview of the straightforward syntax of element-by element
arithmetic between Series objects; we'll dig into this further in Operating
on Data in Pandas.
DataFrame as two-dimensional array
As mentioned previously, we can also view the DataFrame as an
enhanced two-dimensional array. We can examine the raw underlying
data array using the values attribute:
EXAMPLE 4
PROGRAM OUTPUT
data.values array([[ 4.23967000e+05,
3.83325210e+0 7, 9.04139261e+01],
[ 1.70312000e+05, 1.95528600e+07,
1.14806121e+02],
[ 1.49995000e+05, 1.28821350e+07,
8.58837628e+01],
[ 1.41297000e+05, 1.96511270e+07,
1.39076746e+02],
[ 6.95662000e+05, 2.64481930e+07,
3.80187404e+01]])
With this picture in mind, many familiar array-like observations can be
done on the DataFrame itself. For example, we can transpose the full
DataFrame to swap rows and columns:
EXAMPLE 5
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
data.T
When it comes to indexing of DataFrame objects, however, it is clear that
the dictionary-style indexing of columns precludes our ability to simply
treat it as a NumPy array. In particular, passing a single index to an array
accesses a row and passing a single "index" to a DataFrame accesses a
column:
EXAMPLE 5
PROGRAM OUTPUT
data.values[0] array([ 4.23967000e+05,
3.83325210e+0 7, 9.04139261e+01])
data['area'] California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Thus for array-style indexing, we need another convention. Here Pandas
again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc
indexer, we can index the underlying array as if it is a simple NumPy
array (using the implicit Python-style index), but the DataFrame index
and column labels are maintained in the result:
EXAMPLE 5
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
data.iloc[:3, :2]
Similarly, using the loc indexer we can index the underlying data in an
array-like style but using the explicit index and column names:
EXAMPLE 5
PROGRAM OUTPUT
data.loc[:'Illinois',
:'pop']
Keep in mind that for integer indices, the ix indexer is subject to the
same potential sources of confusion as discussed for integer indexed
Series objects.
2.2 Operating on data
One of the essential pieces of NumPy is the ability to perform quick
element-wise operations, both with basic arithmetic (addition,
subtraction, multiplication, etc.) and with more sophisticated operations
(trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherit much of this functionality from NumPy, and the ufuncs
that we introduced in “Computation on NumPy Arrays: Universal
Functions” on page 50 are key to this. Pandas includes a couple useful
twists, however: for unary operations like negation and trigonometric
functions, these ufuncs will preserve index and column labels in the
output, and for binary operations such as addition and multiplication,
Pandas will automatically align indices when passing the objects to the
ufunc. This means that keeping the context of data and combining data
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
from different sources—both potentially
error -prone tasks with raw NumPy arrays—become essentially foolproof
ones
with Pandas. We will additionally see that there are well-defined
operations between
one-dimensional Series structures and two-dimensional DataFrame
structures.
Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will
work on Pandas Series and DataFrame objects. Let’s start by defining a
simple Series and DataFrame on which to demonstrate this:
EXAMPLE 5
PROGRAM OUTPUT
import pandas as pd
import numpy as np
rng =
np.random.RandomState(42)
ser = pd.Series(rng.randint(0,
10, 4)) ser
df = pd.DataFrame(rng.randint(0, A B C D
10, (3, 4)), 0 6 9 2 6
columns=['A', 'B', 'C', 'D']) 1 7 4 3 7
df 2 7 2 5 4
If we apply a NumPy ufunc on either of these objects, the result will be
another Pandas object with the indices preserved:
EXAMPLE 5
PROGRAM OUTPUT
np.exp(ser)
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
np.sin(df * np.pi / 4)
Or, for a slightly more complex calculation:
Any of the ufuncs discussed in “Computation on NumPy Arrays: Universal
Functions”
UFuncs: Index Alignment
For binary operations on two Series or DataFrame objects, Pandas will
align indices in the process of performing the operation.
Index alignment in Series
As an example, suppose we are combining two different data sources,
and find only the top three US states by area and the top three US states
by population:
EXAMPLE 5
PROGRAM OUTPUT
area = pd.Series({'Alaska':
1723337, 'Texas': 695662,
'California': 423967},
name='area') population =
pd.Series({'California':
38332521, 'Texas': 26448193,
'New York': 19651127},
name='population')
population / area
The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices:
EXAMPLE 5
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
PROGRAM OUTPUT
area.index | population.index Index(['Alaska', 'California',
'New York', 'Texas'],
dtype='object')
Any item for which one or the other does not have an entry is marked
with NaN, or “Not a Number,” which is how Pandas marks missing data .
‐
This index matching is imple mented this way for any of Python’s built-in
arithmetic expressions; any missing values are filled in with NaN by
default:
EXAMPLE 5
PROGRAM OUTPUT
A = pd.Series([2, 4, 6], index=[0, 0 NaN
1, 2]) B = pd.Series([1, 3, 5], 1 5.0
index=[1, 2, 3]) A + B 2 9.0
3 NaN
dtype: float64
If using NaN values is not the desired behavior, we can modify the fill
value using appropriate object methods in place of the operators. For
example, calling A.add(B) is equivalent to calling A + B, but allows
optional explicit specification of the fill value for any elements in A or B
that might be missing:
EXAMPLE 6
PROGRAM OUTPUT
A.add(B, fill_value=0)
Index alignment in DataFrame
A similar type of alignment takes place for both columns and indices
when you are performing operations on DataFrames:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
EXAMPLE 6
PROGRAM OUTPUT
A = pd.DataFrame(rng.randint(0,
20, (2, 2)),
columns=list('AB'))
B = pd.DataFrame(rng.randint(0,
10, (3, 3)),
columns=list('BAC'))
A + B
Notice that indices are aligned correctly irrespective of their order in the
two objects, and indices in the result are sorted. As was the case with
Series, we can use the associated object’s arithmetic method and pass
any desired fill_value to be used in place of missing entries. Here we’ll fill
with the mean of all values in A (which we compute by first stacking the
rows of A):
EXAMPLE 6
PROGRAM OUTPUT
fill = A.stack().mean()
A.add(B, fill_value=fill)
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Table 3-1 lists Python operators and their equivalent Pandas object
methods.
Table 3-1. Mapping between Python operators and Pandas
methods Python operator Pandas method(s)
SYMBOL OPERATOR
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series,
the index and column alignment is similarly maintained. Operations
between a DataFrame and a Series are similar to operations between a
two-dimensional and one-dimensional NumPy array. Consider one
common operation, where we find the difference of a two-dimensional
array and one of its rows:
EXAMPLE 6
PROGRAM OUTPUT
A = rng.randint(10, array([[1, 7, 5, 1],
[4, 0, 9, 5],
size=(3, 4)) A
[8, 0, 9, 2]])
array([[ 0, 0, 0, 0],
A - A[0] [ 3, -7, 4, 4],
[ 7, -7, 4, 1]])
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
According to NumPy’s broadcasting rules subtraction between a
two-dimensional array and one of its rows is applied row-wise. In Pandas,
the convention similarly operates row-wise by default:
EXAMPLE 6
PROGRAM OUTPUT
df=pd.DataFrame(A,columns=list('Q
RST')) print(df)
df - df.iloc[0]
df.subtract(df['R'], axis=0)
If you would instead like to operate column-wise, you can use the object
methods mentioned earlier, while specifying the axis keyword:
Note that these DataFrame/Series operations, like the operations
discussed before, will automatically align indices between the two
elements:
EXAMPLE 6
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
halfrow = df.iloc[0, ::2]
halfrow
EXAMPLE 6
PROGRAM OUTPUT
df - halfrow
This preservation and alignment of indices and columns means that
operations on data in Pandas will always maintain the data context,
which prevents the types of silly errors that might come up when you are
working with heterogeneous and/or misaligned data in raw NumPy
arrays.
3.10 Missing Data
The difference between data found in many tutorials and data in the real
world is that real-world data is rarely clean and homogeneous. In
particular, many interesting datasets will have some amount of data
missing. To make matters even more complicated, different data sources
may indicate missing data in different ways.
In this section, we will discuss some general considerations for missing
data, discuss how Pandas chooses to represent it, and demonstrate some
built-in Pandas tools for handling missing data in Python. Here and
throughout the book, we'll refer to missing data in general as null, NaN,
or NA values.
3.10.1 Pythonic missing data
The first sentinel value used by Pandas is None, a Python singleton object
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
that is often used for missing data in Python code. Because it is a Python
object, None cannot be used in any arbitrary NumPy/Pandas array, but
only in arrays with data type 'object' (i.e., arrays of Python objects):
EXAMPLE 1
PROGRAM OUTPUT
import numpy as np array([1, None, 3, 4],
dtype=object )
import pandas as pd
vals1 = np.array([1, None,
3, 4]) vals1
This dtype=object means that the best common type representation
NumPy could infer for the contents of the array is that they are Python
objects. While this kind of object array is useful for some purposes, any
operations on the data will be done at the Python level, with much more
overhead than the typically fast operations seen for arrays with native
types:
EXAMPLE 1
PROGRAM OUTPUT
vals2 = np.array([1, np.nan, 3, 4]) array([ 1., nan, 3., 4.])
Notice that NumPy chose a native floating-point type for this array: this
means that unlike the object array from before, this array supports fast
operations pushed into compiled code. You should be aware that NaN is a
bit like a data virus–it infects any other object it touches. Regardless of
the operation, the result of arithmetic with NaN will be another NaN:
EXAMPLE 1
PROGRAM OUTPUT
1 + np.nan nan
0 * np.nan nan
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Note that this means that aggregates over the values are well defined
(i.e., they don't result in an error) but not always useful:
EXAMPLE 1
PROGRAM OUTPUT
vals2.sum(), vals2.min(), (nan, nan, nan)
vals2.max()
NumPy does provide some special aggregations that will ignore these
missing values:
EXAMPLE 1
PROGRAM OUTPUT
vals2.sum(), vals2.min(), (nan, nan, nan)
vals2.max()
(8.0, 1.0, 4.0)
np.nansum(vals2),
np.nanmin(vals2),
np.nanmax(vals2)
Keep in mind that NaN is specifically a floating-point value; there is no
equivalent NaN value for integers, strings, or other types.
NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the
two of them nearly interchangeably, converting between them where
appropriate:
EXAMPLE 1
PROGRAM OUTPUT
pd.Series([1, np.nan, 2, None])
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
For types that don't have an available sentinel value, Pandas
automatically type-casts when NA values are present. For example, if we
set a value in an integer array to np.nan, it will automatically be upcast
to a floating-point type to accommodate the NA:
EXAMPLE 1
PROGRAM OUTPUT
x=pd.Series(range(2),
dtype=int) x
x[0] = None
Notice that in addition to casting the integer array to floating point,
Pandas automatically converts the None to a NaN value. (Be aware that
there is a proposal to add a native integer NA to Pandas in the future; as
of this writing, it has not been included).
While this type of magic may feel a bit hackish compared to the more
unified approach to NA values in domain-specific languages like R, the
Pandas sentinel/casting approach works quite well in practice and in my
experience only rarely causes issues.
The following table lists the upcasting conventions in Pandas when NA
values are introduced:
TypeclassConversion When Storing NAsNA Sentinel
Value floating No change np.nan
object No change None or np.nan integer Cast to float64
np.nan
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
boolean Cast to object None or np.nan
Keep in mind that in Pandas, string data is always stored with an object
dtype.
3.10.2 Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially
interchangeable for indicating missing or null values. To facilitate this
convention, there are several useful methods for detecting, removing, and
replacing null values in Pandas data structures. They are:
∙ isnull(): Generate a boolean mask indicating missing values ∙
notnull(): Opposite of isnull()
∙ dropna(): Return a filtered version of the data
∙ fillna(): Return a copy of the data with missing values filled or imputed
It will conclude this section with a brief exploration and demonstration of
these routines.
3.10.3 Detecting null values
Pandas data structures have two useful methods for detecting null data:
isnull() and notnull(). Either one will return a Boolean mask over the
data. For example:
EXAMPLE 1
PROGRAM OUTPUT
data = pd.Series([1, np.nan,
'hello', None])
data.isnull()
data[data.notnull()]
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
As mentioned in Data Indexing and Selection, Boolean masks can be
used directly as a Series or DataFrame index:
The isnull() and notnull() methods produce similar Boolean results for
DataFrames.
3.10.4 Dropping null values
In addition to the masking used before, there are the convenience
methods, dropna() (which removes NA values) and fillna() (which fills in
NA values). For a Series, the result is straightforward:
EXAMPLE 1
PROGRAM OUTPUT
df = pd.DataFrame([[1, np.nan,
2], [2, 3, 5],
[np.nan, 4, 6]]) df
df.dropna()
df.dropna(axis='columns')
It cannot drop single values from a DataFrame; we can only drop full rows
or full columns. Depending on the application, you might want one or the
other, so dropna() gives a number of options for a DataFrame.
By default, dropna() will drop all rows in which any null value is present:
Alternatively, you can drop NA values along a different axis; axis=1 drops
all columns containing a null value:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Filling null values
Sometimes rather than dropping NA values, you'd rather replace them
with a valid value. This value might be a single number like zero, or it
might be some sort of imputation or interpolation from the good values.
You could do this in-place using the isnull() method as a mask, but
because it is such a common operation Pandas provides the fillna()
method, which returns a copy of the array with the null values replaced.
Consider the following Series:
EXAMPLE 1
PROGRAM OUTPUT
data = pd.Series([1, np.nan, 2,
None, 3], index=list('abcde'))
data
We can fill NA entries with a single value, such as zero:
EXAMPLE 1
PROGRAM OUTPUT
data.fillna(0)
We can specify a forward-fill to propagate the previous value forward:
EXAMPLE 1
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
data.fillna(method='ffill')
data.fillna(method='bfill')
df.fillna(method='ffill', axis=1)
Or we can specify a back-fill to propagate the next values backward: For
DataFrames, the options are similar, but we can also specify an axis
along which the fills take place:
Notice that if a previous value is not available during a forward fill, the NA
value remains.
3.11 Hierarchical indexing
Up to this point we've been focused primarily on one-dimensional and
two-dimensional data, stored in Pandas Series and DataFrame objects,
respectively. Often it is useful to go beyond this and store
higher -dimensional data–that is, data indexed by more than one or two
keys. While Pandas does provide Panel and Panel4D objects that natively
handle three-dimensional and four -dimensional data (see Aside: Panel
Data), a far more common pattern in practice is to make use of
hierarchical indexing (also known as multi-indexing) to incorporate
multiple index levels within a single index. In this way,
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
higher -dimensional data can be compactly represented within the
familiar one-dimensional Series and two
dimensional DataFrame objects.
In this section, we'll explore the direct creation of MultiIndex objects,
considerations when indexing, slicing, and computing statistics across
multiply indexed data, and useful routines for converting between simple
and hierarchically indexed representations of your data.
Begin with the standard imports:
import pandas as pd
import numpy as np
A Multiply Indexed Series
Let's start by considering how we might represent two-dimensional data
within a one-dimensional Series. For concreteness, we will consider a
series of data where each point has a character and numerical key.
The bad way
Suppose you would like to track data about states from two different
years. Using the Pandas tools we've already covered, you might be
tempted to simply use Python tuples as keys:
EXAMPLE 1
PROGRAM OUTPUT
index = [('California', 2000),
('California', 2010),('New York',
2000), ('New York', 2010), ('Texas',
2000), ('Texas', 2010)] populations
= [33871648, 37253956,18
976457, 19378102,20851820,
2514556 1]
pop = pd.Series(populations,
index=inde x)
pop
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
With this indexing scheme, you can straightforwardly index or slice the
series based on this multiple index:
EXAMPLE 2
PROGRAM OUTPUT
pop[('California', 2010):('Texas',
2000)]
But the convenience ends there. For example, if you need to select all
values from 2010, you'll need to do some messy (and potentially slow)
munging to make it happen:
EXAMPLE 3
PROGRAM OUTPUT
pop[[i for i in pop.index if i[1] ==
2010]]
This produces the desired result, but is not as clean (or as efficient for
large datasets) as the slicing syntax we've grown to love in Pandas.
The Better Way: Pandas MultiIndex
Fortunately, Pandas provides a better way. Our tuple-based indexing is
essentially a rudimentary multi-index, and the Pandas MultiIndex type
gives us the type of operations we wish to have. We can create a multi
index from the tuples as follows:
EXAMPLE 4
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
index=pd.MultiIndex.from_tuples(i
ndex) index
Notice that the MultiIndex contains multiple levels of indexing–in this
case, the state names and the years, as well as multiple labels for each
data point which encode these levels.
If re-index our series with this MultiIndex, we see the hierarchical
representation of the data:
EXAMPLE 5
PROGRAM OUTPUT
pop = pop.reindex(index)
pop
Here the first two columns of the Series representation show the multiple
index values, while the third column shows the data. Notice that some
entries are missing in the first column: in this multi-index representation,
any blank entry indicates the same value as the line above it.
Now to access all data for which the second index is 2010, we can simply
use the Pandas slicing notation:
EXAMPLE 6
PROGRAM OUTPUT
pop[:, 2010]
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
The result is a singly indexed array with just the keys we're interested in.
This syntax is much more convenient (and the operation is much more
efficient!) than the home-spun tuple-based multi-indexing solution that
we started with. We'll now further discuss this sort of indexing operation
on hieararchically indexed data.
MultiIndex as extra dimension
You might notice something else here: we could easily have stored the
same data using a simple DataFrame with index and column labels. In
fact, Pandas is built with this equivalence in mind. The unstack() method
will quickly convert a multiply indexed Series into a conventionally
indexed DataFrame:
EXAMPLE 7
PROGRAM OUTPUT
pop_df = pop.unstack()
pop_df
pop_df.stack()
Naturally, the stack() method provides the opposite operation:
Seeing this, you might wonder why we would bother with hierarchical
indexing at all. The reason is simple: just as we were able to use
multi-indexing to represent two-dimensional data within a one
dimensional Series, we can also use it to represent data of three or more
dimensions in a Series or DataFrame. Each extra level in a multi-index
represents an extra dimension of data; taking advantage of this property
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
gives us much more flexibility in the types of data we can represent.
EXAMPLE 8
PROGRAM OUTPUT
pop_df = pd.DataFrame({'total':
pop, 'under18': [9267089,
9284094, 4687374, 4318033,
5906301, 6879014]}) pop_df
3.12 Combining datasets
One essential feature offered by Pandas is its high-performance, in
memory join and merge operations. If you have ever worked with
databases, you should be familiar with this type of data interaction. The
main interface for this is the pd.merge function, and few examples of how
this can work in practice.
3.12.1 Relational Algebra
The behavior implemented in pd.merge() is a subset of what is known as
relational algebra, which is a formal set of rules for manipulating
relational data, and forms the conceptual foundation of operations
available in most databases. The strength of the relational algebra
approach is that it proposes several primitive operations, which become
the building blocks of more complicated operations on any dataset. With
this lexicon of fundamental operations implemented efficiently in a
database or other program, a wide range of fairly complicated composite
operations can be performed.
3.12.2 Categories of Joins
The pd.merge() function implements a number of types of joins: the
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
one-to-one, many-to-one, and many-to-many joins. All three types of joins
are accessed via an identical call to the pd.merge() interface; the type of
join performed depends on the form of the input data. Here we will show
simple examples of the three types of merges, and discuss detailed
options further below.
3.12.2.1 One-to-one joins
Perhaps the simplest type of merge expresion is the one-to one join,
which is in many ways very similar to the column-wise concatenation
seen in Combining Datasets: Concat & Append. As a concrete example,
consider the following two DataFrames which contain information on
several employees in a company:
EXAMPLE 1
PROGRAM OUTPUT
df1 = pd.DataFrame({'employee':
['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'En
gineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee':
['Lisa' , 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012,
2014]})
print('df1’)
print( 'df2')
To combine this information into a single DataFrame, we can use the
pd.merge() function:
EXAMPLE 2
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
df3 = pd.merge(df1, df2)
df3
The pd.merge() function recognizes that each DataFrame has an
"employee" column, and automatically joins using this column as a key.
The result of the merge is a new DataFrame that combines the
information from the two inputs. Notice that the order of entries in each
column is not necessarily maintained: in this case, the order of the
"employee" column differs between df1 and df2, and the pd.merge()
function correctly accounts for this. Additionally, keep in mind that the
merge in general discards the index, except in the special case of merges
by index
3.12.2.2 Many-to-one joins
Many-to-one joins are joins in which one of the two key columns
contains duplicate entries. For the many-to-one case, the resulting
DataFrame will preserve those duplicate entries as appropriate. Consider
the following example of a many-to-one join:
EXAMPLE 3
PROGRAM OUTPUT
df4=pd.DataFrame({'group':
['Accounting', 'Engineering',
'HR'],
'supervisor': ['Carly',
'Guido', 'St eve']})
display('df3', 'df4',
'pd.merge(df3, df4)')
3.12.2.3 Many-to-many joins
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Many-to-many joins are a bit confusing conceptually, but are
nevertheless well defined. If the key column in both the left and right
array contains duplicates, then the result is a many-to-many merge. This
will be perhaps most clear with a concrete example. Consider the
following, where we have a DataFrame showing one or more skills
associated with a particular group. By performing a many-to-many join,
we can recover the skills associated with any individual person:
EXAMPLE 4
PROGRAM OUTPUT
df5= pd.DataFrame({'group':
['Accounting', 'Accounting',
'Engineering', 'Eng ineering',
'HR', 'HR'],
'skills': ['math', 'spreadsh eets',
'coding', 'linux',
'spreadsheets', 'or ganization']})
display('df1', 'df5', "pd.merge(df1,
df5)" )
These three types of joins can be used with other Pandas tools to
implement a wide array of functionality. But in practice, datasets are
rarely as clean as the one we're working with here. In the following section
we'll consider some of the options provided by pd.merge() that enable you
to tune how the join operations work.
Specification of the Merge Key
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
We've already seen the default behavior of pd.merge(): it looks for one or
more matching column names between the two inputs, and uses this as
the key. However, often the column names will not match so nicely, and
pd.merge() provides a variety of options for handling this.
3.12.3 The on keyword
Most simply, you can explicitly specify the name of the key column using
the on keyword, which takes a column name or a list of column names:
EXAMPLE 5
PROGRAM OUTPUT
display('df1', 'df2', "pd.merge(df1,
df2, on='employee')")
df1
This option works only if both the left and right DataFrames have the
specified column name.
3.12.4 Specifying Set Arithmetic for Joins
In all the preceding examples have glossed over one important
consideration in performing a join: the type of set arithmetic used in the
join. This comes up when a value appears in one key column but not the
other. Consider this example:
EXAMPLE 6
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
PROGRAM OUTPUT
df6 = pd.DataFrame({'name':
['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food']) df7 =
pd.DataFrame({'name': ['Mary',
'Joseph'],'drink': ['pepsi', 'coke']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6,
df7)') print(df6)
print(df7)
print(pd.merge(df6, df7))
pd.merge(df6, df7, how='inner')
Here we have merged two datasets that have only a single "name" entry in
common: Mary. By default, the result contains the intersection of the two
sets of inputs; this is what is known as an inner join.
Other options for the how keyword are 'outer', 'left', and 'right'. An outer
join returns a join over the union of the input columns, and fills in all
missing values with NAs
EXAMPLE 7
PROGRAM OUTPUT
display('df6', 'df7', "pd.merge(df6,
df7, how='outer')")
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
The left join and right join return joins over the left entries and right
entries, respectively. For example:
EXAMPLE 8
PROGRAM OUTPUT
display('df6', 'df7', "pd.merge(df6,
df7, how='left')")
print(pd.merge(df6, df7,
how='right'))
3. 13 aggregation and grouping
3.13.1 Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization:
computing aggregations like sum(), mean(), median(), min(), and max(), in
which a single number, gives insight into the nature of a potentially large
dataset. explore aggregations in Pandas, from simple operations akin to
what we’ve seen on NumPy arrays, to more sophisticated operations
based on the concept of a groupby.
Planets Data Here we will use the Planets dataset, available via the
Seaborn package. It gives information on planets those astronomers have
discovered around other stars (known as extrasolar planets or exoplanets
for short). It can be downloaded with a simple Seaborn command:
EXAMPLE 1
PROGRAM OUTPUT
import seaborn as sns (1035, 6)
planets=sns.load_dataset('pla
nets') planets.shape
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
EXAMPLE 2
PROGRAM OUTPUT
planets.head()
3.13.2 Simple Aggregation in Pandas
Earlier we explored some of the data aggregations available for NumPy
arrays. As with a onedimensional NumPy array, for a Pandas Series the
aggregates return a single value:
EXAMPLE 3
PROGRAM OUTPUT
rng=np.random.RandomStat 0 0.374540
e(42) ser = 1 0.950714
pd.Series(rng.rand(5)) ser 2 0.731994
3 0.598658
4 0.156019
dtype: float64
EXAMPLE 4
PROGRAM OUTPUT
ser.sum() 2.8119254917081569
ser.mean() 0.56238509834163142
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
For a DataFrame, by default the aggregates return results within each
column:
EXAMPLE 5
PROGRAM OUTPUT
df = pd.DataFrame({'A':
rng.rand(5), 'B': rng.rand(5)})
df
df.mean()
df.mean(axis='columns')
By specifying the axis argument, you can instead aggregate within each
row: Pandas Series and DataFrames include all of the common aggregates
mentioned in “Aggregations: Min, Max, and Everything in Between” n
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result. Let’s use this
on the Planets data, for now dropping rows with missing values:
EXAMPLE 6
PROGRAM OUTPUT
planets.dropna().descri
be()
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Listing of Pandas aggregation methods
S Aggregation Description
.
N
O
1 count() Total number of items
2 first(), last() First and last item
3 mean(),median() Mean and median
4 min(), max() Minimum and maximum
5 std(), var() Standard deviation and variance
6 mad() Mean absolute deviation
7 prod() Product of all items
8 sum() Sum of all items
These are all methods of DataFrame and Series objects.
To go deeper into the data, however, simple aggregates are often not
enough. The next level of data summarization is the group by operation,
which allows you to quickly and efficiently compute aggregates on subsets
of data.
3.13.3 GroupBy
Simple aggregations can give you a flavor of your dataset, but often we
would prefer to aggregate conditionally on some label or index: this is
implemented in the so called groupby operation. The name “group by”
comes from a command in the SQL database language, but it is perhaps
more illuminative to think of it in the terms first coined by Hadley
Wickham of Rstats fame: split, apply, combine.
Split, apply, combine
A canonical example of this split-apply-combine operation, where the
“apply” is a summation aggregation, is illustrated in Figure 3-1. Figure 3-1
makes clear what the GroupBy accomplishes:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
• The split step involves breaking up and grouping a DataFrame
depending on the value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
• The combine step merges the results of these operations into an output
array.
While we could certainly do this manually using some combination of the
masking, aggregation, and merging commands covered earlier, it’s
important to realize that the intermediate splits do not need to be explicitly
instantiated. Rather, the GroupBy can (often) do this in a single pass over
the data, updating the sum, mean, count, min, or other aggregate for
each group along the way. The power of the GroupBy is that it abstracts
away these steps: the user need not think abouthow the computation is
done under the hood, but rather thinks about the operation as a whole. As
a concrete example, let’s take a look at using Pandas for the computation
shown in Figure 3-1. We’ll start by creating the input DataFrame:
EXAMPLE 7
PROGRAM OUTPUT
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A',
'B', 'C'], 'data': range(6)}, columns=['key',
'data'])
df
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
We can compute the most basic split-apply-combine operation with the
groupby() method of DataFrames, passing the name of the desired key
column:
EXAMPLE 8
PROGRAM OUTPUT
df.groupby('key').sum() A 3
B 5
C 7
The sum() method is just one possibility here; you can apply virtually any
common Pandas or NumPy aggregation function, as well as virtually any
valid DataFrame operation, as we will see in the following discussion.
EXAMPLE 9
PROGRAM OUTPUT
import numpy as np
import pandas as pd
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B',
'C'], 'data1': range(6), 'data2':
rng.randint(0, 10, 6)},
columns=['key','data1','data2'],
index=['AA','BB','cc','dd','ee','ff'])
df
Aggregation. We’re now familiar with GroupBy aggregations with sum(),
median(), and the like, but the aggregate() method allows for even more
flexibility. It can take a string, a function, or a list thereof, and compute
all the aggregates at once. Here is a quick example combining all these:
EXAMPLE 10
PROGRAM OUTPUT
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
df.groupby('key').aggregate(['
min', np.median, max])
Another useful pattern is to pass a dictionary mapping column names to
operations to be applied on that column.
PIVOT TABLE
A pivot table is a table of grouped values that aggregates all the individual items of a much bigger table.
A pivot table provides a summary of discrete categories, such as sum, averages, as well as various
statistics of interests.
A pivot table serves as a very useful tool for you to explore and analyze your data, and makes it easy for
you to perform comparisons and view trends.
Loading the data
Let’s load the data into a Pandas DataFrame:
import pandas as pd
data_url='https://raw.githubusercontent.com/resbaz/r-novice-gapminder-fil
es/master/data/gapminder-FiveYearData.csv'
df=pd.read_csv(data_url)
df
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Finding the mean values for each country
To start off, let’s find the mean of the various statistics for each country using the pivot_table() function:
pd.pivot_table(df,
index=’country’,
aggfunc=’mean’)
The index parameter specifies the index to use for the result of the function. The aggfunc parameter specifies
the function to apply on the numeric columns of the dataframe. The following figure shows the result and how
the various parameters dictate the outcome:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
The default aggfunc value is mean (a Pandas function) if you do not specify its value. You can also supply a
NumPy function such as np.mean, or any function that returns an aggregated value .
pd.pivot_table(df,
index=’country’,
aggfunc=’np.mean’)
Finding the mean GDP and mean, max, and min of life expectancies
From the previous result you see that it does not really make sense to calculate the mean of the year column.
Also, you might want to know the minimum, maximum, and average life expectancies of each country. To do
so, you can specify a dictionary for the aggfunc parameter and indicate what function to apply to which column:
import numpy as np
pd.pivot_table(df,
index=’country’
aggfunc={‘gdpPercap’:np.mean,
‘lifeExp”:[np.mean,np.max,np.min]})
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
The above code snippet produces the following output:
Observe that since we did not specify the pop and year columns in the dictionary, they will no longer appear in
the result.
Finding the mean values of each countries for each year
The index parameters also accept a list of columns that will result in a multi-index dataframe result. For
example, I want to know the mean GDP, life expectancies, and population of each county for each year from
1952 to 2007. I can do it like this:
pd.pivot_table(df,
index=[’country’,’year’],
aggfunc=’mean’)
The result is a multi-index dataframe with country and year as the index, and the rest of the numeric fields as
columns :
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Finding the mean values for each continent
If you want to find the mean values for each continent, simply set the index parameter to continent:
pd.pivot_table(df,
index=’continent’)
You will now see the following result:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Finding the population of each country
If you want to know the mean population for each country from 1952 to 2007, set the index to country and
values to pop:
pd.pivot_table(df,
index=’continent’,
values=’pop’,
aggfunc=’mean’)
The following shows the use of the values parameter:
Finding the mean life expectancies for each continent
To find the mean life expectancies for each continent, set the index and values parameters as follows:
pd.pivot_table(df,
index=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
You will see the result as follows:
What if you want to flip the columns and rows of the result? Easy, change the index parameter to columns:
pd.pivot_table(df,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
The following figure shows the result and the use of the various parameters:
Finding the life expectancies of each country in the various continents
Next, we want to know the life expectancies of each country in each of the five continents. We could do this:
pd.pivot_table(df,
index=’country’,
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
Now the life expectancies of each country will be displayed in the respective continent that the country belongs
to:
Notice the NaNs scattered in the result. If you do not want to see the NaNs, you can set the fill_value parameter
to fill them with some values, such as 0s:
pd.pivot_table(df,
index=’country’,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’,
fill_value=0)
You should now see 0s instead of NaNs:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Finding the mean life expectancies of each continent by year
Finally, let’s find the mean life expectancies of each continent and group them by year:
pd.pivot_table(df,
index=’year’,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
The figure below shows the result and the use of the various parameters:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Vectorized string operations
Vectorized string operations are an essential part of data analysis, especially when dealing with datasets that
have text data.
Traditionally, when dealing with string data in a dataset, programmers have to loop over the data and perform
operations on each element one at a time. This can be time-consuming, especially when dealing with large
datasets. Vectorized string operations solve this problem by allowing programmers to perform operations on
entire arrays of string data at once.
Advantages of vectorized string operations
1. Speed: As mentioned earlier, vectorized string operations are faster than traditional string operations as they
allow operations to be performed on entire arrays of string data at once.
2. Code simplification: Using vectorized string operations can lead to simpler and more concise code, as
programmers no longer need to loop over the data and perform operations on each element one at a time.
3. Ease of use: Vectorized string operations are easy to use, and programmers don’t need to have advanced
knowledge of string manipulation to use them.
Operations that can be performed using vectorized string operations
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
1. Concatenation: Concatenation is the process of joining two or more strings together.
2. Splitting: Splitting is the process of dividing a string into multiple parts based on a specific delimiter.
3. Substring extraction: Substring extraction is the process of extracting a part of a string.
4. Case conversion: Case conversion is the process of converting the case of a string to uppercase or lowercase.
5. Search and replace: Search and replace is the process of finding a specific substring in a string and replacing
it with a different substring.
Load the titanic dataset
Example1: Splitting
To split the name to First name and Last Name into separate columns we can use the vectorized str.split()
method:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 2: Concatenation
To concatenate the first name and last name columns to create a full name column, we can use the vectorized
str.cat() method:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 3: Substring extraction
To extract the title of each passenger from the name column, we can use the vectorized str.extract() method:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 4: Replacing substrings
The str.replace() method can be used to replace specific substrings with other substrings within a string column.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 5: Filtering
The str.contains() method can be used to filter a dataframe based on whether a string column contains any of a
list of substrings.
Filter out all the passengers whose name starts with “B” and ends with “e”
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 6: Slicing
Vectorized string methods in Pandas also allow us to slice strings in a Series using the familiar syntax of
Python’s built-in slicing notation: str[start:stop:step]. The start and stop indices are inclusive, while the step
argument specifies the stride or interval of the slice.
Extract the first 3 characters of each name
Extract the last 5 characters of each name
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Reverse each name
Example 7: Case Conversion
str.lower() method to convert all text to lowercase.
str.upper() method to convert all text to uppercase
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
str.capitalize() method to capitalize the first letter of the text
str.title() method to title case each name, which means to capitalize the first letter of each
word .
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)