0% found this document useful (0 votes)
6 views

Pandas_Data_Analytics

The document provides an overview of pandas Series and DataFrames, detailing their structure, indexing methods, and various operations such as selection, transformation, and handling missing values. It explains the differences between methods and attributes, the use of Boolean masks, and the significance of vectorization for performance. Additionally, it covers advanced topics like MultiIndex, data merging, and the data preparation process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Pandas_Data_Analytics

The document provides an overview of pandas Series and DataFrames, detailing their structure, indexing methods, and various operations such as selection, transformation, and handling missing values. It explains the differences between methods and attributes, the use of Boolean masks, and the significance of vectorization for performance. Additionally, it covers advanced topics like MultiIndex, data merging, and the data preparation process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Series

Series are:

one-dimensional
labeled arrays
of any data type
Or, said differently...

a sequence of values
with associated labels
dtype('O')

numpy expects homogenous ("same type") data:


...but strings are variable-length
SOLUTION:
Boolean Masks
used to index select items at scale
work with [] and .loc
need to be same length as series

pd.Series(['A', 'B', 'C'])[[True, False, True]]

0 A
2 C
dtype: object
List argument:

pd.Series(data=['this', 'is', 'fun'])

Dict argument:

pd.Series(data={0: 'this', 1: 'is', 2: 'fun'})

Also valid series:

pd.Series(data=0) pd.Series(data='weather')
argument

pd.Series(data=students)

parameter
Indexing With Callables

used for highly customizable indexing


work with [], .loc and .iloc
a single-argument function that returns indexing output

a list of labels no slice or boolean


mask support
list of booleans
a slice, etc
Methods vs Attributes

Method is a function bound to the object.

Attribute is a variable bound to the object.


Selection By Label
Approach Example Comment

slices, callables,
[] idx'ing series['label'] boolean masks

slices, callables,
.loc[] series.loc['label'] boolean masks

no slice or boolean
dot access series.label mask support

no slice support;
.get() series.get('label') provides default;
forgiving
Selection By Position
Approach Example Comment

slices, callables,
[] idx'ing series[0] boolean masks

slices, callables,
.iloc[] series.iloc[0] boolean masks

no slice or boolean
dot access series.0 mask support

no slice support;
.get() series.get(0) provides default;
forgiving
What is a csv?
COMMA-SEPARATED VALUES (.CSV) FILE

a type of text file containing values delimited by comma


Bools As Ints

True 1

False 0

The bool type inherits from (is a subclass of) int

bool -> int -> object


Median
The middlemost element in a sorted list of numbers.

10, 11, 12, 13, 14, 15, 16

10, 11, 12, 13, 14, 15, 16, 17


(13+14) / 2 = 13.5
diff()
the first discrete element-wise difference in a series

ser.diff(periods=1)
Dropping Or Filling NAs
.dropna(): excludes NAs from the series

.fillna(): replaces NAs with something else

Note: both methods return a copy of the series

unless

ser.fillna('new value', inplace=True)


Index by min/max
idxmin(): returns the label of the row with minimum value

idxmax(): returns the label of the row with maximum value

Note: if multiple min/max values,


only the first label is returned
Sequential vs Vectorized Ops
vectorization: running operations on entire arrays

func()
sequential

vectorized
func() func()

func()
Series Accounting
.size: number of elements in the series
series.size # 193

.count(): number of non-null elements


series.coun() # 162

.isna().sum(): number of null elements


series.isna().sum() # 31
Size and Shape
.size: number of elements in the series
series.size # 193
.shape: tuple of the dimensions
for a series: (1D) shape, i.e. length for series
series.shape # (193, )
len(): python built-in function
len(series) # 193
sort_values() & sort_index()
sort_values(): returns a new series, sorted by values

sort_index(): returns a new series, sorted by index labels

Default Params: asending=True,


inplace=False
na_position='last'
kind='quicksort'
Transforms
update(): modifies series values in place using another series
ser.update(other_series)
apply(): applies function (or ufunc) on each series value
most
ser.apply(np.sqrt) flexible

map(): subs series values with others from a function, series,


or dict more input
types
ser.map({'old_value' : 'new_value'})
value_counts()
a sorted series containing unique values and their counts

ser.value_counts( sort=True,
ascending=False,
dropna=True,
normalize=False )
Variance
the average of squared differences from the mean

mean

sum of
DATAFRAMES
FIRST KEY CONCEPT

dataframes have two


dimensions: labeled
indices and columns
DATAFRAMES
SECOND KEY CONCEPT

each column in a
dataframe is a series
DATAFRAMES
THIRD KEY CONCEPT

unlike series,
dataframes could be
heterogenous

dtype object int64 bool


Our Data Prep Process

COLLECT CREATE RENAME REPLACE &


UNITS MAPPER DF CONVERT
isolate the units create a dictionary of rename the column replace all the units
from each nutrition key:value pairs labels of nutrition from the dataframe
column label containing the old datafram values and convert
labels and the new values to floats
Our Data Prep Process

COLLECT CREATE RENAME REPLACE &


UNITS MAPPER DF CONVERT
isolate the units create a dictionary of rename the column replace all the units
from each nutrition key:value pairs labels of nutrition from the dataframe
column label containing the old datafram values and convert
labels and the new values to floats
SINGLE-PURPOSE
unlike .loc or .iloc, .at and .iat are only
used for accessing single values
Why use
.at or .iat? FASTER
because of the lack of overhead, they are
much more performant for their isolated
use-case
dropna() with subset
df.dropna(axis=0, subset=['gender'])
but only look at gender

drop DF.DROPNA()
rows
removes columns or rows with
missing values

SUBSET
restricts or localizes the method
application to specific
orthogonal labels
MORE WAYS TO DATAFRAME

dict of tuples dict of dicts


like dict of lists, but with tuples key:value pairs with column names as
keys and index-labeled key:value pairs
column-wise containing values

column-wise

dict of series list of dicts


a continuation of key concept #2 list of key:value pairs containing colum
labels and values
column-wise
row-wise
RANGE VS INT64INDEX

RangeIndex is a special case of Int64Index


both are immutable, sequences of numbers

RangeIndex is an optimized alternative

pd.RangeIndex(start=0, stop=8789, step=1)


APPLY MNEMONIC applies a function to a dataframe

DF.APPLY()

Is aggregation required?

yes no

DF.AGG() DF.TRANSFORM()
Binary (or bitwise) Operators
OPERATOR WHAT IS EXAMPLE

| or True | False -> True

& and True & False -> False

^ xor True ^ False -> True

complement
~ ~True -> -2
Comparators
COMPARISON OPERATOR PANDAS METHOD

< .lt() SUPPORT


FI LL_VALUE

≤ .le()
> .gt()
≥ .ge()

== .eq()
players.duplicated( WHAT COUNTS AS A DUPLICATE?

subset=['name', 'age'],
DEFAULT CUSTOM

keep='first') records with repeating values could be changed to a smaller

across all columns group of attributes using the


)(DETACILPUD

subset paramter

WHICH IS THE ORIGINAL?

DEFAULT CUSTOM

the first occurrence is could be changed to "first",

marked as the original "last" or "neither" using the

keep parmeter
fillna() axes and methods
FILL DIRECTIONS
AXIS=1
METHOD=FFILL

AXIS=0 AXIS=0
METHOD=FFILL METHOD=BFILL

AXIS=1
METHOD=BFILL
lookup(): another way to fancy index

players.lookup([450], ['age'])

array([30])
pandas

memory FLOATS INTS OBJECT

layout 3 cols
9 cols 7 cols
to pop() or not...
players.pop('age')

3 POINTS TO CONSIDER

pop() works on a single column at a time

pop() returns the removed ('popped') column as a series

pop() modifies the underlying dataframe (operates inplace)


Selection Terminology Recap
OPERATOR WHAT IS

players.loc[0:2] slicing

players.loc[players.age > 37] boolean masking

players.loc[132, 'name'] basic (label-based) indexing

fancy
players.loc[[0, 132], ['name', 'market_value']]
indexing
Two's Complement
VER
Y IM
P ORT
ANT

BIT INTEGER A NUMBER FORMAT


00000000 0
A system for representing signed
00000001 1
integers in computers. Using x bits we
00000010 2 could represent 2^x numbers.

11111111 -1
11111110 -2 For example, 32 bits represent

11111101 -3 4294967296 numbers, 64 bits

18446744073709551616, and so on.


Two's Complement
operator
BIT INTEGER

^
00000000 0
00000001 1 inverts the bits
00000010 2

11111111 -1
11111110
11111110 -2
11111101 -3
VECTORIZATION
o p e ra t io n
c o m p le t e
in 2 c y c le s
s , in s t e a d
of 6

CPU
GPU

Made possibly by SIMD at the processor-level


Results in operations that are multiple times faster!
Supported by NumPy, and by extension, pandas
copy view
A "COPY" OF THE DATA A "WINDOW" INTO THE DATA
DEUNITNOC YPOC SV WEIV
HOW DO WE TELL?
2-POINT RULE:

pandas loves to give us copies, but

if we use loc/iloc or at/iat, we are guaranteed to


get a view
what's the
difference? ...almost identical, but:

df.append()
.append() is a DataFrame
instance method

pd.concat()
.append() only operates along
the index axis
concat()

+ =

glues data sets together


a structure-focused operation
merge()

1 a v 7 1 a 9
3
9
b
c
+ r
a
4
9
= 3
9
b
c
1
2
b 1
c 2
k 3

combines data sets together based on the content they share


much more flexible than .concat()!
pd merge

how='inner' how='outer'

+ = + =

only the common keys are selected all keys are selected
similar to set intersection similar to set union
Join Cardinalities

1-1 1-M M-M


eg: person <-> DNA book - pages book - author

Dual-sided uniques One-sided uniques Dual-sided non-uniques


one of the merge objects both merge objects contain non-
both merge objects contain
contain non-unique values unique values
unique values in the respective
key
in the resulting pd.merge() the in the resulting pd.merge() the
records are repeated M times records are repeated M x M times
pd merge

how='left' how='right'

+ = + =

left keys are selected right keys are selected


Always consider sorting
the index Advantages
:NOITADNEMMOCER

improves retrieval performance, which


becomes significant
- for large dataframes, or
- frequent retrieval

enables slicing syntax

overall a good practice when working with


tabular data representations, including
pandas, Excel, SQL, etc
some of the components that

MULTIINDEX INTERNALS
make up MultiIndex objects,
also known as hierarchical
indices in pands

LEVELS NAMES LEVSHAPE VALUES


a list of lists containing a list containing the names a tuple containing the Use visual charts to
each label value for each of each level length of each level communicate info more
of the levels in the effectively.
MultiIndex

L0 L1
PANEL
deprecated since pandas v0.22
VS
- prefer df.MultiIndex for new projects
MULTIINDEX DF
- many of the same pandas concepts apply
for representing
hierarchical data - older docs still available online for panel
split data into groups
Split

Apply

Combine
apply .sum()
Split

Apply

NA_Sales NA_Sales
Combine
0.75 0.80
combine the output
Split
NA_Sales NA_Sales
0.75 0.80

Apply

Combine

You might also like