Series
Series are:
one-dimensional
labeled arrays
of any data type
Or, said differently...
a sequence of values
with associated labels
dtype('O')
numpy expects homogenous ("same type") data:
...but strings are variable-length
SOLUTION:
Boolean Masks
used to index select items at scale
work with [] and .loc
need to be same length as series
pd.Series(['A', 'B', 'C'])[[True, False, True]]
0 A
2 C
dtype: object
List argument:
pd.Series(data=['this', 'is', 'fun'])
Dict argument:
pd.Series(data={0: 'this', 1: 'is', 2: 'fun'})
Also valid series:
pd.Series(data=0) pd.Series(data='weather')
argument
pd.Series(data=students)
parameter
Indexing With Callables
used for highly customizable indexing
work with [], .loc and .iloc
a single-argument function that returns indexing output
a list of labels no slice or boolean
mask support
list of booleans
a slice, etc
Methods vs Attributes
Method is a function bound to the object.
Attribute is a variable bound to the object.
Selection By Label
Approach Example Comment
slices, callables,
[] idx'ing series['label'] boolean masks
slices, callables,
.loc[] series.loc['label'] boolean masks
no slice or boolean
dot access series.label mask support
no slice support;
.get() series.get('label') provides default;
forgiving
Selection By Position
Approach Example Comment
slices, callables,
[] idx'ing series[0] boolean masks
slices, callables,
.iloc[] series.iloc[0] boolean masks
no slice or boolean
dot access series.0 mask support
no slice support;
.get() series.get(0) provides default;
forgiving
What is a csv?
COMMA-SEPARATED VALUES (.CSV) FILE
a type of text file containing values delimited by comma
Bools As Ints
True 1
False 0
The bool type inherits from (is a subclass of) int
bool -> int -> object
Median
The middlemost element in a sorted list of numbers.
10, 11, 12, 13, 14, 15, 16
10, 11, 12, 13, 14, 15, 16, 17
(13+14) / 2 = 13.5
diff()
the first discrete element-wise difference in a series
ser.diff(periods=1)
Dropping Or Filling NAs
.dropna(): excludes NAs from the series
.fillna(): replaces NAs with something else
Note: both methods return a copy of the series
unless
ser.fillna('new value', inplace=True)
Index by min/max
idxmin(): returns the label of the row with minimum value
idxmax(): returns the label of the row with maximum value
Note: if multiple min/max values,
only the first label is returned
Sequential vs Vectorized Ops
vectorization: running operations on entire arrays
func()
sequential
vectorized
func() func()
func()
Series Accounting
.size: number of elements in the series
series.size # 193
.count(): number of non-null elements
series.coun() # 162
.isna().sum(): number of null elements
series.isna().sum() # 31
Size and Shape
.size: number of elements in the series
series.size # 193
.shape: tuple of the dimensions
for a series: (1D) shape, i.e. length for series
series.shape # (193, )
len(): python built-in function
len(series) # 193
sort_values() & sort_index()
sort_values(): returns a new series, sorted by values
sort_index(): returns a new series, sorted by index labels
Default Params: asending=True,
inplace=False
na_position='last'
kind='quicksort'
Transforms
update(): modifies series values in place using another series
ser.update(other_series)
apply(): applies function (or ufunc) on each series value
most
ser.apply(np.sqrt) flexible
map(): subs series values with others from a function, series,
or dict more input
types
ser.map({'old_value' : 'new_value'})
value_counts()
a sorted series containing unique values and their counts
ser.value_counts( sort=True,
ascending=False,
dropna=True,
normalize=False )
Variance
the average of squared differences from the mean
mean
sum of
DATAFRAMES
FIRST KEY CONCEPT
dataframes have two
dimensions: labeled
indices and columns
DATAFRAMES
SECOND KEY CONCEPT
each column in a
dataframe is a series
DATAFRAMES
THIRD KEY CONCEPT
unlike series,
dataframes could be
heterogenous
dtype object int64 bool
Our Data Prep Process
COLLECT CREATE RENAME REPLACE &
UNITS MAPPER DF CONVERT
isolate the units create a dictionary of rename the column replace all the units
from each nutrition key:value pairs labels of nutrition from the dataframe
column label containing the old datafram values and convert
labels and the new values to floats
Our Data Prep Process
COLLECT CREATE RENAME REPLACE &
UNITS MAPPER DF CONVERT
isolate the units create a dictionary of rename the column replace all the units
from each nutrition key:value pairs labels of nutrition from the dataframe
column label containing the old datafram values and convert
labels and the new values to floats
SINGLE-PURPOSE
unlike .loc or .iloc, .at and .iat are only
used for accessing single values
Why use
.at or .iat? FASTER
because of the lack of overhead, they are
much more performant for their isolated
use-case
dropna() with subset
df.dropna(axis=0, subset=['gender'])
but only look at gender
drop DF.DROPNA()
rows
removes columns or rows with
missing values
SUBSET
restricts or localizes the method
application to specific
orthogonal labels
MORE WAYS TO DATAFRAME
dict of tuples dict of dicts
like dict of lists, but with tuples key:value pairs with column names as
keys and index-labeled key:value pairs
column-wise containing values
column-wise
dict of series list of dicts
a continuation of key concept #2 list of key:value pairs containing colum
labels and values
column-wise
row-wise
RANGE VS INT64INDEX
RangeIndex is a special case of Int64Index
both are immutable, sequences of numbers
RangeIndex is an optimized alternative
pd.RangeIndex(start=0, stop=8789, step=1)
APPLY MNEMONIC applies a function to a dataframe
DF.APPLY()
Is aggregation required?
yes no
DF.AGG() DF.TRANSFORM()
Binary (or bitwise) Operators
OPERATOR WHAT IS EXAMPLE
| or True | False -> True
& and True & False -> False
^ xor True ^ False -> True
complement
~ ~True -> -2
Comparators
COMPARISON OPERATOR PANDAS METHOD
< .lt() SUPPORT
FI LL_VALUE
≤ .le()
> .gt()
≥ .ge()
== .eq()
players.duplicated( WHAT COUNTS AS A DUPLICATE?
subset=['name', 'age'],
DEFAULT CUSTOM
keep='first') records with repeating values could be changed to a smaller
across all columns group of attributes using the
)(DETACILPUD
subset paramter
WHICH IS THE ORIGINAL?
DEFAULT CUSTOM
the first occurrence is could be changed to "first",
marked as the original "last" or "neither" using the
keep parmeter
fillna() axes and methods
FILL DIRECTIONS
AXIS=1
METHOD=FFILL
AXIS=0 AXIS=0
METHOD=FFILL METHOD=BFILL
AXIS=1
METHOD=BFILL
lookup(): another way to fancy index
players.lookup([450], ['age'])
array([30])
pandas
memory FLOATS INTS OBJECT
layout 3 cols
9 cols 7 cols
to pop() or not...
players.pop('age')
3 POINTS TO CONSIDER
pop() works on a single column at a time
pop() returns the removed ('popped') column as a series
pop() modifies the underlying dataframe (operates inplace)
Selection Terminology Recap
OPERATOR WHAT IS
players.loc[0:2] slicing
players.loc[players.age > 37] boolean masking
players.loc[132, 'name'] basic (label-based) indexing
fancy
players.loc[[0, 132], ['name', 'market_value']]
indexing
Two's Complement
VER
Y IM
P ORT
ANT
BIT INTEGER A NUMBER FORMAT
00000000 0
A system for representing signed
00000001 1
integers in computers. Using x bits we
00000010 2 could represent 2^x numbers.
11111111 -1
11111110 -2 For example, 32 bits represent
11111101 -3 4294967296 numbers, 64 bits
18446744073709551616, and so on.
Two's Complement
operator
BIT INTEGER
^
00000000 0
00000001 1 inverts the bits
00000010 2
11111111 -1
11111110
11111110 -2
11111101 -3
VECTORIZATION
o p e ra t io n
c o m p le t e
in 2 c y c le s
s , in s t e a d
of 6
CPU
GPU
Made possibly by SIMD at the processor-level
Results in operations that are multiple times faster!
Supported by NumPy, and by extension, pandas
copy view
A "COPY" OF THE DATA A "WINDOW" INTO THE DATA
DEUNITNOC YPOC SV WEIV
HOW DO WE TELL?
2-POINT RULE:
pandas loves to give us copies, but
if we use loc/iloc or at/iat, we are guaranteed to
get a view
what's the
difference? ...almost identical, but:
df.append()
.append() is a DataFrame
instance method
pd.concat()
.append() only operates along
the index axis
concat()
+ =
glues data sets together
a structure-focused operation
merge()
1 a v 7 1 a 9
3
9
b
c
+ r
a
4
9
= 3
9
b
c
1
2
b 1
c 2
k 3
combines data sets together based on the content they share
much more flexible than .concat()!
pd merge
how='inner' how='outer'
+ = + =
only the common keys are selected all keys are selected
similar to set intersection similar to set union
Join Cardinalities
1-1 1-M M-M
eg: person <-> DNA book - pages book - author
Dual-sided uniques One-sided uniques Dual-sided non-uniques
one of the merge objects both merge objects contain non-
both merge objects contain
contain non-unique values unique values
unique values in the respective
key
in the resulting pd.merge() the in the resulting pd.merge() the
records are repeated M times records are repeated M x M times
pd merge
how='left' how='right'
+ = + =
left keys are selected right keys are selected
Always consider sorting
the index Advantages
:NOITADNEMMOCER
improves retrieval performance, which
becomes significant
- for large dataframes, or
- frequent retrieval
enables slicing syntax
overall a good practice when working with
tabular data representations, including
pandas, Excel, SQL, etc
some of the components that
MULTIINDEX INTERNALS
make up MultiIndex objects,
also known as hierarchical
indices in pands
LEVELS NAMES LEVSHAPE VALUES
a list of lists containing a list containing the names a tuple containing the Use visual charts to
each label value for each of each level length of each level communicate info more
of the levels in the effectively.
MultiIndex
L0 L1
PANEL
deprecated since pandas v0.22
VS
- prefer df.MultiIndex for new projects
MULTIINDEX DF
- many of the same pandas concepts apply
for representing
hierarchical data - older docs still available online for panel
split data into groups
Split
Apply
Combine
apply .sum()
Split
Apply
NA_Sales NA_Sales
Combine
0.75 0.80
combine the output
Split
NA_Sales NA_Sales
0.75 0.80
Apply
Combine