0% found this document useful (0 votes)

6 views

Pandas_Data_Analytics

The document provides an overview of pandas Series and DataFrames, detailing their structure, indexing methods, and various operations such as selection, transformation, and handling missing values. It explains the differences between methods and attributes, the use of Boolean masks, and the significance of vectorization for performance. Additionally, it covers advanced topics like MultiIndex, data merging, and the data preparation process.

Uploaded by

koushikfordownload

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Pandas_Data_Analytics

Uploaded by

koushikfordownload

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Series

Series are:

one-dimensional
labeled arrays
of any data type
Or, said differently...

a sequence of values
with associated labels
dtype('O')

numpy expects homogenous ("same type") data:

...but strings are variable-length
SOLUTION:
Boolean Masks
used to index select items at scale
work with [] and .loc
need to be same length as series

pd.Series(['A', 'B', 'C'])[[True, False, True]]

0 A
2 C
dtype: object
List argument:

pd.Series(data=['this', 'is', 'fun'])

Dict argument:

pd.Series(data={0: 'this', 1: 'is', 2: 'fun'})

Also valid series:

pd.Series(data=0) pd.Series(data='weather')
argument

pd.Series(data=students)

parameter
Indexing With Callables

used for highly customizable indexing

work with [], .loc and .iloc
a single-argument function that returns indexing output

a list of labels no slice or boolean

mask support
list of booleans
a slice, etc
Methods vs Attributes

Method is a function bound to the object.

Attribute is a variable bound to the object.

Selection By Label
Approach Example Comment

slices, callables,
[] idx'ing series['label'] boolean masks

slices, callables,
.loc[] series.loc['label'] boolean masks

no slice or boolean
dot access series.label mask support

no slice support;
.get() series.get('label') provides default;
forgiving
Selection By Position
Approach Example Comment

slices, callables,
[] idx'ing series[0] boolean masks

slices, callables,
.iloc[] series.iloc[0] boolean masks

no slice or boolean
dot access series.0 mask support

no slice support;
.get() series.get(0) provides default;
forgiving
What is a csv?
COMMA-SEPARATED VALUES (.CSV) FILE

a type of text file containing values delimited by comma

Bools As Ints

True 1

False 0

The bool type inherits from (is a subclass of) int

bool -> int -> object

Median
The middlemost element in a sorted list of numbers.

10, 11, 12, 13, 14, 15, 16

10, 11, 12, 13, 14, 15, 16, 17

(13+14) / 2 = 13.5
diff()
the first discrete element-wise difference in a series

ser.diff(periods=1)
Dropping Or Filling NAs
.dropna(): excludes NAs from the series

.fillna(): replaces NAs with something else

Note: both methods return a copy of the series

unless

ser.fillna('new value', inplace=True)

Index by min/max
idxmin(): returns the label of the row with minimum value

idxmax(): returns the label of the row with maximum value

Note: if multiple min/max values,

only the first label is returned
Sequential vs Vectorized Ops
vectorization: running operations on entire arrays

func()
sequential

vectorized
func() func()

func()
Series Accounting
.size: number of elements in the series
series.size # 193

.count(): number of non-null elements

series.coun() # 162

.isna().sum(): number of null elements

series.isna().sum() # 31
Size and Shape
.size: number of elements in the series
series.size # 193
.shape: tuple of the dimensions
for a series: (1D) shape, i.e. length for series
series.shape # (193, )
len(): python built-in function
len(series) # 193
sort_values() & sort_index()
sort_values(): returns a new series, sorted by values

sort_index(): returns a new series, sorted by index labels

Default Params: asending=True,

inplace=False
na_position='last'
kind='quicksort'
Transforms
update(): modifies series values in place using another series
ser.update(other_series)
apply(): applies function (or ufunc) on each series value
most
ser.apply(np.sqrt) flexible

map(): subs series values with others from a function, series,

or dict more input
types
ser.map({'old_value' : 'new_value'})
value_counts()
a sorted series containing unique values and their counts

ser.value_counts( sort=True,
ascending=False,
dropna=True,
normalize=False )
Variance
the average of squared differences from the mean

mean

sum of
DATAFRAMES
FIRST KEY CONCEPT

dataframes have two

dimensions: labeled
indices and columns
DATAFRAMES
SECOND KEY CONCEPT

each column in a
dataframe is a series
DATAFRAMES
THIRD KEY CONCEPT

unlike series,
dataframes could be
heterogenous

dtype object int64 bool

Our Data Prep Process

COLLECT CREATE RENAME REPLACE &

UNITS MAPPER DF CONVERT
isolate the units create a dictionary of rename the column replace all the units
from each nutrition key:value pairs labels of nutrition from the dataframe
column label containing the old datafram values and convert
labels and the new values to floats
SINGLE-PURPOSE
unlike .loc or .iloc, .at and .iat are only
used for accessing single values
Why use
.at or .iat? FASTER
because of the lack of overhead, they are
much more performant for their isolated
use-case
dropna() with subset
df.dropna(axis=0, subset=['gender'])
but only look at gender

drop DF.DROPNA()
rows
removes columns or rows with
missing values

SUBSET
restricts or localizes the method
application to specific
orthogonal labels
MORE WAYS TO DATAFRAME

dict of tuples dict of dicts

like dict of lists, but with tuples key:value pairs with column names as
keys and index-labeled key:value pairs
column-wise containing values

column-wise

dict of series list of dicts

a continuation of key concept #2 list of key:value pairs containing colum
labels and values
column-wise
row-wise
RANGE VS INT64INDEX

RangeIndex is a special case of Int64Index

both are immutable, sequences of numbers

RangeIndex is an optimized alternative

pd.RangeIndex(start=0, stop=8789, step=1)

APPLY MNEMONIC applies a function to a dataframe

DF.APPLY()

Is aggregation required?

yes no

DF.AGG() DF.TRANSFORM()
Binary (or bitwise) Operators
OPERATOR WHAT IS EXAMPLE

| or True | False -> True

& and True & False -> False

^ xor True ^ False -> True

complement
~ ~True -> -2
Comparators
COMPARISON OPERATOR PANDAS METHOD

< .lt() SUPPORT

FI LL_VALUE

≤ .le()
> .gt()
≥ .ge()

== .eq()
players.duplicated( WHAT COUNTS AS A DUPLICATE?

subset=['name', 'age'],
DEFAULT CUSTOM

keep='first') records with repeating values could be changed to a smaller

across all columns group of attributes using the

)(DETACILPUD

subset paramter

WHICH IS THE ORIGINAL?

DEFAULT CUSTOM

the first occurrence is could be changed to "first",

marked as the original "last" or "neither" using the

keep parmeter
fillna() axes and methods
FILL DIRECTIONS
AXIS=1
METHOD=FFILL

AXIS=0 AXIS=0
METHOD=FFILL METHOD=BFILL

AXIS=1
METHOD=BFILL
lookup(): another way to fancy index

players.lookup([450], ['age'])

array([30])
pandas

memory FLOATS INTS OBJECT

layout 3 cols
9 cols 7 cols
to pop() or not...
players.pop('age')

3 POINTS TO CONSIDER

pop() works on a single column at a time

pop() returns the removed ('popped') column as a series

pop() modifies the underlying dataframe (operates inplace)

Selection Terminology Recap
OPERATOR WHAT IS

players.loc[0:2] slicing

players.loc[players.age > 37] boolean masking

players.loc[132, 'name'] basic (label-based) indexing

fancy
players.loc[[0, 132], ['name', 'market_value']]
indexing
Two's Complement
VER
Y IM
P ORT
ANT

BIT INTEGER A NUMBER FORMAT

00000000 0
A system for representing signed
00000001 1
integers in computers. Using x bits we
00000010 2 could represent 2^x numbers.

11111111 -1
11111110 -2 For example, 32 bits represent

11111101 -3 4294967296 numbers, 64 bits

18446744073709551616, and so on.

Two's Complement
operator
BIT INTEGER

^
00000000 0
00000001 1 inverts the bits
00000010 2

11111111 -1
11111110
11111110 -2
11111101 -3
VECTORIZATION
o p e ra t io n
c o m p le t e
in 2 c y c le s
s , in s t e a d
of 6

CPU
GPU

Made possibly by SIMD at the processor-level

Results in operations that are multiple times faster!
Supported by NumPy, and by extension, pandas
copy view
A "COPY" OF THE DATA A "WINDOW" INTO THE DATA
DEUNITNOC YPOC SV WEIV
HOW DO WE TELL?
2-POINT RULE:

pandas loves to give us copies, but

if we use loc/iloc or at/iat, we are guaranteed to

get a view
what's the
difference? ...almost identical, but:

df.append()
.append() is a DataFrame
instance method

pd.concat()
.append() only operates along
the index axis
concat()

+ =

glues data sets together

a structure-focused operation
merge()

1 a v 7 1 a 9
3
9
b
c
+ r
a
4
9
= 3
9
b
c
1
2
b 1
c 2
k 3

combines data sets together based on the content they share

much more flexible than .concat()!
pd merge

how='inner' how='outer'

+ = + =

only the common keys are selected all keys are selected
similar to set intersection similar to set union
Join Cardinalities

1-1 1-M M-M

eg: person <-> DNA book - pages book - author

Dual-sided uniques One-sided uniques Dual-sided non-uniques

one of the merge objects both merge objects contain non-
both merge objects contain
contain non-unique values unique values
unique values in the respective
key
in the resulting pd.merge() the in the resulting pd.merge() the
records are repeated M times records are repeated M x M times
pd merge

how='left' how='right'

+ = + =

left keys are selected right keys are selected

Always consider sorting
the index Advantages
:NOITADNEMMOCER

improves retrieval performance, which

becomes significant
- for large dataframes, or
- frequent retrieval

enables slicing syntax

overall a good practice when working with

tabular data representations, including
pandas, Excel, SQL, etc
some of the components that

MULTIINDEX INTERNALS
make up MultiIndex objects,
also known as hierarchical
indices in pands

LEVELS NAMES LEVSHAPE VALUES

a list of lists containing a list containing the names a tuple containing the Use visual charts to
each label value for each of each level length of each level communicate info more
of the levels in the effectively.
MultiIndex

L0 L1
PANEL
deprecated since pandas v0.22
VS
- prefer df.MultiIndex for new projects
MULTIINDEX DF
- many of the same pandas concepts apply
for representing
hierarchical data - older docs still available online for panel
split data into groups
Split

Apply

Combine
apply .sum()
Split

Apply

NA_Sales NA_Sales
Combine
0.75 0.80
combine the output
Split
NA_Sales NA_Sales
0.75 0.80

Apply

Combine

Pandas Worksheets ALL
100% (1)
Pandas Worksheets ALL
8 pages
2.1.4.3 Lab - Using Cisco Webex For Developers List Rooms API
No ratings yet
2.1.4.3 Lab - Using Cisco Webex For Developers List Rooms API
6 pages
1501992967_1496666168_Pandas
No ratings yet
1501992967_1496666168_Pandas
63 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
05Getting Started With Pandas
No ratings yet
05Getting Started With Pandas
44 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Pandas
No ratings yet
Pandas
94 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Pandas Cheatsheets 1.0.6 Web Binder PDF
No ratings yet
Pandas Cheatsheets 1.0.6 Web Binder PDF
8 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas
No ratings yet
Pandas
29 pages
Pandas Summarized Visually in 8
100% (2)
Pandas Summarized Visually in 8
8 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas
No ratings yet
Pandas
42 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
Data Handlinng Using Pandas-I
No ratings yet
Data Handlinng Using Pandas-I
46 pages
XII IP Ch 1 Python Pandas - I Series
No ratings yet
XII IP Ch 1 Python Pandas - I Series
45 pages
Unit III - Pandas - Data Manipulation Using Python
No ratings yet
Unit III - Pandas - Data Manipulation Using Python
15 pages
P Unit-4 NP
No ratings yet
P Unit-4 NP
30 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Unit 2
No ratings yet
Unit 2
81 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Data Analysis Handbook
No ratings yet
Pandas Data Analysis Handbook
55 pages
Panda
No ratings yet
Panda
33 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Python Libraries
No ratings yet
Python Libraries
53 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
1 Data Handlinng Using Pandas-I
No ratings yet
1 Data Handlinng Using Pandas-I
46 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
DOC-20230110-WA0046. (1)
No ratings yet
DOC-20230110-WA0046. (1)
8 pages
2.3 Operations in Pandas
No ratings yet
2.3 Operations in Pandas
6 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Panas Short Notes
No ratings yet
Panas Short Notes
4 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
Phan1_Pandas_Numpy_Matplotlib
No ratings yet
Phan1_Pandas_Numpy_Matplotlib
158 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
rajni_ip_file_final
No ratings yet
rajni_ip_file_final
42 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
Dataframe Notes
No ratings yet
Dataframe Notes
47 pages
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
No ratings yet
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
47 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Line By Line 12 IP
No ratings yet
Line By Line 12 IP
21 pages
chapter 2 Q & A
No ratings yet
chapter 2 Q & A
2 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Java Programming Tutorial With Screen Shots & Many Code Example
From Everand
Java Programming Tutorial With Screen Shots & Many Code Example
Desmond Ohwofosirai
No ratings yet
Quectel EG06 Series Hardware Design V1.3
No ratings yet
Quectel EG06 Series Hardware Design V1.3
99 pages
OWASP MASVS Spain Nov 17
No ratings yet
OWASP MASVS Spain Nov 17
47 pages
Collection of Maritime Press Clippings
No ratings yet
Collection of Maritime Press Clippings
29 pages
Lec1 Intoduction
No ratings yet
Lec1 Intoduction
34 pages
6502 Microprocessors
No ratings yet
6502 Microprocessors
14 pages
Software - Platform (Automation Builder) - ABB
No ratings yet
Software - Platform (Automation Builder) - ABB
12 pages
Perancangan Aplikasi Lorong Garden Sebag 493b72f7
No ratings yet
Perancangan Aplikasi Lorong Garden Sebag 493b72f7
8 pages
Eye Tracker Data Quality: What It Is and How To Measure It
No ratings yet
Eye Tracker Data Quality: What It Is and How To Measure It
9 pages
Eti Unit 4
No ratings yet
Eti Unit 4
3 pages
Vowifi Activation Steps 260722
No ratings yet
Vowifi Activation Steps 260722
10 pages
E Invoicing Guidelines
100% (1)
E Invoicing Guidelines
10 pages
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
No ratings yet
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
23 pages
DM780 Manual
No ratings yet
DM780 Manual
98 pages
Tobi Web Application Agreement
No ratings yet
Tobi Web Application Agreement
4 pages
Flashback
No ratings yet
Flashback
3 pages
CS3491AI & ML Lab Manual
No ratings yet
CS3491AI & ML Lab Manual
105 pages
AUTOSAR SWS DiagnosticCommunicationManager 428-449
No ratings yet
AUTOSAR SWS DiagnosticCommunicationManager 428-449
22 pages
1 Period Grade 7 2022
No ratings yet
1 Period Grade 7 2022
15 pages
SummaryBillMay2024 1 6
No ratings yet
SummaryBillMay2024 1 6
6 pages
VMware-2V0-41.23
No ratings yet
VMware-2V0-41.23
13 pages
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
No ratings yet
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
10 pages
AbuBakar_Assignment2
No ratings yet
AbuBakar_Assignment2
3 pages
Red Hat Openstack Platform 11: Network Functions Virtualization Planning and Prerequisites Guide
No ratings yet
Red Hat Openstack Platform 11: Network Functions Virtualization Planning and Prerequisites Guide
29 pages
Unit 1 Uml Diagrams: Vel Tech High Tech Dr. Rangarajan Dr. Sakunthala Engineering College
No ratings yet
Unit 1 Uml Diagrams: Vel Tech High Tech Dr. Rangarajan Dr. Sakunthala Engineering College
79 pages
B.SC H Computer Sci kupvWYf
No ratings yet
B.SC H Computer Sci kupvWYf
6 pages
Onboarding Process Submitted
No ratings yet
Onboarding Process Submitted
14 pages
Cyber Security C.S.
No ratings yet
Cyber Security C.S.
14 pages
IT Essentials Chapter 3 Exam Answers 2018 2019 Version 6.0 100% IT Essentials Chapter 3 Exam Answers 2018 2019 Version 6.0 100%
No ratings yet
IT Essentials Chapter 3 Exam Answers 2018 2019 Version 6.0 100% IT Essentials Chapter 3 Exam Answers 2018 2019 Version 6.0 100%
7 pages
SB139468
No ratings yet
SB139468
15 pages