What is PANDAS
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating
data.
• Pandas can clean messy data sets, and make them readable and
relevant.
• Relevant data is very important in data science.
PANDAS vs SQL
• Pandas is ideal for working with smaller datasets that can fit into memory and
provides a more flexible and intuitive interface for data manipulation. SQL, on the
other hand, is ideal for working with larger datasets that cannot fit into memory
and provides powerful aggregation and filtering capabilities.
• When it comes to speed and performance, SQL has the upper hand over Pandas.
SQL is optimized for working with large datasets and can handle millions of rows of
data with ease. However, Pandas provides a more flexible and intuitive interface for
data manipulation, making it easier to work with for smaller datasets.
• To choose Between Pandas and SQL depends on the specific requirements of your
project, If you’re working with smaller datasets or need more flexibility in data
manipulation, Pandas is the way to go. If you’re working with larger datasets or
need more advanced aggregation and filtering capabilities, SQL is preferrable.
Types of Datatypes in Pandas
• Series
• Dataframe
• Panel
Series
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
import pandas as pd
a = [12, 74, 26]
var = pd.Series(a)
print(myvar)
Labels
• Create your own label
• We can create our own label using index parameter
var = pd.Series(a, index = ["x", "y", "z"])
DATAFRAMES
• Data sets in Pandas are usually multi-dimensional
tables, called DataFrames.
• Series is like a column, a DataFrame is the whole table.
• It is created by using DataFrame method.
pd.DataFrame()
import pandas as pd
pd.DataFrame([1,3,5,6],columns=['w','x','y','z'])
#It will create 4 rows as we have 4 elements in list
pd.DataFrame([[23,43,89,1]],columns=['A'])
#It will create 1 row as we have only one element in list.
Create dataframe using Dictionary
Import pandas as pd
dic = {'A':112,'B':78,'C':43}
pd.DataFrame(dic)
LOcate elements in Dataframe
• To locate perticular elements of a dataframe we use loc and iloc method in
pandas
• To locate the element based on index name and column name, we use loc.
df.loc[0]
df.loc[2,'x']
• To locate the element based on index number and column number we use
iloc.
df.iloc[2,3]
Other functions of loc and iloc
• Df.loc[:,'A']
• Df.loc[:,:]
• Df.loc[2:6,'X']
• Df.iloc[2:10,23:56]
• Df.iloc[:]
Replace value using loc
• Df.loc[1,'A'] = 34
• Df.iloc[2,4] = 'ABC'
Read a file in Pandas
import pandas as pd
df = pd.read_csv('data.csv')
df.head()
df.tail()
df.shape
df.dtypes
df.select_dtypes('int')
df.describe()
Select and filter records
To select records there are mutliple ways.
df['col1']
df[df.col1 == 'val']
df[df.col.isin([val1,val2....,valn])
df.loc[df[df.col1=='val'].index,col2]
df[df.col.between(10,200)]
Handle Null and duplicate values
• Detect Nan values
df.isnull().sum()
df.dropna(subset=[],how='all',inplace=True)
df.fillna()
df.duplicated()
df.drop_duplicates()
Drop a row or column in Dataframe
df.drop('A',axis=1)
df.drop(1,axis=0)
df.drop(df[df.col=='val'].index,axis=0)
df = df[~(df[col=='val'])]
Mean, Median,Mode
df['col'].mean()
df['col'].median()
df['col'].mode()
df['col'].max()
Insert a row and column to a dataframe
df.insert(col_pos,col_name,value)
df['col1'] = val
df.iloc[row_number]=[2,3,4,5]
Other Important Methods
• df.rename()
• df.col.unique()
• df.nunique()
• df.value_counts('col')
• df.replace()
Plot in Pandas
• Bar plot
• Histogram
• Line plot
• scatter plot
Lambda Function
An anonymous function which is used to execute the code in
a single line of code.
It is very fast compared to traditional function
Syntax : lambda x : x*2
Map vs apply vs applymap
• map is defined on Series data only
• applymap is defined on DataFrames only
• apply is defined on both
Sort and groupby
Sort a dataframe
df.sort_values('col1',ascending=False)
Groupby
df.groupby('col1')['col2'].mean()
Iterate over dataframe
• For i in df.col:
• Pass
• df.iterrows() - iterate over row wise
• df.iteritems() - iterate over column
• df.itertuples() - iterate over row wise
Handle datetime data
• From datetime Import datetime
• pd.date_range(start,end,freq)
• datetime.strptime(x,'%Y-%m-%d')
• df.col.dt.strftime()
Ref : https://www.programiz.com/python-
programming/datetime/strptime
Concat and Joins
Like sql we can perform Join operations in Pandas.
Operations are almost equal to sql join except syntax.
pd.merge(df1,df2,on='col1',how='')
Concat vs Join vs Merge
Concat is used to concat a dataframe in row or column based on axis.
pd.concat()
Join is used to join the dataframes based on index.
df1.join(df2)
Merge will join the dataframes based on columns like sql join
df.merge(df1,on=[],how=’inner’)
Pivoting tables
• import pandas as pd
• pd.pivot_table(df,columns=[],index=[],values=[],aggfunc='mean')
Handle categorical values
• Categorical values represents the class of a variable.
• In real time data processing we can't go with class value
as system only understand numerical value
Using get_dummies() , we can convert to numerical.
pd.get_dummies(df,columns=['col1'])
Using map function to assign our own values to each class.
e.g {'IND': 1,"AUS" :2, "PAK" 3}
df.col.map({'IND': 1,"AUS" :2, "PAK" 3})
Onehot encoding
Save a Dataframe
• We can save Pandas dataframe into any file.
Outlier checking
• An outlier is a data point that significantly deviates from the rest of the data. It can be caused by
measurement errors, data entry errors, or simply natural variation in the data. Outliers can skew
statistical analyses and lead to incorrect conclusions.
We have several methods to identify.
• IQR:
Defines outliers as values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
• Z-score:
Measures how many standard deviations a data point is from the mean. Values greater than 3 or less than
-3 are often considered outliers.
z =(x-mean)/std_dev
Box Plot:
Visually represent the distribution of data and highlight potential outliers.
Methods for Handling Outliers
• Removing outliers:
• If outliers are due to errors or anomalies, they can be removed from
the dataset.
• Imputing outliers:
• If outliers are valid data points, they can be replaced with the mean,
median, or mode of the remaining data.
• Capping outliers:
• Outliers can be capped at a certain value, such as the 90th or 95th
percentile.
Read large files
To read large files efficiently we use chunking in pandas.
e.g
Import pandas as pd
chunksize = 1000
for chunk in pd.read_csv('datasets.csv', chunksize=chunksize):
print(chunk)