Pandas
History:
Pandas were initially developed by Wes McKinney in 2008 while he was working at
AQR Capital Management. He convinced the AQR to allow him to open source the
Pandas. Another AQR employee, Chang She, joined as the second major contributor
to the library in 2012. Over time many versions of pandas have been released.
The latest version of the pandas is 1.0.1
Advantages of pandas:-
• Fast and efficient for manipulating and analyzing data.
• Data from different file objects can be loaded.
• Easy handling of missing data (represented as NaN) in floating point as well
• Size mutability: columns can be inserted and deleted from DataFrame and
higher
dimensional objects
• Data set merging and joining.
• Flexible reshaping and pivoting of data sets
• Powerful group by functionality for performing split-apply-combine operations
on data sets.
Pandas generally provide two data structures for manipulating data, They are:
1.Series --->Single dimensional data
2.DataFrame--->Two dimensional data
Series:
A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant
Example:
import pandas as pd
import numpy as np
data = np.array(['chicken','mutton','fish'])
ser = pd.Series(data)
print(ser)
DataFrame:
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular
fashion in rows and columns. You can think of it as an SQL table or a spreadsheet
data representation.
Features of DataFrame
• Potentially columns are of different types
• Size – Mutable
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows and columns
Create DataFrame
A pandas DataFrame can be created using various inputs like −
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.
import pandas as pd
df = pd.DataFrame()
print(df).
Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
Ex:1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
Ex:2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Handle csv files:
df=pd.read_csv('E:\python batch\car_data.csv')
Record count:
len(df) or df.shape
select specific columns:
df.loc[;,[‘owner’,’transmission’]]
df.sort_values(‘Year’)
filter the data:
df[df['Year']>2013]
df[(df.val > 0.5) & (df.val2 == 1)]
Replace nulls with default values
nba["College"].fillna("No College", inplace = True)
Grouping the data:
df.groupby('Team').groups -----grouping
Joins:
--------
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
pd.merge(left,right,on='id')
pd.merge(left,right,on=['id','subject_id'])
pd.merge(left, right, on='subject_id', how='left')