Introduction to dataframe
What’s pandas
Pandas is an open source package that provides numerous tools for data analysis. It
offers fast, flexible and expressive data structures that can be used for many
different data manipulation tasks.
In order to use Pandas in your Python IDE you need to import the Pandas library first :
import pandas as pd
Pandas data structures
The two primary data structures of pandas are:
1. Series: is one-dimensional array. It can store data of any type. Its values are mutable
but the size cannot be changed.
2. DataFrame: is two-dimensional data with mutable size, it allows to store and
manipulate tabular data in rows of observations and columns of variables.
How to create series
A series may be created from:
1. A numpy array:
import pandas as pd
import numpy as np
array = np.array(["blue", "yellow", "pink", "purple"]) # get the array
series1 = pd.Series(array) #create the series from the array
print(series1)
2. A list:
list = [19, 175, 41, 22]
series2 = pd.Series(list) #create the series from the list
print(series2)```
https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.add.html
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787123137/3/
ch03lvl1sec31/re-indexing-a-series
Series Changing Index
A big advantage we gain compared to NumPy arrays is that we can create a Series
using our own indexes.
For example:
import pandas as pd
color=["pink", "white", "black", "blue"]
occurence = [20, 15, 6, 43]
S=pd.Series(occurence, index=color)
print(S)
Series addition
If we add two series with the same index, we get a new series with the same index
and the corresponding values will be added :
import pandas as pd
color=["pink", "white", "black", "blue"]
S1=pd.Series([20, 15, 6, 43], index=color)
S2=pd.Series([3, 22, 9, 10], index=color)
print(S1+S2)
Dataframe Introduction
DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types.
Dataframe columns are made up of pandas Series.
You can think of it like a spreadsheet or SQL table,or a dictionary of Series.
How to create DataFrame
1. From a list:
import pandas as pd
list = [['Jack', 34, 'Paris'], ['Thomas', 30, 'Roma'],
['Alexandre', 16, 'New York']]
df = pd.DataFrame(list, columns =['name', 'age', 'city'])
2. From dictionary:
dictionary = { 'name' : ['Jack', 'Thomas', 'Alexandre'],
'age' : [34, 30, 16],
'city' : ['Paris', 'Roma', 'New York']}
df = pd.DataFrame(dictionary)
print(df)
3. From numpy array:
import numpy as np
import pandas as pd
my_numpy_array=np.random.randn(3,4)
df=pd.DataFrame(my_numpy_array, columns=list("abcd"))
print(df)
4. From csv file:
Let's create a dataframe from this csv file
import pandas as pd
df=pd.read_csv("csv file example", sep=";")
https://databricks.com/glossary/pandas-dataframe
Dataframe exportation to csv
1. We created a dataframe using dictionary.
2. We uploaded it into a csv file.
3. We created a new dataframe from our csv file.
4. We used the head command to show the first 5 rows.
https://datatofish.com/export-dataframe-to-csv/
Getting information about dataframe
To show general information of different columns such as the type, we write:
df.info()
Viewing our data
df.head() #to show the first 5 rows of our data
df.tail() # to show the last 5 rows of our data
Describing our data
Now, let’s use the describe command for calculating some statistical data for one
specific column.
df.describe() # we will get a detailed description of numerical variables of our data
such as mean, min, std, max...etc
https://note.nkmk.me/en/python-pandas-head-tail/
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
Dataframe Bracket Selection
In this course, we will often have to select specific rows or columns from our
DataFrame.
One of the easiest ways to do that is to use brackets:
https://datatofish.com/select-rows-pandas-dataframe/
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.iloc.html
Dataframe loc/iloc
Dataframe loc
The loc() method allows us to extract rows and columns by labeled index.
df.index=["Jack", "Thomas", "Alexandre", "Anne"] #get the index labeled
df.loc[["Jack", "Thomas"]] #select the first and second row
Dataframe iloc
The iloc() follows the same rules as loc(). It extracts rows and columns by selecting
indexes.
df.iloc[:, 1:3] #select the second and third columns with keeping all rows
print(df)
Setting index in dataframe
We can use the set_index() function if we want to replace the index using one or
more existing column.
Old Index
New Index
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.set_index.html
Dataframe Concatenate
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Dataframe drop
To drop specified labels from rows or columns, we simply use drop() method.
For example, we want to delete the country column we added previously:
df.drop("country", axis=1)
drop() method has inplace=False as default, you can see that the country column is
not gone. Take a break & make some research.
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.drop.html
Dataframe with Pandas Recap
Pandas is an open source library which offers several tools for data analysis. It
provides fast, flexible and expressive data struc tures that can be used for numerous
data manipulation tasks.
We can create DataFrame from list, dictionary, numpy array or csv file.
To convert data to a csv file : data.to_csv(“file.csv”)
To see information about columns : dataframe.info()
To see a brief description about columns and their values : dataframe.describe()
To view the first 5 rows : dataframe.head()
To view the last 5 rows : dataframe.tail()
There are multiple ways to select rows and columns from Pandas DataFrames. Iloc
and loc are the main operations for retrieving data: The iloc method is used to select
indexes for Pandas Dataframe. Whereas loc method ensure the extraction by selecting
labels .
We can also select specific rows or columns from using the brackets.