Introduction to Data Science
using python Part2
Pandas
Reading in Data From Excel
I have the following data saved in the file “Grades_Short.csv”:
Let’s see how we read this data into pandas:
Reading in Data From Excel
I have the following data saved in the file “Grades_Short.csv”:
Before you use pandas you must
Let’s see how we read this data into pandas: import it. Anytime you use pandas put
this line as the top of your code.
Reading in Data From Excel
I have the following data saved in the file “Grades_Short.csv”:
Reading the data into a variable called
Let’s see how we read this data into pandas: df_grades.
Built in read_csv method Path to file
Reading in Data From Excel
So, what is df_grades and how does it store the data?
Typing the name of any variable at the end of a code cell will display the contents of
the variable.
Reading in Data From Excel
So, what is df_grades and how does it store the data?
• df_grades is a pandas dataframe.
• The data is stored in a tabular format very similar to excel.
Reading in Data From Excel
Data file
Jupyter notebook
Reading in Data From Excel
Now Grades_Short.csv is in Data Folder Jupyter notebook
Reading in Data From Excel
Now Grades_Short.csv is in Data Folder Jupyter Notebook
“/” separates directories
Reading in Data From Excel
Now Grades_Short.csv is in Data Folder Jupyter notebook in folder
Notebooks
“..” = go back one directory
The head() Method
Using the head() method
• If the data is really large you don’t want to print out the entire dataframe to your
output.
• The head(n) method outputs the first n rows of the data frame. If n is not supplied,
the default is the first 5 rows.
• I like to run the head() method after I read in the dataframe to check that everything
got read in correctly.
• There is also a tail(n) method that returns the last n rows of the dataframe
Basic Features
Think of this
as a list
object = string
float64 = decimal
int64 = integer
Basic Features
column names
row names = index
Basic Features
column names
row names = index
Basic Features
column names
row names = index
• Pandas defaults to have the index be the row number and it will automatically
recognize that the first row is the column names.
• Next we discuss how to pick out various pieces of the dataframe.
Selecting a Single Column
• Between square brackets, the column must be given as a string
• Outputs column as a series
• A series is a one dimensional dataframe. more on this in the slicing
section
Selecting a Single Column
• Exactly equivalent way to get Name column
• + : don’t have to type brackets or quotes
• -: won’t generalize to selecting multiple columns,, won’t work if column
names have spaces, can’t create new columns this way
Selecting Multiple Columns
• List of strings, which correspond to
column names.
• You can select as many column as
you want.
• Column don’t have to be contiguous.
Storing Result
Why store a slice?
• We might want/have to do our
analysis is steps.
• Less error prone
• More readable
The variable name stores a
series
Slicing a Series
Slice/index through
the index, which is
usually numbers
Slicing a Series
Slice/index through
the index, which is
usually numbers
Picking out single element
Slicing a Series
Slice/index through
the index, which is
usually numbers
Picking out single element Contiguous slice
non_inclusive
Slicing a Series
Slice/index through
the index, which is
usually numbers
Picking out single element Contiguous slice
Arbitrary slice
Slicing a Data Frame
• There are a few ways to pick slice a data frame, we will use the .loc method.
• Access elements through the index labels column names
• We will see how to change both of these labels later on
Slicing a Data Frame
• Pick a single value out.
Column name
Index label (string)
(number)
Slicing a Data Frame
• Pick out entire row: “pick out all
columns”
first_row is a series
Slicing a Data Frame
• Pick out contiguous chunk: Endpoints are inclusive!
Slicing a Data Frame
• Pick out arbitrary chunk:
Built in Functions
How do I compute the average score on the final?
Built in Functions
How do I compute the average score on the final?
Built in mean() method
Built in Functions
How do I compute the highest Mini Exam 1 score?
Built in Functions
I can actually get all key stats for numeric columns at once with the describe()
method:
summary_df is
a dataframe!
Built in Functions
I can actually get all key stats for numeric columns at once with the describe()
method:
Built in Functions
I can actually get all key stats for numeric columns at once with the describe()
method:
Notice here the
index is not row
numbers…
Built in Functions
Other useful built in methods:
value_count(): Gives a count of the number of times each unique value apears in the
column. Returns a series where indices are the unique column values.
Built in Functions
Other useful built in methods:
value_count(): Gives a count of the number of times each unique value appears in the
column. Returns a series where indices are the unique column values.
Built in Functions
Other useful built in methods:
unique(): Returns an array of all of the unique values.
Attributes vs. Methods
When do I a put a ()?
Attributes vs. Methods
When do I a put a ()?
dataframe attributes
dataframe methods
Attributes vs. Methods
When do I a put a ()?
dataframe attributes
dataframe methods
Require computation for output
Features of dataframe
Creating New Columns
Let’s create a useless new column of all 1s:
Creating New Columns
We can also create column as function of other column. The Final was worth 36
points, let’s create a column for each student’s percentage.
Deleting Columns
Deleting Columns