13-007 Datasets and DataFrames
13-007 Datasets and DataFrames
13-007 Datasets and DataFrames
JUPYTER
In this task, you will be using the Jupyter Notebook. This tool is described as
follows: “The Jupyter Notebook is an open-source web application that allows you
to create and share documents that contain live code, equations, visualisations and
narrative text. Uses include data cleaning and transformation, numerical
simulation, statistical modelling, data visualisation, machine learning, and much
more.”
1. Install Jupyter
First, ensure that you have the latest pip; older versions may have
trouble with some dependencies:
Once you have installed Jupyter, you can start the notebook server from the
command line:
jupyter notebook
This will print some information about the notebook server in your terminal,
including the URL of the web application. The notebook will then open in
your browser.
Once the notebook has opened, you should see the dashboard showing the
list of notebooks, files, and subdirectories in the directory you’ve opened. You
can see an example of a Jupyter Notebook below:
To start a new notebook, click the New drop-down menu and click on
Python 3.
A new Jupyter notebook will look like the screenshot below. Make sure to
change the name of the notebook to Data Sources.
In Jupyter notebooks you can specify whether a cell contains code or Markdown.
Markdown is a lightweight Markup language that is used to embed
documentation or other textual information between code cells.
You can read data from a .csv file into a DataFrame using the read_csv() function
as shown below:
insurance_df = pd.read_csv("insurance.csv")
There are also other functions that can be used to read data from different sources
into a pandas DataFrame. For example, read_excel() can be used to read data
from an Excel spreadsheet file and read_sql() can be used to load data from a SQL
database. Sometimes it is easier to extract data from other sources into a .csv file
and then read it into a DataFrame.
There are many ways to specify columns in pandas. The simplest way is to use
dictionary notation for specific columns. In essence, pandas Dataframes can be
thought of as dictionaries. The key is the column name and the value is the
corresponding column values.
To select multiple columns, you simply need to specify a list of strings with each
column name:
In essence, we are filtering the dataset for all entries where the sepal_length is less
than 4.8.
When attempting to gain insight into your data, it is often helpful to leverage
built-in methods to process your data — for example, finding the mean or total of a
column.
Explore the panda's documentation for a list of all methods relating to general
computations and descriptive statistics.
GROUPING IN PANDAS
Data analysis can sometimes get complicated, and more advanced functionality is
needed. Let’s say you want to average the insurance charges of all people between
the ages of 30 and 35. This can be done quite easily using:
# Get people in the 30-35 age group
between_30_and_35 = insurance_df[(insurance_df['age'] > 30) &
(insurance_df['age'] < 35 )]
# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())
Alternatively, you can also use the pandas DataFrame.query() method that takes
boolean strings as an argument as shown below:
# Use the query method to get people in the 30-35 age group
between_30_and_35 = insurance_df.query("age > 30 and age < 35")
# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())
Now let’s say you want to average the insurance charges of every person in each
age group. This can still be done with the syntax you know, but it will take a lot of
lines of code. This is bad because we want to keep our code simple and concise.
Thankfully, pandas provide us with something that allows us to do this with one
line of code:
This groupby() method tells the aggregation to work separately on each unique
group specified.
Practical Task 1
Open the Datasets Task.ipynb file and complete the following tasks in the
notebook. Save your notebook to your task folder for submission.
1. Write the code that performs the action described in the following
statements.
a. Select the 'Limit' and 'Rating' columns of the first five observations
b. Select the first five observations with 4 cards
c. Sort the observations by 'Education'. Show users with a high
education value first.
2. Write a short explanation in the form of a comment for the following lines of
code.
a. df.iloc[:,:]
b. df.iloc[5:,5:]
c. df.iloc[:,0]
d. df.iloc[9,:]
Practical Task 2
Open and run the example file for this task in Jupyter Notebook before attempting
this task. Follow these steps:
Think that the content of this task, or this course as a whole, can be improved? Do
you think we’ve done a good job?
Lynn, S. (2018). The Pandas DataFrame – loading, editing, and viewing data in
Python. Retrieved from Shane Lynn: Pandas Tutorials:
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-
data-in-python/
Jupyter Team. (2015). Running the Notebook — Jupyter Documentation 4.1.1 alpha
documentation. Retrieved 18 August 2020, from
https://test-jupyter.readthedocs.io/en/latest/running.html