13-007 Datasets and DataFrames

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

TASK

Datasets and DataFrames


Introduction
WELCOME TO THE DATASETS AND DATAFRAMES TASK!

In the context of this task, when we refer to a dataset, we are referring to a


collection of related data. This data can be manipulated in various ways
programmatically. In this task, you will be using Pandas DataFrames to manipulate
data.

JUPYTER

In this task, you will be using the Jupyter Notebook. This tool is described as
follows: “The Jupyter Notebook is an open-source web application that allows you
to create and share documents that contain live code, equations, visualisations and
narrative text. Uses include data cleaning and transformation, numerical
simulation, statistical modelling, data visualisation, machine learning, and much
more.”

To use this tool, do the following:

1. Install Jupyter

● Option 1: Installing Jupyter with pip

First, ensure that you have the latest pip; older versions may have
trouble with some dependencies:

pip3 install --upgrade pip

Then install the Jupyter Notebook using:

pip3 install jupyter

● Option 2: Installing Jupyter using Anaconda

○ Download Anaconda. We recommend downloading


Anaconda’s latest Python 3 version.

○ Install the version of Anaconda which you downloaded,


following the instructions on the download page.

2. Run the Jupyter Notebook

Once you have installed Jupyter, you can start the notebook server from the
command line:
jupyter notebook

This will print some information about the notebook server in your terminal,
including the URL of the web application. The notebook will then open in
your browser.

Once the notebook has opened, you should see the dashboard showing the
list of notebooks, files, and subdirectories in the directory you’ve opened. You
can see an example of a Jupyter Notebook below:

To start a new notebook, click the New drop-down menu and click on
Python 3.
A new Jupyter notebook will look like the screenshot below. Make sure to
change the name of the notebook to Data Sources.

In Jupyter notebooks you can specify whether a cell contains code or Markdown.
Markdown is a lightweight Markup language that is used to embed
documentation or other textual information between code cells.

WHAT IS A PANDAS DATAFRAME?

The pandas’ library documentation defines a DataFrame as a “two-dimensional,


size-mutable, with labelled rows and columns”.

Image source: (Petrou, 2017)


In simple terms, think of a DataFrame as a table of data with the following
characteristics (Lynn, 2018):

● “There can be multiple rows and columns in the data.


● Each row represents a sample of data,
● Each column contains a different variable that describes the samples (rows).
● The data in every column is usually the same type of data – e.g. numbers,
strings, dates.
● Usually, unlike an Excel data set, DataFrames avoid having missing values,
and there are no gaps and empty values between rows or columns.”

You can read data from a .csv file into a DataFrame using the read_csv() function
as shown below:

insurance_df = pd.read_csv("insurance.csv")

There are also other functions that can be used to read data from different sources
into a pandas DataFrame. For example, read_excel() can be used to read data
from an Excel spreadsheet file and read_sql() can be used to load data from a SQL
database. Sometimes it is easier to extract data from other sources into a .csv file
and then read it into a DataFrame.

SELECTING COLUMNS IN PANDAS

There are many ways to specify columns in pandas. The simplest way is to use
dictionary notation for specific columns. In essence, pandas Dataframes can be
thought of as dictionaries. The key is the column name and the value is the
corresponding column values.

# Select only the species column


just_the_species = iris_df['species']
just_the_species.sample(5)

To select multiple columns, you simply need to specify a list of strings with each
column name:

# Select columns with sepal and petal information


sepal_and_petal_info = iris_df[['sepal_length', 'sepal_width',
'petal_length', 'petal_width']]
sepal_and_petal_info.sample(5)
You can also choose specific values to be included in your search (i.e. omit certain
rows from the results).

# Filter for specific values in a column


small_sepal_length = iris_df[iris_df['sepal_length'] < 4.8]
small_sepal_length.sample(5)

In essence, we are filtering the dataset for all entries where the sepal_length is less
than 4.8.

BUILT-IN DATAFRAME METHODS

When attempting to gain insight into your data, it is often helpful to leverage
built-in methods to process your data — for example, finding the mean or total of a
column.

Here is a list of common build-in computational and statistical methods in pandas:

● mean(): mean for each column


● min(): minimum for each column
● max(): maximum for each column
● std(): standard deviation for each column
● var(): variance for each column
● nunique(): number of unique values in each column
● count(): number of cells for each column or row that are not empty or
undefined (e.g. NaN)
● sum(): sum of values for each column or row

Explore the panda's documentation for a list of all methods relating to general
computations and descriptive statistics.

GROUPING IN PANDAS

Data analysis can sometimes get complicated, and more advanced functionality is
needed. Let’s say you want to average the insurance charges of all people between
the ages of 30 and 35. This can be done quite easily using:
# Get people in the 30-35 age group
between_30_and_35 = insurance_df[(insurance_df['age'] > 30) &
(insurance_df['age'] < 35 )]

# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())

Alternatively, you can also use the pandas DataFrame.query() method that takes
boolean strings as an argument as shown below:

# Use the query method to get people in the 30-35 age group
between_30_and_35 = insurance_df.query("age > 30 and age < 35")

# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())

Now let’s say you want to average the insurance charges of every person in each
age group. This can still be done with the syntax you know, but it will take a lot of
lines of code. This is bad because we want to keep our code simple and concise.
Thankfully, pandas provide us with something that allows us to do this with one
line of code:

# Get the mean charges for each age


print(insurance_df.groupby('age')['charges'].mean())

This groupby() method tells the aggregation to work separately on each unique
group specified.

For more information about working with


Jupyter, please consult the first chapter
(“IPython: Beyond Normal Python”) in the
book entitled, “Python Data Science
Handbook” by Jake VanderPlas.
Instructions
● Follow the instructions in this task to install Jupyter Notebook.
● In your command line interface, change directory (cd) to the current folder.
● Open Jupyter Notebook by typing: jupyter notebook
● Within this task folder, you will find Jupyter notebook examples. You can
open and explore them by going to Jupyter’s home screen and
double-clicking on the notebook.

○ Dataset_Examples.ipynb for Practical Task 1.


○ Report_Example.ipynb for Practical Task 2.

Practical Task 1

Open the Datasets Task.ipynb file and complete the following tasks in the
notebook. Save your notebook to your task folder for submission.
1. Write the code that performs the action described in the following
statements.
a. Select the 'Limit' and 'Rating' columns of the first five observations
b. Select the first five observations with 4 cards
c. Sort the observations by 'Education'. Show users with a high
education value first.
2. Write a short explanation in the form of a comment for the following lines of
code.
a. df.iloc[:,:]
b. df.iloc[5:,5:]
c. df.iloc[:,0]
d. df.iloc[9,:]
Practical Task 2

Open and run the example file for this task in Jupyter Notebook before attempting
this task. Follow these steps:

● Open the Report.ipynb in this folder.


● Create a DataFrame that contains the data in balance.txt.
● Write the code needed to produce a report that provides the following
information:
○ Compare the average income based on ethnicity.
○ On average, do married or single people have a higher balance?
○ What is the highest income in our dataset?
○ What is the lowest income in our dataset?
○ How many cards do we have recorded in our dataset? (Hint: use
sum())
○ How many females do we have information for vs how many males?
(Hint: use count().

HyperionDev strives to provide internationally-excellent course content that helps


you achieve your learning outcomes.

Think that the content of this task, or this course as a whole, can be improved? Do
you think we’ve done a good job?

Click here to share your thoughts anonymously.


REFERENCES

Lynn, S. (2018). The Pandas DataFrame – loading, editing, and viewing data in
Python. Retrieved from Shane Lynn: Pandas Tutorials:
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-
data-in-python/

Jupyter Team. (2015). Running the Notebook — Jupyter Documentation 4.1.1 alpha
documentation. Retrieved 18 August 2020, from
https://test-jupyter.readthedocs.io/en/latest/running.html

Petrou, T. (2017, October 27). Dissecting the anatomy of a DataFrame. (Packt>)


Retrieved from Pandas Cookbook: Recipes for Scientific Computing, Time Series
Analysis and Data Visualization using Python:
https://subscription.packtpub.com/book/big_data_and_business_intelligence/
9781784393878/1/ch01lvl1sec12/dissecting-the-anatomy-of-a-dataframe

You might also like