0% found this document useful (0 votes)
88 views

Data Science Visualization in R

This document provides an introduction to data visualization. It discusses how visualization can easily communicate information that is difficult to extract from tables of raw data. Effective visualization can help discover biases or errors in data before analysis. The document covers different types of variables - categorical (ordinal, non-ordinal) and numerical (discrete, continuous) - and how understanding the variable types is important for deciding how to visualize the data. It also introduces the concept of data distributions and how visualization can be used to summarize and analyze distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Data Science Visualization in R

This document provides an introduction to data visualization. It discusses how visualization can easily communicate information that is difficult to extract from tables of raw data. Effective visualization can help discover biases or errors in data before analysis. The document covers different types of variables - categorical (ordinal, non-ordinal) and numerical (discrete, continuous) - and how understanding the variable types is important for deciding how to visualize the data. It also introduces the concept of data distributions and how visualization can be used to summarize and analyze distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Science: Visualization

Section 1: Introduction to Data Visualization and Distributions  1.1 Introduction to Data


Visualization  Introduction to Data Visualization
RAFAEL IRIZARRY: Looking at the numbers and character strings

at the final dataset is rarely useful.

To convince yourself, print and stare at the US murders data table.

You can do this with this code.

What do you learn from staring at this table?

How quickly can you determine which states have the largest populations?

Which states have the smallest?

How large is a typical state?

Is there a relationship between population size and total murders?

How do the murder rates vary across regions of the country?

For most human brains, it is quite difficult

to extract this information just from looking at the numbers.

In contrast, the answer to all of these questions

are readily available from examining this plot.

We are reminded of the saying, "A picture is worth a thousand words."

Data visualization provides a powerful way

to communicate a data driven finding.

In some cases, the visualization is so convincing that no follow up analysis

is required.

The growing availability of informative data sets and software tools

has lead to increased reliance on data visualization

across many industries, academia, and government.

A visible example are news organizations that are increasingly

embracing data journalism and including effective info-graphics

and charts as part of their reporting.

A particularly effective example is a Wall Street Journal article


showing data related to the impact of vaccines

on battling infectious diseases.

One of the graphs shows measles cases by US state

through the years with a vertical line demonstrating

when the vaccine was introduced.

Another striking example comes from the New York Times.

This article showed data on scores from the New York City regents exam.

These scores are collected for several reasons,

including to determine if a student graduates from high school.

In New York City, you need a 65 to pass.

The distribution of the test scores forces

us to notice something somewhat problematic.

The most common test score is the minimum passing grade.

With very few just below that value.

This unexpected result is consistent with students

close to passing having their scores bumped up.

This is an example of how data visualization can

lead to discoveries which would otherwise

be missed if we simply subject that data to a battery of data analysis, tools,

or procedures.

Data visualization is the strongest tool of what

we call exploratory data analysis.

John Tukey, considered the father of exploratory data analysis

once said, the greatest value of a picture

is when it forces us to notice what we never expected to see.

We note that many widely used data analysis

tools were initiated by discoveries made with exploratory data analysis.

Exploratory data analysis is perhaps the most important part of data analysis,

yet it is often overlooked.

Data visualization is also now pervasive and philanthropic

in educational organizations.
One example comes from GAPminder and the talks, New Insights on Poverty

and the Best Stats You've Ever Seen, Hans Roslings

forced us to notice the unexpected with a series of plots

related to world health and economics.

In his videos, he used animated graphs to show us how the world was changing,

and how old narratives are no longer true.

It is also important to note that mistakes, biases, systematic errors,

and other unexpected problems often lead to data

that should be handled with care.

Failure to discover these problems often leads to flawed analyses

and false discoveries.

As an example, consider that measurement devices sometimes fail

and that most data analysis procedures are not designed to detect these yet.

These data analysis procedures will still give you an answer.

The fact that it can be hard or impossible to notice an error just

from the reported results makes data visualization particularly important.

In this course, we will learn the basics of data visualization

and exploratory data analysis.

We will use motivating examples.

We will use the ggplot2 package to code.

To learn the basics, we will start with a somewhat artificial example, heights

reported by students.

Then we will cover two of the examples we mentioned,

world health and economics, and infectious diseases

trends in the United States.

Note that there's much more to data visualization that

will be covered here.

But you will get a very good introduction to the topic.

Textbook Link
This video corresponds to the textbook introduction to the data visualization
section.
Key points

 Plots of data easily communicate information that is difficult to extract from


tables of raw values.

 Data visualization is a key component of exploratory data analysis (EDA), in


which the properties of data are explored through visualization and
summarization techniques.

 Data visualization can help discover biases, systematic errors, mistakes and
other unexpected problems in data before those data are incorporated into
potentially flawed analysis.

 This course covers the basics of data visualization and EDA in R using
the ggplot2 package and motivating examples from world health,
economics and infectious disease.

Code

library(dslabs)

data(murders)

head(murders)

Section 1: Introduction to Data Visualization and Distributions  1.1 Introduction to Data


Visualization  Introduction to Distributions
RAFAEL IRIZARRY: You may have noticed that numerical data is often

summarized with an average value.

For example, the quality of a high school is sometimes

summarized with one number--

the average score in a standardized test.

Occasionally, a second number is reported

as well-- the standard deviation.

So, for example, you might read a report stating that scores at this high school

were 680 plus or minus 50.

The last number is the standard deviation.

Note that the report has summarized an entire vector of scores


with just two numbers.

Is this appropriate?

Is there any important piece of information

we're missing by only looking at this summary rather than the entire list?

It turns out that in some cases, these two numbers

are pretty much all we need to understand the data.

Data visualization techniques will help us determine when

this two-number summary is appropriate.

These same techniques will serve as alternatives for when

these two numbers are not enough.

Our first data visualization building block

is learning to summarize lists of factors or numeric vectors.

The most basic statistical summary of a list of objects or numbers

is its distribution.

Once a vector has been summarized as a distribution,

there are several data visualization techniques to effectively relay

this information.

Textbook Link
This video corresponds to the textbook introduction to the section on visualizing
distributions.

Key points

 The most basic statistical summary of a list of objects is its distribution.

 We will learn ways to visualize and analyze distributions in the upcoming


videos.

 In some cases, data can be summarized by a two-number summary: the


average and standard deviation. We will learn to use data visualization to
determine when that is appropriate.
Section 1: Introduction to Data Visualization and Distributions  1.1 Introduction to Data
Visualization  Data Types

RAFAEL IRIZARRY: An important first step in deciding how to visualize data

is to know what type of data it is.

We will be working with two types of variables-- categoricals

and numericals.

Each can be divided into two further groups.

Categoricals can be divided into ordinals and non-ordinals.

And numerical variables can be divided into discrete or continuous.

Variables that are defined by a small number of groups


we call categorical data.

Two simple examples are sex, male or female, or regions of the states

that we looked at in the first course--

Northeast, South, Central, West.

Some categorical data can be ordered.

For example, spiciness can be mild, medium, or hot.

Even if they are not numbers per se, they can still be ordered.

In statistics textbooks they sometimes refer to these as ordinal data.

The other data type are numericals.

Examples that we have seen or will see are population sizes, murder rates,

and heights.

We can further divide numerical data into continuous and discrete.

Continuous variables are those that can take any value such as heights

if measured with enough precision.

For example, a pair of twins maybe 68.12 inches and 68.11 inches respectively.

Counts such as population sizes are discrete

because they have to be round numbers.

Note that discrete numeric data can be considered ordinal.

An example are heights rounded to the nearest inch.

Although this is technically true, we usually

reserve the term "ordinal data" for variables


belonging to a small number of different groups

with each group having many members.

In contrast, when we have many groups with few cases in each group,

we typically refer to this as discrete numerical variables.

So for example, the number of packs of cigarettes a person smokes a day

rounded to the closest pack-- so 0, 1, or 2--

would be considered ordinal.

While the number of cigarettes that we smoke--

0, 1, 2, 3 up to maybe 36--

would be considered a numerical variable.


But indeed, these examples can be considered both

when it comes to data visualization.

Now that we've learned about the different data types,

we're ready to learn about data visualization techniques.

Textbook Link
This video corresponds to the textbook section on variable types.

Key points

 Categorical data are variables that are defined by a small number of groups.

 Ordinal categorical data have an inherent order to the categories


(mild/medium/hot, for example).

 Non-ordinal categorical data have no order to the categories.

 Numerical data take a variety of numeric values.

 Continuous variables can take any value.

 Discrete variables are limited to sets of specific values.


Section 1: Introduction to Data Visualization and Distributions  1.2 Introduction to
Distributions  Describe Heights to ET

RAFAEL IRIZARRY: Here we introduce a new motivating problem.

It is an artificial one, but it'll help us illustrate the concepts needed

to understand distributions.

Later, we'll move on to more realistic examples.

Pretend that we have to describe the heights of our classmates

to ET, an extraterrestrial that has never seen humans.

As a first step, we need to collect data.

To do this, we asked students to report their heights in inches.


We asked them to provide sex information because we

know there are two different groups of heights, males and females.

We collect the data and save it in a data frame.

We've done that already, and here are the first six entries.

One way to convey the heights to ET is to simply

send him the list of 924 heights.

But there are much more effective ways to convey this information,

and understanding the concepts of distributions will help.

To simplify the explanation at first, we focus on male heights.

The most basic statistical summary of a list of objects or numbers

is its distribution.

The simplest way to think of a distribution

is as a compact description of a list with many elements.

This concept should not be new for most of us.

For example, with categorical data, the distribution

simply describes the proportions of each unique category.

For example, the sex represented in the heights data set

can be summarized by the proportions of each of the two categories,

female and male.

This two category frequency table is the simplest form

of a distribution we can form.


Now, we don't really need to visualize it

since one number describes everything we need to know about this data set.

23% are females and the rest are males.

That's all we need to know.

When there are more categories, then a simple bar plot

describes the distribution.

Here's an example showing the proportions of state regions

for the 50 states, plus DC, that we looked at earlier.

We can see in the plot with the height of the bar

tells us the proportion of the four regions


Northeast, South, North Central, and West.

Although this particular plot, a graphical representation of a frequency

table, does not provide much more insights than the table itself,

it is a first example of how we convert a vector

into a visualization that succinctly summarizes

all the information in that vector.

When the data is numerical, the task is much more challenging.

Why is it more complicated?

Well, in general, when data is not categorical,

reporting the frequency of each unique entry

is not an effective summary since most entries are unique.

For example, while several students reported a height of 68 inches,

only one student reported a height of 68.503937007874 inches.

And only one student reported a height of 68.8976377952756 inches.

We assume that they converted from 174 centimeters and 175 centimeters

to inches respectively.

Statistics textbooks teach us that a more useful way

to define a distribution for numerical data

is to define a function that reports the proportion of the data

below a value A for all possible values of A.

This function is called a cumulative distribution function or CDF.


In statistical textbooks, the following mathematical notation is used.

We define a function f of a and make that equal to the proportion of values

x less than or equal to a, which is represented with this Pr, meaning

proportion or probability, and then in parentheses

the event that we require, x less than a.

Here's a plot of the function f for the height data.

Like the frequency table does for categorical data,

the CDF defines the distribution for numerical data.

From the plot we can see, for example, that 14% of the students in our class

have heights below 66 inches.


How do we know this?

Because if we look at the value of f at 66, it's .138.

We also see that 84% of our students have heights below 72 inches.

This is because if we look at the value of f at 72, it's 0.839.

In fact, we can report the proportion of values between any two heights,

say a and b, by computing f of b, and then subtracting f of a.

This means that if we send this plot to ET,

he will have all the information needed to construct the entire list.

Final note, because CDFs can be determined mathematically,

as opposed to using data as we do here, the word empirical

is added to distinguish, and we use the term empirical CDF or ECDF.

Although the CDF provides all the information we need and it is widely

discussed in statistics textbooks, the plot

is actually not very popular in practice.

The main reason is that it does not easily

convey characteristics of interest, such as,

at what value is the distribution centered,

is the distribution symmetric, what range contains 95% of the data.

We can decipher all this from the plot, but it's not that easy.

Histograms are much preferred because they greatly

facilitate answering such questions.


Histograms sacrifice just a bit of information

to produce plots that are much easier to interpret.

The simplest way to make a histogram is to divide a span of our data

into non-overlapping bins of the same size.

Then for each bin, we count the number of values that fall in that interval.

The histogram plots these counts as bars with the base of the bar the interval.

Here's a histogram for the height data where we split the range of values

into one inch intervals.

So we start with the interval 53.5 to 54.5,

then divide the rest into one inch intervals,


all the way up to 80.5 to 81.5.

If we send this plot to ET, he will immediately

learn some important properties about our data.

First, the range of data is from 55 to 81,

but perhaps more importantly, he learns that the majority, with more than 95%,

are between 63 and 75 inches.

Second, the heights are close to symmetric around 69 inches.

Also note, that by adding up counts, ET could

obtain very good approximations of the proportion of the data in any interval.

So it provides almost all the information

that we provided with the CDF.

We provide almost all the information contained

in the raw data that is 708 heights, with just 23 bin counts.

So why is it an approximation?

What information do we lose?

Note that all values in each interval are treated as the same

when computing the bin heights.

So for example, the histogram does not distinguish

between 64, 64.1, and 64.2 inches.

Given that these differences are almost unnoticeable to the eye,

in this particular case, the practical implications are negligible.


Textbook links
This video corresponds to the following sections:

 textbook case study of describing student heights

 textbook section on the distribution function

 textbook section on the cumulative distribution function

 textbook section on histograms

Key points

 A distribution is a function or description that shows the possible values of a


variable and how often those values occur.

 For categorical variables, the distribution describes the proportions of each


category.

 A frequency table is the simplest way to show a categorical distribution.


Use prop.table() to convert a table of counts to a frequency
table. Barplots display the distribution of categorical variables and are a way to
visualize the information in frequency tables.

 For continuous numerical data, reporting the frequency of each unique entry is not
an effective summary as many or most values are unique. Instead, a distribution
function is required.

 The cumulative distribution function (CDF) is a function that reports the proportion


of data below a value a for all values of a: F(a)=Pr(x≤a).
 The proportion of observations between any two values a and b can be computed
from the CDF as F(b)−F(a).
 A histogram divides data into non-overlapping bins of the same size and plots the
counts of number of values that fall in that interval.

Code

# load the dataset

library(dslabs)

data(heights)
# make a table of category proportions

prop.table(table(heights$sex))

Creating other graphs from the video


You will learn to create barplots and histograms in a later section of this course.

Every continuous distribution has a cumulative distribution function (CDF). The


CDF defines the proportion of the data below a given value a for all values of a:
F(a)=Pr(x≤a)

Any continuous dataset has a CDF, not only normal distributions. For example, the
male heights data we used in the previous section has this CDF:

As defined above, this plot of the CDF for male heights has height values a on the
x-axis and the proportion of students with heights of that value or lower ( F(a)) on
the y-axis.

The CDF is essential for calculating probabilities related to continuous data. In a


continuous dataset, the probability of a specific exact value is not informative
because most entries are unique. For example, in the student heights data, only
one individual reported a height of 68.8976377952726 inches, but many students
rounded similar heights to 69 inches. If we computed exact value probabilities, we
would find that being exactly 69 inches is much more likely than being a non-
integer exact height, which does not match our understanding that height is
continuous. We can instead use the CDF to obtain a useful summary, such as the
probability that a student is between 68.5 and 69.5 inches. 

For datasets that are not normal, the CDF can be calculated manually by defining a
function to compute the probability above. This function can then be applied to a
range of values across the range of the dataset to calculate a CDF. Given a
dataset my_data, the CDF can be calculated and plotted like this:

a <- seq(min(my_data), max(my_data), length = 100) # define range of values spanning


the dataset

cdf_function <- function(x) { # computes prob. for a single value

mean(my_data <= x)

cdf_values <- sapply(a, cdf_function)


plot(a, cdf_values)

The CDF defines that proportion of data below a cutoff a. To define the proportion
of values above a, we compute:
1−F(a)
To define the proportion of values between a and b, we compute:
F(b)−F(a)
Note that the CDF can help compute probabilities. The probability of observing a
randomly chosen value between a and b is equal to the proportion of values
between a and b, which we compute with the CDF.

Section 1: Introduction to Data Visualization and Distributions  1.2 Introduction to


Distributions  Smooth Density Plots

RAFAEL IRIZARRY: Smooth density plots are similar to histograms,

but are aesthetically more appealing.

Here's what a smooth density plot looks like for our male heights data.

We're going to learn how to make these plots and what they mean.

Note that we no longer have sharp edges at the interval boundaries

and that many of the local peaks have been removed.


Also notice that the scale of the y-axis changed from counts to something

new called density.

To understand the smooth densities, we have

to understand estimates, a topic we don't cover until later.

However, we provide a heuristic explanation

to help you understand the basics, and so you

can use this useful data visualization tool and learn how to interpret it.

The main new concept you have to understand

is that we assume that our list of observed values of observed heights

comes from a much, much larger list of unobserved values.


So in the case of male heights, you can imagine

our lists of students heights comes from a hypothetical list containing

all the heights of all the male students in all the world measured

very precisely.

So as an example, let's suppose that we have a million of these.

This list of values, like any other list of values, has a distribution.

And this is really what we want to report

to ET, since it is much more general.

Unfortunately, we don't get to see it.

However, we can make an assumption that helps us perhaps approximate.

Because we're assuming that we have a million values measured very precisely,

we can make a histogram with very, very small bins.

This is going to help us understand what smooth densities are.

The assumption is that if we do this, consecutive bins will be similar.

This is what we mean by smooth.

We don't have big jumps from bin to bin.

Here's a hypothetical histogram with a bin size of 1

from the hypothetical list of 1 million values.

And here's what happens when we make the bins smaller and smaller.

You can see that the edges start disappearing and the histogram

becomes smoother.
For this hypothetical situation, the smooth density

is basically the curve that goes through the top of the histogram bars

when the bins are very, very small.

You can see it in the plot as a blue curve.

To make the curve not depend on the hypothetical size

of the hypothetical list, we compute the curve on the frequency scale

rather than the count scale.

OK.

Now back to reality.

With our example, we don't have millions of measurements.


Instead, we have 708 and we can't make a histogram with very, very small bins.

So how can we estimate this hypothetical smooth curve

that we would see if we would see all the measurements?

So what we're going to do is we're going to start

by making a histogram with our data, compute frequencies rather than counts,

and using the bin size appropriate for our data,

so the histogram looks like this.

Because it is a small sample, we get unsmooth variation in these heights.

To smooth the histogram what we're going to do

is we're going to start by keeping the heights of the histogram bar shown here

with little points.

We keep these points.

And now we draw a smooth curve that goes through the top

of these histograms bars.

That's the blue curve.

And we get what we call the smooth density.

Here's what the final product looks like.

Before we continue, note that smooth is a relative term.

We can actually control the smoothness of the curve that

defines a smooth density through an option in the function, the GG plot

option that computes the smooth density.


Here are two example using different degrees

of smoothness on the same histogram.

On the left, we see a curve this more jagged.

It goes up and down quickly.

And on the right, we see a curve that goes up and then goes down smoothly.

We need to make this choice with care, as the resulting visualization

can change our interpretation of the data.

We should select a degree of smoothness that we

can defend as being representative of the underlying data.

In the case of height, we really do have reason


to believe that there are a proportion of people

with similar heights should be about the same.

For example, the proportion that is 72 inches

should be more similar to the proportion of the 71 inches

than the proportion that is 78 or 65.

This implies that the curve should be pretty smooth, more

like the example on the right than the example on the left.

Always keep in mind that while the histogram is an assumption

free summary, the smooth density is based on assumptions and choices

that you make as a data analyst.

Finally, to be able to interpret density plots,

we need to understand the units of the y-axis.

We point out that interpreting the y-axis of a smooth density plot

is not straightforward.

It is scaled so that the area under the density curve adds up to 1.

So if you imagine, you form a bin with a base that is 1 unit long.

The y-axis value tells us the proportion of values in that bin.

But this is only true if the bin is of size 1.

For other size intervals, the best way to determine the proportion of data

in that interval is by computing the proportion of the total area contained

in that interval.
Here's an example, if we take the proportion of values between 65 and 68,

it'll equal the proportion of the graph that is in that blue region

that we're showing you right there.

The proportion of this area is about 0.31, meaning that about 31%

of our values are between 65 and 68 inches.

Now that we understand this, we are ready to use a smooth density

as a summary.

For this dataset, we would feel quite comfortable

with the smoothness assumption.

And therefore, we're sharing the aesthetically pleasing figure


with ET, which you could use to understand our male height data.

As a final note, we point out that an advantage

of smooth densities over histograms is that it makes it easier

to compare two distributions.

This is in large part because the jagged edges of the histogram add clutter,

so when we're comparing two histograms, it makes it a little bit hard to see.

Here's an example of what it looks like when you use density plots.

With the right argument, GG plot automatically

shades the intersecting regions with different colors.

So it makes it very easy to get an idea of what the two distributions are.

Textbook link
This video corresponds to the textbook section on smooth density plots.

Key points

 Smooth density plots can be thought of as histograms where the bin width is


extremely or infinitely small. The smoothing function makes estimates of the true
continuous trend of the data given the available sample of data points.

 The degree of smoothness can be controlled by an argument in the plotting


function. (We will learn functions for plotting later.)

 While the histogram is an assumption-free summary, the smooth density plot is


shaped by assumptions and choices you make as a data analyst.
 The y-axis is scaled so that the area under the density curve sums to 1. This
means that interpreting values on the y-axis is not straightforward. To determine
the proportion of data in between two values, compute the area under the smooth
density curve in the region between those values.

 An advantage of smooth densities over histograms is that densities are easier to


compare visually.

A further note on histograms


Note that the choice of binwidth has a determinative effect on shape. There is no
"correct" choice for binwidth, and you can sometimes gain insights into the data by
experimenting with binwidths.

Section 1: Introduction to Data Visualization and Distributions  1.2 Introduction to


Distributions  Normal Distribution

Normal Distribution
RAFAEL IRIZARRY: Histogram and density plots

provide excellent summaries of a distribution.

But can we summarize even further?

We often see the average and the standard deviation

used as a summary statistic for a list of numbers, a two number summary.

To understand what these summaries are and why they are so widely used,

we need to understand the normal distribution.

The normal distribution, also known as the bell curve

and as the Gaussian distribution, is one of the most famous

mathematical concepts in history.

A reason for this is that approximately normal distributions

occur in many situations.

Examples include, gambling winnings, heights, weights, blood pressure,

standardized test scores, and experimental measurement error.

There are explanations for this, but we explain it later.

Here we focus on how the normal distribution helps us summarize data.

Rather than using data, the normal distribution

is defined with a mathematical formula.


Let's see what this is.

For any interval, a and b, the proportion of values in the interval

can be computed using this formula.

You don't need to memorize or understand the details of the formula.

We're going to have R code that computes it for us.

But note that it is completely defined by just 2 parameters, m and s.

The rest of the symbols in the formula represent the interval ends, a and b,

and known mathematical constants, pi and e.

These two parameters, m and s, are referred to as the average, also

call the mean, that's why we use the letter m,


and the standard deviation of the distribution respectively.

The distribution is symmetric, centered at the average,

and most values, about 95%, are within two standard deviations

from the average.

Here's what the distribution looks like when the average is zero

and the standard deviation is one.

The fact that the distribution is defined by just two parameters

implies that if a dataset is approximated by a normal distribution,

all information needed to describe this distribution

can be encoded in just two numbers, the average and the standard deviation,

which we now define for an arbitrary list of numbers.

For a list of numbers contained in a vector that we'll call x,

the average is simply defined as a sum of x divided by the length of x.

Here's the code that you use in R. And the standard deviation

is defined with the following formula.

It's the square root of the sum of the differences between the values

and the mean squared divided by the length.

You can think of this as the average distance

between the values and their average.

Let's compute the average and the standard deviation

for the male heights, which we will store in an object


called x using this code.

The pre-built function, mean and SD can be used here.

We don't have to type out the code that we showed before,

because these functions are so common that pre-built functions exist.

So the code would actually look like this.

We see that the average height is 69.31 inches

and the standard deviation is 3.61 inches.

So here is a plot of the smooth density of our male heights.

We're making no assumptions here, other than assuming it's somewhat smooth.

And then overlaid with a black curve is the normal distribution


that has average 69.31, and standard deviation 3.61.

Note how close those two are.

This is telling us that the normal distribution approximate

the distribution of our male heights.

Before we continue, let's introduce the concept of standard units.

For data that is approximately normal, it

is convenient to think in terms of standard units.

The standard unit of value tells us how many standard deviations away

from the average this value is.

Specifically for a value x, we define the standard unit

as z equals x minus the average divided by the standard deviation.

If you look back at the formula for the normal distribution,

you notice that what is being exponentiated is negative z squared.

The maximum of e to the minus z squared over 2

is when z equals 0, which explains why the maximum of the distribution

is at the average.

It also explains the symmetry, since negative z squared

is symmetric around 0.

To understand why standard units are useful,

notice that if we convert normally distributed data into standard units,

we can quickly know if for example, a person is about average height, that
would mean z equals 0.

A person that is tall would be z equals 2.

A person that is short would be z equals negative 2.

And extremely rare occurrences, say a 7 footer or something like that,

would have a z bigger than 3.

Or someone very short would have a z smaller negative 3.

Note that it does not matter what the original units are.

These rules apply to any data that is approximately normal.

In R, we can quickly obtain standard units using the function scale.

Type in this code.


Now back to our example.

To see how many men are within two standard deviations from the average,

now that we were already converted to standard units, all we have to do

is count the number of z's that are less than 2 and bigger than negative 2,

and then divide by the total.

So we take the mean of this quantity that you see here in the code,

and we see that the proportion is 0.951.

Note that it's about 95%, which is exactly

what the normal distribution predicts.

So it's quite useful.

If we can assume that the data is approximately normal,

at least for this interval, we can predict the proportion

without actually looking at the data.

We simply know that 95% of the data for normally

distributed data is between negative 2 and 2.

Now to further confirm that in fact the approximation is a good one,

we need to look at other intervals.

And for this, we will use quantile plots.

Textbook link
For more information, consult this textbook section on the normal distribution.
Correction
At 3:27 and 3:50, the audio gives incorrect values for the average and standard
deviation. The code on screen and the transcript are correct.

Key points

 The normal distribution:

o Is centered around one value, the mean

o Is symmetric around the mean

o Is defined completely by its mean (μ) and standard deviation ( σ )

o Always has the same proportion of observations within a given distance of


the mean (for example, 95% within 2 σ)

 The standard deviation is the average distance between a value and the mean
value.

 Calculate the mean using the mean() function.

 Calculate the standard deviation using the sd() function or manually. 

 Standard units describe how many standard deviations a value is away from the
mean. The z-score, or number of standard deviations an observation x is away
from the mean μ:

Z=x−μσ

 Compute standard units with the scale() function.

 Important: to calculate the proportion of values that meet a certain condition, use
the mean() function on a logical vector. Because TRUE is converted to 1
and FALSE is converted to 0, taking the mean of this vector yields the proportion
of TRUE.

Equation for the normal distribution


The normal distribution is mathematically defined by the following formula for any
mean μ and standard deviation σ:

Pr(a<x<b)=∫ba1√2πσe−12 ( ) dx
x−μσ 2

Code
# define x as vector of male heights

library(tidyverse)

library(dslabs)

data(heights)

index <- heights$sex=="Male"

x <- heights$height[index]

# calculate the mean and standard deviation manually

average <- sum(x)/length(x)

SD <- sqrt(sum((x - average)^2)/length(x))

# built-in mean and sd functions - note that the audio and printed values disagree

average <- mean(x)

SD <- sd(x)

c(average = average, SD = SD)

# calculate standard units

z <- scale(x)

# calculate proportion of values within 2 SD of mean

mean(abs(z) < 2)

Note about the sd function


The built-in R function sd() calculates the standard deviation, but it divides
by length(x)-1 instead of length(x). When the length of the list is large, this
difference is negligible and you can use the built-in sd() function. Otherwise, you
should compute σ by hand. For this course series, assume that you should use
the sd() function unless you are told not to do so.

Section 1: Introduction to Data Visualization and Distributions  1.2 Introduction to


Distributions  Normal Distribution: Standard Units and Z-scores

Normal Distribution: Standard Units and Z-scores

Standard units
For data that are approximately normal, standard units describe the number of
standard deviations an observation is from the mean. Standard units are denoted
by the variable z and are also known as z-scores.

For any value x from a normal distribution with mean μ and standard deviation σ,


the value in standard units is:

z=x−μσ

Standard units are useful for many reasons. Note that the formula for the normal
distribution is simplified by substituting z in the exponent:

Pr(a<x<b)=∫ba1√2πσe−12z2dx
When z=0, the normal distribution is at a maximum, the mean μ. The function is
defined to be symmetric around z=0.
The normal distribution of z-scores is called the standard normal distribution and is
defined by μ=0 and σ=1.

Z-scores are useful to quickly evaluate whether an observation is average or


extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are
significantly above or below the mean, and z-scores above 3 or below -3 are
extremely rare. 

We will learn more about benchmark z-score values and their corresponding
probabilities below.

Code: Converting to standard units


The scale function converts a vector of approximately normally distributed values
into z-scores.

z <- scale(x)

You can compute the proportion of observations that are within 2 standard
deviations of the mean like this:

mean(abs(z) < 2)

The 68-95-99.7 Rule


The normal distribution is associated with the 68-95-99.7 rule. This rule describes
the probability of observing events within a certain number of standard deviations
of the mean. 
The probability distribution function for the normal distribution is defined such that:
 About 68% of observations will be within one standard deviation of the mean
(μ±σ). In standard units, this is equivalent to a z-score of ∣z∣≤1.
 About 95% of observations will be within two standard deviations of the
mean (μ±2σ). In standard units, this is equivalent to a z-score of ∣z∣≤2.

 About 99.7% of observations will be within three standard deviations of the


mean (μ±3σ). In standard units, this is equivalent to a z-score of ∣z∣≤3.
 We will learn how to compute these exact probabilities in a later section, as
well as probabilities for other intervals.
Section 1: Introduction to Data Visualization and Distributions  1.2 Introduction to
Distributions  The Normal CDF and pnorm

The normal CDF and pnorm


RAFAEL IRIZARRY: In the data visualization module,

we introduced the normal distribution as a useful approximation

to many naturally occurring distributions,

including that of height.

The cumulative distribution for the normal distribution


is defined by a mathematical formula, which in R can

be obtained with the function pnorm.

We say that a random quantity is normally distributed with average, avg,

and standard deviation, s, if its probability distribution

is defined by f of a equals pnorm a, average, s.

This is the code.

This is useful, because if we are willing to use the normal approximation

for say, height, we don't need the entire dataset

to answer questions such as, what is the probability that a randomly selected

student is taller than 70.5 inches.


We just need the average height and the standard deviation.

Then we can use this piece of code, 1 minus pnorm 70.5 mean of x, sd of x.

And that gives us the answer of 0.37.

The normal distribution is derived mathematically.

Apart from computing the average and the standard deviation,

we don't use data to define it.

Also the normal distribution is defined for continuous variables.

It is not described for discrete variables.

However, for practicing data scientists, pretty much everything we do

involves data, which is technically speaking discrete.

For example, we could consider our adult data categorical

with each specific height a unique category.

The probability distribution would then be

defined by the proportion of students reporting each of those unique heights.

Here is what a plot of that would look like.

This would be the distribution function for those categories.

So each reported height gets a probability

defined by the proportion of students reporting it.

Now while most students rounded up their height to the nearest inch,

others reported values with much more precision.

For example, student reported his height to be 69.6850393700787.


What is that about?

What's that very, very precise number?

Well, it turns out, that's 177 centimeters.

So the student converted it to inches, and copied and pasted the result

into the place where they had to report their heights.

The probability assigned to this height is about 0.001.

It's 1 in 708.

However, the probability for 70 inches is 0.12.

This is much higher than what was reported with this other value.

But does it really make sense to think that the probability of being exactly 70
inches is so much higher than the probability of being 69.68?

Clearly, it is much more useful for data analytic purposes

to treat this outcome as a continuous numeric variable.

But keeping in mind that very few people, perhaps none,

are exactly 70 inches.

But rather, that people rounded to the nearest inch.

With continuous distributions, the probability of a singular value

is not even defined.

For example, it does not make sense to ask

what is the probability that a normally distributed value is 70.

Instead, we define probabilities for intervals.

So we could ask instead, what is a probability that someone

is between 69.99 and 70.01.

In cases like height in which the data is rounded,

the normal approximation is particularly useful

if we deal with intervals that include exactly one round number.

So for example, the normal distribution is

useful for approximating the proportion of students

reporting between 69.5 and 70.5.

Here are three other examples.

Look at the numbers that are being reported.


This is using the data, the actual data, not the approximation.

Now look at what we get when we use the approximation.

We get almost the same values.

For these particular intervals, the normal approximation is quite useful.

However, the approximation is not that useful for other intervals.

For example, those that don't include an integer.

Here are two examples.

If we use these two intervals, again, this

is the data, look at the approximations now with the normal distribution.

They're not that good.


In general, we call this situation discretization.

Although the true height distribution is continuous,

the reported heights tend to be more common at discrete values,

in this case, due to rounding.

As long as we are aware of how to deal with this reality,

the normal approximation can still be a very useful tool.

Textbook link
Here is the textbook section on the theoretical distribution and the normal
approximation.

Key points

 The normal distribution has a mathematically defined CDF which can be computed
in R with the function pnorm().

 pnorm(a, avg, s) gives the value of the cumulative distribution


function F(a) for the normal distribution defined by average avg and standard
deviation s.
 We say that a random quantity is normally distributed with average avg and
standard deviation s if the approximation pnorm(a, avg, s) holds for all
values of a.

 If we are willing to use the normal approximation for height, we can estimate the
distribution simply from the mean and standard deviation of our values.
 If we treat the height data as discrete rather than categorical, we see that the data
are not very useful because integer values are more common than expected due to
rounding. This is called discretization.

 With rounded data, the normal approximation is particularly useful when computing
probabilities of intervals of length 1 that include exactly one integer.

Code: Using pnorm to calculate probabilities


Given male heights x:

library(tidyverse)

library(dslabs)

data(heights)

x <- heights %>% filter(sex=="Male") %>% pull(height)

We can estimate the probability that a male is taller than 70.5 inches with:

1 - pnorm(70.5, mean(x), sd(x))

Code: Discretization and the normal approximation

# plot distribution of exact heights in data

plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")

# probabilities in actual data over length 1 ranges containing an integer

mean(x <= 68.5) - mean(x <= 67.5)

mean(x <= 69.5) - mean(x <= 68.5)

mean(x <= 70.5) - mean(x <= 69.5)

# probabilities in normal approximation match well

pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))


pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))

pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))

# probabilities in actual data over other ranges don't match normal approx as well

mean(x <= 70.9) - mean(x <= 70.1)

pnorm(70.9, mean(x), sd(x)) - pnorm(70.1, mean(x), sd(x))

Definition of quantiles
Definition of quantiles
Quantiles are cutoff points that divide a dataset into intervals with set probabilities.
The qth quantile is the value at which q% of the observations are equal to or less
than that value.
Using the quantile function
Given a dataset data and desired quantile q, you can find the qth quantile
of data with:

quantile(data,q)

Percentiles
Percentiles are the quantiles that divide a dataset into 100 intervals each with 1%
probability. You can determine all percentiles of a dataset data like this:

p <- seq(0.01, 0.99, 0.01)

quantile(data, p)

Quartiles
Quartiles divide a dataset into 4 parts each with 25% probability. They are equal to
the 25th, 50th and 75th percentiles. The 25th percentile is also known as the 1st
quartile, the 50th percentile is also known as the median, and the 75th percentile is
also known as the 3rd quartile.

The summary() function returns the minimum, quartiles and maximum of a vector.

Examples
Load the heights dataset from the dslabs package:

library(dslabs)
data(heights)

Use summary() on the heights$height variable to find the quartiles:

summary(heights$height)

Find the percentiles of heights$height:

p <- seq(0.01, 0.99, 0.01)

percentiles <- quantile(heights$height, p)

Confirm that the 25th and 75th percentiles match the 1st and 3rd quartiles. Note
that quantile() returns a named vector. You can access the 25th and 75th
percentiles like this (adapt the code for other percentile values):

percentiles[names(percentiles) == "25%"]

percentiles[names(percentiles) == "75%"]

Finding quantiles with qnorm


Definition of qnorm
The qnorm() function gives the theoretical value of a quantile with probability p of
observing a value equal to or less than that quantile value given a normal
distribution with mean mu and standard deviation sigma:

qnorm(p, mu, sigma)

By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives


quantiles for the standard normal distribution.

qnorm(p)

Recall that quantiles are defined such that p is the probability of a random


observation less than or equal to the quantile.

Relation to pnorm
The pnorm() function gives the probability that a value from a standard normal
distribution will be less than or equal to a z-score value z. Consider:
pnorm(-1.96) ≈0.025

The result of pnorm() is the quantile. Note that:

qnorm(0.025) ≈−1.96

qnorm() and pnorm() are inverse functions:

pnorm(qnorm(0.025)) =0.025
Theoretical quantiles
You can use qnorm() to determine the theoretical quantiles of a dataset: that is, the
theoretical value of quantiles assuming that a dataset follows a normal distribution.
Run the qnorm() function with the desired probabilities p, mean mu and standard
deviation sigma. 

Suppose male heights follow a normal distribution with a mean of 69 inches and
standard deviation of 3 inches. The theoretical quantiles are:

p <- seq(0.01, 0.99, 0.01)

theoretical_quantiles <- qnorm(p, 69, 3)

Theoretical quantiles can be compared to sample quantiles determined with the


quantile function in order to evaluate whether the sample follows a normal
distribution.

Quantile-Quantile Plots
Section 1: Introduction to Data Visualization and Distributions  1.3 Quantiles, Percentiles, and
Boxplots  Quantile-Quantile Plots

RAFAEL IRIZARRY: In a previous video, we described

how, if a distribution is well approximated

by the normal distribution, we can have a very useful and


short summary.

But to check if, in fact, it is a good approximation,

we can use quantile-quantile plots, or q-q plots.


We start by defining a series of proportion, for example, p
equals 0.05,

0.10, 0.15, up to 0.95.

Once this is defined for each p, we determine the value q,

so that the proportion of the values in the data below q is


p.

The q's are referred to as the quantiles.

To give a quick example, for the male heights

data that we showed in previous videos, we have that 50% of


the data

is below 69.5 inches.

So this means that if p equals 0.5, then the q associated


with that p is 69.5.

Now, we can make this computation for a series of p's.

If the quantiles for the data match the quantiles for the
normal distribution,

then it must be because the data is approximated by a normal


distribution.

To obtain the quantiles for the data, we can use the quantile
function

in r like this.

So we're going to define an object called observe quantiles,


and calculate

the quantiles for x at the series of values of p,

which is stored in the object p.

To obtain the theoretical normal distribution quantiles


with the corresponding average and standard deviation,

we use the qnorm function, like this.

We're going to find an object called theoretical quantiles.

To see if they match or not, we can plot them against each


other,

and then draw an identity line to see if the points fall on


the line.

We can do this using the plot function, like this.

Very simple piece of code produces this plot.

Note that the points fall almost on the line,

meaning that the normal approximation is a pretty good


approximation.

Now, one final note.

This code becomes slightly simpler if we use standard units.

If we use standard units, we don't have to define

the mean and the standard deviation in the function qnorm.

So the code simplifies and looks like this.

Now, using the histogram, the density plots, and the q-q
plots,

we have become convinced that the male height data is well


approximated

with a normal distribution.

So in this case, we can report back to ET a very succinct


summary.

Male heights follow a normal distribution,


with an average of 69.44 inches, and a standard deviation of
3.27 inches.

With this information, ET will have everything

he needs to know to describe, and know what to expect,

when he meets our male students.

Textbook link
This video corresponds to the textbook section on quantile-quantile plots.

Key points

 Quantile-quantile plots, or QQ-plots, are used to check whether distributions are


well-approximated by a normal distribution.

 Given a proportion p, the quantile q is the value such that the proportion of values
in the data below q is p.

 In a QQ-plot, the sample quantiles in the observed data are compared to the
theoretical quantiles expected from the normal distribution. If the data are well-
approximated by the normal distribution, then the points on the QQ-plot will fall
near the identity line (sample = theoretical).

 Calculate sample quantiles (observed quantiles) using the quantile() function.

 Calculate theoretical quantiles with the qnorm() function. qnorm() will calculate


quantiles for the standard normal distribution (μ=0,σ=1) by default, but it can
calculate quantiles for any normal distribution
given mean() and sd() arguments. We will learn more about qnorm() in the
probability course.
 Note that we will learn alternate ways to make QQ-plots with less code later in the
series.

Code

# define x and z

library(tidyverse)

library(dslabs)
data(heights)

index <- heights$sex=="Male"

x <- heights$height[index]

z <- scale(x)

# proportion of data below 69.5

mean(x <= 69.5)

# calculate observed and theoretical quantiles

p <- seq(0.05, 0.95, 0.05)

observed_quantiles <- quantile(x, p)

theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))

# make QQ-plot

plot(theoretical_quantiles, observed_quantiles)

abline(0,1)

# make QQ-plot with scaled values

observed_quantiles <- quantile(z, p)

theoretical_quantiles <- qnorm(p)

plot(theoretical_quantiles, observed_quantiles)

abline(0,1)

You might also like