Data Science Visualization in R
Data Science Visualization in R
How quickly can you determine which states have the largest populations?
is required.
This article showed data on scores from the New York City regents exam.
or procedures.
Exploratory data analysis is perhaps the most important part of data analysis,
in educational organizations.
One example comes from GAPminder and the talks, New Insights on Poverty
In his videos, he used animated graphs to show us how the world was changing,
and that most data analysis procedures are not designed to detect these yet.
To learn the basics, we will start with a somewhat artificial example, heights
reported by students.
Textbook Link
This video corresponds to the textbook introduction to the data visualization
section.
Key points
Data visualization can help discover biases, systematic errors, mistakes and
other unexpected problems in data before those data are incorporated into
potentially flawed analysis.
This course covers the basics of data visualization and EDA in R using
the ggplot2 package and motivating examples from world health,
economics and infectious disease.
Code
library(dslabs)
data(murders)
head(murders)
So, for example, you might read a report stating that scores at this high school
Is this appropriate?
we're missing by only looking at this summary rather than the entire list?
is its distribution.
this information.
Textbook Link
This video corresponds to the textbook introduction to the section on visualizing
distributions.
Key points
and numericals.
Two simple examples are sex, male or female, or regions of the states
Even if they are not numbers per se, they can still be ordered.
Examples that we have seen or will see are population sizes, murder rates,
and heights.
Continuous variables are those that can take any value such as heights
For example, a pair of twins maybe 68.12 inches and 68.11 inches respectively.
In contrast, when we have many groups with few cases in each group,
0, 1, 2, 3 up to maybe 36--
Textbook Link
This video corresponds to the textbook section on variable types.
Key points
Categorical data are variables that are defined by a small number of groups.
to understand distributions.
know there are two different groups of heights, males and females.
We've done that already, and here are the first six entries.
But there are much more effective ways to convey this information,
is its distribution.
since one number describes everything we need to know about this data set.
table, does not provide much more insights than the table itself,
We assume that they converted from 174 centimeters and 175 centimeters
to inches respectively.
From the plot we can see, for example, that 14% of the students in our class
We also see that 84% of our students have heights below 72 inches.
In fact, we can report the proportion of values between any two heights,
he will have all the information needed to construct the entire list.
Although the CDF provides all the information we need and it is widely
We can decipher all this from the plot, but it's not that easy.
Then for each bin, we count the number of values that fall in that interval.
The histogram plots these counts as bars with the base of the bar the interval.
Here's a histogram for the height data where we split the range of values
but perhaps more importantly, he learns that the majority, with more than 95%,
obtain very good approximations of the proportion of the data in any interval.
in the raw data that is 708 heights, with just 23 bin counts.
So why is it an approximation?
Note that all values in each interval are treated as the same
Key points
For continuous numerical data, reporting the frequency of each unique entry is not
an effective summary as many or most values are unique. Instead, a distribution
function is required.
Code
library(dslabs)
data(heights)
# make a table of category proportions
prop.table(table(heights$sex))
Any continuous dataset has a CDF, not only normal distributions. For example, the
male heights data we used in the previous section has this CDF:
As defined above, this plot of the CDF for male heights has height values a on the
x-axis and the proportion of students with heights of that value or lower ( F(a)) on
the y-axis.
For datasets that are not normal, the CDF can be calculated manually by defining a
function to compute the probability above. This function can then be applied to a
range of values across the range of the dataset to calculate a CDF. Given a
dataset my_data, the CDF can be calculated and plotted like this:
mean(my_data <= x)
The CDF defines that proportion of data below a cutoff a. To define the proportion
of values above a, we compute:
1−F(a)
To define the proportion of values between a and b, we compute:
F(b)−F(a)
Note that the CDF can help compute probabilities. The probability of observing a
randomly chosen value between a and b is equal to the proportion of values
between a and b, which we compute with the CDF.
Here's what a smooth density plot looks like for our male heights data.
We're going to learn how to make these plots and what they mean.
can use this useful data visualization tool and learn how to interpret it.
all the heights of all the male students in all the world measured
very precisely.
This list of values, like any other list of values, has a distribution.
Because we're assuming that we have a million values measured very precisely,
And here's what happens when we make the bins smaller and smaller.
You can see that the edges start disappearing and the histogram
becomes smoother.
For this hypothetical situation, the smooth density
is basically the curve that goes through the top of the histogram bars
OK.
by making a histogram with our data, compute frequencies rather than counts,
is we're going to start by keeping the heights of the histogram bar shown here
And now we draw a smooth curve that goes through the top
And on the right, we see a curve that goes up and then goes down smoothly.
like the example on the right than the example on the left.
is not straightforward.
So if you imagine, you form a bin with a base that is 1 unit long.
For other size intervals, the best way to determine the proportion of data
in that interval.
Here's an example, if we take the proportion of values between 65 and 68,
it'll equal the proportion of the graph that is in that blue region
The proportion of this area is about 0.31, meaning that about 31%
as a summary.
This is in large part because the jagged edges of the histogram add clutter,
so when we're comparing two histograms, it makes it a little bit hard to see.
Here's an example of what it looks like when you use density plots.
So it makes it very easy to get an idea of what the two distributions are.
Textbook link
This video corresponds to the textbook section on smooth density plots.
Key points
Normal Distribution
RAFAEL IRIZARRY: Histogram and density plots
To understand what these summaries are and why they are so widely used,
The rest of the symbols in the formula represent the interval ends, a and b,
and most values, about 95%, are within two standard deviations
Here's what the distribution looks like when the average is zero
can be encoded in just two numbers, the average and the standard deviation,
Here's the code that you use in R. And the standard deviation
It's the square root of the sum of the differences between the values
We're making no assumptions here, other than assuming it's somewhat smooth.
The standard unit of value tells us how many standard deviations away
is at the average.
is symmetric around 0.
we can quickly know if for example, a person is about average height, that
would mean z equals 0.
Note that it does not matter what the original units are.
To see how many men are within two standard deviations from the average,
is count the number of z's that are less than 2 and bigger than negative 2,
So we take the mean of this quantity that you see here in the code,
Textbook link
For more information, consult this textbook section on the normal distribution.
Correction
At 3:27 and 3:50, the audio gives incorrect values for the average and standard
deviation. The code on screen and the transcript are correct.
Key points
The standard deviation is the average distance between a value and the mean
value.
Standard units describe how many standard deviations a value is away from the
mean. The z-score, or number of standard deviations an observation x is away
from the mean μ:
Z=x−μσ
Important: to calculate the proportion of values that meet a certain condition, use
the mean() function on a logical vector. Because TRUE is converted to 1
and FALSE is converted to 0, taking the mean of this vector yields the proportion
of TRUE.
Pr(a<x<b)=∫ba1√2πσe−12 ( ) dx
x−μσ 2
Code
# define x as vector of male heights
library(tidyverse)
library(dslabs)
data(heights)
x <- heights$height[index]
# built-in mean and sd functions - note that the audio and printed values disagree
SD <- sd(x)
z <- scale(x)
mean(abs(z) < 2)
Standard units
For data that are approximately normal, standard units describe the number of
standard deviations an observation is from the mean. Standard units are denoted
by the variable z and are also known as z-scores.
z=x−μσ
Standard units are useful for many reasons. Note that the formula for the normal
distribution is simplified by substituting z in the exponent:
Pr(a<x<b)=∫ba1√2πσe−12z2dx
When z=0, the normal distribution is at a maximum, the mean μ. The function is
defined to be symmetric around z=0.
The normal distribution of z-scores is called the standard normal distribution and is
defined by μ=0 and σ=1.
We will learn more about benchmark z-score values and their corresponding
probabilities below.
z <- scale(x)
You can compute the proportion of observations that are within 2 standard
deviations of the mean like this:
mean(abs(z) < 2)
to answer questions such as, what is the probability that a randomly selected
Then we can use this piece of code, 1 minus pnorm 70.5 mean of x, sd of x.
Now while most students rounded up their height to the nearest inch,
So the student converted it to inches, and copied and pasted the result
It's 1 in 708.
This is much higher than what was reported with this other value.
But does it really make sense to think that the probability of being exactly 70
inches is so much higher than the probability of being 69.68?
is the data, look at the approximations now with the normal distribution.
Textbook link
Here is the textbook section on the theoretical distribution and the normal
approximation.
Key points
The normal distribution has a mathematically defined CDF which can be computed
in R with the function pnorm().
If we are willing to use the normal approximation for height, we can estimate the
distribution simply from the mean and standard deviation of our values.
If we treat the height data as discrete rather than categorical, we see that the data
are not very useful because integer values are more common than expected due to
rounding. This is called discretization.
With rounded data, the normal approximation is particularly useful when computing
probabilities of intervals of length 1 that include exactly one integer.
library(tidyverse)
library(dslabs)
data(heights)
We can estimate the probability that a male is taller than 70.5 inches with:
# probabilities in actual data over other ranges don't match normal approx as well
Definition of quantiles
Definition of quantiles
Quantiles are cutoff points that divide a dataset into intervals with set probabilities.
The qth quantile is the value at which q% of the observations are equal to or less
than that value.
Using the quantile function
Given a dataset data and desired quantile q, you can find the qth quantile
of data with:
quantile(data,q)
Percentiles
Percentiles are the quantiles that divide a dataset into 100 intervals each with 1%
probability. You can determine all percentiles of a dataset data like this:
quantile(data, p)
Quartiles
Quartiles divide a dataset into 4 parts each with 25% probability. They are equal to
the 25th, 50th and 75th percentiles. The 25th percentile is also known as the 1st
quartile, the 50th percentile is also known as the median, and the 75th percentile is
also known as the 3rd quartile.
Examples
Load the heights dataset from the dslabs package:
library(dslabs)
data(heights)
summary(heights$height)
Confirm that the 25th and 75th percentiles match the 1st and 3rd quartiles. Note
that quantile() returns a named vector. You can access the 25th and 75th
percentiles like this (adapt the code for other percentile values):
percentiles[names(percentiles) == "25%"]
percentiles[names(percentiles) == "75%"]
qnorm(p)
Relation to pnorm
The pnorm() function gives the probability that a value from a standard normal
distribution will be less than or equal to a z-score value z. Consider:
pnorm(-1.96) ≈0.025
qnorm(0.025) ≈−1.96
pnorm(qnorm(0.025)) =0.025
Theoretical quantiles
You can use qnorm() to determine the theoretical quantiles of a dataset: that is, the
theoretical value of quantiles assuming that a dataset follows a normal distribution.
Run the qnorm() function with the desired probabilities p, mean mu and standard
deviation sigma.
Suppose male heights follow a normal distribution with a mean of 69 inches and
standard deviation of 3 inches. The theoretical quantiles are:
Quantile-Quantile Plots
Section 1: Introduction to Data Visualization and Distributions 1.3 Quantiles, Percentiles, and
Boxplots Quantile-Quantile Plots
If the quantiles for the data match the quantiles for the
normal distribution,
To obtain the quantiles for the data, we can use the quantile
function
in r like this.
Now, using the histogram, the density plots, and the q-q
plots,
Textbook link
This video corresponds to the textbook section on quantile-quantile plots.
Key points
Given a proportion p, the quantile q is the value such that the proportion of values
in the data below q is p.
In a QQ-plot, the sample quantiles in the observed data are compared to the
theoretical quantiles expected from the normal distribution. If the data are well-
approximated by the normal distribution, then the points on the QQ-plot will fall
near the identity line (sample = theoretical).
Code
# define x and z
library(tidyverse)
library(dslabs)
data(heights)
x <- heights$height[index]
z <- scale(x)
# make QQ-plot
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)