Data Vizualization and Manipulation (R)
Liana Harutyunyan
Programming for Data Science
April 23, 2024
American University of Armenia
liana.harutyunyan@aua.am
1
Data Vizualization
The most popular package in R, that we can use to plot
graphs is ggplot2.
As a reminder to install a package, we write:
install.packages("ggplot2")
You should write this either in the console, or in the R script
or Rmd part, but after completion, delete or comment the
line.
Then you need to ”import” the package.
library(ggplot2)
2
ggplot2
Understanding the grammar of the ggplot2 package is
fundamental to being able to use it effectively.
The goal was to create a highly customizable framework.
3
ggplot2 basics
• The idea is that you have building blocks and you use
them to construct a graph.
• Possible components/building blocks are:
• data
• aesthetic mapping
• geometric object
• statistical transformations
• scales
• coordinate system
• position adjustments
• faceting
4
Our first graph
Whatever graph you are printing, first you need a ggplot
object.
ggplot(data=data, aes(x=column1, y=column2))
We can omit data and aes part here, if we want to specify
them in graph type object (we will see shortly).
Exercise: Use the code above with diamonds built-in dataset
of R.
5
Scatterplot
• For the next step we specify with the given data and
columns, what type of graph we want to plot.
ggplot(data=data, aes(x=column1, y=column2))+
geom point()
• geom point stands for scatterplot - mainly used for
showing the relationship between two continuous
variables.
6
Scatterplot
• For the next step we specify with the given data and
columns, what type of graph we want to plot.
ggplot(data=data, aes(x=column1, y=column2))+
geom point()
• geom point stands for scatterplot - mainly used for
showing the relationship between two continuous
variables.
• Exercise: Plot the relationship between the columns
carat vs price. You will notice that data is very big. For
the next step, take random subset of data (1000 rows),
and plot for this subset.
6
Scatterplot
The data and aestetics values can be specified inside
geom point as well, if they do not refer to all layers.
7
Scatterplot
The data and aestetics values can be specified inside
geom point as well, if they do not refer to all layers.
• If data and aes refer to whole graph and possible future
layers:
ggplot(data=data, aes(x=column1, y=column2))+
geom point()
7
Scatterplot
The data and aestetics values can be specified inside
geom point as well, if they do not refer to all layers.
• If data and aes refer to whole graph and possible future
layers:
ggplot(data=data, aes(x=column1, y=column2))+
geom point()
• If they refer only to this level:
ggplot()+
geom point(data=data, aes(x=column1, y=column2))
7
ggplot rules
IMPORTANT NOTE: + should be on the previous line.
8
Line plot
Another geometrical object example is geom line - creates
line plot, joins the points together.
• Equivalent of plt.plot in python.
• Exercise: Let’s create x 2 , then sigmoid functions graph
in R.
9
Line plot
Another geometrical object example is geom line - creates
line plot, joins the points together.
• Equivalent of plt.plot in python.
• Exercise: Let’s create x 2 , then sigmoid functions graph
in R.
• Exercise: Create price vs carat lineplot - for this we need
to change only geom point part.
9
Line plot
Another geometrical object example is geom line - creates
line plot, joins the points together.
• Equivalent of plt.plot in python.
• Exercise: Let’s create x 2 , then sigmoid functions graph
in R.
• Exercise: Create price vs carat lineplot - for this we need
to change only geom point part.
• What will happen if we keep both geom point and
geom line?
9
Other aestetics
There are other aes values other than x and y, such as
• color
• shape (0-25)
• size
You can set them by constant values, or by variable names
(columns).
10
Other aestetics
There are other aes values other than x and y, such as
• color
• shape (0-25)
• size
You can set them by constant values, or by variable names
(columns).
Exercise: Try different combinations of them in price vs carat
scatterplot.
10
Other aestetics
There are other aes values other than x and y, such as
• color
• shape (0-25)
• size
You can set them by constant values, or by variable names
(columns).
Exercise: Try different combinations of them in price vs carat
scatterplot.
Exercise: Set any of the categorical columns as value of color.
10
Other aestetics
There are other aes values other than x and y, such as
• color
• shape (0-25)
• size
You can set them by constant values, or by variable names
(columns).
Exercise: Try different combinations of them in price vs carat
scatterplot.
Exercise: Set any of the categorical columns as value of color.
Exercise: Set any of continuous column as value of color.
10
Other aestetics
Can change the colors manually:
scale color manual(values = c(color 1, color 2, ...))
11
Other aestetics
Can change the colors manually:
scale color manual(values = c(color 1, color 2, ...))
Same can be done with shape and size arguments.
11
Histogram
• Histogram is used to describe distribution of
continuous variable.
• We only need to change a few things for the graph to
become histogram - the beauty of ggplot.
ggplot(data, aes(column1)) + geom histogram()
• It has bin or binwidth parameter to control the number
of bins.
• Exercise: Plot histogram of price of diamonds.
12
Boxplot
• Another way to describe the distribution of
continuous/numerical variable in the data is boxplot.
• Can you guess what needs to be changed?
ggplot(data, aes(column1)) + geom boxplot()
• Exercise: Plot boxplot of price of diamonds.
• We can also specify both x and y mappings. If x is
categorical variable, boxplot will be plotted for each
category of x.
• Exercise: Plot boxplot for each cut of diamond.
13
Categorical features — Barplot
• As a reminder: categorical features are those that have
discrete possible values.
• These are mainly represented as type factor in R.
• To plot categorical data, we can use barplot.
14
Categorical features — Barplot
• As a reminder: categorical features are those that have
discrete possible values.
• These are mainly represented as type factor in R.
• To plot categorical data, we can use barplot.
ggplot(data, aes(x=column1)) + geom bar()
• If we only specify x, it plots the counts for each possible
value of x.
• Exercise: Plot how many of each diamond cut there are
in the dataset.
14
Categorical features — Barplot
• As a reminder: categorical features are those that have
discrete possible values.
• These are mainly represented as type factor in R.
• To plot categorical data, we can use barplot.
ggplot(data, aes(x=column1)) + geom bar()
• If we only specify x, it plots the counts for each possible
value of x.
• Exercise: Plot how many of each diamond cut there are
in the dataset.
• Can specify fill = c("color1", ..., "colorN") to
have different colors for bars.
• Why we do not use scale color manual here?
14
Barplots
What if we want to have another value other than count as
the height of our bars?
Exercise: Obtain a dataframe, where you will have for each
cut value, the mean price.
15
Barplots
What if we want to have another value other than count as
the height of our bars?
Exercise: Obtain a dataframe, where you will have for each
cut value, the mean price.
Now having the aggregated dataset (for each discrete value
of categorical feature - one number), we can plot:
ggplot(data agg, aes(x=categorical column, y=agg val)) +
geom bar(stat="identity")
Note: if we are specifying both x and y values for barplot, we
need stat="identity" 15
Barplots
Another aestetic value for barplots is fill.
You can specify another categorical variable as fill, and the
bars will be colored accordingly.
Exercise: Try with cut-count barplot, fill the bars with other
categorical feature (example: color feature).
16
Barplots
Another aestetic value for barplots is fill.
You can specify another categorical variable as fill, and the
bars will be colored accordingly.
Exercise: Try with cut-count barplot, fill the bars with other
categorical feature (example: color feature).
Exercise: Try to have geom bar(position="fill") - we lose
counts, but have percentegaes of each fill variable.
16
Barplots
Another aestetic value for barplots is fill.
You can specify another categorical variable as fill, and the
bars will be colored accordingly.
Exercise: Try with cut-count barplot, fill the bars with other
categorical feature (example: color feature).
Exercise: Try to have geom bar(position="fill") - we lose
counts, but have percentegaes of each fill variable.
Exercise: Try to have geom bar(position="dodge") - we have
multiple bars for each *fill* column value.
16
Boxplots and categorical
To combine categorical and continuous variables, we can
plot for each categorical value - boxplot.
As already discussed:
ggplot(data, aes(categorical, numerical)) + geom boxplot()
17
Other features of ggplot
• + xlim, ylim - for zooming in/out (can be used as
separate blocks, or inside coord cartesian)
example: xlim(c(1, 10)) + ylim(c(0, 5))
18
Other features of ggplot
• + xlim, ylim - for zooming in/out (can be used as
separate blocks, or inside coord cartesian)
example: xlim(c(1, 10)) + ylim(c(0, 5))
• + coord flip() - flips the graph 90 degrees - making x to y
and y to x.
18
Other features of ggplot
• + xlim, ylim - for zooming in/out (can be used as
separate blocks, or inside coord cartesian)
example: xlim(c(1, 10)) + ylim(c(0, 5))
• + coord flip() - flips the graph 90 degrees - making x to y
and y to x.
• + labs - to include titles for graph and x and y
coordinates
example: labs(title="The title", x="lab for x",
y="lab for y")
18
Facet grid
We can also split the original graph into subgraphs by
categorical variable.
For this, we need to add:
+ facet grid(.∼cut)
This will make the graph divided like ”columns”.
To have the subgraphs by ”rows”, we can do:
+ facet grid(cut∼.)
19
Facet grid vs Facet wrap
facet grid() forms a matrix of panels defined by row and
column faceting variables.
It is most useful when you have two discrete variables, and
all combinations of the variables exist in the data. You can
also do:
+ facet grid(cut∼color)
If you have only one variable with many levels, try
facet wrap().
20
Design of the graphs
You can edit anything in the ggplot graph.
• By default themes: - these include theme classic(),
theme minimal(), theme dark(), theme light()
theme grey(), theme bw() among others.
• By modifying aspect of the plot individually within
theme().
21
Modifying subparts
• Within theme() write the argument name for the plot
element you want to edit, like:
plot.title =
• Provide an element () function to the argument.
• Most often, use element text()
• element rect() for canvas background colors
• element blank() to remove plot elements
• Within the element () function, write argument
assignments to make the fine adjustments you desire.
• legend.position = accepts simple values like “bottom”,
“top”, “left”, and “right”.
22
Modifying subparts
Example:
Let’s take one of the graphs and make the following
changes:
• Make the legend’s position at the bottom of the graph,
• Change the plot’s title font to 30 and style ”italic”,
• Change the X axis tick mark labels to have color ”red”,
fontsize 15 and be rotated 90 degrees,
• Change the Y axis tick mark labels to have fontsize 15,
• Change the both axis labels to have fontsize 20.
• Change the plot’s background color.
• Delete part of the plot’s background grids.
23
Summary
Reading
https://epirhandbook.com/en/ggplot-basics.html
Questions?
24