Chapter: Exploratory Data Analysis and Visualization
Topics to be covered:
1. Importing Data in Various Formats to R Environment
2. Examining a Data Frame
3. Examining for Missing Values in a Data Frame
4. Examining Extreme Values/ Outliers
5. Use of Table and Proportion Table for Qualitative Variables
6. Use of Two-Way Table and Proportion Table for Qualitative Variables
7. Basic Descriptive Statistics for Quantitative Variables
8. Basic Data Visualization: Qualitative Variables
9. Basic Data Visualization: Quantitative Variables
# Importing Data in Various Formats to R Environment
It is important to know how to import data in various formats in R environment. In this section,
we will learn:
Importing excel data
Importing CSV data
Importing data in other Formats
myData <- readxl :: read_excel (“file_name.xlsx”, header = T, sheet = “sheet_name”)
External package readxl is required; data will be imported as a tibble (another flexible and
faster implementation of data frame)
Class Exercise-1: Import a sheet named Major from an excel workbook called Collected_Data_1
and label it as myMajor and examine the structure of myMajor.
Class Exercise-2: Import a single sheet excel worksheet called Gig and label it as myGigData and
examine the structure of myGigData.
myData <- read.csv (“file_name.csv”, header = T)
No external package is required; Data will be imported as a data frame.
Class Exercise-3: Import the Gig.csv data file and label it as myData and examine its structure.
1|Page
# Examining a Data frame
A data frame can be systematically examined in R by evaluating the following aspects:
Structure of the Data Frame (str function)
Dimension of the Data Frame (dim function)
Top Six Rows (head function)
Bottom Six Rows (tail function)
Top 4 and Bottom 4 Rows (headTail function: package: psych)
Statistical Summary of Data Frame (summary function)
Names of the variables or columns (names or colnames function)
Names of the rows or observations (rownames function)
Class Exercise-4: Evaluate myData data frame created by importing the Gig.csv file in terms of
dimension, top six rows, bottom six rows, top 4 & bottom 4 rows, and the statistical summary
of the data frame. Print the variables or column names of myData. Print the first 3 row names
of myData.
# Examining for Missing Value in a Data Frame
The presence or absence of missing value(s) can be examined for the entire data frame by using
the following codes in R:
is.na(dataframe) #will return a logical object
sum(is.na(data_frame)) # will return the number of rows in which have missing values
rows_with_na <- myData [!complete.cases(data_frame), ]
We can determine the variable-wise or column-wise missing values by using the following code
in R with pipe (|>) function:
colsums(is.na(dataframe$vector1)) |>
t() |>
t()
Class Exercise-5: Determine the number of rows containing missing values and print the
number of rows containing the missing values and label it rows_with_na from the data frame
myData. Show the variable or column-wise list of missing values. In which columns or variables,
don’t we have any missing values?
The presence or absence of missing value(s) can be examined for a particular variable in the
data frame by using the following functions:
is.na(data_frame$vector_1) # will return a logical object
which(is.na(data_frame$vector_1)) # will return the number of rows with NA in vector_1
2|Page
sum(is.na(data_frame$vector_1)) # will return the number of missing values in vector_1
na_in_vector_1 <- myData [is.na(data_frame$vector_1), ]
Class Exercise-6: Determine the number of missing values in industry variable in myData and
identify the rows containing missing value in the industry variable.
# Examining Extreme Values/ Outliers
There are several ways to detect extreme values or outliers. But the most frequently used
method to detect outliers is the boxplot (also known as box-and-whisker plot) method
developed by John Tukey in 1970:
Here the boxplot can be constructed by using the following code in R:
boxplot(dataframe$vector, horizontal = TRUE) # plot the boxplot horizontally
The number of small circles indicates the number of distinct outliers we have in the vector.
All the outliers can be identified by using the following code in R:
boxplot(dataframe$vector, horizontal = TRUE, plot = FALSE)$out
The total number of outliers can be identified by using the following code in R:
length(boxplot(dataframe$vector, horizontal = TRUE, plot = FALSE)$out)
All the outliers can be tabulated in terms of their frequency by using the following code in R:
table(boxplot(dataframe$vector, horizontal = TRUE, plot = FALSE)$out)
Class Exercise-7: Import the file stored in onlineshop.csv and label it as onlineshop. Evaluate
the structure of the data frame onlineshop. Use Tukey’s method to identify the number of
distinct outliers in the AGE variable of the data frame. Identify all the outliers of the variable
AGE. How many outliers do we have here? Tabulate the outliers. What is the least frequently
occurring outlier in the AGE variable.
3|Page
We can also create boxplots across the categories by using the following code in R:
boxplot(dataframe$vector1, dataframe$vector2) # vector1 is numeric and vector2 is categorical.
Class Exercise-8: Create a series of boxplots of hourly wage across the categories of industry
from the myData data frame created from importing Gig.csv file. In which industry or
industries, we don’t have any missing value?
# Use of Table and Proportion Table for the Qualitative Variable
To a get a frequency table for a qualitative variable in a dataframe in R, we may run the
following code in R:
table(dataframe$vector)
And to the get the proportion table for a qualitative variable in a dataframe in R, we may run
the following code in R:
proportion(table(dataframe$vector))
Class Exercise-9: Load Gig.csv data into the R environment and store it as myData and create a
subset of myData discarding the missing value and store it as myDataComplete. Create a
frequency table of the qualitative variable industry and also create proportion table of the
same variable.
Solution: Here,
myData <- read.csv("Gig.csv")
myDataComplete <- na.omit(myData)
table(myDataComplete$Industry)
proportions(table(myDataComplete$Industry))
# Use of Two-Way Table and Proportion Table for Multiple Qualitative Variables
To create a table of multiple qualitative variables in R environment, we may use table function
and add two variables separated by “,”, the first variable added will be arranged in the row and
the second variable added will be arranged in the column.
table(dataframe$vector1, dataframe$vector2)
To get the proportion table for multiple qualitative variables, we run
proportion (table(dataframe$vector1, dataframe$vector2), 1) # % calculated across rows
4|Page
proportion (table(dataframe$vector1, dataframe$vector2), 2) # % calculated across columns
Class Exercise-10: Create a two-way frequency table for the variables industry and job in the
myDataComplete data frame. Create a proportion table of the same two variables across the
rows.
Solution: Here,
table(myDataComplete$Industry, myDataComplete$Job)
proportions(table(myDataComplete$Industry, myDataComplete$Job), 1)
# Basic Descriptive Statistics for Quantitative Variables
There are many ways of calculating descriptive statistics for quantitative variables in R. The
following functions, in the base R, have already introduced to calculate various descriptive
statistics:
min()
max()
mean()
var()
sd()
median()
summary()
The summary function in R for a quantitative variable will return a summary of min, 1st
quartile, median or 2nd quartile, 3rd quartile, max, and mean. The syntax used for running
summary function on a quantitative variable vector1 in a data frame is given below:
summary(dataframe$vector1)
Classwork-11: Determine the summary statistics for the variable hourly wage in the dataframe
myDataComplete.
Solution: Here
summary(myDataComplete$HourlyWage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
24.28 34.55 41.82 40.15 46.02 51.00
But one of the most comprehensive way of calculating descriptive statistics in R environment is
using the describe() function from psych package. To install and make the psych package active,
we run the following code
install.packages(“psych”) # If not already installed
5|Page
library (psych) # to make the psych package active in this session
To run the describe function from psych package, we run
describe(dataframe$vector1) # where vector1 is a numerical variable
We can calculate the group-wise statistics by using the describeBy function from psych:
describeBy(dataframe$vector1, dataframe$vector2) # where vector2 is the grouping variable.
If we want to calculate any specific function, across the group, we can use the tapply function in
base R:
Tapply(dataframe$vector1, vector2, function_name)
Class Exercise-12: Calculate the descriptive statistics of the variable hourly wage in the
dataframe myDataComplete. Calculate the descriptive statistics across the industry variable.
Calculate the mean values of HourlyWage across the industry variable
Solution: Here,
describe(myDataComplete$HourlyWage)
describeBy(myDataComplete$HourlyWage, myDataComplete$Industry)
tapply(myDataComplete$HourlyWage, myDataComplete$Industry, mean)
Skewness and Kurtosis can be calculated from the e1071 package in the following way:
skewness(dataframe$vector1) # where vector1 is a numeric vector
kurtosis(dataframe$vector1) # where vector1 is a numeric vector
Skewness and Kurtosis can also be calculated from psych package from the skew() and
kurtorsi() functions.
Class Exercise-13: Calculate the skewness and kurtosis from the HourlyWage variable in the
myDataComplete dataframe and interpret.
# Basic Data Visualization: Qualitative Variables
For qualitative variables, the most commonly used methods to visualize categorical or
qualitative variables are: Bar Chart and Pie Chart.
Bar Chart: We can construct a bar chart in R environment by using the barplot function by
setting certain parameters
barplot (table(dataframe$vector1), # where vector1 is qualitative variable
main = “The Title of the plot”,
xlab = “The label for x-axis”,
ylab = “the label for y-axis”)
Class Work-14: Construct a bar chart from the variable industry in the myDataComplete data
frame by setting the title as Industry Distribution of Workers and labeling x and y axis as you
may deem appropriate without changing the color parameter in the R environment.
6|Page
Solution: Here,
barplot(table(myDataComplete$Industry),
main = "Industry Distribution of Workers",
xlab = "Industry",
ylab = "Numbers of Employees")
Pie Chart: We can create a pie chart in R environment by using the pie function by setting
certain parameters:
pie(table(pie(table(myDataComplete$Industry),
main = "The Title of the Plot"))
Class Work-15: Construct a bar chart from the variable industry in the myDataComplete data
frame by setting the title as Industry Distribution of Workers without changing the color
parameter in the R environment.
Solution: Here,
pie(table(myDataComplete$Industry),
main = "Industry Distribution of Workers")
# Basic Data Visualization: Quantitative Variables
The most commonly used methods to visualize quantitative variables are:
Histogram and
Boxplot
Histogram: A basic histogram for a quantitative variable can be constructed by using the hist()
function by setting certain parameters:
hist(vector1, # vector1 is a numerical (quantitative) vector)
main = "The Title of the Plot",
xlab = " The label for x-axis ",
ylab = " The label for y-axis")
Class Work-16: Construct a histogram for the hourly wage in the myDataComplete dataframe
by setting title as The Distribution of Hourly Wage and labeling x and y axis as you may deem
appropriate without changing the color parameter in the R environment.
Boxplot: A basic boxplot for a quantitative variable can be constructed by using the boxplot
function by setting certain parameters:
boxplot(vector1, # vector1 is a numerical (quantitative) vector)
7|Page
main = " The Title of the Plot ",
horizontal = T) # by default horizontal is set to FALSE
Class Work-17: Construct a boxplot for the hourly wage in the myDataComplete data frame by
setting title as The Distribution of Hourly Wage and labeling x and y axis as you may deem
appropriate without changing the color parameter in the R environment.
8|Page
Exercises: Exploratory Data Analysis and Visualization
Exercise-1: Import the onlineshop.csv file and label it as onlineshop in the R environment.
Evaluate onlineshop in terms of dimension, top six rows, bottom six rows, top 4 & bottom 4
rows, and the statistical summary of the data frame. Print the variables or column names of
onlineshop. Print the first 3 row names of onlineshop.
Exericse-2: Consider the onlineshop data in Exercise-1. Determine the number of rows
containing missing values and print the number of rows containing missing values and label it
rows_with_na from the data frame onlineshop. Show the variable or column-wise list of
missing values. In which columns or variables, don’t we have any missing values?
Exericise-3: Consider the onlineshop data in Exercise-1. Determine the number of missing
values in TYPE variable in onlineshop and identify the rows containing missing value in the TYPE
variable.
Exericise-4: Consider the onlineshop data in Exercise-1. Create a subset of onlineshop data
frame by discarding all the missing values and label the new data frame as
onlineshopComplete. How many observations do we have in this new data frame? How many
of them are numeric? How many of them are character?
Exercise-5: Consider the onlineshopComplete data frame in created in Exercise-4. Convert the
variable TYPE into a factor labeling 1 = Manufacturing, and 2 = Service. Create a series of
boxplot across the TYPE of industry. How many distinct extreme values do we have in each
type?
Exericse-6: Consider the data frame onlineshopCompleted in Exercise-5. Create a frequency
table of the qualitative variable PAYMENT_METHOD and create proportion table of the same
variable. What percentage of customers are using PayPal?
Exercise-7: Consider the data frame onlineshopCompleted in Exercise-5. Create a two-way
frequency table for the variables GENDER and PAYMENT_METHOD in the
onlineshopCompleted data frame. Create a proportion table of the same two variables across
the rows. What percentage of PayPal users are male? Did you need to create another
proportion table? Explain.
Exercise-8: Consider the data frame onlineshopCompleted in Exercise-5. Determine the
summary statistics for the variable CREDIT_SCORE in the dataframe onlineshopCompleted.
What is the average mean credit score? Compare the median and mean credit score, which one
is higher. Calculate the mean value of the variable CREDIT_SCORE across the variable GENDER.
Which gender has higher mean credit score and lower skewness?
9|Page
Exercise-9: Consider the data frame onlineshopCompleted in Exercise-5. Construct a bar chart
from the variable PAYMENT_METHOD in the onlineshopCompleted data frame by setting the
title as Payment Method Distribution of the Users and labeling x and y axis as you may deem
appropriate by changing the color parameter to be “yellowgreen” in the R environment.
Exercise-10: Consider the data frame onlineshopCompleted in Exercise-5. Construct a
histogram for the AGE in the onlineshopCompleted data frame by setting title as The
Distribution of User Age and labeling x and y axis as you may deem appropriate without
changing the color parameter in the R environment. Create another histogram by setting the
breaks = 30. Which of these histograms is more informative? Explain
10 | P a g e