0% found this document useful (0 votes)
7 views19 pages

QB Samplealllllll Hemu

Uploaded by

movinreddy2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

QB Samplealllllll Hemu

Uploaded by

movinreddy2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Si.

NO Questions
1. What is R? Give the features of R. What are the limitations of R Explain
• R is a language and environment for statistical computing and graphics.
• R provides a wide variety of statistical and graphical techniques, and is
highly extensible.
• R programming is a leading tool for machine learning, statistics, and data
analysis.
• The environment in R is a data structure that stores objects like variables,
functions, and data frames
• One of R’s strengths is the ease with which well-designed publication-
quality plots can be produced, including mathematical symbols and
formulae where needed.
• R is available as Free Software under the terms of the Free Software
Foundation’s source code form. It compiles and runs on a wide variety of
UNIX platforms and similar systems (including Linux), Windows and
MacOS.
• R is a true computer language, it allows users to add additional functionality
by defining new functions.
• Much of the system is itself written in the R. R can be linked to C, C++
and Fortran code and call this code at run time. Advanced users can write C
code to manipulate R objects directly.
Features of R are:
• It is an open-source tool
• R supports Object-oriented as well as Procedural programming.
• It provides an environment for statistical computation and software
development.
• Provides extensive packages & libraries
• R has a wonderful community for people to share and learn from experts
• Platform independence
• Integration with other lang
• Robust community ansd support
Limitations:
• R has less support for dynamic or 3D graphics.
• R consumes more memory. R objects must generally be stored in physical
memory.
• Its functionality is based on consumer demand and (voluntary) user
contributions. If no one feels like implementing your favorite method, then
it’s your job to implement it
• Big data processing is slow

2. How will you enter input and perform evaluation in R?


The <- symbol is the assignment operator.
> x <- 1
> print(x)
[1] 1
>x
[1] 1
> msg <- "hello"
x <- ## Incomplete expression
The # character indicates a comment. Anything to the right of the # (including the #

itself) is ignored. R does not support multi-line comments or comment blocks

EVALUATION:

When a complete expression is entered at the prompt, it is evaluated and the result
of the evaluated expression is returned. The result may be auto-printed.
> x <- 5 ## nothing printed
> x ## auto-printing occurs
[1] 5
> print(x) ## explicit printing
[1] 5
The [1] shown in the output indicates that x is a vector and 5 is its first element.

3. What are the various objects in R?


R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer(l pettali)
• complex(
• logical (True/False)

Data frames
List
Vector
Mtris: Matrices are two dimensional vectors with a dimension attribute.
The dimension attribute is itself an integer vector of length 2 (number of rows,
number of columns)

4. What are vectors? With the help of a code explain how to create a vector in R.
• Vectors are the basic building blocks of R. Vectors are a sequence of
elements belonging to the same data type. A vector is a single dimensional,
homogenous data structure in R
Empty vectors can be created with the vector(). A vector can only contain objects
of the same class

5. Explain implicit and explicit coercion.


when different objects are mixed in a vector, coercion occurs so that every element
in the vector is of the same class.
• In the example above, we see the effect of implicit coercion. What R tries to
do is find a way to represent all of the objects in the vector in a reasonable
fashion.
• The hierarchy of coercion in R is: logical → integer → double →
character.
explicit
As.numeric(variable name),char,logical
char_num <- "25"
num <- as.numeric(char_num)
print(num)
6. How will you create a matrix in R? parameters, operations
matrix(data, nrow, ncol, byrow = FALSE, dimnames = NULL)
 byrow: Logical value. If TRUE, fills the matrix by rows (default is FALSE,
which fills by columns).
 dimnames: Optional list of row and column names.
mat <- matrix(1:9, nrow = 3, ncol = 3)
print(mat)

Matrix Operations
1. Matrix Arithmetic
o Addition: mat + mat2
o Subtraction: mat - mat2
o Multiplication (element-wise): mat * mat2
o Division (element-wise): mat / mat2
o To perform matrix multiplication, use the %*% operator.
o Use the t() function to transpose a matrix.
o Use square brackets [] to access specific elements, rows, or columns.

7. What are factors? Explain


• Factors are used to represent categorical data and can be unordered or
ordered.
• One can think of a factor as an integer vector where each integer has a label.
• Factors are important in statistical modeling and are treated specially by
modelling functions like lm() and glm().
• Factors are self-describing. Having a variable that has values “Male” and
“Female” is better than a variable that has values 1 and 2.
• Factor objects can be created with the factor() function.
Syntax:
factor(x, levels = NULL, ordered = FALSE)
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x ## Levels are put in alphabetical order
[1] yes yes no yes no
Levels: no yes

8. Illustrate how R handles missing values?


Missing values are denoted by NA or NaN for q undefined mathematical operations.
• is.na() is used to test objects if they are NA
• is.nan() is used to test for NaN
• NA values have a class also, so there are integer NA, character NA, etc.
• A NaN value is also NA but the converse is not true
9. What are data frames? Explain
• Data frames are used to store tabular data in R.
• They are an important type of object in R and are used in a variety of
statistical modeling applications.
• Hadley Wickham’s package dplyr has an optimized set of functions
designed to work efficiently with data frames.
• Data frames are represented as a special type of list where every element of
the list has to have the same length.
• Each element of the list can be thought of as a column and the length of each
element of the list is the number of rows.
• data frames can store different classes of objects in each column.
• In addition to column names, indicating the names of the variables or
predictors, data frames have a special attribute called row.names which
indicate information about each row of the data frame.
• Data frames are usually created by reading in a dataset using the
read.table() or read.csv().
• data.frame() function

• >a
• localisation tumorsize progress
• XX348 proximal 6.3 FALSE
• XX234 distal 8.0 TRUE
• XX987 proximal 10.0 FALSE

10. Explain how reading and writing operations of different formats are performed in
R?
READ:
 read.table, read.csv, for reading tabular data
• readLines, for reading lines of a text file
 source, for reading in R code files (inverse of dump)
• dget, for reading in R code files (inverse of dput)
• load, for reading in saved workspaces

Write:
• write.table, for writing tabular data to text files (i.e. CSV) or connections
• • writeLines, for writing character data line-by-line to a file or connection
• • dump, for dumping a textual representation of multiple R objects
• • dput, for outputting a textual representation of an R object
• • save, for saving an arbitrary number of R objects in binary format to a file.
• • serialize, for converting an R object into a binary format for outputting to a
connection (or file).

11. Explain sub setting a list, matrix and vector in R How will you subset nested
elements in a list?

Subsetting means extracting subsets from R objects like lists, vectors, matrix etc.
There are three operators that can be used to extract subsets of R objects.
• The [ operator always returns an object of the same class as the original. It can be
used to select multiple elements of an object
• The [[ operator is used to extract elements of a list or a data frame. It can only be
used to extract a single element and the class of the returned object will not
necessarily be a list or data frame.
• The $ operator is used to extract elements of a list or data frame by literal name. Its
semantics are similar to that of [[.

12. How will you extract multiple elements of list? Explain partial matching.
The [ operator can be used to extract multiple elements from a list. For example, if
you wanted to extract the first and third elements of a list, you would do the
following
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"
Note that x[c(1, 3)] is NOT the same as x[[c(1, 3)]].

• Partial matching of names is allowed with [[ and $. This is often very useful
during interactive work if the object you’re working with has very long
element names. You can just abbreviate those names and R will figure out
what element you’re referring to.
13. How will you remove NA values?

14 What is dplyr package in R? Explain select(),filter(),rename(),arrange(), mutate ()


and transmutate(), groupby() and summarize() of dplyr package.
dplyr package defines a grammar of data manipulation, providing a consistent set
of verbs that help you solve the most common data manipulation challenges:
• mutate() --adds new variables that are functions of existing variables
• select() --picks variables based on their names.
• filter()-- picks cases based on their values.
• arrange: reorder rows of a data frame
• Summarize()/ summarise() --reduces multiple values down to a single
summary.
• These all combine naturally with group_by() which allows you to perform
any operation “by group”.
• rename: rename variables in a data frame
transmute()
 Similar to mutate(), but only keeps the newly created columns.
 If only the result of transformations is required, transmute() is useful.

dplyr is designed to abstract over how the data is stored. That means as well as
working with local data frames, you can also work with remote database tables,
using exactly the same R code. Install the dbplyr package then read
vignette("databases", package = "dbplyr").
• For operations like filtering, reordering, collapsing we have dplyr package in
R that has a highly optimized set of functions for working with data frames.
• The dplyr package was developed by Hadley Wickham of RStudio
• Dplyr provides a grammar for data manipulation and operating on data
frames. This helps to communicate the operation you are doing on data
frame and it provides an abstraction for data manipulation.
• • %>%: the “pipe” operator is used to connect multiple verb actions together
into a pipeline

15 Why %>% is used in dplyr? Explain with an example code.


• • %>%: the “pipe” operator is used to connect multiple verb actions together
into a pipeline
• to create readable, sequential data manipulation pipelines. It allows you to
pass the result of one operation directly into the next operation, making code
cleaner.
• The pipe operator works by taking the output of the left-hand side and
feeding it as the first argument to the function on the right-hand side. This
makes the code flow like a series of steps, which is often easier to read and
understand
# Load the dplyr package
library(dplyr)

# Sample data frame


data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Age = c(25, 30, 25, 35),
Score = c(90, 85, 88, 92)
)

# Data manipulation pipeline with %>%


result <- data %>%
select(Name, Age, Score) %>% # Step 1: Select specific columns
filter(Score > 85) %>% # Step 2: Filter rows where Score > 85
mutate(Adjusted_Score = Score + 5) %>% # Step 3: Create a new column
"Adjusted_Score"
group_by(Age) %>% # Step 4: Group by "Age"
summarize(Average_Adjusted = mean(Adjusted_Score)) # Step 5: Calculate
average adjusted score

# Print the result


print(result)
16 What are the control structures used in R? Explain with examples

Control structures in R allow you to control the flow of execution of a series of R


expressions. Basically, control structures allow you to put some “logic” into your R
code.
Commonly used control structures are
• if and else: testing a condition and acting on it

• for: execute a loop a fixed number of times


• while: execute a loop while a condition is true
• repeat: execute an infinite loop (must break out of it to stop)
count <- 1

repeat {
print(count)
count <- count + 1

if (count > 5) {
break
}
}
• break: break the execution of a loop
for (i in 1:10) {
if (i == 5) {
break # Stop the loop when i is equal to 5
}
print(i)
}
• next: skip an interation of a loop
for (i in 1:5) {
if (i == 3) {
next # Skip the iteration when i is equal to 3
}
print(i)
}

17 What are different looping constructs in R? Explain with an example code.

Explain how to create bar graphs, scatter plot, line graph, histogram and curve in R
with and without ggplot.
Using Base R
r
Copy code
# Plotting a curve
curve(sin, from = -pi, to = pi, main = "Sine Curve", xlab = "X", ylab = "sin(X)", col =
"green")
Using ggplot2
r
1.
Copy code
# Data for curve
x <- seq(-pi, pi, length.out = 100)
data <- data.frame(x = x, y = sin(x))

# Sine curve
ggplot(data, aes(x = x, y = y)) +
geom_line(color = "green") +
ggtitle("Sine Curve")

2 What is a bar graph? How will you make a basic bar graph in R? Explain

bar graphs are used to display numeric values (on the y-axis), for different categories

(on the x-axis). sometimes the bar heights represent counts of cases in the data set,
and sometimes they represent values in the data set.

use barplot() and pass it a vector of values for the height of each bar and (optionally) a
vector of labels for each bar. names. arg: This parameter is a vector of names
appearing under each bar in bar chart.
barplot(BOD$demand, names.arg = BOD$Time)
3 Bar graph – with & without factor, fill colour, outline, grouping, palette, reorder,
dodge, scale_fill_brewer() or scale_fill_manual()
RcolorBrewer

Ggplot(BOD,aes(x=time,y=demand))+geam_col(fill=”blue”,color=”black”)

• We’ll map Date to the x position and map Cultivar to the fill color

The most basic bar graphs have one categorical variable on the x-axis and one
continuous variable on the y-axis. Sometimes you’ll want to use another
categorical variable to divide up the data, in addition to the variable on the xaxis.
You can produce a grouped bar plot by mapping that variable to fill,
which represents the fill color of the bars. You must also use position =
"dodge", which tells the bars to “dodge” each other horizontally; if you don’t,
you’ll end up with a stacked bar plot.

ggplot(data, aes(x = factor(time), y = demand, fill = group)) + geom_col(position = "d


odge", color = "black") +scale_fill_brewer(palette = "Set2")

ggplot(BOD, aes(x = reorder(factor(time), -demand), y = demand)) + geom_col(fill =


"steelblue", color = "black")

ggplot(data, aes(x = factor(time), y = demand, fill = group)) + geom_col(position = "d


odge", color = "black") +scale_fill_manually(values=c(‘’,’’))
Explain how to create a bar graph of counts and how will colors will be used in Bar
graph?
When you use a bar graph of counts, ggplot2 will automatically count the number of
observations in each category. This can be done by setting only the x aesthetic
in ggplot2, and ggplot will automatically use stat = "count" to compute the counts.
• Ex: ggplot(diamonds, aes(x = cut)) +geom_bar()
4

ggplot(mpg, aes(x = class, fill = drv)) + geom_bar()


so you may want to set them using scale_fill_brewer() or scale_fill_manual().
What is a line graph? How will you add points to line graph? Explain

• Line graphs are typically used for visualizing how one continuous variable, on
the y-axis, changes in relation to another continuous variable, on the x-axis.
• Often the x variable represents time, but it may also represent some other
continuous quantity
5
ggplot(BOD, aes(x = Time, y = demand)) +
geom_line()

ggplot(BOD, aes(x = Time, y = demand)) +


geom_line()+geom_point(). {color size}

Explain how will you make line graph with multiple lines?
In addition to the variables mapped to the x-and y-axes, map another (discrete)
variable to colour or linetype

ggplot(BOD, aes(x = Time, y = demand,color=’supp’)) +


6
geom_line()

ggplot(BOD, aes(x = Time, y = demand,linetype=’supp’)) +


geom_line()

7 How will you change appearance of points in a line graph and how will you make a
graph of shaded area? Explain
In geom_point(), set the size, shape, colour, and/or fill outside of aes()
• ggplot(BOD, aes(x = Time, y = demand)) +
• geom_line() +
• geom_point(size = 4, shape = 22, colour = "darkred", fill = "pink")

Use geom_area() to get a shaded area


ggplot(BOD, aes(x = Time, y = demand)) +
geom_area(color,fill,alpha=.2)

What is a scatter plot? How will you make a basic scatter plot in R? Explain

A scatter plot is a type of data visualization that displays values for two variables for
8 a set of data. Each point on the plot corresponds to one observation in the dataset.
Scatter plots are particularly useful for identifying relationships, correlations, and
distributions between the two variables.
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()

How will you group points together using shapes or colors in scatter plot? Explain

Color is a common way to differentiate groups in scatter plots. By mapping a


categorical variable to the color aesthetic, you can assign different colors to each
group.
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) + geom_point(size = 3) + #
Points with a specified size labs(title = "Scatter Plot of MPG vs Horsepower Colored
by Cylinders", x = "Horsepower", y = "Miles per Gallon", color ="Cylinders")
9

You can also differentiate points using shapes. By mapping a categorical variable to
the shape aesthetic, you can use different symbols to represent different groups
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl), shape = factor(am))) + geom_
point(size = 3) + labs(title ="Scatter Plot of MPG vs Horsepower Colored by
Cylinders and Shaped by Transmission", x = "Horsepower", y = "Miles per
Gallon", color = "Cylinders", shape = "Transmission")
Explain how to map a continuous variable to color or size.

In ggplot2, mapping a continuous variable to color or size in a scatter plot allows you
to add an extra dimension of information to the visualization. This is particularly
useful when you want to see how a third variable varies with two primary variables
(on the x and y axes) by changing the color gradient or point size.
10
ggplot(mtcars, aes(x = hp, y = mpg, color = wt)) + geom_point(size = 3)

Mapping a continuous variable to size scales each point’s size according to the
variable's value. Larger values will result in larger points, while smaller values will
have smaller points
ggplot(mtcars, aes(x = hp, y = mpg, size = wt)) + geom_point(color = "blue")
11 How to deal with over plotting in scatter plot? Explain different methods

Adjusting Point Transparency (Alpha)


library(ggplot2)

# Scatter plot with transparency to reduce overplotting


ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(alpha = 0.5) + # Set transparency
labs(title = "Scatter Plot with Transparency",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()

Using Smaller Point Sizes


Another straightforward approach is to reduce the size of each point. This works well
when points are closely packed but don’t completely overlap.
r
Copy code
# Scatter plot with smaller points
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(size = 1.5) + # Set smaller point size
labs(title = "Scatter Plot with Smaller Point Size",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()

. Jittering the Points


The geom_jitter() function adds random noise (jitter) to the points, slightly displacing
them to reduce overlap. This is particularly useful when dealing with discrete data or
datasets where multiple observations have the same or similar values.
r
Copy code
# Scatter plot with jitter
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_jitter(width = 0.5, height = 0.5) + # Add jitter
labs(title = "Scatter Plot with Jitter",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()

Using 2D Binning (Hexbin or Rectangular Binning)


Binning groups the data into hexagonal or rectangular bins and assigns a color or
shade to each bin based on the number of points in it. This technique is helpful for
large datasets where high density occurs in certain areas.
 Hexagonal Binning (using geom_hex() from ggplot2):
r
Copy code
# Hexagonal binning
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_hex()

 Rectangular Binning (using geom_bin2d()):


r
Copy code
# Rectangular binning
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_bin2d(bins = 30)
Create a data frame from lists and create different graphs

# Creating lists for the data frame


student_names <- c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace",
"Hannah")
math_scores <- c(85, 90, 78, 92, 88, 76, 95, 89)
science_scores <- c(78, 85, 88, 90, 91, 79, 85, 84)
english_scores <- c(82, 88, 80, 86, 87, 83, 90, 88)

# Creating the data frame


student_df <- data.frame(
Name = student_names,
Math = math_scores,
Science = science_scores,
English = english_scores
)

# Display the data frame


print(student_df)

library(ggplot2)

12 # Bar graph of Math scores


ggplot(student_df, aes(x = Name, y = Math)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Math Scores of Students", x = "Students", y = "Math Score") +
theme_minimal()

# Line graph of Science scores


ggplot(student_df, aes(x = Name, y = Science, group = 1)) +
geom_line(color = "blue") +
geom_point(size = 3, color = "red") +
labs(title = "Science Scores of Students", x = "Students", y = "Science Score") +
theme_minimal()

# Scatter plot of Math vs Science scores


ggplot(student_df, aes(x = Math, y = Science)) +
geom_point(size = 3, color = "purple") +
labs(title = "Math vs Science Scores", x = "Math Score", y = "Science Score") +
theme_minimal()

# Histogram of English scores


ggplot(student_df, aes(x = English)) +
geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
labs(title = "Distribution of English Scores", x = "English Score", y = "Frequency") +
theme_minimal()
13 Construct multiple line graph that chooses a color using a palette from
ColorBrewer and color manual, and also represent using a line graph how far the
points and lines should move when dodged.

# Sample data frame


library(ggplot2)

df <- data.frame(
Year = rep(2010:2015, each = 3),
Value = c(40, 60, 80, 50, 65, 85, 55, 70, 90, 58, 75, 95, 60, 80, 100, 62, 85, 105),
Category = rep(c("A", "B", "C"), times = 6)
)

# Line graph with ColorBrewer palette


ggplot(df, aes(x = Year, y = Value, color = Category, group = Category)) +
geom_line(position = position_dodge(width = 0.2)) + # Dodge lines
geom_point(position = position_dodge(width = 0.2)) + # Dodge points
scale_color_brewer(palette = "Set1") + # Use ColorBrewer palette
labs(title = "Multiple Line Graph with ColorBrewer Palette") +
theme_minimal()

# Line graph with manual colors


ggplot(df, aes(x = Year, y = Value, color = Category, group = Category)) +
geom_line(position = position_dodge(width = 0.2)) + # Dodge lines
geom_point(position = position_dodge(width = 0.2)) + # Dodge points
scale_color_manual(values = c("A" = "blue", "B" = "green", "C" = "purple")) + #
Manual colors
labs(title = "Multiple Line Graph with Manual Colors") +
theme_minimal()
Explain about reorder function to sort bars by a variable

# Bar plot with reordered bars


ggplot(scores, aes(x = reorder(Subject, Scores), y = Scores)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Scores by Subject (Reordered by Scores)",
14 x = "Subject", y = "Scores") +
theme_minimal()

The reorder() function takes a factor and reorders it according to the values of
another variable.
reorder(x, FUN = mean, ...)
15 Explain about ggplot package and different methods like aes etc

ggplot(data = <data_frame>, aes(x = <x_variable>, y = <y_variable>, ...)) +


<geoms> +
<scales> +
<labs> +
<theme>
1. aes() Function
 The aes() function (short for aesthetics) is used to define how data
variables are mapped to visual properties.
Themes
Themes allow you to customize the non-data elements of the plot, such as
background, grid lines, and text.
Example:
r
Copy code
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal()
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of Weight vs MPG",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")

2. Geometric Objects (geom_*)


Geometric objects (geoms) represent the actual data points or lines in the plot.
There are many geoms available in ggplot2, including:
 geom_point(): For scatter plots.
 geom_line(): For line graphs.
 geom_bar(): For bar charts.
 geom_histogram(): For histograms.
 geom_boxplot(): For boxplots.
Example:
r
Copy code
# Bar graph
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "lightblue")

You might also like