18CSO106T Data Analysis Using Open Source Tool: Question Bank
18CSO106T Data Analysis Using Open Source Tool: Question Bank
18CSO106T Data Analysis Using Open Source Tool: Question Bank
a. John Tukey
b. William S.
c. Hans Peter Luhn
d. None of the above
Answer: a
2 __________ refers to data which can be ranked, has consistent units and has a true zero e.g.
age. Some statistics software packages may refer to cardinal and ratio data as ‘scale’.
a. Nominal data
b. Ordinal data
c. Cardinal/Interval data
d. Ratio data
Answer: c
a. Factors
b. Matrices
c. Vectors
d. Data Frames
Answer: d
c. if(<condition>) { ## do something}
else if {## do something else}
Answer: a
Answer: c
a. 10
b. 01
c. 00120
d. 0000000000
Answer: d
7 Two vectors having the same initial points are called as__________
a. Collinear vectors
b. Unit vectors
c. Equal vectors
d. Coinitial vectors
Answer: d
8 The plot method on series and dataframe is just a simple wrapper around _________
a. gplt.plot()
b. plt.plotgraph()
c. plt.plot()
d. scatter.plot()
Answer: c
9 Missing values in this csv file have been represented by an exclamation mark (“!”) and a
question mark (“?”). Which of the codes below will read the above csv file correctly into R?
a. csv(‘Dataframe.csv’)
b. csv(‘Dataframe.csv’,header=FALSE, sep=’,’,na.strings=c(‘?’))
c. csv2(‘Dataframe.csv’,header=FALSE,sep=’,’,na.strings=c(‘?’,’!’))
d. dataframe(‘Dataframe.csv’)
Answer: c
10 Suppose ABC is the matrix of 3 rows and 4 columns. Choose correct option(s) to rename
columns:
a. row_names(ABC)= c(“row1”,”row2”,”row3”)
b. rownames(ABC)=c(“row”,”row2”,”row3”)
c. rownames(ABC)=c(“row1”,”row2”)
d. row(ABC)=c(“row1”,”row2”)
Answer: b
a. inspecting data
b. cleaning data
c. transforming data
d. All of Above
Answer: d
12 Which of the following is not a major data analysis approach?
a. Data Mining
b. Predictive Intelligence
c. Business Intelligence
d. Text Analytics
Answer: b
a. median
b. 90th percentile
c. interquartile range
d. mean
Answer: c
Answer: b
> x <- 3
> switch (6, 2+2, mean(1:10), rnorm(5))
a. 10
b. 1
c. NULL
d. 5
Answer : c
Unit 1 - Long Answer Type Question (10 marks)
6 Brief on the various types of data structures and explain how data can be accessed 10
within it. Give suitable syntax / code snippets.
7 a. Create a small data frame representing a database of films. It should contain the 10
fields title, director, year, country, and at least three films.
b. Create a second data frame of the same format as above, but containing just one
new film.
c. Merge the two data frames using rbind().
d. Try sorting the titles using sort(): what happens?
Answer: d
2 In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers
to__________
a. (X-intercept, Slope)
b. (Slope, X-Intercept)
c. (Y-Intercept, Slope)
d. (slope, Y-Intercept)
Answer: c
a. Only a and b
b. Only a and c
c. Only b and c
d. a, b and c
Answer: a
4 Which of the following metrics can be used for evaluating regression models?
i) R squared
ii) Adjusted R Squared
iii) F- Statistics
iv) RMSE/ MSE/ MAE
a. ii and iv
b. i and ii
c. i, ii, iii, iv
d. i , iii and iv
Answer: c
5 If the absolute value of your calculated t-statistic exceeds the critical value from the standard
normal distribution you can:
Answer: b
6 Which of the following offsets do we use in case of least square line fit? Suppose the
horizontal axis is an independent variable and the vertical axis is a dependent variable.
a. Vertical offset
b. Perpendicular offset
c. Both but depend on situation
d. Correlation coefficient
Answer: a
7 A survey is taken from a randomly selected sample of 100 students on whether they had ever
played Cricket. 25% (0.25) of the 100 students said they had played Cricket. Which one of
the following statements about the number 0.25 is correct?
a. It is a sample proportion
b. It is a population proportion
c. It is a random number
d. It is an error
Answer: a
a. x[ordersort(x$B),]
b. x[rev(order(x$B)),]
c. x[order(x$B),]
d. x[rev(ordersort(x$B)),]
Answer: b
Answer: b
10 You are given the following piece of code for forward propagation through a single hidden
layer in a neural network. This layer uses the sigmoid activation. Identify and correct the
error.
a. z = np.matmul(W, a prev) + b OR z = np.dot(W, a prev) + b
b. z = np.matmul(W, a prev) + a OR z = np.dot(W, a prev) + a
c. z = np.matmul(W, a prev) + b AND z = np.dot(W, a prev) + b
d. z = np.matmul(W, b prev) + b OR z = np.dot(W, b prev) + b
Answer: a
a. lm(formula, data)
b. lr(formula, data)
c. lrm(formula, data)
d. regression.linear(formula, data)
Answer: a
12 Which of the following counts the number of good cases when doing pairwise analysis?
a. count.pairwise
b. count() +
c. anova.para()
d. count.poly()
Answer: a
13 __________ refers to a group of techniques for fitting and studying the straight-line
relationship between two variables.
a. Linear regression
b. Logistic regression
c. Gradient Descent
d. Greedy algorithms
Answer: a
Answer: c
a. Scatter plot
b. Barchart
c. Histograms
d. None of these
Answer : a
1 Which of the following methods do we use to best fit the data in Logistic Regression?
Answer: b
a. R-square
b. Root Mean Squared Error
c. Residual Sum of Squares
d. Ordinary least squares
Answer: b
3 The________function produces a matrix that contains all of the pairwise correlations among
the predictors in a data set. The first command below gives an error message because
the_______________variable is qualitative.
Answer: c
Answer: b
6 Which of the following is true about below graphs(A,B, C left to right) between the cost
function and Number of iterations?
Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively. Which of the following
is true about l1,l2 and l3?
a. l2 < l1 < l3
b. l1 > l2 > l3
c. l1 = l2 = l3
d. l1 < l2 > l3
Answer: a
7 Suppose you have been given the following scenario for training and validation error for Linear
Regression.
Which of the following scenarios would give you the right hyper parameter?
a. 1
b. 2
c. 3
d. 4
Answer: b
8 What would be the root mean square training error for this data if you run a Linear Regression
model of the form (Y = A0+A1X)?
a. Less than 0
b. Greater than zero
c. Equal to 0
d. Less than or equal than of these
Answer: c
a. lda()
b. partimat( )
c. qda()
d. lr(formula, data)
Answer: a
a. Gcc
b. Reshape
c. Reshape2
d. gcc2
Answer: c
Answer: b
12 Which is the number of nearby neighbours to be used to classify the new record ?
a. KNN
b. Validation data
c. Euclidean Distance
d. All the above
Answers : a
13 When performing regression or classification, which of the following is the correct way to
preprocess the data?
a. Normalize the data → PCA → training
b. PCA → normalize PCA output → training
c. Normalize the data → PCA → normalize PCA output → training
d. None of the above
Answer: a
14 In which of the following cases will K-means clustering fail to give good results?
1) Data points with outliers
2) Data points with different densities
3) Data points with nonconvex shapes
a. 1 and 2
b. 2 and 3
c. 1, 2, and 3
d. 1 and 3
Answer: c
15 To apply bagging to regression trees, which of the following is/are true in such a case?
a. 1 and 2
b. 2 and 3
c. 1 and 3
d. 1,2 and 3
Answer: d
1 a) How to choose a regression model that is best fit for a given data? 10
b) Explain Factors Regression in R
3 Below are two different logistic models with different values for β0 and β1. 10
Which of the following statement(s) is true about β0 and β1 values of two logistics
models (Green, Black)?
Note: consider Y = β0 + β1*X. Here, β0 is intercept and β1 is coefficient.
4 a) Explain LDA, QDA, KNN Model Evaluation 8
b) How do you do quadratic discriminant analysis in R? 2
8 Explain the K-Nearest Neighbors in R and in what scenarios can K- nearest neighbors 10
algorithm be used
Answer: (b)
2 Which of the following is true about the tuning parameter in the Lasso model?
(a) Accounts for the amount of expansion of data values about a central point
(b) Results in a trade-off between bias and variance in resulting estimators
(c) Increases with variance
(d) Does Not increase with bias
Answer: (b)
3 Suppose we fit “Lasso Regression” to a data set, which has 100 features (X1,X2…X100).
Now, we rescale one of these feature by multiplying with 10 (say that feature is X1), and then
refit Lasso regression with the same regularization parameter. Now, which of the following
options will be correct?
Answer: a
4 Which of the following steps / assumptions in regression modeling impacts the trade-off
between under-fitting and overfitting the most.
Answer: a
5 Let’s say, a “Linear regression” model perfectly fits the training data (train error is zero). Now,
Which of the following statements is true?
Answer: c
6 _______is analyzing the root cause of the difference in performance between the current and
the perfect models.
a. Ablative analysis
b. Error analysis
c. ANOVA
d. Tradeoff analysis
Answer: b
The two main types of stepwise procedures in regressions are:
7
a. Backward elimination and Forward selection
b. Prediction and non-prediction
c. step() and regdubsets()
d. Ridge estimator and bayesian estimator
Answer: a
a. Bivariate Data
b. Univariate Data
c. Uniform Distribution
d. Normal Distribution
Answer: c
9 Statement 1: The cost function is altered by adding a penalty equivalent to the square of the
magnitude of the coefficients
Statement 2: Ridge and Lasso regression are some of the simple techniques to reduce model
complexity and prevent overfitting which may result from simple linear regression.
a. Statement 1 is true and statement 2 is false
b. Statement 1 is False and statement 2 is true
c. Both Statement (1 & 2) is wrong
d. Both Statement (1 & 2) is true
Answer: d
a. Ggplot.
b. Glmnet
c. Caret
d. Dplyr
Answer: b
11 Which of the following can be used to create the most common graph types?
a. Qplot
b. Quickplot
c. Plot
d. All of the mentioned
Answer : a
a. True
b. False
Answer: a
13 In Ridge regression, A hyper parameter is used called “_____________” that controls the
weighting of the penalty to the loss function.
a) Alpha.
b) Gamma.
c) Lambda.
d) None of the above
Answer: c
Answer: a
15 Different models trained using different training datasets derived from same population has
high accuracy or make similar accurate predictions, then the models are said to have _______
a. High bias
b. Low bias
Answer b
1 a) Which of the following statement(s) is / are true for Gradient Decent (GD) and 6
Stochastic Gradient Decent (SGD)? Justify your choice.
1. In GD and SGD, you update a set of parameters in an iterative manner to
minimize the error function.
2. In SGD, you have to run through all the samples in your training set for a
single update of a parameter in each iteration.
3. In GD, you either use the entire data or a subset of training data to update a
parameter in each iteration.
4
b) What are the layout components of Bootstrap?
2 a) Which of the following is/are one of the important step(s) to pre-process the text 6
in NLP based projects? Justify.
1. Stemming
2. Stop word removal
3. Object Standardization
Answer: (b)
2 Decision trees can be used if the input and output variables are:
(a) Categorical
(b) Continuous
(c) Both (a) and (b)
(d) None of the above
Answer: (c)
(a) PCA
(b) Bagging
(c) Boosting
(d) Random Forest
Answer: (d)
4 Which of the following need not be tuned using cross-validation to avoid overfitting in a
random forest algorithm?
(a) Minimum size of terminal nodes
(b) Maximum size of terminal nodes
(c) Maximum number of terminal nodes
(d) None of the above
Answer: (b)
5 Observe the code snippet given below and answer: What should be filled in place of method
to fit a linear regression with backward selection?
step.model <- train(Inputdatafile~.,
method = “______________”,
tuneGrid = data.frame(nvmax = 1:8),
trControl = train.control
)
(a) leapforward
(b) leapBackward
(c) leapbackward
(d) leapSeq
Answer: (b)
What is the importance of using PCA before clustering? Choose the most complete answer.
6 a. Find the explained variance
b. Avoid bad features
c. Find good features to improve your clustering score
d. Find which dimension of data maximize the features variance
Answer: d
7 Consider the following figure for answering the next few questions. In the figure, X1 and X2
are the two features and the data point is represented by dots (-1 is negative class and +1 is a
positive class). And you first split the data based on feature X1(say splitting point is x11)
which is shown in the figure using vertical line. Every value less than x11 will be predicted as
positive class and greater than x will be predicted as negative class.
a. 1
b. 2
c. 3
d. 4
Answer: a
8 The most popularly used dimensionality reduction algorithm is Principal Component Analysis
(PCA). Which of the following is/are true about PCA?
a. 1 and 2
b. 1 and 3
c. 2, 3 and 4
d. 1, 2, 3 and 4
Answer: d
9 What will happen when eigenvalues are roughly equal?
Answer: b
10 Which of the following is true about training and testing error in such a case?
Suppose you want to apply the AdaBoost algorithm on Data D which has T observations. You
set half the data for training and half for testing initially. Now you want to increase the
number of data points for training T1, T2 … Tn where T1 < T2…. Tn-1 < Tn.
a. The difference between training error and test error increases as number of
observations increases
b. The difference between training error and test error decreases as number of
observations increases
c. The difference between training error and test error will not change
d. None of These
Answer: b
11 What is the biggest weakness of decision trees compared to logistic regression classifiers?
Answer a
12 Which of the following is a widely used and effective machine learning algorithm based on
the idea of bagging?
a. Decision Tree
b. Regression
c. Classification
d. Random Forest
Answer: d
13 The most widely used metrics and tools to assess a classification model are:
a. Confusion matrix
b. Cost-sensitive accuracy
c. Area under the ROC curve
d. All of the above
Answer: d
a. Factor analysis
b. Decision trees are robust to outliers
c. Decision trees are prone to be overfit
d. None of the above
Answer: c
15 Which of the following is the most appropriate strategy for data cleaning before performing
clustering analysis, given less than desirable number of data points:
1. Capping and flouring of variables
2. Removal of outliers
a. 1
b. 2
c. 1 and 2
d. none of the mentioned
Answer: a
4 a) List down the attribute selection measures used by the ID3 algorithm to 6
construct a Decision Tree
b) List down the problem domains in which Decision Trees are most suitable. 4
6 a) If it takes one hour to train a Decision Tree on a training set containing 1 million 6
instances, roughly how much time will it take to train another Decision Tree on a
training set containing 10 million instances?
b) What are the disadvantages of Classification and Regression Trees (CART)? 4
***