18CSO106T Data Analysis Using Open Source Tool: Question Bank

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Question Bank

18CSO106T Data Analysis using Open Source Tool

Unit 1 - MCQ (1 mark)

1 The term Data Analysis is defined by the statistician ____________

a. John Tukey
b. William S.
c. Hans Peter Luhn
d. None of the above

Answer: a

2 __________ refers to data which can be ranked, has consistent units and has a true zero e.g.
age. Some statistics software packages may refer to cardinal and ratio data as ‘scale’.

a. Nominal data
b. Ordinal data
c. Cardinal/Interval data
d. Ratio data

Answer: c

3 __________ are a more generalized form of a matrix.

a. Factors
b. Matrices
c. Vectors
d. Data Frames

Answer: d

4 Which of the following is valid syntax for if else statement in R?

a. if(<condition>) {## do something}


else { ## do something else}
b. if(<condition>) { ## do something}
elseif {## do something else}

c. if(<condition>) { ## do something}
else if {## do something else}

d. if(<condition>) { ##& do something}


else {##@ do something else}

Answer: a

5 Point out the correct statement?

a. The value NaN represents undefined value


b. NaN can also be thought of as a missing value
c. Number Inf represents infinity in R
d. “raw” objects are commonly used directly in data analysis

Answer: c

6 What will be the output of the following R code?


> x <- vector("numeric", length = 10)
>x

a. 10
b. 01
c. 00120
d. 0000000000

Answer: d

7 Two vectors having the same initial points are called as__________

a. Collinear vectors
b. Unit vectors
c. Equal vectors
d. Coinitial vectors

Answer: d
8 The plot method on series and dataframe is just a simple wrapper around _________

a. gplt.plot()
b. plt.plotgraph()
c. plt.plot()
d. scatter.plot()

Answer: c

9 Missing values in this csv file have been represented by an exclamation mark (“!”) and a
question mark (“?”). Which of the codes below will read the above csv file correctly into R?

a. csv(‘Dataframe.csv’)
b. csv(‘Dataframe.csv’,header=FALSE, sep=’,’,na.strings=c(‘?’))
c. csv2(‘Dataframe.csv’,header=FALSE,sep=’,’,na.strings=c(‘?’,’!’))
d. dataframe(‘Dataframe.csv’)

Answer: c

10 Suppose ABC is the matrix of 3 rows and 4 columns. Choose correct option(s) to rename
columns:

a. row_names(ABC)= c(“row1”,”row2”,”row3”)
b. rownames(ABC)=c(“row”,”row2”,”row3”)
c. rownames(ABC)=c(“row1”,”row2”)
d. row(ABC)=c(“row1”,”row2”)

Answer: b

11 Data Analysis is a process of

a. inspecting data
b. cleaning data
c. transforming data
d. All of Above

Answer: d
12 Which of the following is not a major data analysis approach?

a. Data Mining
b. Predictive Intelligence
c. Business Intelligence
d. Text Analytics

Answer: b

13 Identify which of the following is a measure of dispersion

a. median
b. 90th percentile
c. interquartile range
d. mean

Answer: c

14  Point out the wrong statement?

a. Setting up a workstation to take full advantage of the customizable features of R is a


straightforward thing
b. q() is used to quit the R program
c. R has an inbuilt help facility similar to the man facility of UNIX
d. Windows versions of R have other optional help systems also

Answer: b

15 What will be the output of the following R code?

> x <- 3
> switch (6, 2+2, mean(1:10), rnorm(5))

a. 10
b. 1
c. NULL
d. 5

Answer : c
Unit 1 - Long Answer Type Question (10 marks)

1 a) How to Change a Data Frame’s Row And Column Names 3


b) Explain the types of Data & Measurement Scales: Nominal, Ordinal, Interval 7
and Ratio.

2 a) Difference between Array vs Matrix in R Programming 4


b) Create three vectors x,y,z with integers and each vector has 3 elements. 6
Combine the three vectors to become a 3×3 matrix A where each column
represents a vector. Change the row names to a,b,c.

3 a) What is the advantage of using the R programming language for statistical 3


computing and graphics?
b) Justify 3 real world instances / applications where using R Programming would 7
be advantageous over other programming languages.

4 a) Discuss the various data structures in R with suitable syntax. 5


b) Explain the term 'Data Wrangling’ in Data Analytics. 5

5 a) Give a step-wise explanation on how to visualize sparse matrices in python 5


using Matplotlib?
b) How do you treat outliers in a dataset? 3
c) What is Time Series analysis? 2

6 Brief on the various types of data structures and explain how data can be accessed 10
within it. Give suitable syntax / code snippets.

7 a. Create a small data frame representing a database of films. It should contain the 10
fields title, director, year, country, and at least three films.
b. Create a second data frame of the same format as above, but containing just one
new film.
c. Merge the two data frames using rbind().
d. Try sorting the titles using sort(): what happens?

8 Explain the control structure with examples.

Unit 2 – MCQ (1 mark)

1 In practice, Line of best fit or regression line is found when:


a. Sum of residuals (∑(Y – h(X))) is minimum
b. Sum of the absolute value of residuals (∑|Y-h(X)|) is maximum
c. Sum of the square of residuals ( ∑ (Y-h(X))2) is maximum
d. Sum of the square of residuals ( ∑ (Y-h(X))2) is minimum

Answer: d

2 In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers
to__________

a. (X-intercept, Slope)
b. (Slope, X-Intercept)
c. (Y-Intercept, Slope)
d. (slope, Y-Intercept)

Answer: c

3 Consider the following learning algorithms:


a) Logistic regression
b) Back propagation
c) Linear regression
Which of the following options represents classification algorithms?

a. Only a and b
b. Only a and c
c. Only b and c
d. a, b and c

Answer: a
4 Which of the following metrics can be used for evaluating regression models?
i) R squared
ii) Adjusted R Squared
iii) F- Statistics
iv) RMSE/ MSE/ MAE

a. ii and iv
b. i and ii
c. i, ii, iii, iv
d. i , iii and iv

Answer: c

5 If the absolute value of your calculated t-statistic exceeds the critical value from the standard
normal distribution you can:

a. Safely assume that your regression results are significant


b. Reject the null hypothesis
c. Reject the assumption that the error terms are homoskedastic
d. Conclude that most of the actual values are very close to the regression line

Answer: b

6 Which of the following offsets do we use in case of least square line fit? Suppose the
horizontal axis is an independent variable and the vertical axis is a dependent variable.

a. Vertical offset
b. Perpendicular offset
c. Both but depend on situation
d. Correlation coefficient
Answer: a

7 A survey is taken from a randomly selected sample of 100 students on whether they had ever
played Cricket. 25% (0.25) of the 100 students said they had played Cricket. Which one of
the following statements about the number 0.25 is correct?
a. It is a sample proportion
b. It is a population proportion
c. It is a random number
d. It is an error

Answer: a

8 Which of the following sort dataframe by the order of the elements in B.

a. x[ordersort(x$B),]
b. x[rev(order(x$B)),]
c. x[order(x$B),]
d. x[rev(ordersort(x$B)),]

Answer: b

9 _________and_________ are types of matrix functions?

a. Apply and lapply


b. Apply and sapply
c. Sapply only
d. Lappy only

Answer: b

10 You are given the following piece of code for forward propagation through a single hidden
layer in a neural network. This layer uses the sigmoid activation. Identify and correct the
error.
a. z = np.matmul(W, a prev) + b OR z = np.dot(W, a prev) + b
b. z = np.matmul(W, a prev) + a OR z = np.dot(W, a prev) + a
c. z = np.matmul(W, a prev) + b AND z = np.dot(W, a prev) + b
d. z = np.matmul(W, b prev) + b OR z = np.dot(W, b prev) + b

Answer: a

11 Function used for linear regression in R is __________

a. lm(formula, data)
b. lr(formula, data)
c. lrm(formula, data)
d. regression.linear(formula, data)

Answer: a

12 Which of the following counts the number of good cases when doing pairwise analysis?

a. count.pairwise
b. count() +
c. anova.para()
d. count.poly()

Answer: a

13 __________ refers to a group of techniques for fitting and studying the straight-line
relationship between two variables.

a. Linear regression
b. Logistic regression
c. Gradient Descent
d. Greedy algorithms

Answer: a

14 In a linear regression problem, we are using “R-squared” to measure goodness-of-fit. We add


a feature in the linear regression model and retrain the same model.Which of the following
options is true?

a. If R Squared increases, this variable is significant.


b. If R Squared decreases, this variable is not significant.
c. Individually R squared cannot tell about variable importance. We can’t say anything
about it right now.
d. None of these

Answer: c

15 To test the linear relationship of y(dependent) and x(independent) continuous variables,


which of the following plots is best suited?

a. Scatter plot
b. Barchart
c. Histograms
d. None of these

Answer : a

Unit 2 - Long Answer Type Questions (10 marks)

1 a) Mention the library/ies used for linear regression in R? 3


b) What are the assumptions of linear regression? 4
c) With a syntax/code snippet, explain how MGARCH BEKK is performed using R 3
programming language.

2 a) What is the Null and Alternate Hypothesis? 3


b) Discuss briefly on the k-fold cross validation method and its pros and cons. 6

3 a) What is linear regression? Explain its types. 5


b) How to test if your linear model has a good fit? 5

4 a) Explain categorical predictors in detail. 5


b) What are the applications of KNN? Mention a scenario when it is most suitable to 5
use KNN?

5 a) Explain how the significance of a correlation can be tested. 5


b) Discuss: Multiple Linear Regression Model. 5

6 a) Explain the significance of slope coefficients using: -Critical -p-value -confidence 7


interval 3
b) Briefly discuss the types of correlation coefficients

7 Explain the Writing Functions for linear regression in R with example 10

8 Discuss: Non-linear Transformations of the Predictors in R 10

Unit 3 - MCQ (1 mark)

1 Which of the following methods do we use to best fit the data in Logistic Regression?

a. Least Square Error


b. Maximum Likelihood
c. Jaccard distance
d. Both A and B

Answer: b

2 _______________ measures the model prediction error. It corresponds to the average


difference between the observed known values of the outcome and the predicted value by the
model.

a. R-square
b. Root Mean Squared Error
c. Residual Sum of Squares
d. Ordinary least squares

Answer: b

3 The________function produces a matrix that contains all of the pairwise correlations among
the predictors in a data set. The first command below gives an error message because
the_______________variable is qualitative.

a. Pair() and Direction


b. Predict() and unidirection
c. Cor() and Direction
d. Lda() and cor()
Answer: c

4 If two variables, x and y, have a very strong linear relationship, then:

a. There is evidence that x causes a change in y


b. There is evidence that y causes a change in x
c. There might not be any causal relationship between x and y
d. None of these alternatives is correct

Answer: c

5 Ridge regression takes ________________ value of variables.

a. Squared value of variables


b. Absolute value of variables
c. Cube value of variables
d. Root value of variables

Answer: b

6 Which of the following is true about below graphs(A,B, C left to right) between the cost
function and Number of iterations?

Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively. Which of the following
is true about l1,l2 and l3?

a. l2 < l1 < l3
b. l1 > l2 > l3
c. l1 = l2 = l3
d. l1 < l2 > l3

Answer: a

7 Suppose you have been given the following scenario for training and validation error for Linear
Regression.
Which of the following scenarios would give you the right hyper parameter?

a. 1
b. 2
c. 3
d. 4

Answer: b

8 What would be the root mean square training error for this data if you run a Linear Regression
model of the form (Y = A0+A1X)?

a. Less than 0
b. Greater than zero
c. Equal to 0
d. Less than or equal than of these

Answer: c

9 _______prints discriminant functions based on centered (not standardized) variables.

a. lda()
b. partimat( )
c. qda()
d. lr(formula, data)

Answer: a

10 ________specializes in converting data from wide to long format

a. Gcc
b. Reshape
c. Reshape2
d. gcc2

Answer: c

11 0 and 1, or pass and fail or true and false is an example of?

a. Multinomial Logistic Regression


b. Binary Logistic Regression
c. Ordinal Logistic Regression
d. None of the above

Answer: b

12 Which is the number of nearby neighbours to be used to classify the new record ?

a. KNN
b. Validation data
c. Euclidean Distance
d. All the above

Answers : a

13 When performing regression or classification, which of the following is the correct way to
preprocess the data?
a. Normalize the data → PCA → training 
b. PCA → normalize PCA output → training
c. Normalize the data → PCA → normalize PCA output → training
d. None of the above

Answer: a

14 In which of the following cases will K-means clustering fail to give good results?
1) Data points with outliers
2) Data points with different densities
3) Data points with nonconvex shapes

a. 1 and 2
b. 2 and 3
c. 1, 2, and 3 
d. 1 and 3
Answer: c

15 To apply bagging to regression trees, which of the following is/are true in such a case?

1. We build the N regression with N bootstrap sample


2. We take the average the of N regression tree
3. Each tree has a high variance with low bias

a. 1 and 2
b. 2 and 3
c. 1 and 3
d. 1,2 and 3

Answer: d

Unit 3 - Long Answer Type Questions (10 marks)

1 a) How to choose a regression model that is best fit for a given data? 10
b) Explain Factors Regression in R

2 a) Can logistic regression be used for more than 2 classes? 3


b) What is the ROC curve in logistic regression? 4
c) Explain how and why AUC ROC be used for regression. If not, why? 3

3 Below are two different logistic models with different values for β0 and β1. 10

Which of the following statement(s) is true about β0 and β1 values of two logistics
models (Green, Black)?
Note: consider Y = β0 + β1*X. Here, β0 is intercept and β1 is coefficient.
4 a) Explain LDA, QDA, KNN Model Evaluation 8
b) How do you do quadratic discriminant analysis in R? 2

5 a) Explain if AUC ROC can be used for regression. If not, why? 3


b) What is the importance of the correlation analysis between the stock market 7
valuation and the economic situation of business entities growing?

6 a) In what scenarios can multiple linear regression be used? 3


b) Explain the steps to run a linear regression analysis using R 7

7 Explain the Linear Discriminant Analysis and Quadratic Discriminant Analysis in R 10

8 Explain the K-Nearest Neighbors in R and in what scenarios can K- nearest neighbors 10
algorithm be used

Unit 4 - MCQ (1 mark)

1 Which of the following is not a step involved in Leave-One-Out cross validation?


(a) Leave out one data point and build the model on the rest of the data set
(b) Test the model against the next subset and record the test error associated with the
prediction
(c) Repeat the process for all data points
(d) Compute the overall prediction error by taking the average of all these test error
estimates recorded at step 2.

Answer: (b)

2 Which of the following is true about the tuning parameter in the Lasso model?
(a) Accounts for the amount of expansion of data values about a central point
(b) Results in a trade-off between bias and variance in resulting estimators
(c) Increases with variance
(d) Does Not increase with bias

Answer: (b)

3 Suppose we fit “Lasso Regression” to a data set, which has 100 features (X1,X2…X100).
Now, we rescale one of these feature by multiplying with 10 (say that feature is X1), and then
refit Lasso regression with the same regularization parameter. Now, which of the following
options will be correct?

a. It is more likely for X1 to be included in the model.


b. It is more likely for X1 to be excluded from the model
c. Can’t say.
d. None of these.

Answer: a

4 Which of the following steps / assumptions in regression modeling impacts the trade-off
between under-fitting and overfitting the most.

a. The polynomial degree


b. Whether we learn the weights by matrix inversion or gradient descent
c. The use of a constant-term
d. The non- polynomial degree

Answer: a

5 Let’s say, a “Linear regression” model perfectly fits the training data (train error is zero). Now,
Which of the following statements is true?

a. You will always have test error zero


b. You can not have test error zero
c. None of the above
d. You can have test error zero

Answer: c

6 _______is analyzing the root cause of the difference in performance between the current and
the perfect models.

a. Ablative analysis
b. Error analysis
c. ANOVA
d. Tradeoff analysis

Answer: b
The two main types of stepwise procedures in regressions are:
7
a. Backward elimination and Forward selection
b. Prediction and non-prediction
c. step() and regdubsets()
d. Ridge estimator and bayesian estimator

Answer: a

8 The below graph distributions are:

a. Bivariate Data
b. Univariate Data
c. Uniform Distribution
d. Normal Distribution

Answer: c

9 Statement 1: The cost function is altered by adding a penalty equivalent to the square of the
magnitude of the coefficients
Statement 2: Ridge and Lasso regression are some of the simple techniques to reduce model
complexity and prevent overfitting which may result from simple linear regression.
a. Statement 1 is true and statement 2 is false
b. Statement 1 is False and statement 2 is true
c. Both Statement (1 & 2) is wrong
d. Both Statement (1 & 2) is true

Answer: d

10 To do Ridge and Lasso Regression in R we will use which library ___________.

a. Ggplot.
b. Glmnet
c. Caret
d. Dplyr

Answer: b

11 Which of the following can be used to create the most common graph types?

a. Qplot
b. Quickplot
c. Plot
d. All of the mentioned

Answer : a

12 For k cross-validation, smaller k value implies less variance.

a. True
b. False

Answer: a

13 In Ridge regression, A hyper parameter is used called “_____________” that controls the
weighting of the penalty to the loss function.
a) Alpha.
b) Gamma.
c) Lambda.
d) None of the above

Answer: c

14 Ridge regression takes ________________ value of variables.


a) Squared value of variables.
b) Absolute value of variables.
c) Cube value of variables.
d) Root value of variables.

Answer: a

15 Different models trained using different training datasets derived from same population has
high accuracy or make similar accurate predictions, then the models are said to have _______

a. High bias
b. Low bias

Answer b

Unit 4 - Long Answer Type Questions (10 marks)

1 a) Which of the following statement(s) is / are true for Gradient Decent (GD) and 6
Stochastic Gradient Decent (SGD)? Justify your choice.
1. In GD and SGD, you update a set of parameters in an iterative manner to
minimize the error function.
2. In SGD, you have to run through all the samples in your training set for a
single update of a parameter in each iteration.
3. In GD, you either use the entire data or a subset of training data to update a
parameter in each iteration.
4
b) What are the layout components of Bootstrap?

2 a) Which of the following is/are one of the important step(s) to pre-process the text 6
in NLP based projects? Justify.
1. Stemming
2. Stop word removal
3. Object Standardization

b) Explain if containers can be nested in Bootstrap? If not, why?


4
3 a) Is leave one out cross-validation a better method than k-fold cross validation? 6
Explain with suitable scenarios.
b) Comment on the variance of leave-one-out cross-validation. 4

4 a) Explain why ridge regression is/not a shrinkage method? 4


b) What is shrinkage in linear regression? 4
c) Define shrinkage analysis. 2

5 a) Explain Ridge Regression and the Lasso in R 5


b) Explain Validation Set Approach 5

6 a) Explain the various dimension reduction methods 5


b) Explain Principal Components Regression 5

7 a) Explain Logistic Regression 5


b) Write notes on LDA 5

8 Explain the Forward and Backward Stepwise Selection in R 10

Unit 5 - MCQ (1 mark)

1 Based on the cues, choose the most appropriate answer:


I: It is a set of nested clusters that are arranged as a tree
II: Requires the computation and storage of an n×n distance matrix

(a) K-means clustering


(b) Hierarchical clustering
(c) K-fold cross validation
(d) Regression Tree

Answer: (b)

2 Decision trees can be used if the input and output variables are:

(a) Categorical
(b) Continuous
(c) Both (a) and (b)
(d) None of the above
Answer: (c)

3 _____________ is a special type of bagging applied to decision trees.

(a) PCA
(b) Bagging
(c) Boosting
(d) Random Forest

Answer: (d)

4 Which of the following need not be tuned using cross-validation to avoid overfitting in a
random forest algorithm?
(a) Minimum size of terminal nodes
(b) Maximum size of terminal nodes
(c) Maximum number of terminal nodes
(d) None of the above

Answer: (b)

5 Observe the code snippet given below and answer: What should be filled in place of method
to fit a linear regression with backward selection?
step.model <- train(Inputdatafile~.,
method = “______________”,
tuneGrid = data.frame(nvmax = 1:8),
trControl = train.control
)

(a) leapforward
(b) leapBackward
(c) leapbackward
(d) leapSeq

Answer: (b)

What is the importance of using PCA before clustering? Choose the most complete answer.
6 a. Find the explained variance
b. Avoid bad features
c. Find good features to improve your clustering score
d. Find which dimension of data maximize the features variance
Answer: d

7 Consider the following figure for answering the next few questions. In the figure, X1 and X2
are the two features and the data point is represented by dots (-1 is negative class and +1 is a
positive class). And you first split the data based on feature X1(say splitting point is x11)
which is shown in the figure using vertical line. Every value less than x11 will be predicted as
positive class and greater than x will be predicted as negative class.

How many data points are misclassified in above image?

a. 1
b. 2
c. 3
d. 4

Answer: a

8 The most popularly used dimensionality reduction algorithm is Principal Component Analysis
(PCA). Which of the following is/are true about PCA?

1. PCA is an unsupervised method


2. It searches for the directions that data have the largest variance
3. Maximum number of principal components <= number of features
4. All principal components are orthogonal to each other

a. 1 and 2
b. 1 and 3
c. 2, 3 and 4
d. 1, 2, 3 and 4

Answer: d
9 What will happen when eigenvalues are roughly equal?

a. PCA will perform outstandingly


b. PCA will perform badly
c. Can’t Say
d. PCA perform neurally

Answer: b

10 Which of the following is true about training and testing error in such a case?

Suppose you want to apply the AdaBoost algorithm on Data D which has T observations. You
set half the data for training and half for testing initially. Now you want to increase the
number of data points for training T1, T2 … Tn where T1 < T2…. Tn-1 < Tn.

a. The difference between training error and test error increases as number of
observations increases
b. The difference between training error and test error decreases as number of
observations increases
c. The difference between training error and test error will not change
d. None of These

Answer: b

11 What is the biggest weakness of decision trees compared to logistic regression classifiers?

a. Decision trees are more likely to overfit the data


b. Decision trees are more likely to underfit the data
c. Decision trees do not assume independence of the input features
d. None of the mentioned

Answer a

12 Which of the following is a widely used and effective machine learning algorithm based on
the idea of bagging?

a. Decision Tree
b. Regression
c. Classification
d. Random Forest 

Answer: d
13 The most widely used metrics and tools to assess a classification model are:
a. Confusion matrix
b. Cost-sensitive accuracy
c. Area under the ROC curve
d. All of the above 

Answer: d

14 Which of the following is a disadvantage of decision trees?

a. Factor analysis
b. Decision trees are robust to outliers
c. Decision trees are prone to be overfit 
d. None of the above

Answer: c

15  Which of the following is the most appropriate strategy for data cleaning before performing
clustering analysis, given less than desirable number of data points:
1. Capping and flouring of variables
2. Removal of outliers

a. 1
b. 2
c. 1 and 2
d. none of the mentioned

Answer: a

Unit 5 - Long Answer Type Questions (10 marks)

1 a) Explain the basics of decision trees- Regression Trees , Classification Trees 7


b) Why is Euclidean distance preferred over Manhattan distance in the K-means 3
Algorithm?

2 a) Explain in detail about Fitting Classification Trees in R 5


b) Explain Linear Models 3
c) What are the Uses of Principal Components 2
3 a) Explain Bagging - Random Forests 3
b) Define Boosting 3
c) Discuss the advantages and disadvantages of decision trees 4

4 a) List down the attribute selection measures used by the ID3 algorithm to 6
construct a Decision Tree
b) List down the problem domains in which Decision Trees are most suitable. 4

5 a) Explain Principal Components Analysis in R programming language? 6


b) Explain Hierarchical Clustering 4

6 a) If it takes one hour to train a Decision Tree on a training set containing 1 million 6
instances, roughly how much time will it take to train another Decision Tree on a
training set containing 10 million instances?
b) What are the disadvantages of Classification and Regression Trees (CART)? 4

7 Explain K-Means Clustering and Hierarchical Clustering in R 10

8 Explain Fitting Classification Trees in R and Fitting Regression Trees in R 10

***

You might also like