Subjective Questions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what could you infer about their effect on the
dependent variable?

Ans: From the analysis of the categorical variables from the dataset it could be inferred the bike rental rates are
likely to be higher in summer and the fall season, are more prominent in the months of September and October,
more so in the days of Sat, Wed and Thurs and in the year of 2019. Additionally we could discern that bike rental
are higher on holidays.

2. Why is it important to use drop_first=True during dummy variable creation?

Ans: The drop_first = True is used while creating dummy variables to drop the base/reference category. The reason
for this is to avoid the multi-collinearity getting added into the model if all dummy variables are included. The
reference category can be easily deduced where 0 is present in a single row for all the other dummy variables of a
particular category.

3. Looking at the pair-plot among the numerical variables, which one has the highest correlation with the target
variable?

Ans: The temp variable has the highest correlation with the target variable.

4. How did you validate the assumptions of Linear Regression after building the model on the training set?

Ans: Validated the assumptions of linear regression by checking the VIF, error distribution of residuals and linear
relationship between the dependent variable and a feature variable.

5. Based on the final model, which are the top 3 features contributing significantly towards explaining the demand
of the shared bikes?

Ans The top 3 variables are:

weathersit :
Temperature is the Most Significant Feature which affects the Business positively, Whereas the other
Environmental condition such as Raining, Humidity, Windspeed and Cloudy affects the Business negatively.
‘Yr’:

The growth year on year seems organic given the geological attributes.

‘season’:

Winter season is playing the crucial role in the demand of shared bikes.

General Subjective Questions


1. Explain the linear regression algorithm in detail.

Ans: Linear Regression is an ML algorithm used for supervised learning. It helps in predicting a dependent
variable(target) based on the given independent variable(s). The regression technique tends to establish a linear
relationship between a dependent variable and the other given independent variables. There are two types of
linear regression- simple linear regression and multiple linear regression. Simple linear regression is used when a
single independent variable is used to predict the value of the target variable. Multiple Linear Regression is when
multiple independent variables are used to predict the numerical value of the target variable. A linear line showing
the relationship between the dependent and independent variables is called a regression line. A positive linear
relationship is when the dependent variable on the Y-axis along with the independent variable in the X-axis.
However, if dependent variables value decreases with increase in independent variable value increase in X-axis, it is
a negative linear relationship.

2. Explain the Anscombe’s quartet in detail.

Ans: Anscombe's quartet consists of four data sets that have nearly identical simple descriptive statistics but have
very different distributions and appear very different when presented graphically. Each dataset consists of eleven
points. The primary purpose of Anscombe’s quartet is to illustrate the importance of looking at a set of data
graphically before beginning the analysis process as the statistics merely does not give the an accurate
representation of two datasets being compared.

3. What is Pearson’s R?

Ans The Pearson’s R (also known as Pearson’s correlation coefficients) measures the strength between the different
variables and the relation with each other. The Pearon’s R returns values between -1 and 1. The interpretation of
the coefficients are:

 -1 coefficient indicates strong inversely proportional relationship.


 0 coefficient indicates no relationship.
 1 coefficient indicates strong proportional relationship.

n ( Σx∗y )−( Σx )∗( Σy )


r=
√ [ n Σ x − ( Σx ) ]∗[ n Σ y −( Σy ) ]
2 2 2 2

Where:

N = the number of pairs of scores

Σxy = the sum of the products of paired scores

Σx = the sum of x scores

Σy = the sum of y scores

Σx2 = the sum of squared x scores

Σy2 = the sum of squared y scores


4. What is scaling? Why is scaling performed? What is the difference between normalized scaling and standardized

scaling?

Ans: Scaling is a technique performed in pre-processing during building a machine learning model to standardize
the independent feature variables in the dataset in a fixed range.

The dataset could have several features which are highly ranging between high magnitudes and units. If there is no
scaling performed on this data, it leads to incorrect modelling as there will be some mismatch in the units of all the
features involved in the model.

The difference between normalization and standardization is that while normalization brings all the data points in a
range between 0 and 1, standardization replaces the values with their Z scores.

5. You might have observed that sometimes the value of VIF is infinite. Why does this happen?

Ans: The value of VIF is infinite when there is a perfect correlation between the two independent variables. The
Rsquared value is 1 in this case. This leads to VIF infinity as VIF equals to 1/(1-R2). This concept suggests that is
there is a problem of multi-collinearity and one of these variables need to be dropped in order to define a working
model for regression.

6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.

Ans: Q-Q plots are the quantile-quantile plots. It is a graphical tool to assess the 2 data sets are from common
distribution. The theoretical distributions could be of type normal, exponential or uniform. The Q-Q plots are
useful in the linear regression to identify the train data set and test data set are from the populations with same
distributions. This is another method to check the normal distribution of the data sets in a straight line with
patterns explained below

 Interpretations
o Similar distribution: If all the data points of quantile are lying around the straight line at
an angle of 45 degree from x-axis.
o Y values < X values: If y-values quantiles are lower than x-values quantiles.
o X values < Y values: If x-values quantiles are lower than y-values quantiles.
o Different distributions – If all the data points are lying away from the straight line.
 Advantages
o Distribution aspects like loc, scale shifts, symmetry changes and the outliers all can be
daintified from the single plot.
o The plot has a provision to mention the sample size as well.

You might also like