0% found this document useful (0 votes)
12 views

Exploratory Data Analysis for Machine Learning

The document consists of a series of questions and answers related to Exploratory Data Analysis (EDA), covering key concepts such as data visualization techniques, measures of central tendency, handling outliers, and dimensionality reduction. It emphasizes the importance of understanding data distributions, relationships between variables, and appropriate methods for data preprocessing. The content serves as a guide for best practices in EDA to enhance data analysis and modeling decisions.

Uploaded by

eyob53834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Exploratory Data Analysis for Machine Learning

The document consists of a series of questions and answers related to Exploratory Data Analysis (EDA), covering key concepts such as data visualization techniques, measures of central tendency, handling outliers, and dimensionality reduction. It emphasizes the importance of understanding data distributions, relationships between variables, and appropriate methods for data preprocessing. The content serves as a guide for best practices in EDA to enhance data analysis and modeling decisions.

Uploaded by

eyob53834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1. Which of the following best describes Exploratory Data Analysis (EDA)?

a) Applying deep learning models to predict labels directly.


b) Using statistical techniques to summarize and visualize the main characteristics of the
data.
c) Splitting data into training and test sets without examining distributions.
d) Only applying regression models to identify relationships.
2. In EDA, a box plot is most useful for:
a) Displaying the frequency distribution of a categorical variable.
b) Comparing the distributions of numeric variables and identifying outliers.
c) Showing the correlation between two continuous variables.
d) Representing cumulative counts of a variable.
3. Which measure of central tendency is most robust against outliers?
a) Mean
b) Median
c) Mode
d) Geometric Mean
4. Suppose you have a dataset with a highly skewed distribution. Which measure of spread
is typically more appropriate than the standard deviation?
a) Range
b) Variance
c) Interquartile Range (IQR)
d) Mean Absolute Deviation (MAD)
5. When dealing with missing data, which of the following methods might introduce bias if
the data are not missing at random?
a) Dropping all rows with missing values
b) Imputing with the mean of the column
c) Imputing using regression-based methods
d) Multiple imputation with chain equations
6. A histogram is best used for:
a) Displaying the relationship between two categorical variables.
b) Showing the distribution of a single continuous variable.
c) Examining correlation between two continuous variables.
d) Visualizing time series data.
7. Correlation between two variables is best visualized using a:
a) Scatter plot
b) Pie chart
c) Stacked bar chart
d) Box plot
8. The Pearson correlation coefficient measures:
a) A linear association between two variables.
b) A nonlinear association between two variables.
c) The rank correlation between two variables.
d) The difference in medians between two groups.
9. If a variable’s distribution is right-skewed, it means:
a) The mass of the distribution is concentrated on the left and the tail extends to the right.
b) The mass is concentrated on the right and the tail extends to the left.
c) The distribution is symmetrical.
d) The distribution is uniform.
10. A QQ (Quantile-Quantile) plot is generally used to:
a) Compare the distributions of two categorical variables.
b) Assess whether a data set is normally distributed.
c) Visualize pairwise correlations for multiple variables.
d) Evaluate model residuals against a target variable.
11. Standardizing a feature typically involves:
a) Shifting it to have mean = 0 and scaling it to have standard deviation = 1.
b) Multiplying all values by a constant factor greater than 1.
c) Taking the logarithm of all values.
d) Replacing all outliers with the median.
12. A violin plot provides more information than a box plot by:
a) Displaying the correlation matrix.
b) Showing the kernel density estimation of the distribution.
c) Highlighting the linear regression line.
d) Visualizing only the median and outliers.
13. The purpose of a pairwise correlation matrix is:
a) To display relationships between more than two categorical variables.
b) To summarize missing values in each feature.
c) To show linear correlation coefficients between multiple continuous variables.
d) To evaluate classification accuracy.
14. When analyzing categorical variables, a bar plot is often more appropriate than a line plot
because:
a) Categorical variables have a natural continuous ordering.
b) Line plots emphasize trends between ordered values, which may not be meaningful for
categories.
c) Bar plots cannot show frequencies.
d) Line plots and bar plots are identical.
15. Identifying and handling outliers is important in EDA because:
a) Outliers never affect the mean.
b) Outliers can distort statistical measures and model performance.
c) Outliers are always due to data entry errors.
d) Removing outliers will always improve accuracy.
16. Which of the following transformations is often used to reduce right-skew in a feature’s
distribution?
a) Exponential transformation
b) Logarithmic transformation
c) Inverse transformation
d) Binarization
17. If a dataset’s features are on very different scales, a common approach before applying
distance-based algorithms is to:
a) Leave the data as is.
b) Encode the data using one-hot encoding.
c) Scale the features, for example using StandardScaler or MinMaxScaler.
d) Only normalize the target variable.
18. The mean and variance are best used as measures of central tendency and spread when:
a) The data are highly skewed.
b) The data have outliers.
c) The data are approximately normally distributed.
d) The data are purely categorical.
19. The Spearman correlation coefficient is preferred over Pearson’s when:
a) Both variables are normally distributed.
b) The relationship is nonlinear but monotonic.
c) Variables are categorical.
d) There are no missing values.
20. If your dataset contains mixed data types (numeric and categorical), which plot is best
suited for an initial overview?
a) A scatter matrix plot of all features
b) A heatmap of correlations
c) A pair plot that includes histograms and scatter plots for numeric variables and
frequency charts for categorical ones
d) A QQ plot for all variables combined
21. Principal Component Analysis (PCA) in EDA is often used to:
a) Cluster data into meaningful groups.
b) Reduce dimensionality while retaining most variation in the data.
c) Replace missing values with principal components.
d) Create a target variable for supervised learning.
22. When dealing with time series data in EDA, a line plot is preferred over a bar plot for:
a) Visualizing frequency counts of a categorical variable.
b) Showing changes in a continuous variable over time.
c) Comparing distributions between two groups.
d) Displaying box-and-whisker statistics.
23. One-hot encoding is used to:
a) Scale numeric features.
b) Encode categorical variables into binary indicator variables.
c) Normalize distributions of skewed data.
d) Identify outliers in continuous data.
24. A heatmap is particularly useful for:
a) Showing the distribution of a single variable.
b) Representing correlations or distances between multiple variables as color-coded
intensities.
c) Highlighting missing values in a 3D graph.
d) Displaying regression lines between pairs of variables.
25. Identifying the shape of the distribution is a crucial step in EDA because it helps in:
a) Selecting the appropriate modeling algorithm.
b) Deciding which scaling method to apply.
c) Understanding the nature of your data and choosing the right statistical measures.
d) Automatically tuning hyperparameters of a model.
26. In EDA, which of the following is a good practice when dealing with a large number of
features?
a) Ignore all features that don’t look important at first glance.
b) Use dimensionality reduction techniques such as PCA to visualize patterns in fewer
dimensions.
c) Only focus on features with the highest variance.
d) Convert all features to a single summary statistic.
27. A scatter plot with a clear nonlinear pattern (e.g., a curved shape) suggests that:
a) Pearson correlation might not fully capture the relationship.
b) The data are normally distributed.
c) There is no relationship between the variables.
d) It is safe to assume linearity.
28. Before building a predictive model, checking for multicollinearity is important because:
a) Multicollinearity inflates the variance of model coefficients, making them unstable.
b) Multicollinearity improves model interpretability.
c) Multicollinearity eliminates outliers automatically.
d) Multicollinearity ensures no missing data remain.
29. A density plot is different from a histogram in that it:
a) Is a discrete representation of counts.
b) Provides a smooth estimate of the distribution of a variable.
c) Cannot show continuous variables.
d) Always has multiple peaks.
30. In EDA, which method is best for visualizing both the median and the spread of data
while retaining information about the underlying distribution shape?
a) Bar plot
b) Box plot
c) Violin plot
d) Pie chart
31. When you detect a potential outlier, your first step should be to:
a) Remove it immediately.
b) Check the data source and domain context to determine if it’s a valid data point.
c) Replace it with the mean.
d) Replace it with the median.
32. Transforming skewed data using a log transform is appropriate when:
a) The data contain zero or negative values.
b) The data are already symmetric.
c) The data are strictly positive and heavily right-skewed.
d) The data are categorical.
33. A pair plot (or scatterplot matrix) is typically used to:
a) Show the temporal evolution of a single feature.
b) Visualize relationships and distributions among multiple numeric variables.
c) Represent correlations as color-coded blocks.
d) Display contingency tables for categorical variables.
34. The primary goal of EDA is to:
a) Confirm a specific hypothesis without bias.
b) Explore data to uncover underlying structure, detect outliers, and suggest hypotheses.
c) Build the final predictive model directly.
d) Implement complex neural networks for classification.
35. The Min-Max scaling transformation maps data to a range:
a) From -1 to 1
b) From 0 to 1
c) From -∞ to ∞
d) From 1 to 10
36. In a correlation matrix, which value indicates the strongest linear relationship?
a) 0
b) 0.5
c) -1 or 1
d) 0.1
37. If a feature has many outliers and heavy tails, a robust scaler might be preferred because
it:
a) Uses the mean and variance.
b) Uses the median and interquartile range, reducing the influence of outliers.
c) Normalizes the data to have a sum of 1.
d) Removes all outliers before scaling.
38. One main reason for employing EDA is to:
a) Automatically select the best machine learning algorithm.
b) Gain insights and guide further data preprocessing and modeling decisions.
c) Avoid visualizing data altogether.
d) Ensure that linear regression is always the best choice.
39. If you find a high correlation between two features in your dataset, one typical approach
before modeling is to:
a) Drop one of the correlated features to reduce redundancy.
b) Keep both features unchanged.
c) Invert one feature’s values.
d) Assign random values to one feature.
40. Kernel density estimation (KDE) in EDA is used to:
a) Estimate a discrete probability distribution’s PMF directly.
b) Smoothly estimate the probability density function of a continuous variable.
c) Check for missing values.
d) Encode categorical variables into numeric form.
1. b (EDA involves summarizing and visualizing the main characteristics of the
data.)
2. b (A box plot helps visualize distribution and detect outliers.)
3. b (The median is less affected by outliers than the mean.)
4. c (IQR is robust against outliers and skewed distributions.)
5. b (Mean imputation can introduce bias if data are not missing at random.)
6. b (A histogram shows the distribution of a single continuous variable.)
7. a (A scatter plot is used to visualize the relationship between two continuous
variables.)
8. a (Pearson correlation measures linear association.)
9. a (Right-skewed means the tail extends to the right.)
10. b (QQ plots are used to assess how well data fit a normal distribution.)
11. a (Standardization sets mean to 0 and std. dev. to 1.)
12. b (Violin plots add density information to box plot elements.)
13. c (A correlation matrix shows pairwise linear correlations.)
14. b (Bar plots are better for showing counts of categories; line plots imply
continuity.)
15. b (Outliers can distort statistical measures and subsequent modeling.)
16. b (A log transform often reduces right-skew.)
17. c (Scaling features before distance-based methods is common practice.)
18. c (Mean and variance are most appropriate for roughly normal distributions.)
19. b (Spearman is used for monotonic but not necessarily linear relationships.)
20. c (A pair plot can display histograms for numeric and frequency charts for
categorical features.)
21. b (PCA is used to reduce dimensionality.)
22. b (Line plots are ideal for showing continuous changes over time.)
23. b (One-hot encoding converts categorical variables into binary indicators.)
24. b (Heatmaps represent values, like correlations, as color intensities.)
25. c (Knowing the distribution shape informs choice of statistics and
transformations.)
26. b (Dimensionality reduction can help understand and visualize large feature sets.)
27. a (Nonlinear patterns suggest Pearson correlation alone is insufficient.)
28. a (Multicollinearity causes instability in regression coefficients.)
29. b (Density plots provide a smooth estimate of the distribution.)
30. c (Violin plots show distribution shape, median, and spread.)
31. b (Always verify if an outlier is a true data point or an error first.)
32. c (Log transforms are appropriate for positive, right-skewed data.)
33. b (Pair plots visualize relationships and distributions among numeric variables.)
34. b (EDA aims to uncover data structure and guide hypothesis generation.)
35. b (Min-Max scaling maps data to the [0, 1] range.)
36. c (Correlation of ±1 indicates the strongest linear relationship.)
37. b (A robust scaler uses median and IQR, reducing outlier influence.)
38. b (EDA guides preprocessing and modeling decisions.)
39. a (Dropping one feature reduces redundancy and multicollinearity issues.)
40. b (KDE is used to smoothly estimate a continuous variable’s probability density
function.)

You might also like