Exploratory Data Analysis For Machine Learning

The document consists of a series of questions and answers related to Exploratory Data Analysis (EDA), covering key concepts such as data visualization techniques, measures of central tendency, handling outliers, and dimensionality reduction. It emphasizes the importance of understanding data distributions, relationships between variables, and appropriate methods for data preprocessing. The content serves as a guide for best practices in EDA to enhance data analysis and modeling decisions.

Uploaded by

eyob53834

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views6 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

eyob53834

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

1. Which of the following best describes Exploratory Data Analysis (EDA)?

a) Applying deep learning models to predict labels directly.

b) Using statistical techniques to summarize and visualize the main characteristics of the
data.
c) Splitting data into training and test sets without examining distributions.
d) Only applying regression models to identify relationships.
2. In EDA, a box plot is most useful for:
a) Displaying the frequency distribution of a categorical variable.
b) Comparing the distributions of numeric variables and identifying outliers.
c) Showing the correlation between two continuous variables.
d) Representing cumulative counts of a variable.
3. Which measure of central tendency is most robust against outliers?
a) Mean
b) Median
c) Mode
d) Geometric Mean
4. Suppose you have a dataset with a highly skewed distribution. Which measure of spread
is typically more appropriate than the standard deviation?
a) Range
b) Variance
c) Interquartile Range (IQR)
d) Mean Absolute Deviation (MAD)
5. When dealing with missing data, which of the following methods might introduce bias if
the data are not missing at random?
a) Dropping all rows with missing values
b) Imputing with the mean of the column
c) Imputing using regression-based methods
d) Multiple imputation with chain equations
6. A histogram is best used for:
a) Displaying the relationship between two categorical variables.
b) Showing the distribution of a single continuous variable.
c) Examining correlation between two continuous variables.
d) Visualizing time series data.
7. Correlation between two variables is best visualized using a:
a) Scatter plot
b) Pie chart
c) Stacked bar chart
d) Box plot
8. The Pearson correlation coefficient measures:
a) A linear association between two variables.
b) A nonlinear association between two variables.
c) The rank correlation between two variables.
d) The difference in medians between two groups.
9. If a variable’s distribution is right-skewed, it means:
a) The mass of the distribution is concentrated on the left and the tail extends to the right.
b) The mass is concentrated on the right and the tail extends to the left.
c) The distribution is symmetrical.
d) The distribution is uniform.
10. A QQ (Quantile-Quantile) plot is generally used to:
a) Compare the distributions of two categorical variables.
b) Assess whether a data set is normally distributed.
c) Visualize pairwise correlations for multiple variables.
d) Evaluate model residuals against a target variable.
11. Standardizing a feature typically involves:
a) Shifting it to have mean = 0 and scaling it to have standard deviation = 1.
b) Multiplying all values by a constant factor greater than 1.
c) Taking the logarithm of all values.
d) Replacing all outliers with the median.
12. A violin plot provides more information than a box plot by:
a) Displaying the correlation matrix.
b) Showing the kernel density estimation of the distribution.
c) Highlighting the linear regression line.
d) Visualizing only the median and outliers.
13. The purpose of a pairwise correlation matrix is:
a) To display relationships between more than two categorical variables.
b) To summarize missing values in each feature.
c) To show linear correlation coefficients between multiple continuous variables.
d) To evaluate classification accuracy.
14. When analyzing categorical variables, a bar plot is often more appropriate than a line plot
because:
a) Categorical variables have a natural continuous ordering.
b) Line plots emphasize trends between ordered values, which may not be meaningful for
categories.
c) Bar plots cannot show frequencies.
d) Line plots and bar plots are identical.
15. Identifying and handling outliers is important in EDA because:
a) Outliers never affect the mean.
b) Outliers can distort statistical measures and model performance.
c) Outliers are always due to data entry errors.
d) Removing outliers will always improve accuracy.
16. Which of the following transformations is often used to reduce right-skew in a feature’s
distribution?
a) Exponential transformation
b) Logarithmic transformation
c) Inverse transformation
d) Binarization
17. If a dataset’s features are on very different scales, a common approach before applying
distance-based algorithms is to:
a) Leave the data as is.
b) Encode the data using one-hot encoding.
c) Scale the features, for example using StandardScaler or MinMaxScaler.
d) Only normalize the target variable.
18. The mean and variance are best used as measures of central tendency and spread when:
a) The data are highly skewed.
b) The data have outliers.
c) The data are approximately normally distributed.
d) The data are purely categorical.
19. The Spearman correlation coefficient is preferred over Pearson’s when:
a) Both variables are normally distributed.
b) The relationship is nonlinear but monotonic.
c) Variables are categorical.
d) There are no missing values.
20. If your dataset contains mixed data types (numeric and categorical), which plot is best
suited for an initial overview?
a) A scatter matrix plot of all features
b) A heatmap of correlations
c) A pair plot that includes histograms and scatter plots for numeric variables and
frequency charts for categorical ones
d) A QQ plot for all variables combined
21. Principal Component Analysis (PCA) in EDA is often used to:
a) Cluster data into meaningful groups.
b) Reduce dimensionality while retaining most variation in the data.
c) Replace missing values with principal components.
d) Create a target variable for supervised learning.
22. When dealing with time series data in EDA, a line plot is preferred over a bar plot for:
a) Visualizing frequency counts of a categorical variable.
b) Showing changes in a continuous variable over time.
c) Comparing distributions between two groups.
d) Displaying box-and-whisker statistics.
23. One-hot encoding is used to:
a) Scale numeric features.
b) Encode categorical variables into binary indicator variables.
c) Normalize distributions of skewed data.
d) Identify outliers in continuous data.
24. A heatmap is particularly useful for:
a) Showing the distribution of a single variable.
b) Representing correlations or distances between multiple variables as color-coded
intensities.
c) Highlighting missing values in a 3D graph.
d) Displaying regression lines between pairs of variables.
25. Identifying the shape of the distribution is a crucial step in EDA because it helps in:
a) Selecting the appropriate modeling algorithm.
b) Deciding which scaling method to apply.
c) Understanding the nature of your data and choosing the right statistical measures.
d) Automatically tuning hyperparameters of a model.
26. In EDA, which of the following is a good practice when dealing with a large number of
features?
a) Ignore all features that don’t look important at first glance.
b) Use dimensionality reduction techniques such as PCA to visualize patterns in fewer
dimensions.
c) Only focus on features with the highest variance.
d) Convert all features to a single summary statistic.
27. A scatter plot with a clear nonlinear pattern (e.g., a curved shape) suggests that:
a) Pearson correlation might not fully capture the relationship.
b) The data are normally distributed.
c) There is no relationship between the variables.
d) It is safe to assume linearity.
28. Before building a predictive model, checking for multicollinearity is important because:
a) Multicollinearity inflates the variance of model coefficients, making them unstable.
b) Multicollinearity improves model interpretability.
c) Multicollinearity eliminates outliers automatically.
d) Multicollinearity ensures no missing data remain.
29. A density plot is different from a histogram in that it:
a) Is a discrete representation of counts.
b) Provides a smooth estimate of the distribution of a variable.
c) Cannot show continuous variables.
d) Always has multiple peaks.
30. In EDA, which method is best for visualizing both the median and the spread of data
while retaining information about the underlying distribution shape?
a) Bar plot
b) Box plot
c) Violin plot
d) Pie chart
31. When you detect a potential outlier, your first step should be to:
a) Remove it immediately.
b) Check the data source and domain context to determine if it’s a valid data point.
c) Replace it with the mean.
d) Replace it with the median.
32. Transforming skewed data using a log transform is appropriate when:
a) The data contain zero or negative values.
b) The data are already symmetric.
c) The data are strictly positive and heavily right-skewed.
d) The data are categorical.
33. A pair plot (or scatterplot matrix) is typically used to:
a) Show the temporal evolution of a single feature.
b) Visualize relationships and distributions among multiple numeric variables.
c) Represent correlations as color-coded blocks.
d) Display contingency tables for categorical variables.
34. The primary goal of EDA is to:
a) Confirm a specific hypothesis without bias.
b) Explore data to uncover underlying structure, detect outliers, and suggest hypotheses.
c) Build the final predictive model directly.
d) Implement complex neural networks for classification.
35. The Min-Max scaling transformation maps data to a range:
a) From -1 to 1
b) From 0 to 1
c) From -∞ to ∞
d) From 1 to 10
36. In a correlation matrix, which value indicates the strongest linear relationship?
a) 0
b) 0.5
c) -1 or 1
d) 0.1
37. If a feature has many outliers and heavy tails, a robust scaler might be preferred because
it:
a) Uses the mean and variance.
b) Uses the median and interquartile range, reducing the influence of outliers.
c) Normalizes the data to have a sum of 1.
d) Removes all outliers before scaling.
38. One main reason for employing EDA is to:
a) Automatically select the best machine learning algorithm.
b) Gain insights and guide further data preprocessing and modeling decisions.
c) Avoid visualizing data altogether.
d) Ensure that linear regression is always the best choice.
39. If you find a high correlation between two features in your dataset, one typical approach
before modeling is to:
a) Drop one of the correlated features to reduce redundancy.
b) Keep both features unchanged.
c) Invert one feature’s values.
d) Assign random values to one feature.
40. Kernel density estimation (KDE) in EDA is used to:
a) Estimate a discrete probability distribution’s PMF directly.
b) Smoothly estimate the probability density function of a continuous variable.
c) Check for missing values.
d) Encode categorical variables into numeric form.
1. b (EDA involves summarizing and visualizing the main characteristics of the
data.)
2. b (A box plot helps visualize distribution and detect outliers.)
3. b (The median is less affected by outliers than the mean.)
4. c (IQR is robust against outliers and skewed distributions.)
5. b (Mean imputation can introduce bias if data are not missing at random.)
6. b (A histogram shows the distribution of a single continuous variable.)
7. a (A scatter plot is used to visualize the relationship between two continuous
variables.)
8. a (Pearson correlation measures linear association.)
9. a (Right-skewed means the tail extends to the right.)
10. b (QQ plots are used to assess how well data fit a normal distribution.)
11. a (Standardization sets mean to 0 and std. dev. to 1.)
12. b (Violin plots add density information to box plot elements.)
13. c (A correlation matrix shows pairwise linear correlations.)
14. b (Bar plots are better for showing counts of categories; line plots imply
continuity.)
15. b (Outliers can distort statistical measures and subsequent modeling.)
16. b (A log transform often reduces right-skew.)
17. c (Scaling features before distance-based methods is common practice.)
18. c (Mean and variance are most appropriate for roughly normal distributions.)
19. b (Spearman is used for monotonic but not necessarily linear relationships.)
20. c (A pair plot can display histograms for numeric and frequency charts for
categorical features.)
21. b (PCA is used to reduce dimensionality.)
22. b (Line plots are ideal for showing continuous changes over time.)
23. b (One-hot encoding converts categorical variables into binary indicators.)
24. b (Heatmaps represent values, like correlations, as color intensities.)
25. c (Knowing the distribution shape informs choice of statistics and
transformations.)
26. b (Dimensionality reduction can help understand and visualize large feature sets.)
27. a (Nonlinear patterns suggest Pearson correlation alone is insufficient.)
28. a (Multicollinearity causes instability in regression coefficients.)
29. b (Density plots provide a smooth estimate of the distribution.)
30. c (Violin plots show distribution shape, median, and spread.)
31. b (Always verify if an outlier is a true data point or an error first.)
32. c (Log transforms are appropriate for positive, right-skewed data.)
33. b (Pair plots visualize relationships and distributions among numeric variables.)
34. b (EDA aims to uncover data structure and guide hypothesis generation.)
35. b (Min-Max scaling maps data to the [0, 1] range.)
36. c (Correlation of ±1 indicates the strongest linear relationship.)
37. b (A robust scaler uses median and IQR, reducing outlier influence.)
38. b (EDA guides preprocessing and modeling decisions.)
39. a (Dropping one feature reduces redundancy and multicollinearity issues.)
40. b (KDE is used to smoothly estimate a continuous variable’s probability density
function.)

Chapter 3
No ratings yet
Chapter 3
12 pages
Practice Questions for Tableau Desktop Specialist Certification Case Based
From Everand
Practice Questions for Tableau Desktop Specialist Certification Case Based
Exam OG
5/5 (1)
Itae006 Test 1 and 2
No ratings yet
Itae006 Test 1 and 2
18 pages
Data Visualization Question Bank eDBDA Sept 21
No ratings yet
Data Visualization Question Bank eDBDA Sept 21
5 pages
Lovely - Billie Eilish (With Khalid) : Download
No ratings yet
Lovely - Billie Eilish (With Khalid) : Download
1 page
Module-1 MCQ of Data Analytics and Visualization
No ratings yet
Module-1 MCQ of Data Analytics and Visualization
6 pages
Revision Exercise SDSC5001 Midterm
No ratings yet
Revision Exercise SDSC5001 Midterm
4 pages
EDA QB Full Answers
No ratings yet
EDA QB Full Answers
18 pages
CBT Unit 2 New
No ratings yet
CBT Unit 2 New
2 pages
Unit 3
No ratings yet
Unit 3
222 pages
Bussiness Analytics - FINAL
No ratings yet
Bussiness Analytics - FINAL
34 pages
Data Visualization - PGDBDA - Feb 19
No ratings yet
Data Visualization - PGDBDA - Feb 19
11 pages
Chapter6 MCQs
No ratings yet
Chapter6 MCQs
6 pages
Ds Unit 2 QB
No ratings yet
Ds Unit 2 QB
25 pages
EDA Module1 Full Answers
No ratings yet
EDA Module1 Full Answers
5 pages
Data Analysis
No ratings yet
Data Analysis
7 pages
DS Bits Mid-2 Student
No ratings yet
DS Bits Mid-2 Student
3 pages
Tutorial 2 Solutions
No ratings yet
Tutorial 2 Solutions
5 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Unit 3
No ratings yet
Unit 3
47 pages
02 - Data Types - MCQ
No ratings yet
02 - Data Types - MCQ
4 pages
UIIC AO Dataanalytics Syllabuscoveredthroughmcqs
No ratings yet
UIIC AO Dataanalytics Syllabuscoveredthroughmcqs
333 pages
DS Bits Mid-2 Exam
No ratings yet
DS Bits Mid-2 Exam
4 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Eda Important Two Marks & 16 Marks
0% (1)
Eda Important Two Marks & 16 Marks
17 pages
Q2 Ans
No ratings yet
Q2 Ans
5 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Data Science EDA MCQs Document
No ratings yet
Data Science EDA MCQs Document
24 pages
De&v Two Marks Questions With Answers
No ratings yet
De&v Two Marks Questions With Answers
19 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Ids Unit 3
No ratings yet
Ids Unit 3
4 pages
05 AIHC Exp02
No ratings yet
05 AIHC Exp02
11 pages
Ds Quiz
No ratings yet
Ds Quiz
9 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
Data Science Quiz Answers
No ratings yet
Data Science Quiz Answers
5 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Important Questions
No ratings yet
Important Questions
3 pages
DAV_final
No ratings yet
DAV_final
61 pages
Alternate Simulated Practice Exam ETC1010
No ratings yet
Alternate Simulated Practice Exam ETC1010
4 pages
Levels of Measurement Q A
No ratings yet
Levels of Measurement Q A
16 pages
EDA 2K22 DEC Exploratory Data Analysis OE00075
No ratings yet
EDA 2K22 DEC Exploratory Data Analysis OE00075
4 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Quiz 1 Repaired
No ratings yet
Quiz 1 Repaired
6 pages
MSLR Important Theory Questions
No ratings yet
MSLR Important Theory Questions
17 pages
PHD Research Methodology Interpretation and Presentation of Data
No ratings yet
PHD Research Methodology Interpretation and Presentation of Data
24 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Sheet 3
No ratings yet
Sheet 3
12 pages
Sixth Simulated Practice Exam ETC1010
No ratings yet
Sixth Simulated Practice Exam ETC1010
3 pages
Bam 212
No ratings yet
Bam 212
7 pages
Foundations of Data Science - R19AD253
No ratings yet
Foundations of Data Science - R19AD253
22 pages
Final Paper ANS KEY OFFICIAL
No ratings yet
Final Paper ANS KEY OFFICIAL
15 pages
RMSM - Quiz
No ratings yet
RMSM - Quiz
5 pages
IT446 Test Bank
No ratings yet
IT446 Test Bank
57 pages
8614 Quiz
No ratings yet
8614 Quiz
14 pages
DEV UNIT 3,4 MCQs
No ratings yet
DEV UNIT 3,4 MCQs
6 pages
Multiple Choice Questions (The Answers Are Provided After The Last Question.)
100% (2)
Multiple Choice Questions (The Answers Are Provided After The Last Question.)
6 pages
Exam-WPS Office
No ratings yet
Exam-WPS Office
13 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Clo 2
No ratings yet
Clo 2
39 pages
Computer Vision Day 2
No ratings yet
Computer Vision Day 2
50 pages
10 Improving Deep Neural Networks Hyperparameter Tuning, Regularization
No ratings yet
10 Improving Deep Neural Networks Hyperparameter Tuning, Regularization
6 pages
Supervised Machine Learning Regression
No ratings yet
Supervised Machine Learning Regression
6 pages
1 - Python Basics
No ratings yet
1 - Python Basics
9 pages
Mohammed Sikander Profile 2022
No ratings yet
Mohammed Sikander Profile 2022
3 pages
Acceptable Means Compliance Guidance Material Part 147 Module 12 Helicopter Aerodynamics Structures Systems
No ratings yet
Acceptable Means Compliance Guidance Material Part 147 Module 12 Helicopter Aerodynamics Structures Systems
4 pages
用虚拟机来创建raid0、5、1磁盘阵列的过程 PDF
No ratings yet
用虚拟机来创建raid0、5、1磁盘阵列的过程 PDF
32 pages
Vinod K Datastage
No ratings yet
Vinod K Datastage
14 pages
Shell Immersion Cooling Fluid Marketing Brochure Updated Oct 23
No ratings yet
Shell Immersion Cooling Fluid Marketing Brochure Updated Oct 23
12 pages
NAS 57163364635 10 Acknowledgement Slip
No ratings yet
NAS 57163364635 10 Acknowledgement Slip
1 page
User-Guide-1 0 1
No ratings yet
User-Guide-1 0 1
8 pages
Ccs354 Network Security Lab
100% (1)
Ccs354 Network Security Lab
63 pages
Tianyu POS Leaflet - P30 V1.2
No ratings yet
Tianyu POS Leaflet - P30 V1.2
1 page
Datasheet Trina Tallmax TSM-455 DE17MII
No ratings yet
Datasheet Trina Tallmax TSM-455 DE17MII
2 pages
HONEYWELL TPS HISTORY MODULE - Automation & Control Engineering Forum
No ratings yet
HONEYWELL TPS HISTORY MODULE - Automation & Control Engineering Forum
1 page
Model Question Paper-1 Subject: Microcontroller (18EE52) Semester: 5th
No ratings yet
Model Question Paper-1 Subject: Microcontroller (18EE52) Semester: 5th
4 pages
Fisnar Operating Manual F4000N Desktop-Robot
No ratings yet
Fisnar Operating Manual F4000N Desktop-Robot
150 pages
3RD Year Ee N-22 Curriculum
No ratings yet
3RD Year Ee N-22 Curriculum
165 pages
OpAmp Frequency Compensation - Ver3
No ratings yet
OpAmp Frequency Compensation - Ver3
43 pages
A A S T M T: Watch Keeping MT 213
No ratings yet
A A S T M T: Watch Keeping MT 213
48 pages
Becares Final2
No ratings yet
Becares Final2
20 pages
Research Methodology: An Introduction: K.Balaji
No ratings yet
Research Methodology: An Introduction: K.Balaji
17 pages
EE102 Lab 1
No ratings yet
EE102 Lab 1
12 pages
Application of Trigonometry
100% (1)
Application of Trigonometry
6 pages
Case Study - Maintenance Management
No ratings yet
Case Study - Maintenance Management
27 pages
AVEVA Reports 01-21 PDF
No ratings yet
AVEVA Reports 01-21 PDF
2 pages
Salesforce Artificial Intelligence
No ratings yet
Salesforce Artificial Intelligence
13 pages
New Fire Control System Design For Fire Fighting and Rescue Equipment
No ratings yet
New Fire Control System Design For Fire Fighting and Rescue Equipment
6 pages
Tutorial 2: CC118 Computer Organisation and Architecture
No ratings yet
Tutorial 2: CC118 Computer Organisation and Architecture
3 pages
Product Realization Process
No ratings yet
Product Realization Process
7 pages
NovaLCT LED Configuration Tool For Synchronous Control System User Manual V5.3.1
No ratings yet
NovaLCT LED Configuration Tool For Synchronous Control System User Manual V5.3.1
132 pages
White Paper
No ratings yet
White Paper
8 pages
Functional Specifications - Chromeleon 7.3.1
No ratings yet
Functional Specifications - Chromeleon 7.3.1
424 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis For Machine Learning

Uploaded by

1. Which of the following best describes Exploratory Data Analysis (EDA)?

a) Applying deep learning models to predict labels directly.

You might also like