Machine Learning - Lec4 - 5
Machine Learning - Lec4 - 5
Machine Learning - Lec4 - 5
Data Exploration
Data exploration, preprocessing, and visualization are crucial steps in
machine learning that help in understanding the data, preparing it for
modeling, and gaining insights.
Data Exploration
Understanding the Data:
● Check basic statistics: Mean, median, standard deviation, etc., to understand the
distribution of features.
● Investigate data types: Categorical, numerical, text, etc.
● Explore missing values: Identify and handle missing or null values appropriately.
Visualizing Data:
● Histograms, box plots, or density plots help understand feature distributions.
● Scatter plots show relationships between variables.
● Heatmaps or correlation matrices reveal feature correlations.
● Categorical data: Bar charts, pie charts, or count plots to analyze distributions.
Data Preprocessing
Handling Missing Values:
● Impute missing values using techniques like mean, median, mode, or advanced imputation methods.
● Delete or interpolate missing values based on the dataset and the impact on analysis.
Encoding Categorical Variables:
● Convert categorical variables into numerical format using techniques like one-hot encoding or label
encoding.
Feature Scaling:
● Normalize or standardize numerical features to bring them to a similar scale, preventing dominance by
larger values.
Feature Engineering:
● Create new features based on existing ones that might be more informative for the model.
● Transform variables (log, square root) to make their distributions more Gaussian.
Data Visualization
● Univariate Visualizations:
○ Histograms, box plots, and kernel density plots for single variables to understand
their distributions and outliers.
● Bivariate and Multivariate Visualizations:
○ Scatter plots, pair plots, heatmaps, or correlation matrices to explore
relationships between variables.
○ Facet grids or conditional plots for comparisons across multiple variables.
● Interactive Visualization:
○ Tools like Plotly or Bokeh for interactive plots that allow zooming, hovering,
and filtering.
Tools for Data Exploration and Visualization
● Python Libraries: Pandas, Matplotlib, Seaborn, Plotly, Bokeh, Altair.
● R Programming: ggplot2, dplyr, tidyr.
● BI Tools: Tableau, Power BI for interactive and dashboard-style
visualizations.
● Jupyter Notebooks: Ideal for combining code, visualization, and explanatory
text in one document.
Importance
● Understanding the data: It helps in selecting appropriate models, identifying potential
issues, and making informed decisions during model building.
● Quality Improvement: Proper preprocessing and visualization lead to cleaner data,
which often results in more accurate models.
● Insights and Interpretability: Visualization aids in communicating findings and
insights from the data to stakeholders.
Effective data exploration, preprocessing, and visualization are integral parts of the
machine learning pipeline, contributing significantly to the success and interpretability of
the models built on the data.
Data Types
Numeric data refers to data that can be expressed as a number. Examples of
numeric data include height, weight, and temperature.
Categorical Data: Data that can be categorized but lacks an inherent hierarchy or
order is known as categorical data. In other words, there is no mathematical
connection between the categories. A person's gender (male/female), eye color
(blue, green, brown, etc.), type of vehicle they drive (sedan, SUV, truck, etc.), or
the kind of fruit they consume (apple, banana, orange, etc.) are examples of
categorical data.
Outliers
● Outliers are the values that look different from the other values in the data.
● Outliers are data points that significantly deviate from the majority of the
data. They can be caused by errors, anomalies, or simply rare events.
Reasons for outliers in data
1. Errors during data entry or a faulty measuring device (a faulty sensor may
employees)
Problem caused by outliers
1. Outliers in the data may causes problems during model fitting (esp. linear
models).
2. Outliers may inflate the error metrics which give higher weights to large