
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18



FALL 2024
 Q/A session from the previous lecture.
 Check the home-work given.
Tasks of learning examples
 Supervised: classification,regression analysis,
 Unsupervised : anomaly detection , dimensionality reduction,
 Reinforcement: many robots implement Reinforcement Learning algorithms
to learn how to walk, gaming chess players etc,
Main Challenges of Machine Learning
 Insufficient Quantity of Training Data
 Non-representative Training Data
 Poor quality data
 Irrelevant features
 Overfitting the Training Data
 Under-fitting the Training Data
Testing and Validating
 Putting the model in production for testing is bad.
 splitting the data into two sets: the training set and the test set : better
 80:20 ratio for train and test
 Data
 Infrastructure
 Algorithms
 Visualizations
Machine Learning Pipeline
 a series of interconnected data processing and modeling steps
 designed to automate, standardize and streamline the process of building,
training, evaluating and deploying machine learning models.
stages of a machine learning pipeline
1. Data collection
2. Data preprocessing
3. Feature engineering
4. Model selection
5. Model training
6. Model evaluation
7. Model deployment
8. Monitoring and maintenance
Data Collection
 new data is collected from various data sources, such as
databases, APIs , or files
 often involves raw data which may require preprocessing to be
 Common sources of data : Kaggle , UCI
Data preprocessing

 involves cleaning, transforming and preparing input data for modeling.

 Common preprocessing steps include handling missing values, encoding
categorical variables, scaling numerical features and splitting the
data into training and testing sets.
Feature engineering & Model
 Feature engineering
 creating new features or selecting relevant features from the
data that can improve the model's predictive power.
 This step often requires domain knowledge and creativity.
 Model selection
 choose the appropriate machine learning algorithm(s) based on the problem
type (e.g., classification, regression), data characteristics, and performance
Model training & Model evaluation
 Model training
 The selected model(s) are trained on the training dataset using the
chosen algorithm(s).
 This involves learning the underlying patterns and relationships within
the training data.
 Pre-trained models can also be used, rather than training a new model.
 Model evaluation
 We will be assessing the model's performance using a separate testing
dataset or through cross-validation.
 Common evaluation metrics depend on the specific problem but may
include accuracy, precision, recall, F1-score, mean squared error or
Model deployment & Maintenance
 Model deployment
 Once a satisfactory model is developed and evaluated, it can be deployed to
a production environment where it can make predictions on new, unseen
 Maintenance
 After deployment, it's important to continuously monitor the model's
performance and retrain it as needed to adapt to changing data patterns.
 This step ensures that the model remains accurate and reliable in a real-
world setting.
 Lets do some practical implementation from the pipeline discussed
Titanic Dataset Collection
 Kaggle holds a wide range of datasets of various types
 One of the most common and beginner datasets / competition is titanic
 This dataset is used to predict the survivals in Titanic
 Download the dataset
 Upload it on drive
 Explore it
 https://kaggle.com/c/titanic/data
Pandas and NumPy
 pandas and NumPy are very useful libraries in Python
 Pandas is a very popular library for working with data . DataFrames are at
the center of pandas. A DataFrame is structured like a table or spreadsheet.
The rows and the columns both have indexes, and you can perform
operations on rows or columns separately.
 NumPy is an open-source Python library that facilitates efficient numerical
operations on large quantities of data.
 Pandas is built on the top of numpy
 If you are working on anaconda use !pip install numpy and pandas
 Pip is made for installing things in colab , anaconda etc
 After installation import numpy and pandas into your code
Matplotlip and seaborn
 For visualization purposes
 Matplotlib is primarily used for basic chart plotting,
 while Seaborn offers many default themes and a wide variety of schemes for
statistical visualization.
 Import these two libraries in colab
Loading dataset
 Read the csv file using pandas
 And display the dataset in notebook
 Read first few rows of data

You might also like