Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 18
INTRODUCTION TO MACHINE LEARNING
BY AATIQA BINT E GHAZALI
FALL 2024 Revision Q/A session from the previous lecture. Check the home-work given. Tasks of learning examples Supervised: classification,regression analysis, Unsupervised : anomaly detection , dimensionality reduction, Reinforcement: many robots implement Reinforcement Learning algorithms to learn how to walk, gaming chess players etc, Main Challenges of Machine Learning Insufficient Quantity of Training Data Non-representative Training Data Poor quality data Irrelevant features Overfitting the Training Data Under-fitting the Training Data Testing and Validating Putting the model in production for testing is bad. splitting the data into two sets: the training set and the test set : better option 80:20 ratio for train and test THE ML TOOLBOX Data Infrastructure Algorithms Visualizations Machine Learning Pipeline a series of interconnected data processing and modeling steps designed to automate, standardize and streamline the process of building, training, evaluating and deploying machine learning models. stages of a machine learning pipeline 1. Data collection 2. Data preprocessing 3. Feature engineering 4. Model selection 5. Model training 6. Model evaluation 7. Model deployment 8. Monitoring and maintenance Data Collection new data is collected from various data sources, such as databases, APIs , or files often involves raw data which may require preprocessing to be useful. Common sources of data : Kaggle , UCI Data preprocessing
involves cleaning, transforming and preparing input data for modeling.
Common preprocessing steps include handling missing values, encoding categorical variables, scaling numerical features and splitting the data into training and testing sets. Feature engineering & Model selection Feature engineering creating new features or selecting relevant features from the data that can improve the model's predictive power. This step often requires domain knowledge and creativity. Model selection choose the appropriate machine learning algorithm(s) based on the problem type (e.g., classification, regression), data characteristics, and performance requirements. Model training & Model evaluation Model training The selected model(s) are trained on the training dataset using the chosen algorithm(s). This involves learning the underlying patterns and relationships within the training data. Pre-trained models can also be used, rather than training a new model. Model evaluation We will be assessing the model's performance using a separate testing dataset or through cross-validation. Common evaluation metrics depend on the specific problem but may include accuracy, precision, recall, F1-score, mean squared error or others. Model deployment & Maintenance Model deployment Once a satisfactory model is developed and evaluated, it can be deployed to a production environment where it can make predictions on new, unseen data. Maintenance After deployment, it's important to continuously monitor the model's performance and retrain it as needed to adapt to changing data patterns. This step ensures that the model remains accurate and reliable in a real- world setting. Lets do some practical implementation from the pipeline discussed Titanic Dataset Collection Kaggle holds a wide range of datasets of various types One of the most common and beginner datasets / competition is titanic dataset This dataset is used to predict the survivals in Titanic Download the dataset Upload it on drive Explore it https://kaggle.com/c/titanic/data Pandas and NumPy pandas and NumPy are very useful libraries in Python Pandas is a very popular library for working with data . DataFrames are at the center of pandas. A DataFrame is structured like a table or spreadsheet. The rows and the columns both have indexes, and you can perform operations on rows or columns separately. NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. Pandas is built on the top of numpy If you are working on anaconda use !pip install numpy and pandas Pip is made for installing things in colab , anaconda etc After installation import numpy and pandas into your code Matplotlip and seaborn For visualization purposes Matplotlib is primarily used for basic chart plotting, while Seaborn offers many default themes and a wide variety of schemes for statistical visualization. Import these two libraries in colab Loading dataset Read the csv file using pandas And display the dataset in notebook Read first few rows of data