Machine Learning With Python
Machine Learning With Python
On
Bachelor of Technology
in
CERTIFICATE
This is to certify that the industrial training entitled “Machine Learning with Python” is the
Bonafede work carried out by Shagun Kumari Mangal, student of B. Tech in Computer
Science & Engineering at Jaipur Engineering College and Research Centre, during the year
2022-23 in partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science & Engineering under my guidance.
Designation : Trainer
i
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
VISION OF INSTITUTE
To become renowned Centre of outcome based leaning and work towards academic,
professional, cultural and social enrichments of the lives of individual and communities
MISSION OF INSTITUTE
1. Focus on evaluation of learning outcomes and motivate students to inculcate research
aptitude by project-based learning.
2. Identify areas of focus and provide platform to gain knowledge and solutions based on
informed perception of Indian, regional and global needs.
3. Offer opportunities for interaction between academia and industry.
4. Develop human potential to its fullest extent so that intellectually capable and imaginatively
gifted leaders can emerge in a range of professions.
ii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
iii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
PSO1 Ability to interpret and analyse network specific, cyber security issues, automation in real
world environment.
PSO2 Ability to design and develop mobile and web-based applications under realistic constraints.
iv
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
CO-1 3 3 2 2 2 1 1 2 2 3 3 3
3CS7-30
Industrial Training
CO-2 3 3 3 3 3 1 1 2 2 3 3 3
v
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
ACKNOWLEDGEMENT
It has been a great honour and privilege to undergo training at UPFLAIRS Pvt Ltd, Jaipur. I
am very grateful to MR. SANAM PEEYUSH for giving his valuable time and constructive
guidance in preparing the report for training. It would not have been possible to complete this
report in short period of time without their kind encouragement and valuable guidance.
I wish to express our deep sense of gratitude to our Industrial Training Guide MR. SUNIL
SHARMA, Assistant Professor, Jaipur Engineering College and Research Centre, Jaipur for
guiding us from the inception till the completion of the industrial training. We sincerely
acknowledge him for giving his valuable guidance, support for literature survey, critical
reviews and comments for our industrial training.
I would like to first of all express our thanks to MR. ARPIT AGRAWAL, Director of JECRC,
for providing us such a great infrastructure and environment for our overall development.
I express sincere thanks to DR. V. K. CHANDNA, Principal of JECRC, for his kind
cooperation and extendible support towards the completion of our industrial training.
Words are inadequate in offering our thanks to DR. SANJAY GAUR, Head of Department of
Computer Science Engineering, for consistent encouragement and support for shaping our
industrial training in the presentable form.
Also, our warm thanks to Jaipur Engineering College and Research Centre, who provided
us this opportunity to carryout, this prestigious industrial training and enhance our learning in
various technical fields.
vi
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
ABSTRACT
Name : Shagun Kumari Mangal
RTU Roll no : 21EJCCS806
Branch : Computer Science and Engineering
Training Industry : Upflairs
Training Industry Address : JECRC Foundation
Trainer : SANAM PEEYUSH
Technology Name/Project Name : Machine Learning with python
Training mode : Offline
Training Duration : 12th September - 15th October
Training Status : Complete
Description: -
Python is one of the most preferred languages for scientific computing, data science, and
machine learning, boosting both performance and productivity by enabling the use of low-
level libraries. In this training we also learned about one more library of python which helps
in building GUI, this library is called ‘tkinter’. With the help of this library, we can give our
Machine Learning Model a nice User interface.
For Machine Learning we use many different Libraries of Python which helps us in building
a nice and much accurate model of Machine learning. These libraries are:
• numpy
• pandas
• sklearn
• seaborn
• matplotlib
All above information are true and right in my knowledge.
21EJCCS806
vii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
List of Figures
viii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
ix
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
List of Table
x
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
TABLE OF CONTENTS
xi
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
xii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
HISTORY OF PYTHON:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which
was inspired by SETL, capable of exception handling and interfacing with the Amoeba
operating system. Its implementation began in December 1989.
INTRODUCTION:
Python is a dynamic, interpreted (bytecode-compiled) language. There are no type declarations
of variables, parameters, functions, or methods in source code. This makes the code short and
flexible, and you lose the compile-time type checking of the source code. Python is a widely
used general-purpose, high level programming language. It was created by Guido van
Rossum in 1991 and further developed by the Python Software Foundation. It was designed
with an emphasis on code readability, and its syntax allows programmers to express their
concepts in fewer lines of code. Python is a programming language that lets you work quickly
and integrate systems more efficiently.
1
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Multipurpose language:
Disadvantages:
• Poor Memory Efficiency. To make it simple for the developer, Python needs a lot of
memory space; this can be a tad problematic if you want to develop apps where you
need to optimize memory.
• Slow Speed. ...
• Database Access. ...
• Weak in Mobile Computing. ...
• Runtime Error
2
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Uses of python:
Python is commonly used for developing websites and software, task automation, data
analysis, and data visualization. Python is also used in data analysis and machine learning.
python provide a built-in library called pygame,
Which is used to develop the game. google is also using python for their data analysis, machine
learning, artificial intelligence
Application of python:
• Web Development.
• Game Development.
• Machine Learning and Artificial Intelligence.
• Data Science and Data Visualization.
• Desktop GUI.
• Web Scraping Applications.
• Business Applications.
• Audio and Video Applications.
3
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Data types are the classification or categorization of data items. It represents the kind of value
that tells what operations can be performed on a particular data. Since everything is an object
in Python programming, data types are actually classes and variables are instance (object) of
these classes.
Following are the standard or built-in data type of Python:
• Numeric
• Sequence type
• Boolean
• Set
• Dictionary
4
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
KEYWORD IN PYTHON:
Every programming language has special reserved words, or keywords, that have specific
meanings and restrictions around how they should be used. Python is no different. Python
keywords are the fundamental building blocks of any Python program.
Keyword Description
as To create an alias
5
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
OPERATOR IN PYTHON:
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Identity operators
• Membership operators
• Bitwise operators
6
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Arithmetic operators
Arithmetic operators are used with numeric values to perform common mathematical
operations. Some of them are:
Assignment operators
7
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Comparison operators
Comparison operators are used to compare two values. These are listed below:
Logical operators
Logical operators are used to combine conditional statements. These are listed below:
Identity operators
Identity operators are used to compare the objects, not if they are equal, but if they are
actually the same object, with the same memory location. These are listed below:
8
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Membership operators
Bitwise operators
Bitwise operators are used to compare (binary) numbers. These are listed below:
LOOPS IN PYTHON
Python programming language provides the following types of loops to handle looping
requirements. Python provides three ways for executing the loops. While all the ways
provide similar basic functionality, they differ in their syntax and condition checking time.
Types of loops:
1.while loop
2.for loop
9
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
while expression:
statement(s)
2.) For loop
Syntax:
for iterator Var in sequence:
statements(s)
Advantages:
• Easy to learn, read and write. python is a high-level programming language that has
English-like syntax.
• Interpreted language
• Python has very simple syntax.
• User friendly data structure.
• Object oriented and procedural programming language
• Dynamically typed language.
10
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
MACHINE LEARNING:
Machine learning, on the other hand, is an automated process that enables machines to solve
problems with little or no human input, and take actions based on past observations.
While artificial intelligence and machine learning are often used interchangeably, they are two
different concepts. AI is the broader concept – machines making decisions, learning new skills,
and solving problems in a similar way to humans – whereas machine learning is a subset of AI
that enables intelligent systems to autonomously learn new things from data.
Instead of programming machine learning algorithms to perform tasks, you can feed them
examples of labelled data (known as training data), which helps them make calculations,
process data, and identify patterns automatically.
Put simply, Google’s Chief Decision Scientist describes machine learning as a fancy labelling
machine. After teaching machines to label things like apples and pears, by showing them
examples of fruit, eventually they will start labelling apples and pears without any help –
provided they have learned from appropriate and accurate training examples.
Machine learning can be put to work on massive amounts of data and can perform much more
accurately than humans. It can help you save time and money on tasks and analyses,
like solving customers pain point to improve customer satisfaction, support ticket automation
and data mining from internal sources and all over the internet.
11
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
1.Supervised learning
2.Unsupervised learning
4. Reinforcement learning
5.Deep learning
Supervised learning algorithms and supervised learning models make predictions based on
labelled training data. Each training sample includes an input and a desired output. A
supervised learning algorithm analyses this sample data and makes an inference – basically, an
educated guess when determining the labels for unseen data.
Unsupervised learning algorithms uncover insights and relationships in unlabelled data. In this
case, models are fed input data but the desired outcomes are unknown, so they have to make
inferences based on circumstantial evidence, without any guidance or training. The models are
not trained with the “right answer,” so they must find patterns on their own.
In semi supervised machine learning, training data is split into two. A small amount of labelled
data and a larger set of unlabelled data.
In this case, the model uses labelled data as an input to make inferences about the unlabelled
data, providing more accurate results than regular supervised-learning models.
Reinforcement learning is concerned with how a software agent (or computer program) ought
to act in a situation to maximize the reward. In short, reinforced machine learning models
attempt to determine the best possible path they should take in a given situation. They do this
through trial and error. Since there is no training data, machines learn from their own
mistakes and choose the actions that lead to the best solution or maximum reward
12
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
In machine learning process data is given to machine in form of data in form of 0’s and 1’s and
then machine process some tasks and give a machine model in output. Machine learning uses
two types of techniques: supervised learning, which trains a model on known input and output
data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns
or intrinsic structures in input data. Machine Learning is making the computer learn from
studying data and statistics. Machine Learning is a step into the direction of artificial
intelligence (AI). Machine Learning is a program that analyses data and learns to predict the
outcome.
13
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
NumPy:
NumPy stands for ‘Numerical Python’. It is an open-source Python library used to perform
various mathematical and scientific tasks. It contains multi-dimensional arrays and matrices,
along with many high-level mathematical functions that operate on these arrays and matrices.
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
basic linear algebra, basic statistical operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional
arrays of homogeneous data types, with many operations being performed in compiled code
for performance. There are several important differences between NumPy arrays and the
standard Python sequences:
14
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
• NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
• The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory. The exception: one can have arrays of (Python,
including NumPy) objects, thereby allowing for arrays of different sized elements.
• NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
• A growing plethora of scientific and mathematical Python-based packages are using
NumPy arrays; though these typically support Python-sequence input, they convert
such input to NumPy arrays prior to processing, and they often output NumPy arrays.
In other words, in order to efficiently use much (perhaps even most) of today’s
scientific/mathematical Python-based software, just knowing how to use Python’s
built-in sequence types is insufficient - one also needs to know how to use NumPy
arrays.
Installing NumPy:
import numpy as
15
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Pandas:
Getting Started
The first step of working in pandas is to ensure whether it is installed in the Python folder
or not. If not then we need to install it in our system using pip command. Type cmd
command in the search box and locate the folder using cd command where python-pip
file has been installed. After locating it, type the command:
16
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Histogram: boxplot:
17
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Scikit learn:
Installation:
pip install -U scikit-learn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
provides beautiful default styles and colour palettes to make statistical plots more attractive.
It is built on the top of matplotlib library and also closely integrated to the data structures
from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data. It
provides dataset-oriented APIs, so that we can switch between different visual
representations for same variables for better understanding of dataset.
• Relational plots: This plot is used to understand the relation between two
variables.
18
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
• Categorical plot: This plot deals with categorical variables and how they can be
visualized.
• Distribution plot: This plot is used for examining univariate and bivariate
distributions
• Regression plot: The regression plots in seaborn are primarily intended to add a
visual guide that helps to emphasize patterns in a dataset during exploratory data
analyses.
• Matrix plot: A matrix plot is an array of scatterplots.
• Multi-plot grids: It is a useful approach is to draw multiple instances of the
same plot on different subsets of the dataset.
Conclusion:
The growth and popularity of Machine Learning language call for efficient tools, and sklearn
in Python serves the need for beginners as well as those solving supervised learning problems.
Efficiency and versatility of use make scikit-learn one of the prime choices of academic and
industrial organizations for performing various operations.
19
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
About the dataset: The Dataset contains 1339 entries and 7 details such as age, sex, bmi,
children , smoker, region.
Table: 11 Dataset
Loading the dataset:
20
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Collection Information:
Now to use describe method in pandas, just type the below statement:
21
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Data visualisation:
22
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
BMI Distribution:
23
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
24
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Charges Distribution:
25
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
DATA PROCESSING
Data pre-processing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data pre-processing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
Steps of encoding:
26
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
27
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
The train-test split procedure is used to estimate the performance of machine learning
algorithms when they are used to make predictions on data not used to train the model.
It is a fast and easy procedure to perform, the results of which allow you to compare the
performance of machine learning algorithms for your predictive modelling problem. Although
simple to use and interpret, there are times when the procedure should not be used, such as
when you have a small dataset and situations where additional configuration is required, such
as when it is used for classification and the dataset is not balanced.
Splitting of data:
28
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Model training:
The process involved in training a linear regression model is similar in many ways to how
other machine learning models are trained. We need to work on a training data set and model
the relationship of its variables in a way that doesn’t impact the ability of the model to predict
new data samples. Model is trained to improve your prediction equation continuously.
It is done by iteratively looping through the given dataset. Every time you repeat this action,
you simultaneously update the bias and weight value in the direction that the gradient or cost
function indicates. The stage of the completion of training is reached when an error threshold
is touched or when there is no reduction in cost with the training iterations that follow.
REGRESSION:
The learning technique is used to serve the objective of reproducing output values. In other
words, it is used in situations in which we need to fit data to a specific value. For example, it
is often used to estimate the price of different items. Regression can be used to predict more
things than you can possibly imagine.
Linear Regression:
It is one of the machine learning techniques that fall under supervised learning. The rise in the
demand and use of machine learning techniques is behind the sudden upsurge in the use of
linear regression in several areas.
When do we Use linear regression:
The most important of these conditions is the existence of a linear relationship between the
variables of your data set. This allows them to be easily plotted. You need to see the difference
that exists between the predicted values and achieved value in real are constant. The predicted
values should still be independent, and the correlation between predictors should be too close
for comfort.
29
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
30
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Logistic Regression:
Logistic regression is used to explain the relationship between one dependent binary variable
31
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
32
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Training data is the initial dataset you use to teach a machine learning application to
recognize patterns or perform to your criteria, while testing or validation data is used to
evaluate your model’s accuracy. You’ll need a new dataset to validate the model because it
already “knows” the training data.
Splitting of dataset:
Splitting the dataset into train and test sets is one of the important parts of data pre-processing,
as by doing so, we can improve the performance of our model and hence give better
predictability. We can understand it as if we train our model with a training set and then test it
with a completely different test dataset, and then our model will not be able to understand the
correlations between the features.
33
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Therefore, if we train and test the model with two different datasets, then it will decrease the
performance of the model. Hence it is important to split a dataset into two parts, i.e., train and test
set.
In this way, we can easily evaluate the performance of our model. Such as, if it performs well
with the training data, but does not perform well with the test dataset, then it is estimated that
the model may be overfitted.
For splitting the dataset, we can use the train_test_split function of scikit-learn.
CLASSIFICATION :
Classification is a process of categorizing a given set of data into classes, It can be performed
on both structured or unstructured data. The process starts with predicting the class of given
data points. The classes are often referred to as target, label or categories.
The classification predictive modelling is the task of approximating the mapping function from
input variables to discrete output variables. The main goal is to identify which class/category
the new data will fall into.
34
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
• Classifier – It is an algorithm that is used to map the input data to a specific category.
• Classification Model – The model predicts or draws a conclusion to the input data
given for training, it will predict the class or category for the data.
• Feature – A feature is an individual measurable property of the phenomenon being
observed.
• Binary Classification – It is a type of classification with two outcomes, for e.g. – either
true or false.
• Multi-Class Classification – The classification with more than two classes, in multi-
class classification each sample is assigned to one and only one label or target.
• Multi-label Classification – This is a type of classification where each sample is
assigned to a set of labels or targets.
• Initialize – It is to assign the classifier to be used for the
• Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the
model for training the train X and train label y.
• Predict the Target – For an unlabelled observation X, the predict(X) method returns
predicted label y.
• Evaluate – This basically means the evaluation of the model i.e classification report,
accuracy score, etc.
Example :
The best example of an ML classification algorithm is Email Spam Detector. The main goal
of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data. Classification
algorithms can be better understood using the below diagram.
35
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
LOGISTIC REGRESSION:
It is a classification algorithm in machine learning that uses one or more independent variables
to determine an outcome. The outcome is measured with a dichotomous variable meaning it
will have only two possible outcomes.
The goal of logistic regression is to find a best-fitting relationship between the dependent
variable and a set of independent variables. It is better than other binary classification
algorithms like nearest neighbour since it quantitatively explains the factors leading to
classification.
The main disadvantage of the logistic regression algorithm is that it only works when the
predicted variable is binary, it assumes that the data is free of missing values and assumes that
the predictors are independent of each other.
Use Cases
36
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
CLASSIFIER:
o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
37
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
38
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child node.
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
39
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Random forest classifier: Random Forest is a popular machine learning algorithm that
belongs to the supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
40
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
Forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random Forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
41
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
There are mainly four sectors where Random Forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
42
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data. We can also say that it
is a technique to check how a statistical model generalizes to an independent dataset.
In machine learning
there is always the need to test the stability of the model. It means based only on the training
dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After that, we test our model
on that sample before deployment, and this complete process comes under cross-validation.
This is something different from the general train-test split.
There are some common methods that are used for cross-validation. These methods are given
below:
43
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
o Train/test split: The input data is divided into two parts, that are training set and test
set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest
disadvantages.
o Training Data: The training data is used to train the model, and the dependent
variable is known.
o Test Data: The test data is used to make the predictions from the model that is
already trained on the training data. This has the same features as training data
but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by
splitting the dataset into groups of train/test splits, and averaging the result. It can be
used if we want to optimize our model that has been trained on the training dataset for
the best performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-validation,
as there is no certainty of the type of data in machine learning.
44
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
o In predictive modelling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5 years
stock values, but the realistic future values for the next 5 years may drastically different,
so it is difficult to expect the correct output for such situations.
Applications of Cross-Validation
45
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Future Scope
• Creating the model with additional parameters such as Work Experience, Technical
Papers Written, and Content of Letter of Recommendation etc.
• Creating a model based on the graph of admitted vs enrolled students of previous years
to predict the increase or decrease in cut-off scores among applicants.
• Comparing different universities based on applied vs admitted data.
• it is expanding across all fields such as banking and finance, information technology,
media & entertainment, gaming, and the automotive industry.
• As the Machine Learning scope is very high, there are some areas where researchers
are working toward revolutionizing the world for the future.
• self-driving cars are built using Machine Learning, IoT sensors, high-definition
cameras, voice recognition systems, etc.
• In robotics, inventions were possible with the help of Machine Learning and Artificial
Intelligence.
• The scope of Machine Learning in India, as well as in other parts of the world, is high
in comparison to other career fields when it comes to job opportunities.
• The progress in the field of Artificial Intelligence and Machine Learning has made it
possible to achieve the goal of computer vision faster.
• Machine Learning will accelerate the processing power of the automation system used
in various technologies.
• This gives the benefit to the organization for making effective business strategies as per
the predictions of the ML algorithms.
• The most fascinating and accurate development in the field of Machine Learning has to
be Quantum Computers. Experts believe that quantum computing has a scope to boost
the potential of Machine Learning and increase its manifolds. As an interesting fact,
Google’s quantum processor in 2019 performed a task in 200 sec.
46
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
Conclusion
• After the Final Submission of test data, model’s accuracy score was 87%.
• Graphical representation of the data provided useful insights and lead to choosing better
model
• Linear Regression worked best for this dataset because the data was linearly correlated.
• Machine learning is a powerful tool for making predictions from data.
• The aim of machine learning is to automate analytical model building and enable
computers to learn from data without being explicitly programmed to do so.
• it is important to remember that machine learning is only as good as the data that is
used to train the algorithms. In order to make accurate predictions, it is important to use
high-quality data that is representative of the real-world data that the algorithm will be
used on.
• Machine Learning is a technique of training machines to perform the activities a human
brain can do, albeit bit faster and better than an average human-being.
• Today we have seen that the machines can beat human champions in games such as
Chess, AlphaGO, which are considered very complex.
• machines can be trained to perform human activities in several areas and can aid
humans in living better lives.
• lesser amount of data and clearly labelled data for training, opt for Supervised
Learning. Unsupervised Learning would generally give better performance and results
for large data sets.
• looked at the choices of various development languages, IDEs and Platforms.
47
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura
References:
1. www.google.com
2. www.Kaggle.com
3. www.geeksforgeeks.com
4. www.madewithml.com
5. www.acte.in
6. intellipaat.com
7. www.alibabacloud.com
8. www.tutorialspoint.com
9. www.udemy.com
10.www.javatpoint.com
48