Final Updated AI
Final Updated AI
Final Updated AI
3
Traditional Programming
Traditional programming is a manual process where
programmer need to write a program that uses input
data and runs on a computer to produce the output.
Program Compute
Data r Output
4
Convert °C To °F
The equation is as follows:
T(°F) = T(°C) × 9/5 + 32
Based on Rule
Source: https://zeenews.india.com/cricket/on-
this-day-in-2007-sachin-tendulkar
Source: https://
towardsdatascience.com/ 6
Artificial Intelligence
Artificial intelligence (AI) involves machines
that can execute tasks that are similar to that
of human intelligence; they were first
invented in 1956 by John McCarthy.
7
What is intelligence?
1) Learning
The simplest is learning by trial and error. For example, a
simple computer program for solving mate-in-one chess
problems might try moves at random until mate is found.
The program might then store the solution with the position
so that the next time the computer encountered the same
position it would recall the solution.
9
3) Reasoning
To reason is to draw inferences appropriate to the situation.
Inferences are classified as either deductive or inductive.
An example of the former is, “Sachin must be in either the museum
or the café. He is not in the café; therefore he is in the museum,”
In inductive, “Previous accidents of this sort were caused by
instrument failure; therefore this accident was caused by
instrument failure.”
The most significant difference between these forms of
reasoning is that in the deductive case the truth of the
premises guarantees the truth of the conclusion, whereas
in the inductive case the truth of the premise lends support
to the conclusion without giving absolute assurance.
Inductive reasoning is common in science, where data are
collected and tentative models are developed to describe and
predict future behaviour—until the appearance of
anomalous data forces the model to be revised.
Deductive reasoning is common in mathematics and logic,
10
where elaborate structures of irrefutable theorems are built
4) Perception
In perception the environment is scanned by means of
various sensory organs, real or artificial, and the scene is
decomposed into separate objects in various spatial
relationships.
Analysis is complicated by the fact that an object may appear
different depending on the angle from which it is viewed, the
direction and intensity of illumination in the scene, and how
much the object contrasts with the surrounding field.
5) Language
A language is a system of signs having meaning by
convention. In this sense, language need not be confined to
the spoken word.
Traffic signs, for example, form a minilanguage, it being a matter
of convention that ⚠ means “hazard ahead” in some countries.
11
History of Artificial
Intelligence
12
❖ Maturation of Artificial Intelligence (1943-1952)
• Year 1943: The first work which is now recognized as AI
was done by Warren McCulloch and Walter pits in 1943.
They proposed a model of artificial neurons.
14
❖ The golden years-Early enthusiasm (1956-1974)
• Year 1966: The researchers emphasized developing
algorithms which can solve mathematical problems. W.
Joseph created the first chatbot in 1966, which was
named as ELIZA.
15
❖ A boom of AI (1980-1987)
• Year 1980: After AI winter duration, AI came back with
"Expert System". Expert systems were programmed that
emulate the decision-making ability of a human expert.
• In the Year 1980, the first national conference of the
American Association of Artificial Intelligence was held
at Stanford University.
16
❖ The emergence of intelligent agents (1993-2011)
• Year 1997: In the year 1997, IBM Deep Blue beats world
chess champion, Gary Kasparov, and became the first
computer to beat a world chess champion.
• Year 2002: for the first time, AI entered the home in the
form of Roomba, a vacuum cleaner.
• Year 2006: AI came in the Business world till the year
2006. Companies like Facebook, Twitter, and Netflix also
started using AI.
17
❖ Deep learning, big data and artificial general
intelligence (2011-present)
• Year 2011: In the year 2011, IBM's Watson won jeopardy, a
quiz show, where it had to solve the complex questions as well
as riddles. Watson had proved that it could understand natural
language and can solve tricky questions quickly.
• Year 2012: Google has launched an Android app feature
"Google now", which was able to provide information to the
user as a prediction.
• Year 2014: In the year 2014, Chatbot won a competition in the
infamous "Turing test."
• Year 2018: The "Project Debater" from IBM debated on
complex topics with two master debaters and also performed
extremely well.
• Google has demonstrated an AI program "Duplex" which was a
virtual assistant and which had taken hairdresser appointment
on call, and lady on other side didn't notice that she was talking 18
Artificial Intelligence
20
Artificial Intelligence, Machine
Learning and Deep Learning
21
Machine learning
Machine learning is a simple way of achieving AI. Arthur
Samuel summoned the phrase not long after AI, in 1960,
defining it as, “the capability to train without being overtly
programmed”.
Deep learning
It is a subset of machine learning and is called deep learning
because it makes use of deep neural networks. Deep learning is
a computer software that mimics the network of neurons in a
brain.
In deep learning, the learning phase is done through a neural
network. A neural network is an architecture where the layers
are stacked on top of each other
* 22
Machine Learning and Deep
Learning Performance w.r.t.
Data
* 23
Machine Learning v/s Deep
Learning
* 24
When to use ML or DL?
* 25
What are the Most Popular
Languages for Machine
Learning
⚫ Python
⚫R
⚫ MATLAB
⚫ SAS (Statistical Analysis System)
⚫ Scala
⚫ JAVA
⚫ C++
⚫ Hadoop
Expert System
In artificial intelligence, an expert system is a computer system
that emulates the decision-making ability of a human expert.
The first expert systems were created in the 1950s and then
production increases in the 1980s.
* 37
Machine Learning
Machine learning is an application of artificial intelligence
(AI) that provides systems the ability to automatically learn
and improve from experience without being explicitly
programmed.
Input Compute
Program
Output r
38
How ML Algorithms Works?
Source: https://www.spaceotechnologies.com/machine-learning-app-development-complete-guide/
39
Convert °C To °F
Source: https://www.spaceotechnologies.com/machine-learning-app-development-complete-guide/
41
Pattern Recognition
Pattern recognition is the process of recognizing patterns
by using machine learning algorithm.
42
Machine Learning v/s Pattern
Recognition
43
How Algorithm find Patterns?
New data
44
Learning Algorithm
Machine Learning is a concept which provides ability to the
machine to automatically learn and improve from experience
without being explicitly programmed.
The process of learning begins with observations in order to
find patterns in data and make better decisions in the future
based on the examples that we provide.
The primary aim of learning algorithm is to allow the
computers learn automatically without human intervention
Machine Learning
Algorithm
Un-
Supervised Reinforcement
Supervised
Learning Learning
Learning
Algorithm Algorithm 45
Algorithm
Supervised Learning
Learning in the presence of instructor/supervisor/teacher
❖ Ex. Classroom teaching
46
10
Num-1 Num- Sum
2
5 5 10
5
8 2 10 Model Logi
10 3 13 5 c
15 6 21
Training Phase
20 4 21
30 40 70
30
Trained Model 7
40 0
Testing Phase
47
Training Phase
Testing Phase
Source: https://mc.ai/supervised-vs-unsupervised-learning/
48
Types of Supervised Learning
Machine Learning
Algorithm
Un-
Supervised Reinforcement
Supervised
Learning Learning
Learning
Algorithm Algorithm
Algorithm
Regression Classification
49
Regression
If the output of the model is a continuous value.
It is used to predict a continuous value.
Num-1 Num- Sum
2
5 5 10
8 2 10
10 3 13
15 6 21
Ex. 20 4 21
❖ House price prediction 30 40 70
❖ Stock market prediction
❖ Predicting age of a person
❖ number of copies a music album will be sold next month
50
51
Types of ML Classification
Algorithms:
⚫ Linear Models
⚫ Logistic Regression
⚫ Support Vector Machines
⚫ Non-linear Models
⚫ K-Nearest Neighbours Source: https://www.kdnuggets.com/2019/12/enabling-deep-
⚫ Kernel SVM
learning-revolution.html
⚫ Naïve Bayes
⚫ Decision Tree Classification
⚫ Random Forest Classification
52
Classification v/s Regression
Source: https://medium.com/deep-math-machine-learning-
ai/different-types-of-machine-learning-and-their-
types-34760b9128a2
53
Unsupervised Learning
Unsupervised learning is very much the opposite of
supervised learning. It features are not labeled.
55
Source: https://towardsdatascience.com/what-are-the-types-of-machine-learning-
e2b9e5d1756f
Source: https://mc.ai/supervised-vs-unsupervised-learning/
56
Machine Learning
Algorithm
Un-
Supervised Reinforcement
Supervised
Learning Learning
Learning
Algorithm Algorithm
Algorithm
Dimension
Regression Classification Clustering
Reduction
57
Clustering
Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
Source: https://towardsdatascience.com/unsupervised-
Source: http://bampe08.blogspot.com/2015/03/cluster- learning-and-data-clustering-eeecb78b422a
analysis-for-business.html 58
Dimension Reduction
Dimensionality reduction is simply, the process of reducing
the dimension of your feature set.
❖ Ex. Principal Component Analysis (PCA)
59
Why Unsupervised Learning is
needed?
Annotating large datasets is very costly and hence we can
label only a few examples manually.
❖ Example: Speech Recognition
60
Examples of Unsupervised
Learning
Recommender Systems: Group customers into similar
purchasing segments
❖ YouTube or Netflix, you’ve video recommendation
system.
61
Supervised vs. Unsupervised
Learning
62
Classification v/s Clustering
Source: https://deepcast.ai/article/2017/07/18/
going_beyond_k_means_for_effective_clustering_of_production_curves
63
Classification v/s Clustering v/
s Regression
Source: https://www.researchgate.net/publication/314626729_Parallel_Linear_Algebra/
figures?lo=1
64
Reinforcement Learning
It is about taking suitable action to maximize reward in a
particular situation
Source: https://www.geeksforgeeks.org/what-is-reinforcement-
learning/ 65
Evaluation Metric
Regression
o RMSE
o MSE
o R-Square
o Adjusted R Square
66
Deep learning
67
Underfitting and Overfitting
Underfitting: Low training and test accuracy
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-
learning/
68
Underfitting and Overfitting
Underfitting: Low training and test accuracy
Source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-
with-it-6803a989c76/
69
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-
fitting-machine-learning/
70
Agents
An AI system is composed of an agent and its environment. The
agents act in their environment. The environment may contain
other agents.
An agent is anything that can perceive its environment
through sensors and Current +acts upon that environment
through effectors. History
Agent
Program
⚫ Sensor: Sensor is a device which detects the change in the environment and sends the
information to other electronic devices. An agent observes its environment through
sensors.
⚫ Actuators: Actuators are the component of machines that converts energy into motion.
The actuators are only responsible for moving and controlling a system. An actuator can71
Examples of Agent
A human agent has sensory organs such as eyes, ears, nose,
tongue and skin parallel to the sensors, and other organs such
as hands, legs, mouth, for effectors.
72
PEAS Representation
⚫ PEAS is a type of model on which an AI agent
works upon.
• P: Performance measure
• E: Environment
• A: Actuators
• S: Sensors
73
Types of Agents
❖ Simple Reflex Agents
❖ Model-Based Reflex Agents
❖ Goal-Based Agents
❖ Utility-Based Agents
❖ Learning Agent
74
Simple reflex agents
Simple reflex agents ignore the rest of the percept history and
act only on the basis of the current percept.
The agent function is based on the condition-action rule.
If the condition is true, then the action is taken, else not.
This agent function only succeeds when the environment is
fully observable (i.e. not suitable for partial observable).
75
Model-based reflex agents
It works by finding a rule whose condition matches the current
situation.
A model-based agent can handle partially observable
environments.
The current state is stored inside the agent which maintains
some kind of structure describing the part of the world which
cannot be seen.
76
Goal-based agents
These kinds of agents take decisions based on how far they are
currently from their goal(description of desirable situations).
Their every action is intended to reduce its distance from the
goal.
They usually require search and planning. The goal-based
agent’s behavior can easily be changed.
77
Utility-based agents
The agents which are developed having their end uses as
building blocks are called utility-based agents.
When there are multiple possible alternatives, then to decide which one is
best, utility-based agents are used. They choose actions based on a preference
(utility) for each state. Sometimes achieving the desired goal is not enough.
We may look for a quicker, safer, cheaper trip to reach a destination.
78
Learning Agent
A learning agent in AI is the type of agent that can learn from
its past experiences or it has learning capabilities. It starts to
act with basic knowledge and then is able to act and adapt
automatically through learning.
A learning agent has mainly four conceptual components,
which are:
1. Learning element: It is responsible for making improvements by learning
from the environment
2. Critic: The learning element takes feedback from critics which describes how
well the agent is doing with respect to a fixed performance standard.
3. Performance element: It is responsible for selecting external action
4. Problem Generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.
79
Rational and Intelligent Agent
❖ Intelligent Agent
An intelligent agent is a goal-directed agent. It perceives its
environment through its sensors using the observations and
built-in knowledge, acts upon the environment through its
actuators.
❖ Rational Agent
Rationality is nothing but status of being reasonable, sensible,
and having good sense of judgment.
Rationality is concerned with expected actions and results
depending upon what the agent has perceived.
A rational agent always performs right action. The problem the
agent solves is characterized by Performance Measure,
Environment, Actuators, and Sensors (PEAS).
80
Turing Test in AI
In 1950, Alan Turing introduced a test to check whether a
machine can think like a human or not, this test is known as
the Turing Test. I
n this test, Turing proposed that the computer can be said to be
an intelligent if it can mimic human response under specific
conditions.
81
General Problem Solving
Components
❖ Every problem should be properly formulated in
artificial intelligence. Problem formulation is very
important before applying any search algorithm.
❖ Problem Limitation
⚫ There always some limitations while solving problems. All
these limitations or constraints must be fulfill while creating
system. For example “I have only few features, some records
are missing. System must be 90% accurate otherwise will be
useless”.
❖ Goal or Solution
⚫ What is expected from system? The Goal state or final state or
the solution of problem is defined here. This will help us to
proposed appropriate solution for problem. For example “we83
❖ Solution Space
⚫ Problem can be solve in many ways. Some solution will be
efficient than others. Some will consume less resources, some
will be simple etc. There are always alternatives exists. Many
possible ways with which we can solve problem is known as
Solution Space. For example “price of house can be predict
using many machine learning algorithms”.
❖ Operators
⚫ Operators are the actions taken during solving problem.
Complete problem is solved using tiny steps or actions and all
these consecutive actions leads to solution of problem.
84
Mouse Path Problem
• Problem Statement
• Problem Definition: Mouse is
hungry, mouse is in a puzzle where
there are some cheese. Mouse will
only be satisfied if mouse eat cheese
• Problem Limitation: Some paths
are close i-e dead end, mouse can
only travel through open paths
• Problem Solution: Reach location where
is cheese and eat minimum one cheese.
There are possible solutions (cheese
pieces)
• Solution Space: To reach cheese there
are multiple paths possible
• Operators: Mouse can move in four
possible directions, these directions are
operators or actions which are UP,
DOWN, LEFT and RIGHT 85
Water Jug Problem
• Problem Statement
• Problem Definition: You
have to measure 4 liter (L)
water by using three
buckets 8L, 5L and 3L.
• Problem Limitation: You
can only use these (8L, 5L
and 3L) buckets
• Problem Solution: Measure
exactly 4L water
• Solution Space: There are
multiple ways doing this.
• Operators: Possible actions are
fill water in any bucket and
remove water from any bucket.
86
8 Puzzle or Slide Puzzle
• States: A state description
specifies the location of each of
the eight tiles and the blank in
one of the nine squares.
• Initial state: Any random
shuffled state can be
designated as initial state
• Actions:
• Slide Left
• Slide Right
• Slide Up
• Slide Down
• Transition model: Given a
state and action, this returns
the resulting state
• Goal test: This checks whether
the state matches the goal 87
Search Algorithm
Terminologies
In Artificial Intelligence, Search techniques are universal problem-
solving methods.
❖ Informed Search
Informed search algorithms use domain knowledge. In
an informed search, problem information is available
which can guide the search.
Informed search strategies can find a solution more
efficiently than an uninformed search strategy. 90
Difference between Informed
and Uninformed Search
Informed Search Uninformed Search
92
1. Breadth-first Search
Breadth-first search is the most common search strategy for
traversing a tree or graph. This algorithm searches
breadthwise in a tree or graph, so it is called breadth-first
search.
BFS algorithm starts searching from the root node of the tree
and expands all successor node at the current level before
moving to nodes of next level.
Breadth-first search implemented using FIFO queue data
structure.
93
BFS algorithm from the root node S to goal node K
94
BFS algorithm from the root node S to goal node K
95
The equivalent search tree for the graph
96
2. Depth-first Search
It starts from the root node and follows each path to its greatest
depth node before moving to the next path.
DFS uses a stack data structure for its implementation.
The process of the DFS algorithm is similar to the BFS
algorithm.
97
98
The equivalent search tree the graph
99
100
Calculate the order to print all the nodes of the graph starting from node
H to E
101
102
103
104
Properties of BFS
Time Complexity: Time Complexity
of BFS algorithm can be obtained by
the number of nodes traversed in BFS
until the shallowest Node. Where the
d= depth of shallowest solution and b
is a node at every state.
T (b) = 1+b2+b3+.......+ bd= O (bd)
❖ Disadvantages:
It requires lots of memory since each level of the tree must be
saved into memory to expand the next level.
BFS needs lots of time if the solution is far away from the root
node.
106
Properties of DFS
Completeness: DFS search algorithm is
complete within finite state space as it
will expand every node within a
limited search tree.
❖ Disadvantage:
There is the possibility that many states keep re-occurring, and
there is no guarantee of finding the solution.
DFS algorithm goes for deep down searching and sometime it
may go to the infinite loop.
108
3. Depth-Limited Search
Algorithm
A depth-limited search algorithm is similar to depth-first search
with a predetermined limit. Depth-limited search can solve the
drawback of the infinite path in the Depth-first search. In this
algorithm, the node at the depth limit will treat as it has no
successor nodes further.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
110
Properties of Depth-Limited
Search
Completeness: DLS search algorithm is complete if
the solution is above the depth-limit.
Time Complexity: Time complexity of DLS
algorithm is O(bℓ).
Space Complexity: Space complexity of DLS
algorithm is O(b×ℓ).
Optimal: Depth-limited search can be viewed as a
special case of DFS, and it is also not optimal even if
ℓ>d.
111
4. Bidirectional Search
Algorithm
In normal graph search using BFS/DFS we begin our search in
one direction usually from source vertex toward the goal
vertex, but what if we start search from both direction
simultaneously.
Bidirectional search is a graph search algorithm which find
smallest path from source to goal vertex in a directed graph. It
runs two simultaneous search –
1) Forward search from source/initial vertex toward goal vertex
Performance measures
Completeness : Bidirectional search is complete if BFS is used in
both searches.
Optimality : It is optimal if BFS is used for search and paths have
uniform cost.
Complexity : Space complexity is same as BFS and DFS.
Total time complexity would be O(bd/2 +bd/2) which is far less
than O(bd).
114
A -> B -> C -> D
115
5. Uniform-cost Search
Algorithm
UCS is different from BFS and DFS because here the costs come
into play. In other words, traversing via different edges might
not have the same cost. Uniform-cost search is a searching
algorithm used for traversing a weighted tree or graph.
This algorithm comes into play when a different cost is
available for each edge. The primary goal of the uniform-cost
search is to find a path to the goal node which has the lowest
cumulative cost.
Uniform-cost search expands nodes according to their path
costs form the root node. It can be used to solve any graph/tree
where the optimal cost is in demand.
A uniform-cost search algorithm is implemented by the
priority queue. It gives maximum priority to the lowest
cumulative cost.
116
Uniform cost search is equivalent to BFS algorithm if the path
117
118
119
Advantages:
Uniform cost search is optimal because at every state the path
with the least cost is chosen.
Disadvantages:
It does not care about the number of steps involve in searching
and only concerned about path cost. Due to which this algorithm
may be stuck in an infinite loop.
120
Informed Search Algorithms
Informed search algorithm contains an array of knowledge
such as how far we are from the goal, path cost, how to reach
to goal node, etc. This knowledge help agents to explore less to
the search space and find more efficiently the goal node.
Informed search algorithm uses the idea of heuristic, so it is
also called Heuristic search.
2) A* Search Algorithm
133
Advantages and
Disadvantages of Best First
Search
Advantages:
Best first search can switch between BFS and DFS by gaining the
advantages of both the algorithms.
This algorithm is more efficient than BFS and DFS algorithms.
Disadvantages:
It can behave as an unguided depth-first search in the worst case
scenario.
It can get stuck in a loop as DFS.
This algorithm is not optimal.
134
A* Search Algorithm
A* search is the most commonly known form of best-first
search.
It uses heuristic function h(n), and cost to reach the node n
from the start state g(n).
A* algorithm is similar to UCS except that it uses g(n)+h(n)
instead of g(n).
It has combined features of UCS and greedy best-first search, by
which it solve the problem efficiently.
A* search algorithm finds the shortest path through the search
space using the heuristic function.
This search algorithm expands less search tree and provides
optimal result faster. 135
In A* search algorithm, we use search heuristic as well as the
cost to reach the node. Hence we can combine both costs as
following, and this sum is called as a fitness number.
136
137
138
139
140
141
142
Advantages:
A* search algorithm is the best algorithm than other search
algorithms.
A* search algorithm is optimal and complete.
This algorithm can solve very complex problems.
Disadvantages:
A* search algorithm has some complexity issues.
The main drawback of A* is memory requirement as it keeps all
generated nodes in the memory, so it is not practical for various
large-scale problems.
143
Path highlighted with red shows the path taken by
Greedy Algorithm and the path highlighted with
green shows the path taken by Heuristic A*
algorithm.
144
8-Puzzle using A* Algorithm.
148
149
150
Hill Climbing
Hill climbing algorithm is a local search algorithm which
continuously moves in the direction of increasing elevation/
value to find the peak of the mountain or best solution to the
problem.
It terminates when it reaches a peak value where no neighbor
has a higher value.
Traveling-salesman Problem in which we need to minimize the
distance traveled by the salesman.
It is also called greedy local search as it only looks to its good
immediate neighbor state and not beyond that.
Hill Climbing is mostly used when a good heuristic is available.
In this algorithm, we don't need to maintain and handle the
search tree or graph as it only keeps a single current state.
151
Features of Hill Climbing
Greedy approach: Hill-climbing algorithm search moves in the
direction which optimizes the cost.
2) Steepest-Ascent hill-climbing:
155
Types of Hill Climbing
Algorithm
2) Steepest-Ascent hill-climbing:
156
1. Simple Hill Climbing:
Simple hill climbing is the simplest way to implement a hill climbing
algorithm. It only evaluates the neighbor node state at a time and
selects the first one which optimizes current cost and set it as a
current state. It only checks it's one successor state, and if it finds
better than the current state, then move else be in the same state.
This algorithm has the following features:
Less time consuming
Less optimal solution and the solution is not guaranteed
159
3. Stochastic hill climbing
Stochastic hill climbing chooses at random from among the
uphill moves.
Stochastic hill climbing does not examine for all its neighbor
before moving. Rather, this search algorithm selects one
neighbor node at random and decides whether to choose it
as a current state or examine another state.
The probability of selection can vary with the steepness of the
uphill move.
Stochastic hill climbing usually converges more slowly than
steepest ascent, but in some state landscapes, it finds better
solutions.
Stochastic hill climbing is NOT complete, but it may be less
likely to get stuck.
160
Types of Hill Climbing
Algorithm
2) Steepest-Ascent hill-climbing:
161
Hill Climbing Algorithm
Example
162
Local Maximum and Local
Minimum
163
164
165
8-puzzle: a solution case
166
8-puzzle: stuck at local
maximum
167
Blocks World Problem with Local Heuristic
Function
+1 for each block that is resting on the thing it is suppose to be resting
on.
-1 for each block that is resting on wrong thing.
A D
D C
C B
B A
Initia Goal
l
168
Blocks World Problem with Global Heuristic
Function
For each block that has the correct support structure : +1 to every
block in the support structure.
For each block that has the wrong support structure : -1 to every
block in the support structure.
A D
D C
C B
B A
Initia Goal
l
169
A H
H G
G F
F E
E D
D C
C B
B A
Initia Goal
l
170
A H
H G
G F
F E
E D
D C
C B
B A
Initia Goal
l
171
Solution to Overcome
Problems in Hill Climbing
1. Local maximum:
Utilize the backtracking
technique. Maintain a list of
visited states. If the search
reaches an undesirable state, it
can backtrack to the previous
configuration and explore a new
path.
2. Plateau: Make a big jump.
Randomly select a state far away
from the current state. Chances
are that we will land in a non-
plateau region.
3. Ridge: In this kind of obstacle, 172
Hill Climbing Search Example
with Local Minimum
Both players try to win the game. So, both of them try to make
the best move possible at each turn. Searching techniques like
BFS(Breadth First Search) are not accurate for this as the
branching factor is very high.
176
Components of Game Playing
problem
Initial state: This defines initial configuration of the game and
identifies first payer to move.
Successor function: This identifies which are the possible states
that can be achieved from the current state. This function returns
a list of (move, state) pairs, each indicating a legal move and the
resulting state.
Goal test: Which checks whether a given state is a goal state or
not. States where the game ends are called as terminal states.
Path cost / utility / payoff function: Which gives a numeric
value for the terminal states? In chess, the outcome is win, loss or
draw, with values +1, -1, or 0. Some games have wider range of
possible outcomes.
177
Characteristics of Game
Playing
Unpredictable Opponent: Generally we cannot predict the
behavior of the opponent. Thus we need to find a solution which
is a strategy specifying a move for every possible opponent move
or every possible state.
178
Mini-Max Algorithm
Mini-max algorithm is a recursive or backtracking algorithm which
is used in decision-making and game theory. It provides an optimal
move for the player assuming that opponent is also playing optimally.
In this algorithm two players play the game, one is called MAX and
other is called MIN.
Both the players fight it as the opponent player gets the minimum
benefit while they get the maximum benefit.
Both Players of the game are opponent of each other, where MAX will
select the maximized value and MIN will select the minimized
value.
The minimax algorithm proceeds all the way down to the terminal
179
180
181
Properties of Mini-Max
algorithm
Complete- Min-Max algorithm is Complete. It will definitely find
a solution (if exist), in the finite search tree.
Optimal- Min-Max algorithm is optimal if both opponents are
playing optimally.
Time complexity- As it performs DFS for the game-tree, so the
time complexity of Min-Max algorithm is O(bm), where b is
branching factor of the game-tree, and m is the maximum depth
of the tree.
Space Complexity- Space complexity of Mini-max algorithm is
also similar to DFS which is O(bm).
192
Techniques used for
Knowledge Representation
193
Logic: It is the basic method used to represent the knowledge of a
machine. The term logic means to apply intelligence over the
stored knowledge.
195
Semantic Networks: The technique is based on storing the
knowledge into the system in the form of a graph. Nodes of a
graph represent the objects which exist in the real world, and the
arrow represents the relationship between these objects. Such
techniques show the connectivity of one object with another
object.
For example, Consider the given knowledge stored in a machine
⚫ Ram has a cycle.
⚫ Ram is a boy.
⚫ Cycle has a bell.
⚫ Ram is 12 years old.
⚫ Cycle has two paddles.
196
Frames: In this technique, the knowledge is stored via slots and
fillers. As we have seen in DBMS, we store the information of an
employee in the database, where:
Similarly, the Slots are the entities and Fillers are its attributes.
They are together stored in a frame. So, whenever there is a
requirement, the machine infers the necessary information to
take the decision.
For example, Tomy is a dog having one tail.
197
Different Types of Sentences
1. Declarative sentences declare something, like “Gandhinagar is
the capital of Gujarat.”
2. Exclamatory sentences are emotional expressions, such as
“Happy birthday!”
3. Interrogative sentences ask questions, like “what time is it?”
4. Imperative sentences give commands, such as “turn right at the
traffic light.”
198
Knowledge representation
using- Propositional logic
Propositional logic (PL) is the simplest form of logic where all
the statements are made by propositions.
A proposition is a declarative statement which is either true or
false.
It is a technique of knowledge representation in logical and
mathematical form.
Example
1. a) It is Sunday.
2. b) The Sun rises from West (False proposition)
3. c) 3+3= 7(False proposition)
4. d) 5 is a prime number
199
Some basic facts about
propositional logic
❖ Propositional logic is also called Boolean logic as it works on 0
and 1.
❖ In propositional logic, we use symbolic variables to represent
the logic, and we can use any symbol for a representing a
proposition, such A, B, C, P, Q, R, etc.
❖ Propositions can be either true or false, but it cannot be both.
❖ Propositional logic consists of an object, relations or function,
and logical connectives.
❖ A proposition formula which is always true is called tautology,
and it is also called a valid sentence.
❖ A proposition formula which is always false is
called Contradiction.
❖ Statements which are questions, commands, or opinions are
not propositions such as "Where is Rohini", "How are you",200
❖ There are two types of Propositions:
1. Atomic Propositions
2. Compound propositions
202
Logical Connectives
1. Negation: A sentence such as ¬ P is called negation of P. A literal can
be either Positive literal or negative literal.
2. Conjunction: A sentence which has ∧ connective such as, P ∧ Q is
called a conjunction.
Example: Rohan is intelligent and hardworking. It can be written
as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.
3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is
called disjunction, where P and Q are the propositions.
Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P
∨ Q.
4. Implication: A sentence such as P → Q, is called an implication.
Implications are also known as if-then rules. It can be represented as
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P →
Q
5. Biconditional: A sentence such as P⇔ Q
203
204
P Q R
205
Precedence of Connectives
206
Logical Equivalence
Logical equivalence is one of the features of propositional logic.
Two propositions are said to be logically equivalent if and
only if the columns in the truth table are identical to each
other.
Let's take two propositions A and B, so for logical equivalence,
we can write it as A⇔B. In below truth table we can see that
column for ¬A∨ B and A→B, are identical hence A is Equivalent
to B
207
Properties of Operators
Commutativity:
⚫ P∧ Q= Q ∧ P, or
• P ∨ Q = Q ∨ P.
Associativity:
• (P ∧ Q) ∧ R= P ∧ (Q ∧ R),
• (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
Identity element:
• P ∧ True = P,
• P ∨ True= True.
Distributive:
• P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
• P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).
DE Morgan's Law:
• ¬ (P ∧ Q) = (¬P) ∨ (¬Q)
• ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
Double-negation elimination:
• ¬ (¬P) = P. 208
❑ Which of the following is the contrapositive of ‘if two triangles
are identical, then these are similar’?
A) If two triangles are not similar, then they are not identical.
B) If two triangles are not identical, then these are not similar.
C) If two triangles are not identical, then these are similar.
D) If two triangles are not similar, then these are identical.
Solution:
Consider the following statements
p: Two triangles are identical.
q: Two triangles are similar.
Clearly, the given statement in symbolic form is p ⇒ q.
Therefore, its contrapositive is given by ∼q ⇒ ∼p
Now,
∼p: Two triangles are not identical.
∼q: Two triangles are not similar.
∴ ~q ⇒ ~p: If two triangles are not similar, then these are not
identical. 209
The statement ~(p↔ ~q) is
A) equivalent to p↔q
B) equivalent to ~p↔q
C) a tautology
D) a fallacy
210
Deduction
In AI we need to create new facts from the existing facts. In
proposition logic, the process is called deduction.
Given two presumably true statements, we can deduce a new
true sentence.
The first two sentences are called premises and deductive
sentence is called conclusion.
The whole is called arguments.
One way to find if an argument is valid is to create a truth table
for the premises and the conclusion.
A conclusion is invalid if we can find a counterexample case:
a case in which both premises are true, but the conclusion is
false.
An argument is valid if no counterexample can be found.
211
The only row to be checked is the second row. This row does not
show a counterexample, so the argument is valid.
212
Here row 2 and row 4 need to be checked. Although row 4 is ok,
row 2 shows a counterexample (two true premises result in a false
conclusion).
213
Limitations of Propositional
logic:
❖ Propositional logic has limited expressive power. We cannot
represent relations like ALL, some, or none with propositional
logic. Example:
• All the girls are intelligent.
• Some apples are sweet.
215
First-Order logic
First-order logic is another way of knowledge representation in
artificial intelligence. It is an extension to propositional logic.
First-order logic is also known as Predicate logic or First-
order predicate logic.
FOL is sufficiently expressive to represent the natural
language statements in a concise way.
First-order logic is a powerful language that develops
information about the objects in a more easy way and can
also express the relationship between those objects.
First-order logic (like natural language) does not only assume
that the world contains facts like propositional logic but also
assumes the following things in the world:
• Objects: A, B, people, numbers, colors, wars etc.
• Relations: It can be unary relation or n-any relation
• Function: Father of, best friend, third inning of, end of, ...... 216
First-order logic statements can be divided into
two parts:
217
As a natural language, first-order logic also has two main parts:
• Syntax
• Semantics
Syntax of First-Order logic:
The basic syntactic elements of first-order logic are symbols.
The syntax of FOL determines which collection of symbols is a
logical expression in first-order logic.
Basic Elements of First-order logic
218
Atomic sentences:
Atomic sentences are the most basic sentences of first-order logic.
These sentences are formed from a predicate symbol followed
by a parenthesis with a sequence of terms.
We can represent atomic sentences as Predicate (term1, term2, ..
...., term n).
Example:
Ravi and Ajay are brothers: => Brothers(Ravi, Ajay).
Chinky is a cat: => cat (Chinky).
Complex Sentences:
Complex sentences are made by combining atomic sentences
using connectives.
219
P1: Rani is Neetu mother
P2: Neetu is Meethi mother
221
Universal Quantifier:
Universal quantifier is a symbol of logical representation, which
specifies that the statement within its range is true for everything
or every instance of a particular thing.
The Universal quantifier is represented by a symbol ∀, which
resembles an inverted A.
If x is a variable, then ∀x is read as:
• For all x
• For each x
• For every x.
Example:
All man drink coffee
∀x man(x) → drink (x, coffee).
It will be read as: There are all x where x is a man who drink
coffee.
⚫
• For at least one 'x.'
Example:
⚫ Some boys are intelligent.
⚫ ∃x: boys(x) ∧ intelligent(x)
⚫ It will be read as: There are some x where x is a boy who is
intelligent.
224
225
1. All birds fly.
In this question the predicate is "fly(bird)."
And since there are all birds who fly so it will be represented as
follows.
∀x bird(x) →fly(x).
229
4. Not all students like both Math and Science.
Substitution:
Substitution is a fundamental operation performed on terms
and formulas.
The substitution is complex in the presence of quantifiers in
FOL. If we write F[a/x], so it refers to substitute a constant "a"
in place of variable "x". 232
Inference Rules
233
FOL Inference Rules for
Quantifier
1) Universal Generalization
2) Universal Instantiation
3) Existential Instantiation
4) Existential introduction or Generalization
1. Universal Generalization:
Universal generalization is a valid inference rule which states that if
premise P(c) is true for any arbitrary element c in the universe of
discourse, then we can have a conclusion as ∀ x P(x).
It can be represented as:
This rule can be used if we want to show that every element has a
similar property.
In this rule, x must not appear as a free variable.
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x)
"All bytes contain 8 bits.", it will also be true.
234
2. Universal Instantiation:
Universal instantiation is also called as universal elimination or
UI is a valid inference rule. It can be applied multiple times to
add new sentences.
As per UI, we can infer any sentence obtained by substituting a
ground term (if it does not have any variables Likes(John, Jay)
for the variable.
The UI rule state that we can infer any sentence P(c) by
substituting a ground term c from ∀ x P(x) for any object in the
universe of discourse.
It can be represented as:
Example: IF "Every person like ice-cream"=> ∀x P(x) so we can infer that
"John likes ice-cream" => P(c)
The new KB is logically equivalent to the previous KB.
All kings who are greedy are Evil
∀x king(x) ∧ greedy (x) → Evil (x),
So from this information, we can infer any of the following statements using Universal
Instantiation:
King(John) ∧ Greedy (John) → Evil (John), 235
3. Existential Instantiation:
Existential instantiation is also called as Existential
Elimination, which is a valid inference rule in first-order
logic.
It can be applied only once to replace the existential
sentence.
This rule states that one can infer P(c) from the formula
given in the form of ∃x P(x) for a new constant symbol c.
The restriction with this rule is that c used in the rule must
be a new term for which P(c ) is true.
It can be represented as:
236
4. Existential introduction
An existential introduction is also known as an existential
generalization, which is a valid inference rule in first-order logic.
This rule states that if there is some element c in the universe of
discourse which has a property P, then we can infer that there
exists something in the universe which has the property P.
It can be represented as:
Example:
"Priyanka got good marks in English."
"Therefore, someone got good marks in English."
237
❖ If there is any woman who uses any medication then that
may participate in our drugs test. There is a woman who
uses some medication.
❖ Thus, someone may participate in our drug test.
238
239
240
Unification
Unification is used in logical programming where the aim is to
make expressions look identical to each other, and in order to
make them identical, we generally use substitution.
242
Algorithm for Unification
Step. 1: If Ψ1 or Ψ2 is a variable or constant, then:
a) If Ψ1 or Ψ2 are identical, then return NIL.
b) Else if Ψ1is a variable,
a. then if Ψ1 occurs in Ψ2, then return FAILURE
b. Else return { (Ψ2/ Ψ1)}.
c) Else if Ψ2 is a variable,
a. If Ψ2 occurs in Ψ1 then return FAILURE,
b. Else return {( Ψ1/ Ψ2)}.
d) Else return FAILURE.
Step.2: If the initial Predicate symbol in Ψ1 and Ψ2 are not same, then return
FAILURE.
Step. 3: IF Ψ1 and Ψ2 have a different number of arguments, then return FAILURE.
Step. 4: Set Substitution set(SUBST) to NIL.
Step. 5: For i=1 to the number of elements in Ψ1.
a) Call Unify function with the ith element of Ψ1 and ith element of Ψ2, and put
the result into S.
b) If S = failure then returns Failure
c) If S ≠ NIL then do,
a. Apply S to the remainder of both Ψ1 and Ψ2.
b. SUBST= APPEND(S, SUBST). 243
1. Find the MGU of {p(b, X, f(g(Z))) and p(Z, f(Y), f(Y))}
Here, Ψ 1 = p(b, X, f(g(Z))) , and Ψ2 = p(Z, f(Y), f(Y))
S0 => { p(b, X, f(g(Z))); p(Z, f(Y), f(Y))}
SUBST θ={b/Z}
S1 => { p(b, X, f(g(b))); p(b, f(Y), f(Y))}
SUBST θ={f(Y) /X}
S2 => { p(b, f(Y), f(g(b))); p(b, f(Y), f(Y))}
245
5. Find the MGU of Q(a, g(x, a), f(y)), Q(a, g(f(b), a), x)}
Here, Ψ 1 = Q(a, g(x, a), f(y)), and Ψ2 = Q(a, g(f(b), a), x)
S0 => {Q(a, g(x, a), f(y)); Q(a, g(f(b), a), x)}
SUBST θ= {f(b)/x}
S1 => {Q(a, g(f(b), a), f(y)); Q(a, g(f(b), a), f(b))}
SUBST θ= {b/y}
S1 => {Q(a, g(f(b), a), f(b)); Q(a, g(f(b), a), f(b))},
Successfully Unified.
Unifier: [a/a, f(b)/x, b/y].
247
Example:
1. John likes all kind of food.
2. Apple and vegetable are food
3. Anything anyone eats and not killed is
food.
4. Anil eats peanuts and still alive
5. Harry eats everything that Anil eats.
Prove by resolution that:
6. John likes peanuts.
Step-1: Conversion of Facts into FOL
In the first step we will convert all the given statements into its
first order logic.
248
Step-2: Conversion of FOL into CNF
• Move negation (¬)inwards and
rewrite
• ∀x ¬ food(x) V likes(John, x)
• food(Apple) Λ food(vegetables)
• ∀x ∀y ¬ eats(x, y) V killed(x) V
food(y)
• eats (Anil, Peanuts) Λ alive(Anil)
• ∀x ¬ eats(Anil, x) V eats(Harry, x)
• ∀x killed(x) ] V alive(x)
• ∀x ¬ alive(x) V ¬ killed(x)
• likes(John, Peanuts).
250
1) ¬ food(x) V likes(John, x)
2) food(Apple)
3) food(vegetables)
4) ¬ eats(y, z) V killed(y) V
food(z)
5) eats (Anil, Peanuts)
6) alive(Anil)
7) ¬ eats(Anil, w) V eats(Harry,
w)
8) killed(g) V alive(g)
9) ¬ alive(k) V ¬ killed(k)
0) likes(John, Peanuts).
251
Exercises
Statement-1: "If I am sleepy then I go to bed"
Statement-2: "I am sleepy"
Conclusion: "I go to bed."
Statement-1: "If I am sleepy then I go to bed" ==> P→ Q
Statement-2: "I am sleepy" ==> P
Conclusion: "I go to bed." ==> Q.
252
Statement-1: Today is Sunday or Monday.
Statement-2: Today is not Sunday.
Conclusion: Today is Monday.
254
255
Cats like fish.
Cats eats everything they like.
Manin is a cat.
Prove: Mani eats fish.
256
⚫ All people who are graduating are happy. All happy
people smile. Someone is graduating.
⚫ Prove: Someone is smiling.
257
258
259
Marcus was a man. Marcus was a Pompeian. All
Pompeians were Romans. Caesar was a ruler. All Romans
were either loyal to Caesar or hated him. Everyone is loyal
to someone. People only try to assassinate rulers they
aren't loyal to. Marcus tried to assassinate Caesar.
Prove: Marcus hates Caesar.
260
261
Marcus was a man. Marcus was a Pompeian. All
Pompeians were Romans. Caesar was a ruler. All Romans
were either loyal to Caesar or hated him. Everyone is loyal
to someone. People only try to assassinate rulers they
aren't loyal to. Marcus tried to assassinate Caesar.
Prove: Marcus hates Caesar.
262
263
264
Deep Learning Visualization
https://poloclub.github.io/cnn-explainer/
265
Unit – III: NEURAL NETWORKS
❖ Characteristics of Neural Networks, Historical Development of
Neural Networks Principles. Artificial Neural Networks:
Terminology, Models of Neuron, Topology, Basic Learning Laws,
Pattern Recognition Problem, Basic Functional Units, Pattern
Recognition Tasks by the Functional Units.
266
Artificial Intelligence, Machine
Learning and Deep Learning
267
Biological Neural Networks
A nerve cell neuron is a special biological cell that
processes information.
268
Dendrites − They are tree-like branches, responsible for
receiving the information from other neurons it is connected to.
In other sense, we can say that they are like the ears of neuron.
270
Artificial Neural Networks
(ANN)
Artificial neural networks, usually simply called neural
networks, are computing systems (mathematical function),
inspired by the biological neural networks.
271
Most artificial neurons have three things –
• an input transformation function
• an activation function
• an output transformation function
272
https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/
273
Biological Neural Network and
Artificial Neural Network ANN
Biological Neural Network BNN Artificial Neural Network ANN
Soma Node
Dendrites Input
Synapse Weights or Interconnections
Axon Output
274
Historical Development of
Neural Networks
275
Six Jars of Deep Learning
276
Data- Data everywhere
* 277
* 278
Data Curation
* 279
Data Augmentation
Data augmentation is an integral process in deep learning, as in
deep learning we need large amounts of data and in some cases
it is not feasible to collect thousands or millions of images, so
data augmentation comes to the rescue.
* 280
Operations in Data
Augmentation
1) Rotation: Rotation operation as the name suggests, just
rotates the image by a certain specified degree.
2) Zooming: Zooming operation allows us to either zoom in or
zoom out.
3) Cropping: Cropping allows us to crop the image or select a
particular area from an image.
4) Flipping: Flipping allows us to flip the orientation of the
image. We can use horizontal or vertical flip.
5) Changing the brightness level: This feature helps us to
combat illumination changes.
You can encounter a scenario where most of your dataset
comprises of images having a similar brightness level e.g.
collecting the images of employees entering the office, by
augmenting the images we make sure that our model is robust
* and is able to detect the person even in different surroundings.281
What you want to do with data?
Task
282
Task (From description to structure
specification)
* 283
Task (From description + review to write
FAQs)
* 284
Task (From description + review + FAQs to
Question Answering)
* 285
Task (From image identify people)
* 286
Task (From image identify activity)
* 287
Task (Classification)
* 288
Task (Regression)
* 289
Task (Clustering)
* 290
What is the mathematical formulation of a task?
Models
* 291
* 292
* 293
How do we know which model is better?
Loss function
* 294
How do we know which model is better?
Loss function
* 295
* 296
* 297
How do we identify the parameter of the model?
Learning Algorithm
* 298
* 299
How do we score for our model?
Evaluation
Class
Labels
Lion 1
Tiger 2
Cat 3
Giraffe 4
Dog 5
* 300
* 301
* 302
How Evaluation is different from loss function?
* 303
Artificial Neural Network
304
Common Terminology in ANN
Sample – A single row of a dataset.
Epoch – The number of times the algorithm runs on the
whole training dataset.
Batch – It denotes the number of samples to be taken to for
updating the model parameters.
Learning rate – It is a parameter that provides the model a
scale of how much model weights should be updated. Value
is between 0-1.
Cost Function/Loss Function – A cost function is used to
calculate the cost that is the difference between the
predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that
controls the signal between two neurons.
305
Weight increases the steepness of activation function. This
means weight decide how fast the activation function will
trigger whereas bias is used to delay the initiating of the
activation function.
Due to absence of bias, model will train over point passing
through origin only, which is not in accordance with real-world
scenario. Also with the introduction of bias, the model will
become more flexible.
306
How does the function
behave when we change w
and b?
* 307
1. w: (controls the slope)
a. Negative w, negative slope, mirrored s-shape, becomes more
harsh(vertical/less smooth) the more negative it goes
b. Positive w, positive slope, normal s-shape, becomes more
harsh(vertical/less smooth) the more positive it goes.
* 308
Connections
312
Multi-layer recurrent network: In this network, processing
element output can be directed to the processing element in the
same layer and in the preceding layer forming a multilayer
recurrent network.
They perform the same task for every element of a sequence, with
the output being dependent on the previous computations. Inputs
are not needed at each time step. The main feature of a Recurrent
Neural Network is its hidden state, which captures some
information about a sequence.
313
Learning Rules
Learning rule or Learning process is a method or a
mathematical logic. It improves the Artificial Neural Network’s
performance and applies this rule over the network.
Thus learning rules updates the weights and bias levels of a
network when a network simulates in a specific data
environment.
1) Hebbian learning
2) Perceptron learning rule
3) Delta learning
4) Correlation learning rule
5) Outstar learning rule
314
Hebbian Learning Rule
The Hebbian rule was the first learning rule. In 1949 Donald
Hebb developed it as learning algorithm of the unsupervised
neural network.
The Hebb learning rule assumes that – If two neighbor neurons
activated and deactivated at the same time. Then the weight
connecting these neurons should increase.
For neurons operating in the opposite phase, the weight
between them should decrease.
If there is no signal correlation, the weight should not change.
315
When inputs of both the nodes are either positive or negative,
then a strong positive weight exists between the nodes.
If the input of a node is positive and negative for other, a strong
negative weight exists between the nodes.
At the start, values of all weights are set to zero. This learning
rule can be used for both soft- and hard-activation functions.
317
318
Implementation of AND Gate
with Hebbian Learning Rule
Set all weights to zero, w i = 0 for i=1 to n, and bias to zero.
319
320
321
322
Perceptron Learning Rule
This rule is an error correcting and a supervised learning
algorithm of single layer feedforward networks with linear
activation function.
As being supervised in nature, to calculate the error, there
would be a comparison between the desired/target output and
the actual output.
If there is any difference found, then a change must be made to
the weights of connection.
The perceptron network consists of three units, namely,
sensory unit (input unit), associator unit (hidden unit),
response unit (output unit).
The input units are connected to hidden units with fixed
weights having values 1, 0 or -l, which are assigned at random.
The binary activation function is used in hidden layer. 323
"t" is +1 or-l and α is the learning rate
327
Implement AND function using perceptron networks for bipolar inputs ans targets.
328
329
330
Gradient Descent
Gradient descent is an iterative optimization algorithm to find
the minimum of a function. Here that function is our Loss
Function.
Loss function is the partial deviation with respect to w and b.
He goes down the slope and takes large steps when the slope is
steep and small steps when the slope is less steep.
He decides his next position based on his current position and
331
Delta Learning Rule (Widrow−Hoff
Rule)
The perceptron learning rule originates from the Hebbian
assumption while the delta rule is derived from the gradient-
descent method.
The delta rule updates the weights between the connections so
as to minimize the difference between the net input to the
output unit and the target value.
The major aim is to minimize all errors over all training
patterns. This is done by reducing the error for each pattern,
one at a time.
Delta rule (DR) is similar to the Perceptron Learning Rule (PLR),
with some differences:
1. Error ( Error ( δ), in DR is not restricted to having values of 0, 1,
or - 1 (as in PLR), but may have any value
2. DR can be derived for any differentiable output/activation
function f, whereas in PLR only works for threshold output332
function.
333
Back propagation Network
The back propagation learning algorithm is one of the most
important developments in neural networks.
This learning algorithm is applied to multilayer feed-forward
networks consisting of processing elements with continuous
differentiable activation functions.
The basic concept for this weight update algorithm is simply
the gradient descent method. This is a methods where error is
propagated back to the hidden unit. Back propagation network
is a training algorithm.
The training of the BPN is done in three stages - the feed-
forward of the input training pattern, the calculation and
back-propagation of the error, and updation of weights.
334
A back-propagation neural network is a multilayer, feed-forward
neural network consisting of an input layer, a hidden layer and an
output layer.
The neurons present in the hidden and output layers have biases,
which are the connections from the units whose activation is always 1.
The bias terms also acts as weights.
During the back propagation phase of learning, signals are sent in the
reverse direction. The inputs sent to the BPN and the output
obtained from the net could be either binary (0, 1) or bipolar (-1,
+1). The activation function could be any function which increases
monotonically and is also differentiable.
335
336
337
338
339
340
341
Activation Function
It will help to determine whether the neuron will fire or not.
An activation function is to add non-linearity to the neural
network.
If we have a neural network working without the activation
functions. In that case, every neuron will only be performing
a linear transformation on the inputs using the weights and
biases. It’s because it doesn’t matter how many hidden layers
we attach in the neural network; all layers will behave in the
same way because the composition of two linear functions is a
linear function itself.
342
Activation Function Types
1) Linear Function
2) Binary Step Function
3) Non-Linear Function
Linear Function
343
Mathematically it can be represented as:
345
Pros
Cons
Real-life isn’t that black and white and most situations are not
this binary.
Can only be used at the end of a network.
Best for perceptrons and one layer networks
It cannot provide multi-value outputs—for example, it cannot
be used for multi-class classification problems.
The gradient of the step function is zero, which causes a
hindrance in the backpropagation process.
346
Non-linear Activation function
347
Sigmoid Function
It is also called S-shaped functions. Logistic and hyperbolic
tangent functions are commonly used in sigmoid functions.
There are two types of sigmoid functions.
Binary Sigmoid function (or Logistic function)
Bipolar Sigmoid function (or Hyperbolic Tangent Function or
Tanh)
351
352
it also faces the problem of vanishing gradients similar to the
sigmoid activation function. Plus the gradient of the tanh
function is much steeper as compared to the sigmoid function.
353
ReLu (Rectified Linear Unit) Activation
Function
The main catch here is that the ReLU function does not activate
all the neurons at the same time.
The neurons will only be deactivated if the output of the linear
transformation is less than 0.
354
355
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve
the Dying ReLU problem as it has a small positive slope in the
negative area.
The advantages of Leaky ReLU are same as that of ReLU, in
addition to the fact that it does enable backpropagation, even
for negative input values.
By making this minor modification for negative input values,
the gradient of the left side of the graph comes out to be a non-
zero value. Therefore, we would no longer encounter dead
neurons in that region.
356
357
Softmax Function
The softmax function is a function that turns a vector of K real
values into a vector of K real values that sum to 1.
The input values can be positive, negative, zero, or greater than
one, but the softmax transforms them into values between 0
and 1, so that they can be interpreted as probabilities.
If one of the inputs is small or negative, the softmax turns it
into a small probability, and if an input is large, then it turns it
into a large probability, but it will always remain between 0
and 1.
358
359
360
Choosing the Right One
1) Use the ReLu or LeakyRelu function in hidden layers only
2) In binary classification remember to use the Sigmoid function
in the output layer
3) In multi-class classification(when classes to predict are more
than 2) problem use Softmax function in the output layer
4) Due to the vanishing gradient problem ‘Sigmoid’ and ‘Tanh’
activation functions are avoided sometimes in deep neural
network architectures
5) Always remember you can also invent your own activation
function and can check its performance.
361
McCulloch-Pitts Neuron
362
It may be divided into 2 parts. The first part, g takes an input,
performs an aggregation and based on the aggregated value
the second part, f makes a decision.
366
Data and Task
367
Loss Function
368
Learning Algorithm
369
Evaluation
370
Problems with MP
Neuron Model
1) Boolean Inputs.
2) Boolean Outputs.
3) Linear
4) Fixed slop
5) Threshold b can take only a few possible
values.
6) All the inputs to the model have equal weights.
371
MP Neuron Summary
372
373
Calculate the net input for the network shown in Figure with bias included in
the network.
374
375
376
Perceptron
The original Perceptron was designed to take a number
of binary inputs, and produce one binary output (0 or 1). Then
it updated and except real number as an input.
The idea was to use different weights to represent the
importance of each input, and that the sum of the values
should be greater than a threshold value before making a
decision like true or false (0 or 1).
377
Single and Multi Layered
Perceptron Model
1. Single Layer Perceptron- The Single Layer perceptron is
defined by its ability to linearly classify inputs. This
means that this kind of model only utilizes a single
hyperplane line and classifies the inputs as per the
learned weights beforehand.
378
Data and Task
379
Model
380
Loss function
382
Perceptron Learning
Algorithm
383
384
Evaluation
385
386
MP Neuron v/s Perceptron
387
How Loss function is
different from Square Loss
388
389
Linearly Separable
Two sets of points (or classes) are called linearly
separable, if there is one straight line in the plane exists
so that all the points of one class are on one side of the
line and all the points of the other class are on the other
side. they are called linearly separable.
390
Suppose we have a perceptron having weights corresponding
to the three inputs have the following values:
w1 = 2 ; w2 = −4; and w3 = 1
and the activation of the unit is given by the step-function:
φ(v) = 1 if v≥0 otherwise 0
Calculate the output value y of the given perceptron for each of
the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
391
Pattern P1 P2 P3 P4
w1 = 2 ; w2 = −4; and w3 = 1
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
392
Consider a feed-forward Neural Network having 2 inputs(label -1 and
label -2 )with fully connected layers and we have 2 hidden layers:
Hidden layer-1: Nodes labeled as 3 and 4
Hidden layer-2: Nodes labeled as 5 and 6
A weight on the connection between nodes i and j is represented by wij, such
as w 24 is the weight on the connection between nodes 2 and 4. The following
lists contain all the weights values used in the given network:
w13=−2, w 35=1, w23 = 3, w 45 = −1, w14 = 4, w 36 = −1, w24=−1, w 46=1
Each of the nodes 3, 4, 5, and 6 use the following activation function:
φ(v) = 1 if v≥0 otherwise 0
where v denotes the weighted sum of a node. Each of the input nodes (1 and
2) can only receive binary values (either 0 or 1). Calculate the output of the
network (y5 and y6) for the input pattern given by (node-1 and node-2 as 0, 0
respectively).
393
w13=−2, w 35=1, w23 = 3, w 45 = −1, w14 = 4, w 36 = −1, w24=−1, w 46=1
394
395
XOR with a Hidden Layer
396
Loss Function
The loss function in a neural network quantifies the difference
between the expected outcome and the outcome produced by
the machine learning model.
From the loss function, we can derive the gradients which are
used to update the weights. The average over all losses
constitutes the cost.
In simple words, the Loss is used to calculate the gradients. And
gradients are used to update the weights of the Neural Net.
398
Binary Crossentropy
Cross-entropy-based loss functions are commonly used in
classification scenarios. Cross entropy is a measure of the
difference between two probability distributions.
If you are using BCE loss function, you just need one output
node to classify the data into two classes. The output value
should be passed through a sigmoid activation function and the
range of output is (0 – 1).
If the output is greater than 0.5, the network classifies it as
rain and if the output is less than 0.5, the network classifies it
as not rain (for rain prediction).
One important thing, if you are using BCE loss function the
output of the node should be between (0–1). It means you have
to use a sigmoid activation function on your final output. Since
sigmoid converts any real value in the range between (0–1).
399
Categorical Crossentropy
When we have a multi-class classification task, one of the loss
function you can go ahead is this one. If you are using CCE loss
function, there must be the same number of output nodes as
the classes and the final layer output should be passed through
a softmax activation so that each node output a probability
value between (0–1).
For feeding the target value at the time of training, we have to
one-hot encode them.
Basically, the target vector would be of the same size as the
number of classes and the index position corresponding to the
actual class would be 1 and all others would be zero.
400
Sparse Categorical Crossentropy
This loss function is almost similar to CCE except for one
change.
When we are using SCCE loss function, you do not need to one
hot encode the target vector. If the target image is of a cat,
you simply pass 0, otherwise 1. Basically, whichever the class is
you just pass the index of that class.
401
402
403
Learning Algorithm
404
405
406
For more than two variable
407
408
409
410
411
412
Optimization Algorithms
Optimizers are algorithms or methods used to change the
attributes of your neural network such as weights and learning
rate in order to reduce the losses.
Types of optimizer
1. Gradient Descent
2. Stochastic Gradient Descent
3. Stochastic Gradient descent with momentum
4. Mini-Batch Gradient Descent
5. Adagrad
6. RMSProp
7. AdaDelta
8. Adam
413
Gradient Descent Optimizer
Gradient descent is an iterative optimization algorithm to find
the minimum of a function. Here that function is our Loss
Function.
Loss function is the partial deviation with respect to w and b.
He goes down the slope and takes large steps when the slope is
steep and small steps when the slope is less steep.
He decides his next position based on his current position and
414
Gradient descent is a first-order optimization algorithm which
is dependent on the first order derivative of a loss function.
It calculates that which way the weights should be altered so
that the function can reach a minima.
Through backpropagation, the loss is transferred from one
layer to another and the model’s parameters also known as
weights are modified depending on the losses so that the loss
can be minimized.
415
If we choose α to be very large, Gradient If we choose α to be very small, Gradient
Descent can overshoot the minimum. It may Descent will take small steps to reach local
fail to converge or even diverge. minima and will take a longer time to416
reach
Behavior of GD on different
Surface?
417
Advantages:
1. Very simple to implement.
Disadvantages:
1. This algorithm takes an entire dataset of n-points
at a time to compute the derivative to update the
weights which require a lot of memory.
2. Minima is reached after a long time or is never
reached.
3. This algorithm can be stuck at local minima.
418
Stochastic Gradient Descent
(SGD)
⚫
419
Momentum Based Gradient
Descent
420
421
422
Advantage:
1. Memory requirement is less compared to the GD
algorithm as derivative is computed taking only 1
point at once.
Disadvantages:
1. The time required to complete 1 epoch is large
compared to the GD algorithm.
2. Takes a long time to converge.
3. May stuck at local minima.
423
Nesterov Accelerated Gradient
(NAG) Update rules
424
425
Mini Batch Stochastic
Gradient Descent (MB-SGD)
MB-SGD algorithm is an extension of the SGD algorithm and it
overcomes the problem of large time complexity in the case
of the SGD algorithm. MB-SGD algorithm takes a batch of points
or subset of points from the dataset to compute derivate.
It is observed that the derivate of the loss function for MB-SGD
is almost the same as a derivate of the loss function for GD
after some number of iterations. But the number of iterations
to achieve minima is large for MB-SGD compared to GD and the
cost of computation is also large.
The update of weight is dependent on the derivate of loss for a
batch of points. The updates in the case of MB-SGD are much
noisy because the derivative is not always towards minima.
426
Advantages:
1. Less time complexity to converge compared to
standard SGD algorithm.
Disadvantages:
1. The update of MB-SGD is much noisy compared to
the update of the GD algorithm.
2. May get stuck at local minima.
427
SGD with Momentum
428
Advantages:
1. Has all advantages of the SGD algorithm.
2. Converges faster than the GD algorithm.
Disadvantages:
1. We need to compute one more variable for each
update
429
Adaptive Learning Rate
430
Adagrad (Adaptive Gradient
Descent)
One of the disadvantages of all the optimizers explained is that the
learning rate is constant for all parameters and for each cycle.
The adaptive gradient descent algorithm is slightly different from
other gradient descent algorithms. This is because it uses different
learning rates for each iteration.
The change in learning rate depends upon the difference in the
parameters during training. The more the parameters get change, the
more minor the learning rate changes.
This modification is highly beneficial because real-world datasets
contain sparse as well as dense features. So it is unfair to have the
same value of learning rate for all the features.
431
432
Advantages:
1. Learning rate changes for each training parameter.
2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.
Disadvantages:
1. Computationally expensive as a need to calculate the second order
derivative.
2. The learning rate is always decreasing results in slow training.
433
RMS Prop(Root Mean Square)
In this algorithm, the two gradients are first compared for signs.
If they have the same sign, we’re going in the right direction
and hence increase the step size by a small fraction. Whereas, if
they have opposite signs, we have to decrease the step size.
Then we limit the step size, and now we can go for the weight
update.
434
435
Adam
Adaptive Moment Estimation combines the power of RMSProp
(root-mean-square prop) and momentum-based GD. In Adam
optimizers, the power of momentum GD to hold the history
of updates and the adaptive learning rate provided by
RMSProp makes Adam optimizer a powerful method. It also
introduces two new hyper-parameters beta1 and beta2 which
are usually kept around 0.9 and 0.99 but you can change them
according to your use case.
436
Comparison between various
optimizers
437
Parameter Tuning
438
Pattern Recognition
Pattern recognition is a process of finding regularities and
similarities in data using machine learning data. Now, these
similarities can be found based on statistical analysis, historical
data, or the already gained knowledge by the machine itself.
If we discuss sports, a description of a type would be a pattern.
If a person keeps watching videos related to cricket, YouTube
wouldn’t recommend them chess tutorials videos.
Examples: Speech recognition, speaker identification,
multimedia document recognition (MDR), automatic medical
diagnosis.
Before searching for a pattern there are some certain steps and
the first one is to collect the data from the real world. The
collected data needs to be filtered and pre-processed so that its
system can extract the features from the data. Then based on
the type of the data system will choose the appropriate
algorithm among Classification, Regression, and Regression
to recognize the pattern. 439
1) Classification. In classification, the algorithm assigns labels
to data based on the predefined features. This is an example
of supervised learning.
2) Clustering. An algorithm splits data into a number of clusters
based on the similarity of features. This is an example of
unsupervised learning.
3) Regression. Regression algorithms try to find a relationship
between variables and predict unknown dependent variables
based on known data. It is based on supervised learning.
440
Components of Pattern
Recognition
Inp
ut
Sensing
Segmentation
Feature Extraction
Classification
Post-Processing
441
Decisio
1) Sensor Inp
It gathers the information to be classified. The ut
sensors in a system are used to receives the data Sensing
input.
442
2) Segmentation
Inp
ut
After receiving the input data the different
Sensing
patterns need to be separated.
The measurement data is partitioned into the
small pieces so that each part represent exactly Segmentation
one object to be classified.
The segmentor isolates sensed objects from the
back ground or from other objects.
443
3) Feature Extraction
Inp
Feature selection is the process of selecting a subset ut
of a given set of variables.
Sensing
Feature selection is for filtering irrelevant or
redundant features from your dataset.
Segmentation
Features
Selection
Feature
Extraction
- - - - 445
Good v/s Bed Features
What Makes a Good Features?
448
Expert System
In artificial intelligence, an expert system is a computer system
that emulates the decision-making ability of a human expert.
The first expert systems were created in the 1950s and then
production increases in the 1980s.
459
Rule-based Systems – Forward
Chaining
Forward chaining is a method of reasoning in artificial
intelligence in which inference rules are applied to
existing data to extract additional data until an endpoint
(goal) is achieved.
In this type of chaining, the inference engine starts by
evaluating existing facts, derivations, and conditions
before deducing new information. An endpoint (goal) is
achieved through the manipulation of knowledge that
exists in the knowledge base.
460
Tom is running (A)
If a person is running, he will sweat (A->B)
Therefore, Tom is sweating. (B)
461
Backward Chaining
Forward chaining is not very efficient if the system tries
to prove a conclusion. In this case, it may be more
efficient if backward chaining is used.
Backward chaining is a concept in artificial intelligence
that involves backtracking from the endpoint or goal to
steps that led to the endpoint.
462
Tom is sweating (B).
If a person is running, he will sweat (A->B).
Tom is running (A).
463
Components of an Expert
System
464
Development of Expert
Systems: General Steps
1) Identify Problem Domain
▪ The problem must be suitable for an expert system to solve it.
▪ Find the experts in task domain for the ES project.
▪ Establish cost-effectiveness of the system.
▪ Design the System
2) Identify the ES Technology
▪ Know and establish the degree of integration with the other
systems and databases.
▪ Realize how the concepts can represent the domain knowledge
best.
3) Develop the Prototype
▪ From Knowledge Base: The knowledge engineer works to −
▪ Acquire domain knowledge from the expert.
▪ Represent it in the form of If-THEN-ELSE rules.
465
4) Test and Refine the Prototype
▪ The knowledge engineer uses sample cases to test the prototype for
any deficiencies in performance.
▪ End users test the prototypes of the ES.
5) Develop and Complete the ES
▪ Test and ensure the interaction of the ES with all elements of its
environment, including end users, databases, and other
information systems.
▪ Document the ES project well.
▪ Train the user to use ES.
6) Maintain the System
▪ Keep the knowledge base up-to-date by regular review and update.
▪ Cater for new interfaces with other information systems, as those
systems evolve.
466
Characteristics of an Expert
System
Human experts are perishable, but an expert system is
permanent.
It helps to distribute the expertise of a human.
One expert system may contain knowledge from more
than one human experts thus making the solutions more
efficient.
It decreases the cost of consulting an expert for various
domains such as medical diagnosis.
They use a knowledge base and inference engine.
Expert systems can solve complex problems by deducing
new facts through existing facts of knowledge,
represented mostly as if-then rules rather than through
conventional procedural code. 467
Limitations
468
Advantages :
Low accessibility cost.
Fast response.
Not affected by emotions, unlike humans.
Low error rate.
Capable of explaining how they reached a solution.
Disadvantages :
The expert system has no emotions.
Common sense is the main issue of the expert system.
It is developed for a specific domain.
It needs to be updated manually. It does not learn itself.
Not capable to explain the logic behind the decision.
469
Applications
Different types of medical diagnosis like internal medicine,
blood diseases and show on.
Diagnosis of the complex electronic and electromechanical
system.
Diagnosis of a software development project.
Planning experiment in biology, chemistry and molecular
genetics.
Forecasting crop damage.
Diagnosis of the diesel-electric locomotive system.
Identification of chemical compound structure.
Scheduling of customer order, computer resources and various
manufacturing task.
Assessment of geologic structure from dip meter logs.
Assessment of space structure through satellite and robot.
The design of VLSI system.
Teaching students specialize task. 470
Education – Partnership –
Solutions
471
Prolog
❖ Prolog - PROgramming in LOGic,