Final Updated AI

Artificial Intelligence
Dr. Rajeev Kumar Gupta

Assistant Professor
Pandit Deendayal Energy
University
Gandhinagar, Gujarat 1
Syllabus
Unit-I: INTRODUCTION TO AI AND SEARCHING
❖ AI Problems, Intelligent Agents, Problem Formulation, Basic
Problem Solving Methods. Search strategies, Uniformed Search
Strategies, State-Space Search, Bi-Directional Search, BFS, DFS,
Heuristic Search Strategies, Local Search Algorithms, Hill
Climbing, Greedy Best First Search, A* Search, Simulated
Annealing, Measure of performance and analysis of search
algorithms.
Unit –II: KNOWLEDGE REPRESENTATION AND INFERENCE

❖ Game playing, Knowledge representation using-Predicate logic,
Introduction to predicate calculus, Resolution, Use of predicate
calculus, Knowledge representation using other logic, Structured
representation of knowledge, Production based system, Frame
based system, First order logic. Inference in first order logic,
propositional Vs. first order inference, unification & lifts forward 2
Unit – III: NEURAL NETWORKS
❖ Characteristics of Neural Networks, Historical Development of
Neural Networks Principles. Artificial Neural Networks:
Terminology, Models of Neuron, Topology, Basic Learning Laws,
Pattern Recognition Problem, Basic Functional Units, Pattern
Recognition Tasks by the Functional Units.
Unit –IV: Expert System

❖ Introduction to Expert systems - Architecture of expert systems,
Roles of expert systems - Knowledge Acquisition –Meta
knowledge, Heuristics. Example of expert systems - MYCIN, DART,
XOON, Expert systems shells, Introduction to Planning.
3
Traditional Programming
Traditional programming is a manual process where
programmer need to write a program that uses input
data and runs on a computer to produce the output.
Program Compute
Data r Output
Ex. C, C++, Java, Python etc.
4
Convert °C To °F
The equation is as follows:
T(°F) = T(°C) × 9/5 + 32
For Example, let's convert 5°C celsius temperature to

Fahrenheit:
T(°F) = (5°C × 9/5) + 32 = 41°F
Based on Rule
Limitation: Not possible to write code for each real

5
time problem.
Source: https://medium.com/anubhav-shrimal/dogs-vs-cats-
image-classification-using-resnet-d2ed7e6db2bb
Source: https://zeenews.india.com/cricket/on-
this-day-in-2007-sachin-tendulkar
Source: https://
towardsdatascience.com/ 6
Artificial intelligence (AI) involves machines
that can execute tasks that are similar to that
of human intelligence; they were first
invented in 1956 by John McCarthy.
AI, the ability of a digital computer or

computer-controlled robot to perform tasks
commonly associated with intelligent beings.
7
What is intelligence?
1) Learning
The simplest is learning by trial and error. For example, a
simple computer program for solving mate-in-one chess
problems might try moves at random until mate is found.
The program might then store the solution with the position
so that the next time the computer encountered the same
position it would recall the solution.
This simple memorizing of individual items and

procedures—known as rote learning—is relatively easy to
implement on a computer.
More challenging is the problem of implementing what is

called generalization. Generalization involves applying past
experience to related new situations.
8
2) Problem solving
Problem solving, particularly in artificial intelligence, may be
characterized as a systematic search through a range of
possible actions in order to reach some predefined goal or
solution.
Problem-solving methods divide into special purpose and

general purpose. A special-purpose method is tailor-made
for a particular problem and often exploits very specific
features of the situation in which the problem is embedded.
In contrast, a general-purpose method is applicable to a wide

variety of problems.
9
3) Reasoning
To reason is to draw inferences appropriate to the situation.
Inferences are classified as either deductive or inductive.
An example of the former is, “Sachin must be in either the museum
or the café. He is not in the café; therefore he is in the museum,”
In inductive, “Previous accidents of this sort were caused by
instrument failure; therefore this accident was caused by
instrument failure.”
The most significant difference between these forms of
reasoning is that in the deductive case the truth of the
premises guarantees the truth of the conclusion, whereas
in the inductive case the truth of the premise lends support
to the conclusion without giving absolute assurance.
Inductive reasoning is common in science, where data are
collected and tentative models are developed to describe and
predict future behaviour—until the appearance of
anomalous data forces the model to be revised.
Deductive reasoning is common in mathematics and logic,
10
where elaborate structures of irrefutable theorems are built
4) Perception
In perception the environment is scanned by means of
various sensory organs, real or artificial, and the scene is
decomposed into separate objects in various spatial
relationships.
Analysis is complicated by the fact that an object may appear
different depending on the angle from which it is viewed, the
direction and intensity of illumination in the scene, and how
much the object contrasts with the surrounding field.
5) Language
A language is a system of signs having meaning by
convention. In this sense, language need not be confined to
the spoken word.
Traffic signs, for example, form a minilanguage, it being a matter
of convention that ⚠ means “hazard ahead” in some countries.
11
History of Artificial
Intelligence
12
❖ Maturation of Artificial Intelligence (1943-1952)
• Year 1943: The first work which is now recognized as AI
was done by Warren McCulloch and Walter pits in 1943.
They proposed a model of artificial neurons.
• Year 1949: Donald Hebb demonstrated an updating rule

for modifying the connection strength between neurons.
His rule is now called Hebbian learning.
• Year 1950: The Alan Turing who was an English

mathematician and pioneered Machine learning in 1950.
Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can
check the machine's ability to exhibit intelligent behavior
equivalent to human intelligence, called a Turing test.
13
❖ The birth of Artificial Intelligence (1952-1956)
• Year 1955: An Allen Newell and Herbert A. Simon created

the "first artificial intelligence program"Which was
named as "Logic Theorist". This program had proved 38
of 52 Mathematics theorems, and find new and more
elegant proofs for some theorems.
• Year 1956: The word "Artificial Intelligence" first adopted

by American Computer scientist John McCarthy at the
Dartmouth Conference. For the first time, AI coined as an
academic field.
14
❖ The golden years-Early enthusiasm (1956-1974)
• Year 1966: The researchers emphasized developing
algorithms which can solve mathematical problems. W.
Joseph created the first chatbot in 1966, which was
named as ELIZA.
❖ The first AI winter (1974-1980)

• The duration between years 1974 to 1980 was the first AI
winter duration. AI winter refers to the time period
where computer scientist dealt with a severe shortage of
funding from government for AI researches.
• During AI winters, an interest of publicity on artificial
intelligence was decreased.
15
❖ A boom of AI (1980-1987)
• Year 1980: After AI winter duration, AI came back with
"Expert System". Expert systems were programmed that
emulate the decision-making ability of a human expert.
• In the Year 1980, the first national conference of the
American Association of Artificial Intelligence was held
at Stanford University.
❖ The second AI winter (1987-1993)

• The duration between the years 1987 to 1993 was the
second AI Winter duration.
• Again Investors and government stopped in funding for
AI research as due to high cost but not efficient result.
The expert system such as XCON was very cost effective.
16
❖ The emergence of intelligent agents (1993-2011)
• Year 1997: In the year 1997, IBM Deep Blue beats world
chess champion, Gary Kasparov, and became the first
computer to beat a world chess champion.
• Year 2002: for the first time, AI entered the home in the
form of Roomba, a vacuum cleaner.
• Year 2006: AI came in the Business world till the year
2006. Companies like Facebook, Twitter, and Netflix also
started using AI.
17
❖ Deep learning, big data and artificial general
intelligence (2011-present)
• Year 2011: In the year 2011, IBM's Watson won jeopardy, a
quiz show, where it had to solve the complex questions as well
as riddles. Watson had proved that it could understand natural
language and can solve tricky questions quickly.
• Year 2012: Google has launched an Android app feature
"Google now", which was able to provide information to the
user as a prediction.
• Year 2014: In the year 2014, Chatbot won a competition in the
infamous "Turing test."
• Year 2018: The "Project Debater" from IBM debated on
complex topics with two master debaters and also performed
extremely well.
• Google has demonstrated an AI program "Duplex" which was a
virtual assistant and which had taken hairdresser appointment
on call, and lady on other side didn't notice that she was talking 18
AI has three different levels:
⚫ Narrow AI: An artificial intelligence is said to be

narrow when the machine can perform a
specific task better than a human. The current
research of AI is here now
⚫ General AI: An artificial intelligence reaches the

general state when it can perform any
intellectual task with the same accuracy level
as a human would
⚫ Active AI: An AI is active when it can beat

* humans in many tasks 19
Different Areas Under
20
Artificial Intelligence, Machine
Learning and Deep Learning
Image Source: https://datacatchup.com/artificial-intelligence-machine-learning-and-deep-learning/
21
Machine learning
Machine learning is a simple way of achieving AI. Arthur
Samuel summoned the phrase not long after AI, in 1960,
defining it as, “the capability to train without being overtly
programmed”.
Deep learning
It is a subset of machine learning and is called deep learning
because it makes use of deep neural networks. Deep learning is
a computer software that mimics the network of neurons in a
brain.
In deep learning, the learning phase is done through a neural
network. A neural network is an architecture where the layers
are stacked on top of each other
* 22
Machine Learning and Deep
Learning Performance w.r.t.
Data
* 23
Machine Learning v/s Deep
Learning
* 24
When to use ML or DL?
* 25
What are the Most Popular
Languages for Machine
Learning
⚫ Python
⚫R
⚫ MATLAB
⚫ SAS (Statistical Analysis System)
⚫ Scala
⚫ JAVA
⚫ C++
⚫ Hadoop
Expert System
In artificial intelligence, an expert system is a computer system
that emulates the decision-making ability of a human expert.
Expert systems are designed to solve complex problems by

reasoning through bodies of knowledge, represented mainly as
if–then rules rather than through conventional procedural code.
The first expert systems were created in the 1950s and then
production increases in the 1980s.
Expert systems were among the first truly successful forms of

artificial intelligence. For Example
MYCIN: To identify various bacteria that can cause severe
infections and can also recommend drugs based on the
person’s weight.
CaDet: It is a clinical support system that could identify cancer
in its early stages in patients. 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
1) Huge amount of data
2) Some time difficult to code express honesty, confidence,
facial expression etc. that can be achieve by interaction
3) Some time rule can be unknown
* 36
A shift from rule-based approach to a data-driven
* 37
Machine Learning
Machine learning is an application of artificial intelligence
(AI) that provides systems the ability to automatically learn
and improve from experience without being explicitly
programmed.
Machine Learning, the input data and output are fed to an

algorithm to create a program.
Input Compute
Program
Output r
38
How ML Algorithms Works?
Source: https://www.spaceotechnologies.com/machine-learning-app-development-complete-guide/
39
Convert °C To °F
T(°F) = T(°C) × 9/5 +

32
For Example, let's convert 5°C celsius temperature to

Fahrenheit:
40
T(°F) = (5°C × 9/5) + 32 = 41°F
How ML Algorithms Works?
T(°F) = T(°C) × 9/5 +
32
Source: https://www.spaceotechnologies.com/machine-learning-app-development-complete-guide/
41
Pattern Recognition
Pattern recognition is the process of recognizing patterns
by using machine learning algorithm.
Pattern recognition: Automated recognition of patterns

and regularities in data .
Pattern recognition is the study of how machine can

• Observe the environment
• Learn to classify pattern of interest
• Make decision about the categories of pattern
Two Phase : Learning and Detection.
42
Machine Learning v/s Pattern
Recognition
It is one of the integral elements of machine learning

technology.
Machine learning is basically the idea of training

machines to recognize patterns and apply it to
particle problems.
43
How Algorithm find Patterns?
Character No. of No. of

Lines Curves
A 5 0
B 2 2
C 0 1
D 1 1
New data
44
Learning Algorithm
Machine Learning is a concept which provides ability to the
machine to automatically learn and improve from experience
without being explicitly programmed.
The process of learning begins with observations in order to
find patterns in data and make better decisions in the future
based on the examples that we provide.
The primary aim of learning algorithm is to allow the
computers learn automatically without human intervention
Machine Learning
Algorithm
Un-
Supervised Reinforcement
Supervised
Learning Learning
Learning
Algorithm Algorithm 45
Algorithm
Supervised Learning
Learning in the presence of instructor/supervisor/teacher
❖ Ex. Classroom teaching
Trained machine on a labelled dataset.
Labelled dataset is one which have both input and

output parameters.
It is task driven because outcomes of a supervised

learning algorithm are controlled by the task.
46
10
Num-1 Num- Sum
2
5 5 10
5
8 2 10 Model Logi
10 3 13 5 c
15 6 21
Training Phase
20 4 21
30 40 70
30
Trained Model 7
40 0
Testing Phase
47
Training Phase
Testing Phase
Source: https://mc.ai/supervised-vs-unsupervised-learning/
48
Types of Supervised Learning
Machine Learning
Algorithm
Un-
Supervised
Learning Learning
Learning
Algorithm Algorithm
Algorithm
Regression Classification
49
Regression
If the output of the model is a continuous value.
It is used to predict a continuous value.
Num-1 Num- Sum
2
5 5 10
8 2 10
10 3 13
15 6 21
Ex. 20 4 21
❖ House price prediction 30 40 70
❖ Stock market prediction
❖ Predicting age of a person
❖ number of copies a music album will be sold next month
50
51
Types of ML Classification
Algorithms:
⚫ Linear Models
⚫ Logistic Regression
⚫ Support Vector Machines
⚫ Non-linear Models
⚫ K-Nearest Neighbours Source: https://www.kdnuggets.com/2019/12/enabling-deep-
⚫ Kernel SVM
learning-revolution.html
⚫ Naïve Bayes
⚫ Decision Tree Classification
⚫ Random Forest Classification
52
Classification v/s Regression
Source: https://medium.com/deep-math-machine-learning-
ai/different-types-of-machine-learning-and-their-
types-34760b9128a2
53
Unsupervised Learning
Unsupervised learning is very much the opposite of
supervised learning. It features are not labeled.
In unsupervised learning approach, training data consists

of a set of input vectors x without any corresponding
target values.
Instead, our algorithm would be fed a lot of data and

given the tools to understand the properties/ hidden
pattern of the data.
The goal in such unsupervised learning problems may be

to discover groups of similar group within the data
54
Unsupervised learning is data-driven because outcomes
of an unsupervised learning algorithm are controlled by
the data. Nu Num Su
m-1 -2 m
5 5 10
8 2 10
10 3 13
15 6 21
Input Data Final Result 20 4 21
30 40 70
55
Source: https://towardsdatascience.com/what-are-the-types-of-machine-learning-
e2b9e5d1756f
Source: https://mc.ai/supervised-vs-unsupervised-learning/
56
Machine Learning
Algorithm
Un-
Supervised
Learning Learning
Learning
Algorithm Algorithm
Algorithm
Dimension
Regression Classification Clustering
Reduction
57
Clustering
Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are

“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Source: https://towardsdatascience.com/unsupervised-
Source: http://bampe08.blogspot.com/2015/03/cluster- learning-and-data-clustering-eeecb78b422a
analysis-for-business.html 58
Dimension Reduction
Dimensionality reduction is simply, the process of reducing
the dimension of your feature set.
❖ Ex. Principal Component Analysis (PCA)
Advantages of Dimensionality Reduction

❖ It helps in data compression, and hence reduced storage
space.
❖ It reduces computation time.
❖ It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction

❖ It may lead to some amount of data loss.
59
Why Unsupervised Learning is
needed?
Annotating large datasets is very costly and hence we can
label only a few examples manually.
❖ Example: Speech Recognition
There may be cases where we don’t know how many/

what classes is the data divided into.
❖ Example: Data Mining
60
Examples of Unsupervised
Learning
Recommender Systems: Group customers into similar
purchasing segments
❖ YouTube or Netflix, you’ve video recommendation
system.
Buying Habits: Group customers into similar purchasing

segments
61
Supervised vs. Unsupervised
Learning
Parameter Supervised Learning Unsupervised Learning

Dataset Labelled Unlabelled
Method of The algorithm learns by itself using
Guided learning
Learning dataset
Complexity Simpler method Computationally complex
Accuracy More Accurate Less Accurate
62
Classification v/s Clustering
Source: https://deepcast.ai/article/2017/07/18/
going_beyond_k_means_for_effective_clustering_of_production_curves
63
Classification v/s Clustering v/
s Regression
Source: https://www.researchgate.net/publication/314626729_Parallel_Linear_Algebra/
figures?lo=1
64
Reinforcement Learning
It is about taking suitable action to maximize reward in a
particular situation
Reinforcement learning is the training of machine

learning models to make a sequence of decisions.
Source: https://www.geeksforgeeks.org/what-is-reinforcement-
learning/ 65
Evaluation Metric
Regression
o RMSE
o MSE
o R-Square
o Adjusted R Square
66
Deep learning
It is a subset of machine learning.
Deep learning is a computer software that mimics

the network of neurons in a brain.
In deep learning, the learning phase is done through

a neural network. A neural network is an
architecture where the layers are stacked on top of
each other
67
Underfitting and Overfitting
Underfitting: Low training and test accuracy
Overfitting: High training accuracy and low test accuracy
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-
learning/
68
Underfitting and Overfitting
Underfitting: Low training and test accuracy
Overfitting: High training accuracy and low test accuracy
Source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-
with-it-6803a989c76/
69
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-
fitting-machine-learning/
70
Agents
An AI system is composed of an agent and its environment. The
agents act in their environment. The environment may contain
other agents.
An agent is anything that can perceive its environment
through sensors and Current +acts upon that environment
through effectors. History
Agent
Program
⚫ Sensor: Sensor is a device which detects the change in the environment and sends the
information to other electronic devices. An agent observes its environment through
sensors.
⚫ Actuators: Actuators are the component of machines that converts energy into motion.
The actuators are only responsible for moving and controlling a system. An actuator can71
Examples of Agent
A human agent has sensory organs such as eyes, ears, nose,
tongue and skin parallel to the sensors, and other organs such
as hands, legs, mouth, for effectors.
A robotic agent replaces cameras and infrared range finders

for the sensors, and various motors and actuators for effectors.
A software agent has Keystrokes, file contents, received

network packages which act as sensors and displays on the
screen, files, sent network packets acting as actuators.
72
PEAS Representation
⚫ PEAS is a type of model on which an AI agent
works upon.
• P: Performance measure
• E: Environment
• A: Actuators
• S: Sensors
Let's suppose a self-driving car then PEAS representation will be:

Performance: Safety, time, legal drive, comfort
Environment: Roads, other vehicles, road signs
Actuators: Steering, accelerator, break, signal, horn
Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.
73
Types of Agents
❖ Simple Reflex Agents
❖ Model-Based Reflex Agents
❖ Goal-Based Agents
❖ Utility-Based Agents
❖ Learning Agent
74
Simple reflex agents
Simple reflex agents ignore the rest of the percept history and
act only on the basis of the current percept.
The agent function is based on the condition-action rule.
If the condition is true, then the action is taken, else not.
This agent function only succeeds when the environment is
fully observable (i.e. not suitable for partial observable).
75
Model-based reflex agents
It works by finding a rule whose condition matches the current
situation.
A model-based agent can handle partially observable
environments.
The current state is stored inside the agent which maintains
some kind of structure describing the part of the world which
cannot be seen.
76
Goal-based agents
These kinds of agents take decisions based on how far they are
currently from their goal(description of desirable situations).
Their every action is intended to reduce its distance from the
goal.
They usually require search and planning. The goal-based
agent’s behavior can easily be changed.
77
Utility-based agents
The agents which are developed having their end uses as
building blocks are called utility-based agents.
When there are multiple possible alternatives, then to decide which one is
best, utility-based agents are used. They choose actions based on a preference
(utility) for each state. Sometimes achieving the desired goal is not enough.
We may look for a quicker, safer, cheaper trip to reach a destination.
Agent happiness should be taken into consideration. Utility

describes how “happy” the agent is.
78
Learning Agent
A learning agent in AI is the type of agent that can learn from
its past experiences or it has learning capabilities. It starts to
act with basic knowledge and then is able to act and adapt
automatically through learning.
A learning agent has mainly four conceptual components,
which are:
1. Learning element: It is responsible for making improvements by learning
from the environment
2. Critic: The learning element takes feedback from critics which describes how
well the agent is doing with respect to a fixed performance standard.
3. Performance element: It is responsible for selecting external action
4. Problem Generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.
79
Rational and Intelligent Agent
❖ Intelligent Agent
An intelligent agent is a goal-directed agent. It perceives its
environment through its sensors using the observations and
built-in knowledge, acts upon the environment through its
actuators.
❖ Rational Agent
Rationality is nothing but status of being reasonable, sensible,
and having good sense of judgment.
Rationality is concerned with expected actions and results
depending upon what the agent has perceived.
A rational agent always performs right action. The problem the
agent solves is characterized by Performance Measure,
Environment, Actuators, and Sensors (PEAS).
80
Turing Test in AI
In 1950, Alan Turing introduced a test to check whether a
machine can think like a human or not, this test is known as
the Turing Test. I
n this test, Turing proposed that the computer can be said to be
an intelligent if it can mimic human response under specific
conditions.
81
General Problem Solving
Components
❖ Every problem should be properly formulated in
artificial intelligence. Problem formulation is very
important before applying any search algorithm.
❖ In AI one must identify components of problems,

which are:-
❑ Problem Statement
• Definition
• Limitation or Constraints or Restrictions
❑ Problem Solution
❑ Solution Space
❑ Operators
82
❖ Definition of Problem
⚫ The information about what is to be done? Why it is important
to build AI system? What will be the advantages of proposed
system? For example “I want to predict the price of house
using AI system”.
❖ Problem Limitation
⚫ There always some limitations while solving problems. All
these limitations or constraints must be fulfill while creating
system. For example “I have only few features, some records
are missing. System must be 90% accurate otherwise will be
useless”.
❖ Goal or Solution
⚫ What is expected from system? The Goal state or final state or
the solution of problem is defined here. This will help us to
proposed appropriate solution for problem. For example “we83
❖ Solution Space
⚫ Problem can be solve in many ways. Some solution will be
efficient than others. Some will consume less resources, some
will be simple etc. There are always alternatives exists. Many
possible ways with which we can solve problem is known as
Solution Space. For example “price of house can be predict
using many machine learning algorithms”.
❖ Operators
⚫ Operators are the actions taken during solving problem.
Complete problem is solved using tiny steps or actions and all
these consecutive actions leads to solution of problem.
84
Mouse Path Problem
• Problem Statement
• Problem Definition: Mouse is
hungry, mouse is in a puzzle where
there are some cheese. Mouse will
only be satisfied if mouse eat cheese
• Problem Limitation: Some paths
are close i-e dead end, mouse can
only travel through open paths
• Problem Solution: Reach location where
is cheese and eat minimum one cheese.
There are possible solutions (cheese
pieces)
• Solution Space: To reach cheese there
are multiple paths possible
• Operators: Mouse can move in four
possible directions, these directions are
operators or actions which are UP,
DOWN, LEFT and RIGHT 85
Water Jug Problem
• Problem Statement
• Problem Definition: You
have to measure 4 liter (L)
water by using three
buckets 8L, 5L and 3L.
• Problem Limitation: You
can only use these (8L, 5L
and 3L) buckets
• Problem Solution: Measure
exactly 4L water
• Solution Space: There are
multiple ways doing this.
• Operators: Possible actions are
fill water in any bucket and
remove water from any bucket.
86
8 Puzzle or Slide Puzzle
• States: A state description
specifies the location of each of
the eight tiles and the blank in
one of the nine squares.
• Initial state: Any random
shuffled state can be
designated as initial state
• Actions:
• Slide Left
• Slide Right
• Slide Up
• Slide Down
• Transition model: Given a
state and action, this returns
the resulting state
• Goal test: This checks whether
the state matches the goal 87
Search Algorithm
Terminologies
In Artificial Intelligence, Search techniques are universal problem-
solving methods.
• Search: Searching is a step by step procedure to solve a search-

problem in a given search space. A search problem can have three
main factors:
• Search Space: Search space represents a set of possible solutions,
which a system may have.
• Start State: It is a state from where agent begins the search.
• Goal test: It is a function which observe the current state and
returns whether the goal state is achieved or not.
• Search tree: A tree representation of search problem is called Search
tree. The root of the search tree is the root node which is
corresponding to the initial state.
• Actions: It gives the description of all the available actions to the
agent.
• Transition model: A description of what each action do, can be
represented as a transition model.
• Path Cost: It is a function which assigns a numeric cost to each path. 88
Properties of Search
Algorithms
❑ Completeness: A search algorithm is said to be complete if it
guarantees to return a solution if at least any solution exists for
any random input.
❑ Optimality: If a solution found for an algorithm is guaranteed

to be the best solution (lowest path cost) among all other
solutions, then such a solution for is said to be an optimal
solution.
❑ Time Complexity: Time complexity is a measure of time for an

algorithm to complete its task.
❑ Space Complexity: It is the maximum storage space required

at any point during the search, as the complexity of the
problem.
89
Types of search algorithms
❖ Uninformed/Blind Search
The uninformed search does not contain any domain
knowledge such as closeness, the location of the goal.
It operates in a brute-force way as it only includes
information about how to traverse the tree and how to
identify leaf and goal nodes.
❖ Informed Search
Informed search algorithms use domain knowledge. In
an informed search, problem information is available
which can guide the search.
Informed search strategies can find a solution more
efficiently than an uninformed search strategy. 90
Difference between Informed
and Uninformed Search
Informed Search Uninformed Search
It uses knowledge for the searching It doesn’t use knowledge for

process. searching process.
It finds solution slow as compared

It finds solution more quickly.
to informed search.
Cost is low. Cost is high.
It consumes less time. It consumes moderate time.
It provides the direction regarding No suggestion is given regarding

the solution. the solution in it.
Less complexity (time, space) More complexity (time, space)
Greedy Search, A* Search, Graph Depth First Search, Breadth First

91
Search Search
Types of search algorithms
92
1. Breadth-first Search
Breadth-first search is the most common search strategy for
traversing a tree or graph. This algorithm searches
breadthwise in a tree or graph, so it is called breadth-first
search.
BFS algorithm starts searching from the root node of the tree
and expands all successor node at the current level before
moving to nodes of next level.
Breadth-first search implemented using FIFO queue data
structure.
93
BFS algorithm from the root node S to goal node K
94
BFS algorithm from the root node S to goal node K
95
The equivalent search tree for the graph
Path: S -> D -> A -> G
96
2. Depth-first Search
It starts from the root node and follows each path to its greatest
depth node before moving to the next path.
DFS uses a stack data structure for its implementation.
The process of the DFS algorithm is similar to the BFS
algorithm.
97
98
The equivalent search tree the graph
S -> A -> B -> C -> G
99
100
Calculate the order to print all the nodes of the graph starting from node
H to E
101
102
103
104
Properties of BFS
Time Complexity: Time Complexity
of BFS algorithm can be obtained by
the number of nodes traversed in BFS
until the shallowest Node. Where the
d= depth of shallowest solution and b
is a node at every state.
T (b) = 1+b2+b3+.......+ bd= O (bd)
Space Complexity: Space complexity

of BFS algorithm is given by the
Memory size of frontier which is
O(bd).
Completeness: BFS is complete,

which means if the shallowest goal
node is at some finite depth, then BFS
will find a solution.
105
Advantages and
Disadvantages of BFS
❖ Advantages:
BFS will provide a solution if any solution exists.
If there are more than one solutions for a given problem, then
BFS will provide the minimal solution which requires the least
number of steps.
❖ Disadvantages:
It requires lots of memory since each level of the tree must be
saved into memory to expand the next level.
BFS needs lots of time if the solution is far away from the root
node.
106
Properties of DFS
Completeness: DFS search algorithm is
complete within finite state space as it
will expand every node within a
limited search tree.
Time Complexity: Time complexity of

DFS will be equivalent to the node
traversed by the algorithm. It is given
by:
T(n)= 1+ n2+ n3 +.........+ nm=O(nm)
Where, m= maximum depth of any
node and this can be much larger
than d (Shallowest solution depth)
Space Complexity: DFS algorithm

needs to store only single path from the
root node, hence space complexity of
DFS is equivalent to the size of the
fringe set, which is O(bm).
107
Advantages and
Disadvantages of DFS
❖ Advantage:
DFS requires very less memory as it only needs to store a stack
of the nodes on the path from root node to the current node.
It takes less time to reach to the goal node than BFS algorithm
(if it traverses in the right path).
❖ Disadvantage:
There is the possibility that many states keep re-occurring, and
there is no guarantee of finding the solution.
DFS algorithm goes for deep down searching and sometime it
may go to the infinite loop.
108
3. Depth-Limited Search
Algorithm
A depth-limited search algorithm is similar to depth-first search
with a predetermined limit. Depth-limited search can solve the
drawback of the infinite path in the Depth-first search. In this
algorithm, the node at the depth limit will treat as it has no
successor nodes further.
Depth-limited search can be terminated with two Conditions of

failure:
1) Standard failure value: It indicates that problem does not have
any solution.
2) Cutoff failure value: It defines no solution for the problem within

a given depth limit.
109
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
110
Properties of Depth-Limited
Search
Completeness: DLS search algorithm is complete if
the solution is above the depth-limit.
Time Complexity: Time complexity of DLS
algorithm is O(bℓ).
Space Complexity: Space complexity of DLS
algorithm is O(b×ℓ).
Optimal: Depth-limited search can be viewed as a
special case of DFS, and it is also not optimal even if
ℓ>d.
111
4. Bidirectional Search
Algorithm
In normal graph search using BFS/DFS we begin our search in
one direction usually from source vertex toward the goal
vertex, but what if we start search from both direction
simultaneously.
Bidirectional search is a graph search algorithm which find
smallest path from source to goal vertex in a directed graph. It
runs two simultaneous search –
1) Forward search from source/initial vertex toward goal vertex
2) Backward search from goal/target vertex toward source vertex
Bidirectional search replaces single search graph(which is

likely to grow exponentially) with two smaller sub graphs – one
starting from initial vertex and other starting from goal vertex.
The search terminates when two graphs intersect.
112
Suppose we want to find if there exists a path from vertex 0
to vertex 14. Here we can execute two searches, one from
vertex 0 and other from vertex 14.
When both forward and backward search meet at vertex 7,
we know that we have found a path from node 0 to 14 and
search can be terminated now.
113
When to use bidirectional approach?
1. Both initial and goal states are unique and completely
defined.
2. The branching factor is exactly the same in both directions.
Performance measures
Completeness : Bidirectional search is complete if BFS is used in
both searches.
Optimality : It is optimal if BFS is used for search and paths have
uniform cost.
Complexity : Space complexity is same as BFS and DFS.
Total time complexity would be O(bd/2 +bd/2) which is far less
than O(bd).
114
A -> B -> C -> D
J -> I-> G -> D
115
5. Uniform-cost Search
Algorithm
UCS is different from BFS and DFS because here the costs come
into play. In other words, traversing via different edges might
not have the same cost. Uniform-cost search is a searching
algorithm used for traversing a weighted tree or graph.
This algorithm comes into play when a different cost is
available for each edge. The primary goal of the uniform-cost
search is to find a path to the goal node which has the lowest
cumulative cost.
Uniform-cost search expands nodes according to their path
costs form the root node. It can be used to solve any graph/tree
where the optimal cost is in demand.
A uniform-cost search algorithm is implemented by the
priority queue. It gives maximum priority to the lowest
cumulative cost.
116
Uniform cost search is equivalent to BFS algorithm if the path
117
118
119
Advantages:
Uniform cost search is optimal because at every state the path
with the least cost is chosen.
Disadvantages:
It does not care about the number of steps involve in searching
and only concerned about path cost. Due to which this algorithm
may be stuck in an infinite loop.
120
Informed Search Algorithms
Informed search algorithm contains an array of knowledge
such as how far we are from the goal, path cost, how to reach
to goal node, etc. This knowledge help agents to explore less to
the search space and find more efficiently the goal node.
Informed search algorithm uses the idea of heuristic, so it is
also called Heuristic search.
Heuristics function: Heuristic is a function which is used

in Informed Search, and it finds the most promising path. It
takes the current state of the agent as its input and
produces the estimation of how close agent is from the goal.
The heuristic method, however, might not always give the best
solution, but it guaranteed to find a good solution in reasonable
time.
The value of the heuristic function is always positive. 121
Heuristic function is given as:
h(n) <= h*(n)
Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence
heuristic cost should be less than or equal to the estimated cost.
Pure Heuristic Search

Pure heuristic search is the simplest form of heuristic search
algorithms. It expands nodes based on their heuristic value
h(n). It maintains two lists, OPEN and CLOSED list. In the
CLOSED list, it places those nodes which have already
expanded and in the OPEN list, it places nodes which have yet
not been expanded.
On each iteration, each node n with the lowest heuristic value
is expanded and generates all its successors and n is placed to
the closed list. The algorithm continues unit a goal state is
found.
122
123
124
125
126
In the informed search we will discuss two main algorithms
1) Best First Search Algorithm(Greedy search)
2) A* Search Algorithm
Best-first Search Algorithm (Greedy Search):
Greedy best-first search algorithm always selects the path

which appears best at that moment.
It is the combination of depth-first search and breadth-first
search algorithms. It uses the heuristic function and search.
In the best first search algorithm, we expand the node which is
closest to the goal node
127
Best First Search Algorithm:
Step 1: Place the starting node into the OPEN list.
Step 2: If the OPEN list is empty, Stop and return failure.
Step 3: Remove the node n, from the OPEN list which has the
lowest value of h(n), and places it in the CLOSED list.
Step 4: Expand the node n, and generate the successors of node n.
Step 5: Check each successor of node n, and find whether any
node is a goal node or not. If any successor node is goal node,
then return success and terminate the search, else proceed to
Step 6.
Step 6: For each successor node, algorithm checks for evaluation
function f(n), and then check if the node has been in either OPEN
or CLOSED list. If the node has not been in both list, then add it to
the OPEN list.
Step 7: Return to Step 2. 128
129
130
131
132
Time Complexity: The worst case time complexity of
Greedy best first search is O(bm ).
Space Complexity: The worst case space complexity of
Greedy best first search is O(b m ). Where, m is the maximum
depth of the search space.
Complete: Greedy best-first search is also incomplete, even
if the given state space is finite.
Optimal: Greedy best first search algorithm is not optimal.
133
Advantages and
Disadvantages of Best First
Search
Advantages:
Best first search can switch between BFS and DFS by gaining the
advantages of both the algorithms.
This algorithm is more efficient than BFS and DFS algorithms.
Disadvantages:
It can behave as an unguided depth-first search in the worst case
scenario.
It can get stuck in a loop as DFS.
This algorithm is not optimal.
134
A* Search Algorithm
A* search is the most commonly known form of best-first
search.
It uses heuristic function h(n), and cost to reach the node n
from the start state g(n).
A* algorithm is similar to UCS except that it uses g(n)+h(n)
instead of g(n).
It has combined features of UCS and greedy best-first search, by
which it solve the problem efficiently.
A* search algorithm finds the shortest path through the search
space using the heuristic function.
This search algorithm expands less search tree and provides
optimal result faster. 135
In A* search algorithm, we use search heuristic as well as the
cost to reach the node. Hence we can combine both costs as
following, and this sum is called as a fitness number.
136
137
138
139
140
141
142
Advantages:
A* search algorithm is the best algorithm than other search
algorithms.
A* search algorithm is optimal and complete.
This algorithm can solve very complex problems.
Disadvantages:
A* search algorithm has some complexity issues.
The main drawback of A* is memory requirement as it keeps all
generated nodes in the memory, so it is not practical for various
large-scale problems.
143
Path highlighted with red shows the path taken by
Greedy Algorithm and the path highlighted with
green shows the path taken by Heuristic A*
algorithm.
144
8-Puzzle using A* Algorithm.
f-score = h-score + g-score
h-score: how far the goal node

g-score: number of nodes traversed from the start node to
current node 145
146
147
Water Jug Problem
❖ Some jugs are given which should have non-calibrated
properties. At least any one of the jugs should have filled with
water. Then the process through which we can divide the
whole water into different jugs according to the question can
be called as water jug problem.
❖ Suppose that you are given 3 jugs A,B,C with capacities 8,5 and
3 liters respectively but are not calibrated (i.e. no measuring
mark will be there). Jug A is filled with 8 liters of water. By a
series of pouring back and forth among the 3 jugs, divide the 8
liters into 2 equal parts i.e. 4 liters in jug A and 4 liters in
jug B. How?
148
149
150
Hill Climbing
Hill climbing algorithm is a local search algorithm which
continuously moves in the direction of increasing elevation/
value to find the peak of the mountain or best solution to the
problem.
It terminates when it reaches a peak value where no neighbor
has a higher value.
Traveling-salesman Problem in which we need to minimize the
distance traveled by the salesman.
It is also called greedy local search as it only looks to its good
immediate neighbor state and not beyond that.
Hill Climbing is mostly used when a good heuristic is available.
In this algorithm, we don't need to maintain and handle the
search tree or graph as it only keeps a single current state.
151
Features of Hill Climbing
Greedy approach: Hill-climbing algorithm search moves in the
direction which optimizes the cost.
No backtracking: It does not backtrack the search space, as it

does not remember the previous states.
Feedback Mechanism: The feedback from the previous

computation helps in deciding the next course of action i.e.
whether to move up or down the slope
Types of Hill Climbing Algorithm:
1) Simple hill Climbing:
2) Steepest-Ascent hill-climbing:
3) Stochastic hill Climbing:

152
1. Simple Hill Climbing:
Simple hill climbing is the simplest way to implement a hill climbing
algorithm. It only evaluates the neighbor node state at a time and
selects the first one which optimizes current cost and set it as a
current state. It only checks it's one successor state, and if it finds
better than the current state, then move else be in the same state.
This algorithm has the following features:
Less time consuming
Less optimal solution and the solution is not guaranteed
Algorithm for Simple Hill Climbing:

Step 1: Evaluate the initial state, if it is goal state then return success and
Stop.
Step 2: Loop Until a solution is found or there is no new operator left to
apply.
Step 3: Select and apply an operator to the current state.
Step 4: Check new state:
If it is goal state, then return success and quit.
Else if it is better than the current state then assign new state as 153 a
154
8 Puzzle Problem using Hill
Climbing
155
Types of Hill Climbing
Algorithm
156
1. Simple Hill Climbing:
Simple hill climbing is the simplest way to implement a hill climbing
algorithm. It only evaluates the neighbor node state at a time and
selects the first one which optimizes current cost and set it as a
current state. It only checks it's one successor state, and if it finds
better than the current state, then move else be in the same state.
This algorithm has the following features:
Less time consuming
Less optimal solution and the solution is not guaranteed
Algorithm for Simple Hill Climbing:

Step 1: Evaluate the initial state, if it is goal state then return success and
Stop.
Step 2: Loop Until a solution is found or there is no new operator left to
apply.
Step 3: Select and apply an operator to the current state.
Step 4: Check new state:
If it is goal state, then return success and quit.
Else if it is better than the current state then assign new state as 157 a
2. Steepest-Ascent hill-climbing:
Steepest-Ascent Hill-Climbing algorithm

(gradient search) is a variant of Hill Climbing
algorithm.
In the case of hill climbing technique we picked

any state as a successor which was closer to the
goal than the current state whereas, in Steepest-
Ascent Hill Climbing algorithm, we choose the
best successor among all possible successors and
then update the current state.
158
Algorithm for Steepest-Ascent
hill climbing
159
3. Stochastic hill climbing
Stochastic hill climbing chooses at random from among the
uphill moves.
Stochastic hill climbing does not examine for all its neighbor
before moving. Rather, this search algorithm selects one
neighbor node at random and decides whether to choose it
as a current state or examine another state.
The probability of selection can vary with the steepness of the
uphill move.
Stochastic hill climbing usually converges more slowly than
steepest ascent, but in some state landscapes, it finds better
solutions.
Stochastic hill climbing is NOT complete, but it may be less
likely to get stuck.
160
Types of Hill Climbing
Algorithm
161
Hill Climbing Algorithm
Example
162
Local Maximum and Local
Minimum
163
164
165
8-puzzle: a solution case
166
8-puzzle: stuck at local
maximum
167
Blocks World Problem with Local Heuristic
Function
+1 for each block that is resting on the thing it is suppose to be resting
on.
-1 for each block that is resting on wrong thing.
A D
D C
C B
B A
Initia Goal
l
168
Blocks World Problem with Global Heuristic
Function
For each block that has the correct support structure : +1 to every
block in the support structure.
For each block that has the wrong support structure : -1 to every
block in the support structure.
A D
D C
C B
B A
Initia Goal
l
169
A H
H G
G F
F E
E D
D C
C B
B A
Initia Goal
l
170
A H
H G
G F
F E
E D
D C
C B
B A
Initia Goal
l
171
Solution to Overcome
Problems in Hill Climbing
1. Local maximum:
Utilize the backtracking
technique. Maintain a list of
visited states. If the search
reaches an undesirable state, it
can backtrack to the previous
configuration and explore a new
path.
2. Plateau: Make a big jump.
Randomly select a state far away
from the current state. Chances
are that we will land in a non-
plateau region.
3. Ridge: In this kind of obstacle, 172
Hill Climbing Search Example
with Local Minimum
Hill Climbing is sometimes called greedy local search because it grabs a

good neighbor state without thinking ahead about where to go next.
173
Syllabus
Unit-I: INTRODUCTION TO AI AND SEARCHING
❖ AI Problems, Intelligent Agents, Problem Formulation, Basic
Problem Solving Methods. Search strategies, Uniformed Search
Strategies, State-Space Search, Bi-Directional Search, BFS, DFS,
Heuristic Search Strategies, Local Search Algorithms, Hill
Climbing, Greedy Best First Search, A* Search, Simulated
Annealing, Measure of performance and analysis of search
algorithms.
Unit –II: KNOWLEDGE REPRESENTATION AND INFERENCE

❖ Game playing, Knowledge representation using-Predicate logic,
Introduction to predicate calculus, Resolution, Use of predicate
calculus, Knowledge representation using other logic, Structured
representation of knowledge, Production based system, Frame
based system, First order logic. Inference in first order logic,
propositional Vs. first order inference, unification & lifts forward
174
Game playing
Game Playing is an important domain of artificial intelligence.
Games don’t require much knowledge; the only knowledge we
need to provide is the rules, legal moves and the conditions
of winning or losing the game.
Both players try to win the game. So, both of them try to make
the best move possible at each turn. Searching techniques like
BFS(Breadth First Search) are not accurate for this as the
branching factor is very high.
So, we need another search procedures that improve –

▪ Generate procedure so that only good moves are
generated.
▪ Test procedure so that the best move can be explored
175
first.
Types of Game
Perfect Information Game:
In which player knows all the possible moves of himself and
opponent and their results.
▪ E.g. Chess.
Imperfect Information Game:

In which player does not know all the possible moves of the
opponent.
▪ E.g. Bridge since all the cards are not visible to player.
176
Components of Game Playing
problem
Initial state: This defines initial configuration of the game and
identifies first payer to move.
Successor function: This identifies which are the possible states
that can be achieved from the current state. This function returns
a list of (move, state) pairs, each indicating a legal move and the
resulting state.
Goal test: Which checks whether a given state is a goal state or
not. States where the game ends are called as terminal states.
Path cost / utility / payoff function: Which gives a numeric
value for the terminal states? In chess, the outcome is win, loss or
draw, with values +1, -1, or 0. Some games have wider range of
possible outcomes.
177
Characteristics of Game
Playing
Unpredictable Opponent: Generally we cannot predict the
behavior of the opponent. Thus we need to find a solution which
is a strategy specifying a move for every possible opponent move
or every possible state.
Time Constraints: Every game has a time constraints. Thus it

may be infeasible to find the best move in this time.
178
Mini-Max Algorithm
Mini-max algorithm is a recursive or backtracking algorithm which
is used in decision-making and game theory. It provides an optimal
move for the player assuming that opponent is also playing optimally.
Min-Max algorithm is mostly used for game playing in AI. Such as

Chess, Checkers, tic-tac-toe, go, and various two-players game. This
Algorithm computes the minimax decision for the current state.
In this algorithm two players play the game, one is called MAX and
other is called MIN.
Both the players fight it as the opponent player gets the minimum
benefit while they get the maximum benefit.
Both Players of the game are opponent of each other, where MAX will
select the maximized value and MIN will select the minimized
value.
The minimax algorithm performs a depth-first search algorithm for

the exploration of the complete game tree.
The minimax algorithm proceeds all the way down to the terminal
179
180
181
Properties of Mini-Max
algorithm
Complete- Min-Max algorithm is Complete. It will definitely find
a solution (if exist), in the finite search tree.
Optimal- Min-Max algorithm is optimal if both opponents are
playing optimally.
Time complexity- As it performs DFS for the game-tree, so the
time complexity of Min-Max algorithm is O(bm), where b is
branching factor of the game-tree, and m is the maximum depth
of the tree.
Space Complexity- Space complexity of Mini-max algorithm is
also similar to DFS which is O(bm).
Limitation of the minimax Algorithm:
The main drawback of the minimax algorithm is that it gets really

slow for complex games such as Chess, go, etc. This type of games has
a huge branching factor, and the player has lots of choices to decide
182
Alpha-Beta Pruning
Alpha-beta pruning is a modified version of the minimax
algorithm. It is an optimization technique for the minimax
algorithm.
As we have seen in the minimax search algorithm that the
number of game states it has to examine are exponential in
depth of the tree.
Pruning is a technique by which without checking each node
of the game tree we can compute the correct minimax decision
This involves two threshold parameter Alpha and beta for
future expansion, so it is called alpha-beta pruning.
Alpha-beta pruning can be applied at any depth of a tree, and
sometimes it not only prune the tree leaves but also entire sub-
tree.
The two-parameter can be defined as:
Alpha: The best (highest-value) choice we have found so far at any point
along the path of Maximizer. The initial value of alpha is -∞.
183
Beta: The best (lowest-value) choice we have found so far at any point
184
185
186
187
188
189
Knowledge Representation
Knowledge representation and reasoning (KR, KRR) is the part
of Artificial intelligence which concerned with AI agents
thinking and how thinking contributes to intelligent behavior
of agents.
A machine sounds like an empty box unless it is encoded with
some features or information. Therefore, to make it a valuable
machine, it is required to put the necessary knowledge in it. So
that it could understand it and is able to take the right decisions.
There are three factors which are put into the machine,
which makes it valuable:
• Knowledge: The information related to the environment is stored
in the machine.
• Reasoning: The ability of the machine to understand the stored
knowledge.
• Intelligence: The ability of the machine to make decisions on the
190
basis of the stored information.
What to Represent?
Object: All the facts about objects in our world domain. E.g.,
Guitars contains strings.
Events: Events are the actions which occur in our world.
Performance: It describe behavior which involves knowledge
about how to do things.
Meta-knowledge: It is knowledge about what we know.
Facts: Facts are the truths about the real world and what we
represent.
Knowledge-Base: The central component of the knowledge-based
agents is the knowledge base. It is represented as KB. The
Knowledgebase is a group of the Sentences (Here, sentences are
used as a technical term and not identical with the English
language).
191
Properties of a Knowledge
Representation System
Representational Adequacy: It is the ability of the system to
represent all kinds of knowledge needed in a specific domain.
Inferential Adequacy: It is the ability of a knowledge
representation system to manipulate the current stored
knowledge so that newly gained knowledge could be added.
Inferential Efficiency: It is the ability of the system to directly
add new knowledge in the system with efficiency
Acquisitional Efficiency: It is the ability of the system to
automatically acquire new knowledge from the environment.
This leads the system to give more productive result as more
knowledge adds up with the current knowledge.
192
Techniques used for
Knowledge Representation
193
Logic: It is the basic method used to represent the knowledge of a
machine. The term logic means to apply intelligence over the
stored knowledge.
❖ Logic can be further divided as:
Propositional Logic: This technique is also known

as propositional calculus, statement logic, or sentential logic.
It is used for representing the knowledge about what is true and
what is false.
First-order Logic: It is also known as Predicate logic or First-

order predicate calculus (FOPL). This technique is used to
represent the objects in the form of predicates or quantifiers.
FOPL is an advance version of propositional logic, it removes the

complexity of the sentence represented by it
194
Rule-based System: This is the most commonly used technique
in artificial intelligence.
In the rule-based system, we impose rules over the
propositional logic and first-order logic techniques. If-then
clause is used for this technique.
For example, if there are two variables A and B. Value of
both A and B is True. Consequently, the result of both should
also be True and vice-versa.
It is represented as:
If the value of A and B is True, then the result will be True.
So, such a technique makes the propositional as well as
FOPL logics bounded in the rules.
195
Semantic Networks: The technique is based on storing the
knowledge into the system in the form of a graph. Nodes of a
graph represent the objects which exist in the real world, and the
arrow represents the relationship between these objects. Such
techniques show the connectivity of one object with another
object.
For example, Consider the given knowledge stored in a machine
⚫ Ram has a cycle.
⚫ Ram is a boy.
⚫ Cycle has a bell.
⚫ Ram is 12 years old.
⚫ Cycle has two paddles.
196
Frames: In this technique, the knowledge is stored via slots and
fillers. As we have seen in DBMS, we store the information of an
employee in the database, where:
Similarly, the Slots are the entities and Fillers are its attributes.
They are together stored in a frame. So, whenever there is a
requirement, the machine infers the necessary information to
take the decision.
For example, Tomy is a dog having one tail.
Script: It is an advanced technique over the Frames. Here, the

information is stored in the form of a script. The script is stored
in the system containing all the required information. The system
infers the information from that script and solves the problem
197
Different Types of Sentences
1. Declarative sentences declare something, like “Gandhinagar is
the capital of Gujarat.”
2. Exclamatory sentences are emotional expressions, such as
“Happy birthday!”
3. Interrogative sentences ask questions, like “what time is it?”
4. Imperative sentences give commands, such as “turn right at the
traffic light.”
198
Knowledge representation
using- Propositional logic
Propositional logic (PL) is the simplest form of logic where all
the statements are made by propositions.
A proposition is a declarative statement which is either true or
false.
It is a technique of knowledge representation in logical and
mathematical form.
Example
1. a) It is Sunday.
2. b) The Sun rises from West (False proposition)
3. c) 3+3= 7(False proposition)
4. d) 5 is a prime number
199
Some basic facts about
propositional logic
❖ Propositional logic is also called Boolean logic as it works on 0
and 1.
❖ In propositional logic, we use symbolic variables to represent
the logic, and we can use any symbol for a representing a
proposition, such A, B, C, P, Q, R, etc.
❖ Propositions can be either true or false, but it cannot be both.
❖ Propositional logic consists of an object, relations or function,
and logical connectives.
❖ A proposition formula which is always true is called tautology,
and it is also called a valid sentence.
❖ A proposition formula which is always false is
called Contradiction.
❖ Statements which are questions, commands, or opinions are
not propositions such as "Where is Rohini", "How are you",200
❖ There are two types of Propositions:
1. Atomic Propositions
2. Compound propositions
❖ Atomic Proposition: Atomic propositions are the simple

propositions. It consists of a single proposition symbol. These
are the sentences which must be either true or false.
Example:
a) 2+2 is 4, it is an atomic proposition as it is a true fact.
b) "The Sun is cold" is also a proposition as it is a false fact.
❖ Compound proposition: Compound propositions are

constructed by combining simpler or atomic propositions,
using parenthesis and logical connectives.
Example:
a) "It is raining today, and street is wet."
b) "Ankit is a doctor, and his clinic is in Mumbai."
201
Propositional Logic
Connectives
Logical connectives are used to connect two simpler propositions or
representing a sentence logically. We can create compound propositions
with the help of logical connectives. There are mainly five connectives.
202
Logical Connectives
1. Negation: A sentence such as ¬ P is called negation of P. A literal can
be either Positive literal or negative literal.
2. Conjunction: A sentence which has ∧ connective such as, P ∧ Q is
called a conjunction.
Example: Rohan is intelligent and hardworking. It can be written
as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.
3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is
called disjunction, where P and Q are the propositions.
Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P
∨ Q.
4. Implication: A sentence such as P → Q, is called an implication.
Implications are also known as if-then rules. It can be represented as
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P →
Q
5. Biconditional: A sentence such as P⇔ Q
203
204
P Q R
205
Precedence of Connectives
206
Logical Equivalence
Logical equivalence is one of the features of propositional logic.
Two propositions are said to be logically equivalent if and
only if the columns in the truth table are identical to each
other.
Let's take two propositions A and B, so for logical equivalence,
we can write it as A⇔B. In below truth table we can see that
column for ¬A∨ B and A→B, are identical hence A is Equivalent
to B
207
Properties of Operators
Commutativity:
⚫ P∧ Q= Q ∧ P, or
• P ∨ Q = Q ∨ P.
Associativity:
• (P ∧ Q) ∧ R= P ∧ (Q ∧ R),
• (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
Identity element:
• P ∧ True = P,
• P ∨ True= True.
Distributive:
• P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
• P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).
DE Morgan's Law:
• ¬ (P ∧ Q) = (¬P) ∨ (¬Q)
• ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
Double-negation elimination:
• ¬ (¬P) = P. 208
❑ Which of the following is the contrapositive of ‘if two triangles
are identical, then these are similar’?
A) If two triangles are not similar, then they are not identical.
B) If two triangles are not identical, then these are not similar.
C) If two triangles are not identical, then these are similar.
D) If two triangles are not similar, then these are identical.
Solution:
Consider the following statements
p: Two triangles are identical.
q: Two triangles are similar.
Clearly, the given statement in symbolic form is p ⇒ q.
Therefore, its contrapositive is given by ∼q ⇒ ∼p
Now,
∼p: Two triangles are not identical.
∼q: Two triangles are not similar.
∴ ~q ⇒ ~p: If two triangles are not similar, then these are not
identical. 209
The statement ~(p↔ ~q) is
A) equivalent to p↔q
B) equivalent to ~p↔q
C) a tautology
D) a fallacy
210
Deduction
In AI we need to create new facts from the existing facts. In
proposition logic, the process is called deduction.
Given two presumably true statements, we can deduce a new
true sentence.
The first two sentences are called premises and deductive
sentence is called conclusion.
The whole is called arguments.
One way to find if an argument is valid is to create a truth table
for the premises and the conclusion.
A conclusion is invalid if we can find a counterexample case:
a case in which both premises are true, but the conclusion is
false.
An argument is valid if no counterexample can be found.
211
The only row to be checked is the second row. This row does not
show a counterexample, so the argument is valid.
212
Here row 2 and row 4 need to be checked. Although row 4 is ok,
row 2 shows a counterexample (two true premises result in a false
conclusion).
213
Limitations of Propositional
logic:
❖ Propositional logic has limited expressive power. We cannot
represent relations like ALL, some, or none with propositional
logic. Example:
• All the girls are intelligent.
• Some apples are sweet.
❖ In propositional logic, we cannot describe statements in terms

of their properties or logical relationships.
❖ In propositional logic, a symbol that represents a sentence is

atomic: it cannot be broken up to find information about its
components.
❖ For example, P1: Rani is Neetu mother, P2: Neetu is Meethi mother
we cannot infer from the above two sentences that Rani is the
grandmother of
Meethi. To do so, we need predicate logic: the logic that defines
the relation between the parts in a proposition. 214
Predicate logic
Predicate logic is a mathematical model that is used
for reasoning with predicates.
Predicates are functions that map variables to truth

values. They are essentially Boolean functions
whose value could be true or false, depending on the
arguments to the predicate.
215
First-Order logic
First-order logic is another way of knowledge representation in
artificial intelligence. It is an extension to propositional logic.
First-order logic is also known as Predicate logic or First-
order predicate logic.
FOL is sufficiently expressive to represent the natural
language statements in a concise way.
First-order logic is a powerful language that develops
information about the objects in a more easy way and can
also express the relationship between those objects.
First-order logic (like natural language) does not only assume
that the world contains facts like propositional logic but also
assumes the following things in the world:
• Objects: A, B, people, numbers, colors, wars etc.
• Relations: It can be unary relation or n-any relation
• Function: Father of, best friend, third inning of, end of, ...... 216
First-order logic statements can be divided into
two parts:
Subject: Subject is the main part of the statement.
Predicate: A predicate can be defined as a relation, which binds

two atoms together in a statement.
⚫ Consider the statement: "x is an integer.", it consists of two

parts, the first part x is the subject of the statement and second
part "is an integer," is known as a predicate.
217
As a natural language, first-order logic also has two main parts:
• Syntax
• Semantics
Syntax of First-Order logic:
The basic syntactic elements of first-order logic are symbols.
The syntax of FOL determines which collection of symbols is a
logical expression in first-order logic.
Basic Elements of First-order logic
218
Atomic sentences:
Atomic sentences are the most basic sentences of first-order logic.
These sentences are formed from a predicate symbol followed
by a parenthesis with a sequence of terms.
We can represent atomic sentences as Predicate (term1, term2, ..
...., term n).
Example:
Ravi and Ajay are brothers: => Brothers(Ravi, Ajay).
Chinky is a cat: => cat (Chinky).
Complex Sentences:
Complex sentences are made by combining atomic sentences
using connectives.
219
P1: Rani is Neetu mother
P2: Neetu is Meethi mother
P1: Rani is Neetu mother mother(Rani, Neetu)

P2: Neetu is Meethi mother mother(Neetu, Meethi)
Grandmother(Rani, Meethi)
The sentence “Sachin works for Anil’s sister” can be written

as:
Works [Sachin, sister(Anil)] in which the function
sister(Anil) is used as an argument.
The sentence “John’s father loves Anil’s sister” can be

written as:
loves[father(John), sister(Anil)] 220
Quantifiers in First-order logic:
A quantifier is a language element which generates
quantification.
Quantification specifies the quantity of specimen in the

universe of discourse.
These are the symbols that permit to determine or identify the

range and scope of the variable in the logical expression.
• There are two types of quantifier:

• Universal Quantifier, (for all, everyone, everything)
• Existential quantifier, (for some, at least one).
221
Universal Quantifier:
Universal quantifier is a symbol of logical representation, which
specifies that the statement within its range is true for everything
or every instance of a particular thing.
The Universal quantifier is represented by a symbol ∀, which
resembles an inverted A.
If x is a variable, then ∀x is read as:
• For all x
• For each x
• For every x.
Example:
All man drink coffee
∀x man(x) → drink (x, coffee).
It will be read as: There are all x where x is a man who drink
coffee.
❖ The main connective for universal quantifier ∀ is implication

222
Existential Quantifier:
Existential quantifiers are the type of quantifiers, which express
that the statement within its scope is true for at least one instance
of something.
It is denoted by the logical operator ∃, which resembles as
inverted E. When it is used with a predicate variable then it is
called as an existential quantifier.
If x is a variable, then existential quantifier will be ∃x or ∃(x). And
it will be read as:
• There exists a 'x.'
• For some 'x.'
⚫
• For at least one 'x.'
Example:
⚫ Some boys are intelligent.
⚫ ∃x: boys(x) ∧ intelligent(x)
⚫ It will be read as: There are some x where x is a boy who is
intelligent.
⚫ The main connective for existential quantifier ∃ is and ∧.

223
Free and Bound Variables
The quantifiers interact with variables which appear in a
suitable way. There are two types of variables in First-order
logic which are given below:
Free Variable: A variable is said to be a free variable in a

formula if it occurs outside the scope of the quantifier.
Example: ∀x ∃(y)[P (x, y, z)], where z is a free variable.
Bound Variable: A variable is said to be a bound variable in a

formula if it occurs within the scope of the quantifier.
Example: ∀x ∃(y) [A (x) B( y)], here x and y are the
bound variables.
224
225
1. All birds fly.
In this question the predicate is "fly(bird)."
And since there are all birds who fly so it will be represented as
follows.
∀x bird(x) →fly(x).
2. Every man respects his parent.

Here predicate is "respect(x, y)," where x=man, and y= parent.
Since there is every man so will use ∀, and it will be represented
as follows:
∀x man(x) → respects (x, parent).
3. Some boys play cricket.

In this question, the predicate is "play(x, y)," where x= boys, and
y= game. Since there are some boys so we will use ∃, and it will be
represented as:
∃x boys(x) ∧ play(x, cricket).
226
227
228
Write the following sentence in FOL.
1. Coconut-crunchy is a biscuit.
2. Marry is a child who take Coconut-crunchy.
3. John loves children who take biscuit.
4. John loves Marry.
229
4. Not all students like both Math and Science.
predicate is "like(x, y)," where x= student, and y= subject.

Since there are not all students, so we will use ∀ with negation,
so following representation for this:
¬[∀ (x) student(x) → like(x, Math) ∧ like(x, Science)].
5. Only one student failed in OS.
predicate is "failed(x, y)," where student y fail in subject x.

Student(x), x is a student
Since there is only one student who failed in Mathematics, so we
will use following representation for this:
∃(x) [ student(x) ∧ failed (OS, x) ∧∀ (y) [(¬(x==y) ∧

student(y)) → ¬failed (OS, y)]].
230
231
Deduction
In predicate logic, if there is no quantifier, the verification of an
argument is the same as that which we discussed in
propositional logic.
However, the verification becomes more complicated if there
are quantifiers.
Example
All students are intelligent.
Rahul is a student.
Therefore Rahul is intelligent.
Substitution:
Substitution is a fundamental operation performed on terms
and formulas.
The substitution is complex in the presence of quantifiers in
FOL. If we write F[a/x], so it refers to substitute a constant "a"
in place of variable "x". 232
Inference Rules
233
FOL Inference Rules for
Quantifier
1) Universal Generalization
2) Universal Instantiation
3) Existential Instantiation
4) Existential introduction or Generalization
1. Universal Generalization:
Universal generalization is a valid inference rule which states that if
premise P(c) is true for any arbitrary element c in the universe of
discourse, then we can have a conclusion as ∀ x P(x).
It can be represented as:
This rule can be used if we want to show that every element has a
similar property.
In this rule, x must not appear as a free variable.
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x)
"All bytes contain 8 bits.", it will also be true.
234
2. Universal Instantiation:
Universal instantiation is also called as universal elimination or
UI is a valid inference rule. It can be applied multiple times to
add new sentences.
As per UI, we can infer any sentence obtained by substituting a
ground term (if it does not have any variables Likes(John, Jay)
for the variable.
The UI rule state that we can infer any sentence P(c) by
substituting a ground term c from ∀ x P(x) for any object in the
universe of discourse.
Example: IF "Every person like ice-cream"=> ∀x P(x) so we can infer that
"John likes ice-cream" => P(c)
The new KB is logically equivalent to the previous KB.
All kings who are greedy are Evil
∀x king(x) ∧ greedy (x) → Evil (x),
So from this information, we can infer any of the following statements using Universal
Instantiation:
King(John) ∧ Greedy (John) → Evil (John), 235
3. Existential Instantiation:
Existential instantiation is also called as Existential
Elimination, which is a valid inference rule in first-order
logic.
It can be applied only once to replace the existential
sentence.
This rule states that one can infer P(c) from the formula
given in the form of ∃x P(x) for a new constant symbol c.
The restriction with this rule is that c used in the rule must
be a new term for which P(c ) is true.
236
4. Existential introduction
An existential introduction is also known as an existential
generalization, which is a valid inference rule in first-order logic.
This rule states that if there is some element c in the universe of
discourse which has a property P, then we can infer that there
exists something in the universe which has the property P.
Example:
"Priyanka got good marks in English."
"Therefore, someone got good marks in English."
237
❖ If there is any woman who uses any medication then that
may participate in our drugs test. There is a woman who
uses some medication.
❖ Thus, someone may participate in our drug test.
238
239
240
Unification
Unification is used in logical programming where the aim is to
make expressions look identical to each other, and in order to
make them identical, we generally use substitution.
Substitution means replacing one variable with another term.

When the unification is performed for variables representing
functions, it is called as higher order unification and if
unification is performed for simple variables it is called first
order unification.
The UNIFY algorithm is used for unification, which takes two

atomic sentences and returns a unifier for those sentences (If
any exist).
Unification is a key component of all first-order inference

algorithms. It returns fail if the expressions do not match with
241
each other.
Conditions for Unification
1) Predicate symbol must be same, atoms or expression with
different predicate symbol can never be unified.
2) Number of Arguments in both expressions must be identical.
3) Unification will fail if there are two similar variables present

in the same expression.
P(x, y), and P(a, f(z))

[a/x, f(z)/y]
242
Algorithm for Unification
Step. 1: If Ψ1 or Ψ2 is a variable or constant, then:
a) If Ψ1 or Ψ2 are identical, then return NIL.
b) Else if Ψ1is a variable,
a. then if Ψ1 occurs in Ψ2, then return FAILURE
b. Else return { (Ψ2/ Ψ1)}.
c) Else if Ψ2 is a variable,
a. If Ψ2 occurs in Ψ1 then return FAILURE,
b. Else return {( Ψ1/ Ψ2)}.
d) Else return FAILURE.
Step.2: If the initial Predicate symbol in Ψ1 and Ψ2 are not same, then return
FAILURE.
Step. 3: IF Ψ1 and Ψ2 have a different number of arguments, then return FAILURE.
Step. 4: Set Substitution set(SUBST) to NIL.
Step. 5: For i=1 to the number of elements in Ψ1.
a) Call Unify function with the ith element of Ψ1 and ith element of Ψ2, and put
the result into S.
b) If S = failure then returns Failure
c) If S ≠ NIL then do,
a. Apply S to the remainder of both Ψ1 and Ψ2.
b. SUBST= APPEND(S, SUBST). 243
1. Find the MGU of {p(b, X, f(g(Z))) and p(Z, f(Y), f(Y))}
Here, Ψ 1 = p(b, X, f(g(Z))) , and Ψ2 = p(Z, f(Y), f(Y))
S0 => { p(b, X, f(g(Z))); p(Z, f(Y), f(Y))}
SUBST θ={b/Z}
S1 => { p(b, X, f(g(b))); p(b, f(Y), f(Y))}
SUBST θ={f(Y) /X}
S2 => { p(b, f(Y), f(g(b))); p(b, f(Y), f(Y))}
SUBST θ= {g(b) /Y}

S2 => { p(b, f(g(b)), f(g(b)); p(b, f(g(b)), f(g(b))} Unified Successfully.
And Unifier = { b/Z, f(Y) /X , g(b) /Y}.
2. Find the Most General Unifier (MGU) of {p(f(a), g(Y)) and
p(X, X)}
Sol: S0 => Here, Ψ1 = p(f(a), g(Y)), and Ψ2 = p(X, X)
SUBST θ= {f(a) / X}
S1 => Ψ1 = p(f(a), g(Y)), and Ψ2 = p(f(a), f(a))
SUBST θ= {f(a) / Y}, Unification failed. 244
3. Find the MGU of {p (X, X), and p (Z, f(Z))}
Here, Ψ 1 = {p (X, X), and Ψ2 = p (Z, f(Z))
S0 => {p (X, X), p (Z, f(Z))}
SUBST θ= {X/Z}
S1 => {p (Z, Z), p (Z, f(Z))}
SUBST θ= {f(Z) / Z}, Unification Failed.
Hence, unification is not possible for these expressions.
4. Find the MGU of UNIFY(prime (11), prime(y))

Here, Ψ 1 = {prime(11) , and Ψ2 = prime(y)}
S0 => {prime(11) , prime(y)}
SUBST θ= {11/y}
S1 => {prime(11) , prime(11)} , Successfully unified.
Unifier: {11/y}.
245
5. Find the MGU of Q(a, g(x, a), f(y)), Q(a, g(f(b), a), x)}
Here, Ψ 1 = Q(a, g(x, a), f(y)), and Ψ2 = Q(a, g(f(b), a), x)
S0 => {Q(a, g(x, a), f(y)); Q(a, g(f(b), a), x)}
SUBST θ= {f(b)/x}
S1 => {Q(a, g(f(b), a), f(y)); Q(a, g(f(b), a), f(b))}
SUBST θ= {b/y}
S1 => {Q(a, g(f(b), a), f(b)); Q(a, g(f(b), a), f(b))},
Successfully Unified.
Unifier: [a/a, f(b)/x, b/y].
6. UNIFY(knows(Richard, X), knows(Richard, John))

Here, Ψ 1 = knows(Richard, x), and Ψ2 = knows(Richard, John)
S0 => { knows(Richard, x); knows(Richard, John)}
SUBST θ= {John/X}
S1 => { knows(Richard, John); knows(Richard, John)},
Successfully Unified.
Unifier: {John/x}. 246
Resolution in FOL
Resolution is a theorem proving technique that proceeds by
building refutation proofs, i.e., proofs by contradictions.
Resolution is used, if there are various statements are given,

and we need to prove a conclusion of those statements.
Steps for Resolution:

1. Conversion of facts into first-order logic.
2. Convert FOL statements into CNF
3. Negate the statement which needs to prove (proof by
contradiction)
4. Draw resolution graph (unification).
247
Example:
1. John likes all kind of food.
2. Apple and vegetable are food
3. Anything anyone eats and not killed is
food.
4. Anil eats peanuts and still alive
5. Harry eats everything that Anil eats.
Prove by resolution that:
6. John likes peanuts.
Step-1: Conversion of Facts into FOL
In the first step we will convert all the given statements into its
first order logic.
248
Step-2: Conversion of FOL into CNF
• Move negation (¬)inwards and
rewrite
• ∀x ¬ food(x) V likes(John, x)
• food(Apple) Λ food(vegetables)
• ∀x ∀y ¬ eats(x, y) V killed(x) V
food(y)
• eats (Anil, Peanuts) Λ alive(Anil)
• ∀x ¬ eats(Anil, x) V eats(Harry, x)
• ∀x killed(x) ] V alive(x)
• ∀x ¬ alive(x) V ¬ killed(x)
• likes(John, Peanuts).
Eliminate all implication (→) and rewrite• Rename variables or standardize

• ∀x ¬ food(x) V likes(John, x) variables
• food(Apple) Λ food(vegetables) • ∀x ¬ food(x) V likes(John, x)
• ∀x ∀y ¬ [eats(x, y) Λ ¬ killed(x)] V • food(Apple) Λ food(vegetables)
food(y) • ∀y ∀z ¬ eats(y, z) V killed(y) V
• eats (Anil, Peanuts) Λ alive(Anil) food(z)
• ∀x ¬ eats(Anil, x) V eats(Harry, x) • eats (Anil, Peanuts) Λ alive(Anil)
• ∀x¬ [¬ killed(x) ] V alive(x)
• ∀w¬ eats(Anil, w) V eats(Harry,
• ∀x ¬ alive(x) V ¬ killed(x)
w) 249
• Eliminate existential instantiation quantifier by elimination.
In this step, we will eliminate existential quantifier ∃, and this process is known
as Skolemization. But in this example problem since there is no existential quantifier so all
the statements will remain same in this step.
• Drop Universal quantifiers.

In this step we will drop all universal quantifier since all the statements are not implicitly
quantified so we don't need it.
• ¬ food(x) V likes(John, x)
• food(Apple)
• food(vegetables)
• ¬ eats(y, z) V killed(y) V food(z)
• eats (Anil, Peanuts)
• alive(Anil)
• ¬ eats(Anil, w) V eats(Harry, w)
• killed(g) V alive(g)
• ¬ alive(k) V ¬ killed(k)
• likes(John, Peanuts).
Note: Statements "food(Apple) Λ food(vegetables)" and "eats (Anil, Peanuts) Λ alive(Anil)" can
be written in two separate statements.
Step-3: Negate the statement to be proved

In this statement, we will apply negation to the conclusion statements, which will be written
as ¬likes(John, Peanuts)
250
1) ¬ food(x) V likes(John, x)
2) food(Apple)
3) food(vegetables)
4) ¬ eats(y, z) V killed(y) V
food(z)
5) eats (Anil, Peanuts)
6) alive(Anil)
7) ¬ eats(Anil, w) V eats(Harry,
w)
8) killed(g) V alive(g)
9) ¬ alive(k) V ¬ killed(k)
0) likes(John, Peanuts).
251
Exercises
Statement-1: "If I am sleepy then I go to bed"
Statement-2: "I am sleepy"
Conclusion: "I go to bed."
Statement-1: "If I am sleepy then I go to bed" ==> P→ Q
Statement-2: "I am sleepy" ==> P
Conclusion: "I go to bed." ==> Q.
Statement-1: "If I am sleepy then I go to bed"

Statement-2: "I do not go to the bed."
Conclusion: "I am not sleepy"
Statement-1: "If I am sleepy then I go to bed" ==> P→ Q
Statement-2: "I do not go to the bed."==> ~Q
Conclusion: "I am not sleepy" => ~
252
Statement-1: Today is Sunday or Monday.
Statement-2: Today is not Sunday.
Conclusion: Today is Monday.
Statement-1: Today is Sunday or Monday. ==>P∨Q

Statement-2: Today is not Sunday. ==> ¬P
Conclusion: Today is Monday. ==> Q
Statement: I have a vanilla ice-cream. ==> P

Statement-2: I have Chocolate ice-cream. ==> Q
Conclusion: I have vanilla or chocolate ice-cream. ==> (P∨Q)
Statement: I have a vanilla ice-cream. ==> P

Statement-2: I have Chocolate ice-cream. ==> Q
Conclusion: I have vanilla or chocolate ice-cream. ==> (P∨Q)
253
1) Every person has only one mother.
2) If it isn’t cloudy tomorrow, Harry will go to the

zoo and will not take his umbrella with him.
254
255
Cats like fish.
Cats eats everything they like.
Manin is a cat.
Prove: Mani eats fish.
1. cat(x) => likes(x, Fish)

2. (cat(x) ^ likes(x,y)) => eats(x,y)
3. cat(Manin)
256
⚫ All people who are graduating are happy. All happy
people smile. Someone is graduating.
⚫ Prove: Someone is smiling.
Convert the sentences into predicate Logic

1. ∀x graduating(x) -> happy(x)
2. ∀x happy(x) -> smile(x)
3. ∃x graduating(x)
4. ∃x smile(x) (Prove )
257
258
259
Marcus was a man. Marcus was a Pompeian. All
Pompeians were Romans. Caesar was a ruler. All Romans
were either loyal to Caesar or hated him. Everyone is loyal
to someone. People only try to assassinate rulers they
aren't loyal to. Marcus tried to assassinate Caesar.
Prove: Marcus hates Caesar.
260
261
Marcus was a man. Marcus was a Pompeian. All
Pompeians were Romans. Caesar was a ruler. All Romans
were either loyal to Caesar or hated him. Everyone is loyal
to someone. People only try to assassinate rulers they
aren't loyal to. Marcus tried to assassinate Caesar.
Prove: Marcus hates Caesar.
262
263
264
Deep Learning Visualization
https://poloclub.github.io/cnn-explainer/
265
Unit – III: NEURAL NETWORKS
❖ Characteristics of Neural Networks, Historical Development of
Neural Networks Principles. Artificial Neural Networks:
Terminology, Models of Neuron, Topology, Basic Learning Laws,
Pattern Recognition Problem, Basic Functional Units, Pattern
Recognition Tasks by the Functional Units.
266
Artificial Intelligence, Machine
Learning and Deep Learning
Image Source: https://datacatchup.com/artificial-intelligence-machine-learning-and-deep-learning/
267
Biological Neural Networks
A nerve cell neuron is a special biological cell that
processes information.
268
Dendrites − They are tree-like branches, responsible for
receiving the information from other neurons it is connected to.
In other sense, we can say that they are like the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for

processing of information, they have received from dendrites.
Axon − It is just like a cable through which neurons send the

information.
Synapses − It is the connection between the axon and other

neuron dendrites. 269
Characteristic of Biological
Neuron
1) Non-linearity
2) Input-output Mapping
3) Adaptability
4) Evidential Response (degree of 'confidence)
5) Fault Tolerence
6) Massvely Parallel Computing
270
Artificial Neural Networks
(ANN)
Artificial neural networks, usually simply called neural
networks, are computing systems (mathematical function),
inspired by the biological neural networks.
Artificial neurons are elementary units in an artificial

neural network.
The weights can either amplify or deamplify the original input

signal. For example, if the input is 1 and the input's weight is
0.2 the input will be decreased to 0.2.
271
Most artificial neurons have three things –
• an input transformation function
• an activation function
• an output transformation function
❑ Input transformation functions process the incoming signal –

usually multiply and sum Artificial Neurons
❑ Activation functions process the input signal
• Threshold Function
• Linear
• Saturated linear
• Sigmoid
❑ Output functions process the outgoing signal – usually linear
272
https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/
273
Biological Neural Network and
Artificial Neural Network ANN
Biological Neural Network BNN Artificial Neural Network ANN
Soma Node
Dendrites Input
Synapse Weights or Interconnections
Axon Output
274
Historical Development of
Neural Networks
275
Six Jars of Deep Learning
276
Data- Data everywhere
* 277
* 278
Data Curation
* 279
Data Augmentation
Data augmentation is an integral process in deep learning, as in
deep learning we need large amounts of data and in some cases
it is not feasible to collect thousands or millions of images, so
data augmentation comes to the rescue.
It helps us to increase the size of the dataset and introduce

variability in the dataset.
Data augmentation is the process of increasing the amount and

diversity of data.
We do not collect new data, rather we transform the already

present data.
* 280
Operations in Data
Augmentation
1) Rotation: Rotation operation as the name suggests, just
rotates the image by a certain specified degree.
2) Zooming: Zooming operation allows us to either zoom in or
zoom out.
3) Cropping: Cropping allows us to crop the image or select a
particular area from an image.
4) Flipping: Flipping allows us to flip the orientation of the
image. We can use horizontal or vertical flip.
5) Changing the brightness level: This feature helps us to
combat illumination changes.
You can encounter a scenario where most of your dataset
comprises of images having a similar brightness level e.g.
collecting the images of employees entering the office, by
augmenting the images we make sure that our model is robust
* and is able to detect the person even in different surroundings.281
What you want to do with data?
Task
282
Task (From description to structure
specification)
* 283
Task (From description + review to write
FAQs)
* 284
Task (From description + review + FAQs to
Question Answering)
* 285
Task (From image identify people)
* 286
Task (From image identify activity)
* 287
Task (Classification)
* 288
Task (Regression)
* 289
Task (Clustering)
* 290
What is the mathematical formulation of a task?
Models
* 291
* 292
* 293
How do we know which model is better?
Loss function
* 294
How do we know which model is better?
Loss function
* 295
* 296
* 297
How do we identify the parameter of the model?
Learning Algorithm
* 298
* 299
How do we score for our model?
Evaluation
Class
Labels
Lion 1
Tiger 2
Cat 3
Giraffe 4
Dog 5
* 300
* 301
* 302
How Evaluation is different from loss function?
* 303
Artificial Neural Network
304
Common Terminology in ANN
Sample – A single row of a dataset.
Epoch – The number of times the algorithm runs on the
whole training dataset.
Batch – It denotes the number of samples to be taken to for
updating the model parameters.
Learning rate – It is a parameter that provides the model a
scale of how much model weights should be updated. Value
is between 0-1.
Cost Function/Loss Function – A cost function is used to
calculate the cost that is the difference between the
predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that
controls the signal between two neurons.
305
Weight increases the steepness of activation function. This
means weight decide how fast the activation function will
trigger whereas bias is used to delay the initiating of the
activation function.
Due to absence of bias, model will train over point passing
through origin only, which is not in accordance with real-world
scenario. Also with the introduction of bias, the model will
become more flexible.
306
How does the function
behave when we change w
and b?
* 307
1. w: (controls the slope)
a. Negative w, negative slope, mirrored s-shape, becomes more
harsh(vertical/less smooth) the more negative it goes
b. Positive w, positive slope, normal s-shape, becomes more
harsh(vertical/less smooth) the more positive it goes.
2. b: (controls the midpoint)
* 308
Connections
The arrangement of neurons to form layers and the connection

pattern formed within and between layers is called the
network architecture.
1) Feed forward network: If no neuron in the output layer is an

input to a node in the same layer / proceeding layer.
2) Feedback network: If outputs are directed back as input to the
processing elements in the same layer/proceeding layer.
3) Lateral feedback: If the output is directed back to the input of
the same layer.
4) Recurrent networks: Networks with feedback networks with
closed loop
309
There exist five basic types of connection architecture.
1) Single layer feed forward network
2) Multilayer feed-forward network
3) Single node with its own feedback
4) Single-layer recurrent network
5) Multilayer recurrent network
Single layer feed forward network: In

this type of network, we have only two
layers input layer and output layer but the
input layer does not count because no
computation is performed in this layer. The
output layer is formed when different
weights are applied on input nodes and the
cumulative effect per node is taken. After
this, the neurons collectively give the 310
Multilayer feed-forward
network: This network is
formed by the interconnection of
several layers. Input layer
receives input and buffers input
signal. Output layer generated
output.
Layer between input and output
is called hidden layer. Hidden
layer is internal to the network.
There are Zero to several hidden
layers in a network. More the
hidden layer more is the
Single node ofwith
complexity its ownbut
network, feedback:
When
efficientoutputs can
output is be directed back as
produced.
inputs to the same layer or preceding
layer nodes, then it results in feedback
networks. Recurrent networks are
feedback networks with closed loops. 311
Single-layer recurrent network: This network is a single-layer
network with a feedback connection in which the processing
element’s output can be directed back to itself or to another
processing element or both.
This allows it to exhibit dynamic temporal behavior for a time
sequence. Unlike feedforward neural networks, RNNs can use their
internal state (memory) to process sequences of inputs.
312
Multi-layer recurrent network: In this network, processing
element output can be directed to the processing element in the
same layer and in the preceding layer forming a multilayer
recurrent network.
They perform the same task for every element of a sequence, with
the output being dependent on the previous computations. Inputs
are not needed at each time step. The main feature of a Recurrent
Neural Network is its hidden state, which captures some
information about a sequence.
313
Learning Rules
Learning rule or Learning process is a method or a
mathematical logic. It improves the Artificial Neural Network’s
performance and applies this rule over the network.
Thus learning rules updates the weights and bias levels of a
network when a network simulates in a specific data
environment.
1) Hebbian learning
2) Perceptron learning rule
3) Delta learning
4) Correlation learning rule
5) Outstar learning rule
314
Hebbian Learning Rule
The Hebbian rule was the first learning rule. In 1949 Donald
Hebb developed it as learning algorithm of the unsupervised
neural network.
The Hebb learning rule assumes that – If two neighbor neurons
activated and deactivated at the same time. Then the weight
connecting these neurons should increase.
For neurons operating in the opposite phase, the weight
between them should decrease.
If there is no signal correlation, the weight should not change.
315
When inputs of both the nodes are either positive or negative,
then a strong positive weight exists between the nodes.
If the input of a node is positive and negative for other, a strong
negative weight exists between the nodes.
At the start, values of all weights are set to zero. This learning
rule can be used for both soft- and hard-activation functions.
Hebbs network is suited more for bipolar data. If binary

data is used, the weight updation formula cannot distinguish
two conditions namely:
1. A training pair in which an input unit is “on” and the target
value is “off”. 2. A training pair in which both the input unit
and the target value is “off”.
316
Flow Diagram
s: t refers to each training input and target output pair. Till there exists a pair of
training input and target output, the training process takes place; else it is
stopped.
317
318
Implementation of AND Gate
with Hebbian Learning Rule
Set all weights to zero, w i = 0 for i=1 to n, and bias to zero.
Activation function used here is Bipolar Sigmoidal Function so

the range is [-1,1].
319
320
321
322
Perceptron Learning Rule
This rule is an error correcting and a supervised learning
algorithm of single layer feedforward networks with linear
activation function.
As being supervised in nature, to calculate the error, there
would be a comparison between the desired/target output and
the actual output.
If there is any difference found, then a change must be made to
the weights of connection.
The perceptron network consists of three units, namely,
sensory unit (input unit), associator unit (hidden unit),
response unit (output unit).
The input units are connected to hidden units with fixed
weights having values 1, 0 or -l, which are assigned at random.
The binary activation function is used in hidden layer. 323
"t" is +1 or-l and α is the learning rate
If no error occurs, there is no weight updation and hence the

training process may be stopped.
324
325
326
Perceptron Summary
It required initial weight to be assigned random values.
Uses supervised learning
Guarantees convergence in finite iterations
Used in feedforward network
327
Implement AND function using perceptron networks for bipolar inputs ans targets.
328
329
330
Gradient Descent
Gradient descent is an iterative optimization algorithm to find
the minimum of a function. Here that function is our Loss
Function.
Loss function is the partial deviation with respect to w and b.
He goes down the slope and takes large steps when the slope is
steep and small steps when the slope is less steep.
He decides his next position based on his current position and
331
Delta Learning Rule (Widrow−Hoff
Rule)
The perceptron learning rule originates from the Hebbian
assumption while the delta rule is derived from the gradient-
descent method.
The delta rule updates the weights between the connections so
as to minimize the difference between the net input to the
output unit and the target value.
The major aim is to minimize all errors over all training
patterns. This is done by reducing the error for each pattern,
one at a time.
Delta rule (DR) is similar to the Perceptron Learning Rule (PLR),
with some differences:
1. Error ( Error ( δ), in DR is not restricted to having values of 0, 1,
or - 1 (as in PLR), but may have any value
2. DR can be derived for any differentiable output/activation
function f, whereas in PLR only works for threshold output332
function.
333
Back propagation Network
The back propagation learning algorithm is one of the most
important developments in neural networks.
This learning algorithm is applied to multilayer feed-forward
networks consisting of processing elements with continuous
differentiable activation functions.
The basic concept for this weight update algorithm is simply
the gradient descent method. This is a methods where error is
propagated back to the hidden unit. Back propagation network
is a training algorithm.
The training of the BPN is done in three stages - the feed-
forward of the input training pattern, the calculation and
back-propagation of the error, and updation of weights.
334
A back-propagation neural network is a multilayer, feed-forward
neural network consisting of an input layer, a hidden layer and an
output layer.
The neurons present in the hidden and output layers have biases,
which are the connections from the units whose activation is always 1.
The bias terms also acts as weights.
During the back propagation phase of learning, signals are sent in the
reverse direction. The inputs sent to the BPN and the output
obtained from the net could be either binary (0, 1) or bipolar (-1,
+1). The activation function could be any function which increases
monotonically and is also differentiable.
335
336
337
338
339
340
341
Activation Function
It will help to determine whether the neuron will fire or not.
An activation function is to add non-linearity to the neural
network.
If we have a neural network working without the activation
functions. In that case, every neuron will only be performing
a linear transformation on the inputs using the weights and
biases. It’s because it doesn’t matter how many hidden layers
we attach in the neural network; all layers will behave in the
same way because the composition of two linear functions is a
linear function itself.
342
Activation Function Types
1) Linear Function
2) Binary Step Function
3) Non-Linear Function
Linear Function
The linear activation function is also known as Identity

Function where the activation is proportional to the
input.
343
Mathematically it can be represented as:
However, a linear activation function has two major problems :

1) It’s not possible to use backpropagation as the derivative of the
function is a constant and has no relation to the input x.
2) All layers of the neural network will collapse into one if a linear
activation function is used. No matter the number of layers in
the neural network, the last layer will still be a linear function of
the first layer. So, essentially, a linear activation function turns
the neural network into just one layer. 344
Binary Step Function (Threshold
Function)
Binary step function depends on a threshold value that decides
whether a neuron should be activated or not.
The input fed to the activation function is compared to a
certain threshold; if the input is greater than it, then the
neuron is activated, else it is deactivated, meaning that its
output is not passed on to the next hidden layer.
345
Pros
Best with Binary class classification schemes (“yes” or “no”, 1

or 0)
Cons
Real-life isn’t that black and white and most situations are not
this binary.
Can only be used at the end of a network.
Best for perceptrons and one layer networks
It cannot provide multi-value outputs—for example, it cannot
be used for multi-class classification problems.
The gradient of the step function is zero, which causes a
hindrance in the backpropagation process.
346
Non-linear Activation function
347
Sigmoid Function
It is also called S-shaped functions. Logistic and hyperbolic
tangent functions are commonly used in sigmoid functions.
There are two types of sigmoid functions.
Binary Sigmoid function (or Logistic function)
Bipolar Sigmoid function (or Hyperbolic Tangent Function or
Tanh)
Binary Sigmoid Function

Binary Sigmoid Function or Sigmoid function is a logistic function
where the output values are either binary or vary from 0 to 1.
It is differentiable, non-linear, and produces non-binary
activations.
The sigmoid function is used in output layers of the DNN and is
used for probability-based output.
348
The gradient values are only significant for range -3 to 3, and
the graph gets much flatter in other regions.
It implies that for values greater than 3 or less than -3, the
function will have very small gradients. As the gradient value
approaches zero, the network ceases to learn and suffers from
the Vanishing gradient problem.
349
350
Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation
function, and even has the same S-shape with the difference in
output range of -1 to 1.
In Tanh, the larger the input (more positive), the closer the
output value will be to 1.0, whereas the smaller the input
(more negative), the closer the output will be to -1.0.
The tanh functions have been used mostly in RNN for natural
language processing and speech recognition tasks
351
352
it also faces the problem of vanishing gradients similar to the
sigmoid activation function. Plus the gradient of the tanh
function is much steeper as compared to the sigmoid function.
353
ReLu (Rectified Linear Unit) Activation
Function
The main catch here is that the ReLU function does not activate
all the neurons at the same time.
The neurons will only be deactivated if the output of the linear
transformation is less than 0.
354
355
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve
the Dying ReLU problem as it has a small positive slope in the
negative area.
The advantages of Leaky ReLU are same as that of ReLU, in
addition to the fact that it does enable backpropagation, even
for negative input values.
By making this minor modification for negative input values,
the gradient of the left side of the graph comes out to be a non-
zero value. Therefore, we would no longer encounter dead
neurons in that region.
356
357
Softmax Function
The softmax function is a function that turns a vector of K real
values into a vector of K real values that sum to 1.
The input values can be positive, negative, zero, or greater than
one, but the softmax transforms them into values between 0
and 1, so that they can be interpreted as probabilities.
If one of the inputs is small or negative, the softmax turns it
into a small probability, and if an input is large, then it turns it
into a large probability, but it will always remain between 0
and 1.
358
359
360
Choosing the Right One
1) Use the ReLu or LeakyRelu function in hidden layers only
2) In binary classification remember to use the Sigmoid function
in the output layer
3) In multi-class classification(when classes to predict are more
than 2) problem use Softmax function in the output layer
4) Due to the vanishing gradient problem ‘Sigmoid’ and ‘Tanh’
activation functions are avoided sometimes in deep neural
network architectures
5) Always remember you can also invent your own activation
function and can check its performance.
361
McCulloch-Pitts Neuron
362
It may be divided into 2 parts. The first part, g takes an input,
performs an aggregation and based on the aggregated value
the second part, f makes a decision.
weights are just 1 in case of MP Neuron

363
Inhibitory input: It is a special type of input if this input is active
model will not fire neuron, no matter what is other inputs. In DL we
are try to avoid this input.
b is a threshold which is set in such a way that its gives correct
364
* 365
Data and Task
366
Data and Task
367
Loss Function
368
Learning Algorithm
369
Evaluation
370
Problems with MP
Neuron Model
1) Boolean Inputs.
2) Boolean Outputs.
3) Linear
4) Fixed slop
5) Threshold b can take only a few possible
values.
6) All the inputs to the model have equal weights.
371
MP Neuron Summary
372
373
Calculate the net input for the network shown in Figure with bias included in
the network.
374
375
376
Perceptron
The original Perceptron was designed to take a number
of binary inputs, and produce one binary output (0 or 1). Then
it updated and except real number as an input.
The idea was to use different weights to represent the
importance of each input, and that the sum of the values
should be greater than a threshold value before making a
decision like true or false (0 or 1).
377
Single and Multi Layered
Perceptron Model
1. Single Layer Perceptron- The Single Layer perceptron is
defined by its ability to linearly classify inputs. This
means that this kind of model only utilizes a single
hyperplane line and classifies the inputs as per the
learned weights beforehand.
2. Multi-Layer Perceptron- The Multi-Layer Perceptron is

defined by its ability to use layers while classifying inputs.
This type is a high processing algorithm that allows
machines to classify inputs using various more than one
layer at the same time.
378
Data and Task
379
Model
380
Loss function
1 is a indicator variable. If the

condition is true (actual value
not equal to calculated value)
the value of indicator variable
is 1 otherwise 0.
381
Learning Algorithm
General Learning
Algorithm
382
Perceptron Learning
Algorithm
383
384
Evaluation
385
386
MP Neuron v/s Perceptron
387
How Loss function is
different from Square Loss
388
389
Linearly Separable
Two sets of points (or classes) are called linearly
separable, if there is one straight line in the plane exists
so that all the points of one class are on one side of the
line and all the points of the other class are on the other
side. they are called linearly separable.
not linearly separable
390
Suppose we have a perceptron having weights corresponding
to the three inputs have the following values:
w1 = 2 ; w2 = −4; and w3 = 1
and the activation of the unit is given by the step-function:
φ(v) = 1 if v≥0 otherwise 0
Calculate the output value y of the given perceptron for each of
the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
391
Pattern P1 P2 P3 P4
w1 = 2 ; w2 = −4; and w3 = 1
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
392
Consider a feed-forward Neural Network having 2 inputs(label -1 and
label -2 )with fully connected layers and we have 2 hidden layers:
Hidden layer-1: Nodes labeled as 3 and 4
Hidden layer-2: Nodes labeled as 5 and 6
A weight on the connection between nodes i and j is represented by wij, such
as w 24 is the weight on the connection between nodes 2 and 4. The following
lists contain all the weights values used in the given network:
w13=−2, w 35=1, w23 = 3, w 45 = −1, w14 = 4, w 36 = −1, w24=−1, w 46=1
Each of the nodes 3, 4, 5, and 6 use the following activation function:
φ(v) = 1 if v≥0 otherwise 0
where v denotes the weighted sum of a node. Each of the input nodes (1 and
2) can only receive binary values (either 0 or 1). Calculate the output of the
network (y5 and y6) for the input pattern given by (node-1 and node-2 as 0, 0
respectively).
393
w13=−2, w 35=1, w23 = 3, w 45 = −1, w14 = 4, w 36 = −1, w24=−1, w 46=1
394
395
XOR with a Hidden Layer
396
Loss Function
The loss function in a neural network quantifies the difference
between the expected outcome and the outcome produced by
the machine learning model.
From the loss function, we can derive the gradients which are
used to update the weights. The average over all losses
constitutes the cost.
In simple words, the Loss is used to calculate the gradients. And
gradients are used to update the weights of the Neural Net.
Type of Loss Functions

1) Mean Squared Error (MSE)
2) Binary Crossentropy (BCE)
3) Categorical Crossentropy (CC)
4) Sparse Categorical Crossentropy (SCC)
397
Mean Squared Error
MSE loss is used for regression tasks. As the name suggests,
this loss is calculated by taking the mean of squared
differences between actual(target) and predicted values.
For Example, we have a neural network which takes house
data and predicts house price. In this case, you can use the MSE
loss. Basically, in the case where the output is a real number,
you should use this loss function.
398
Binary Crossentropy
Cross-entropy-based loss functions are commonly used in
classification scenarios. Cross entropy is a measure of the
difference between two probability distributions.
If you are using BCE loss function, you just need one output
node to classify the data into two classes. The output value
should be passed through a sigmoid activation function and the
range of output is (0 – 1).
If the output is greater than 0.5, the network classifies it as
rain and if the output is less than 0.5, the network classifies it
as not rain (for rain prediction).
One important thing, if you are using BCE loss function the
output of the node should be between (0–1). It means you have
to use a sigmoid activation function on your final output. Since
sigmoid converts any real value in the range between (0–1).
399
Categorical Crossentropy
When we have a multi-class classification task, one of the loss
function you can go ahead is this one. If you are using CCE loss
function, there must be the same number of output nodes as
the classes and the final layer output should be passed through
a softmax activation so that each node output a probability
value between (0–1).
For feeding the target value at the time of training, we have to
one-hot encode them.
Basically, the target vector would be of the same size as the
number of classes and the index position corresponding to the
actual class would be 1 and all others would be zero.
400
Sparse Categorical Crossentropy
This loss function is almost similar to CCE except for one
change.
When we are using SCCE loss function, you do not need to one
hot encode the target vector. If the target image is of a cat,
you simply pass 0, otherwise 1. Basically, whichever the class is
you just pass the index of that class.
401
402
403
Learning Algorithm
404
405
406
For more than two variable
407
408
409
410
411
412
Optimization Algorithms
Optimizers are algorithms or methods used to change the
attributes of your neural network such as weights and learning
rate in order to reduce the losses.
Types of optimizer
1. Gradient Descent
2. Stochastic Gradient Descent
3. Stochastic Gradient descent with momentum
4. Mini-Batch Gradient Descent
5. Adagrad
6. RMSProp
7. AdaDelta
8. Adam
413
Gradient Descent Optimizer
Gradient descent is an iterative optimization algorithm to find
the minimum of a function. Here that function is our Loss
Function.
Loss function is the partial deviation with respect to w and b.
He goes down the slope and takes large steps when the slope is
steep and small steps when the slope is less steep.
He decides his next position based on his current position and
414
Gradient descent is a first-order optimization algorithm which
is dependent on the first order derivative of a loss function.
It calculates that which way the weights should be altered so
that the function can reach a minima.
Through backpropagation, the loss is transferred from one
layer to another and the model’s parameters also known as
weights are modified depending on the losses so that the loss
can be minimized.
415
If we choose α to be very large, Gradient If we choose α to be very small, Gradient
Descent can overshoot the minimum. It may Descent will take small steps to reach local
fail to converge or even diverge. minima and will take a longer time to416
reach
Behavior of GD on different
Surface?
417
Advantages:
1. Very simple to implement.
Disadvantages:
1. This algorithm takes an entire dataset of n-points
at a time to compute the derivative to update the
weights which require a lot of memory.
2. Minima is reached after a long time or is never
reached.
3. This algorithm can be stuck at local minima.
418
Stochastic Gradient Descent
(SGD)
⚫
419
Momentum Based Gradient
Descent
420
421
422
Advantage:
1. Memory requirement is less compared to the GD
algorithm as derivative is computed taking only 1
point at once.
Disadvantages:
1. The time required to complete 1 epoch is large
compared to the GD algorithm.
2. Takes a long time to converge.
3. May stuck at local minima.
423
Nesterov Accelerated Gradient
(NAG) Update rules
424
425
Mini Batch Stochastic
Gradient Descent (MB-SGD)
MB-SGD algorithm is an extension of the SGD algorithm and it
overcomes the problem of large time complexity in the case
of the SGD algorithm. MB-SGD algorithm takes a batch of points
or subset of points from the dataset to compute derivate.
It is observed that the derivate of the loss function for MB-SGD
is almost the same as a derivate of the loss function for GD
after some number of iterations. But the number of iterations
to achieve minima is large for MB-SGD compared to GD and the
cost of computation is also large.
The update of weight is dependent on the derivate of loss for a
batch of points. The updates in the case of MB-SGD are much
noisy because the derivative is not always towards minima.
426
Advantages:
1. Less time complexity to converge compared to
standard SGD algorithm.
Disadvantages:
1. The update of MB-SGD is much noisy compared to
the update of the GD algorithm.
2. May get stuck at local minima.
427
SGD with Momentum
A major disadvantage of the MB-SGD algorithm is that updates

of weight are very noisy. SGD with momentum overcomes this
disadvantage by denoising the gradients. Updates of weight are
dependent on noisy derivative and if we somehow denoise the
derivatives then converging time will decrease.
The idea is to denoise derivative using exponential weighting
average that is to give more weightage to recent updates
compared to the previous update.
428
Advantages:
1. Has all advantages of the SGD algorithm.
2. Converges faster than the GD algorithm.
Disadvantages:
1. We need to compute one more variable for each
update
429
Adaptive Learning Rate
430
Adagrad (Adaptive Gradient
Descent)
One of the disadvantages of all the optimizers explained is that the
learning rate is constant for all parameters and for each cycle.
The adaptive gradient descent algorithm is slightly different from
other gradient descent algorithms. This is because it uses different
learning rates for each iteration.
The change in learning rate depends upon the difference in the
parameters during training. The more the parameters get change, the
more minor the learning rate changes.
This modification is highly beneficial because real-world datasets
contain sparse as well as dense features. So it is unfair to have the
same value of learning rate for all the features.
431
432
Advantages:
1. Learning rate changes for each training parameter.
2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.
Disadvantages:
1. Computationally expensive as a need to calculate the second order
derivative.
2. The learning rate is always decreasing results in slow training.
433
RMS Prop(Root Mean Square)
In this algorithm, the two gradients are first compared for signs.
If they have the same sign, we’re going in the right direction
and hence increase the step size by a small fraction. Whereas, if
they have opposite signs, we have to decrease the step size.
Then we limit the step size, and now we can go for the weight
update.
The algorithm mainly focuses on accelerating the optimization

process by decreasing the number of function evaluations to
reach the local minima. The algorithm keeps the moving
average of squared gradients for every weight and divides the
gradient by the square root of the mean square.
434
435
Adam
Adaptive Moment Estimation combines the power of RMSProp
(root-mean-square prop) and momentum-based GD. In Adam
optimizers, the power of momentum GD to hold the history
of updates and the adaptive learning rate provided by
RMSProp makes Adam optimizer a powerful method. It also
introduces two new hyper-parameters beta1 and beta2 which
are usually kept around 0.9 and 0.99 but you can change them
according to your use case.
436
Comparison between various
optimizers
437
Parameter Tuning
438
Pattern Recognition
Pattern recognition is a process of finding regularities and
similarities in data using machine learning data. Now, these
similarities can be found based on statistical analysis, historical
data, or the already gained knowledge by the machine itself.
If we discuss sports, a description of a type would be a pattern.
If a person keeps watching videos related to cricket, YouTube
wouldn’t recommend them chess tutorials videos.
Examples: Speech recognition, speaker identification,
multimedia document recognition (MDR), automatic medical
diagnosis.
Before searching for a pattern there are some certain steps and
the first one is to collect the data from the real world. The
collected data needs to be filtered and pre-processed so that its
system can extract the features from the data. Then based on
the type of the data system will choose the appropriate
algorithm among Classification, Regression, and Regression
to recognize the pattern. 439
1) Classification. In classification, the algorithm assigns labels
to data based on the predefined features. This is an example
of supervised learning.
2) Clustering. An algorithm splits data into a number of clusters
based on the similarity of features. This is an example of
unsupervised learning.
3) Regression. Regression algorithms try to find a relationship
between variables and predict unknown dependent variables
based on known data. It is based on supervised learning.
440
Components of Pattern
Recognition
Inp
ut
Sensing
Segmentation
Feature Extraction
Classification
Post-Processing
441
Decisio
1) Sensor Inp
It gathers the information to be classified. The ut
sensors in a system are used to receives the data Sensing
input.
They are usually some form of transducers such

as a camera or a microphone.
Sensing is used to eliminate the noise. A sensor

converts images or sounds or other physical
into signal data.
442
2) Segmentation
Inp
ut
After receiving the input data the different
Sensing
patterns need to be separated.
The measurement data is partitioned into the
small pieces so that each part represent exactly Segmentation
one object to be classified.
The segmentor isolates sensed objects from the
back ground or from other objects.
443
3) Feature Extraction
Inp
Feature selection is the process of selecting a subset ut
of a given set of variables.
Sensing
Feature selection is for filtering irrelevant or
redundant features from your dataset.
Segmentation
Feature extraction is for creating a new, smaller set

of features that stills captures most of the useful Feature Extraction
information.
The key difference between feature selection and

extraction is that feature selection keeps a subset of
the original features while feature extraction creates
brand new ones.
The task of feature extractor is domain dependent and
requires the knowledge of the domain. 444
Input Features
Set
Identified Useful
Features
Features
Selection
Feature
Extraction
Plot Length Plot Width No. of Build-up Build-up Purchase

Rooms Area (Sq. Ft.) Area (Sq. (Y/N)
Meters)
- - - - - -
Plot Length Plot Width No. of Rooms Build-up Area Purchase (Y/
(Sq. Ft.) N)
Plot No. of Build-up Area (Sq. Purchase (Y/N)

Area Rooms Ft.)
- - - - 445
Good v/s Bed Features
What Makes a Good Features?
The quality of the features measured by its ability to

discriminate different instances into different classes
❖ Example form same class should have similar feature
values
❖ Example form different class should have different
feature values
Source: https://www.geeksforgeeks.org/pattern-recognition-basics-and-design- 446

principles/
More Features Properties
Source: https://www.geeksforgeeks.org/pattern-recognition-basics-and-design- 447

principles/
Unit –IV: Expert System
❖ Introduction to Expert systems - Architecture of expert systems,
Roles of expert systems - Knowledge Acquisition –Meta
knowledge, Heuristics. Example of expert systems - MYCIN, DART,
XOON, Expert systems shells, Introduction to Planning.
448
Expert System
In artificial intelligence, an expert system is a computer system
that emulates the decision-making ability of a human expert.
Expert systems are designed to solve complex problems by

reasoning through bodies of knowledge, represented mainly as
if–then rules rather than through conventional procedural code.
The first expert systems were created in the 1950s and then
production increases in the 1980s.
Expert systems were among the first truly successful forms of

artificial intelligence. For Example
MYCIN: To identify various bacteria that can cause severe
infections and can also recommend drugs based on the
person’s weight.
CaDet: It is a clinical support system that could identify cancer
in its early stages in patients. 449
* 450
* 451
* 452
* 453
* 454
* 455
* 456
* 457
1) Huge amount of data
2) Some time difficult to code express honesty, confidence,
facial expression etc. that can be achieve by interaction
3) Some time rule can be unknown
* 458
Interpreter
The interpreter (inference engine) is a processor or
controller – a program.
AI translators are digital tools that use advanced artificial
intelligence to not only translate the words that are
written or spoken, but also to translate the meaning (and
sometimes sentiment) of the message.
It combines rules and facts.
Interpreters are of two types:
Forward chaining
Backward chaining
459
Rule-based Systems – Forward
Chaining
Forward chaining is a method of reasoning in artificial
intelligence in which inference rules are applied to
existing data to extract additional data until an endpoint
(goal) is achieved.
In this type of chaining, the inference engine starts by
evaluating existing facts, derivations, and conditions
before deducing new information. An endpoint (goal) is
achieved through the manipulation of knowledge that
exists in the knowledge base.
460
Tom is running (A)
If a person is running, he will sweat (A->B)
Therefore, Tom is sweating. (B)
Properties of forward chaining

The process uses a down-up approach (bottom to top).
It starts from an initial state and uses facts to make a
conclusion.
This approach is data-driven.
It’s employed in expert systems and production rule system.
461
Backward Chaining
Forward chaining is not very efficient if the system tries
to prove a conclusion. In this case, it may be more
efficient if backward chaining is used.
Backward chaining is a concept in artificial intelligence
that involves backtracking from the endpoint or goal to
steps that led to the endpoint.
462
Tom is sweating (B).
If a person is running, he will sweat (A->B).
Tom is running (A).
463
Components of an Expert
System
464
Development of Expert
Systems: General Steps
1) Identify Problem Domain
▪ The problem must be suitable for an expert system to solve it.
▪ Find the experts in task domain for the ES project.
▪ Establish cost-effectiveness of the system.
▪ Design the System
2) Identify the ES Technology
▪ Know and establish the degree of integration with the other
systems and databases.
▪ Realize how the concepts can represent the domain knowledge
best.
3) Develop the Prototype
▪ From Knowledge Base: The knowledge engineer works to −
▪ Acquire domain knowledge from the expert.
▪ Represent it in the form of If-THEN-ELSE rules.
465
4) Test and Refine the Prototype
▪ The knowledge engineer uses sample cases to test the prototype for
any deficiencies in performance.
▪ End users test the prototypes of the ES.
5) Develop and Complete the ES
▪ Test and ensure the interaction of the ES with all elements of its
environment, including end users, databases, and other
information systems.
▪ Document the ES project well.
▪ Train the user to use ES.
6) Maintain the System
▪ Keep the knowledge base up-to-date by regular review and update.
▪ Cater for new interfaces with other information systems, as those
systems evolve.
466
Characteristics of an Expert
System
Human experts are perishable, but an expert system is
permanent.
It helps to distribute the expertise of a human.
One expert system may contain knowledge from more
than one human experts thus making the solutions more
efficient.
It decreases the cost of consulting an expert for various
domains such as medical diagnosis.
They use a knowledge base and inference engine.
Expert systems can solve complex problems by deducing
new facts through existing facts of knowledge,
represented mostly as if-then rules rather than through
conventional procedural code. 467
Limitations
Do not have human-like decision-making power.

Cannot possess human capabilities.
Cannot produce correct result from less amount of
knowledge.
Requires excessive training.
468
Advantages :
Low accessibility cost.
Fast response.
Not affected by emotions, unlike humans.
Low error rate.
Capable of explaining how they reached a solution.
Disadvantages :
The expert system has no emotions.
Common sense is the main issue of the expert system.
It is developed for a specific domain.
It needs to be updated manually. It does not learn itself.
Not capable to explain the logic behind the decision.
469
Applications
Different types of medical diagnosis like internal medicine,
blood diseases and show on.
Diagnosis of the complex electronic and electromechanical
system.
Diagnosis of a software development project.
Planning experiment in biology, chemistry and molecular
genetics.
Forecasting crop damage.
Diagnosis of the diesel-electric locomotive system.
Identification of chemical compound structure.
Scheduling of customer order, computer resources and various
manufacturing task.
Assessment of geologic structure from dip meter logs.
Assessment of space structure through satellite and robot.
The design of VLSI system.
Teaching students specialize task. 470
Education – Partnership –
Solutions
471
Prolog
❖ Prolog - PROgramming in LOGic,
❖ A Logic Programming language based on the predicate

logic.
❖ Basically, in prolog program we write the program
statements in terms of facts and rules. The system reads
in the program and stores it. Upon asking the questions
(known as queries) the system gives the answer by
searching through the possible solution(s).
❖ Prolog is also widely used for AI programs especially
experts systems. For Prolog programming GNU-prolog
and SWI-Prolog provides comprehensive prolog
environment and it is also free.
472
473
474
475

Final Updated AI

Uploaded by

Copyright:

Available Formats

Final Updated AI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Updated AI

Uploaded by

Copyright:

Available Formats

Artiﬁcial Intelligence

Dr. Rajeev Kumar Gupta

Unit –II: KNOWLEDGE REPRESENTATION AND INFERENCE

Unit –IV: Expert System

Ex. C, C++, Java, Python etc.

For Example, let's convert 5°C celsius temperature to

Limitation: Not possible to write code for each real

AI, the ability of a digital computer or

This simple memorizing of individual items and

More challenging is the problem of implementing what is

Problem-solving methods divide into special purpose and

In contrast, a general-purpose method is applicable to a wide

• Year 1949: Donald Hebb demonstrated an updating rule

• Year 1950: The Alan Turing who was an English

• Year 1955: An Allen Newell and Herbert A. Simon created

• Year 1956: The word "Artiﬁcial Intelligence" ﬁrst adopted

❖ The ﬁrst AI winter (1974-1980)

❖ The second AI winter (1987-1993)

AI has three different levels:

⚫ Narrow AI: An artiﬁcial intelligence is said to be

⚫ General AI: An artiﬁcial intelligence reaches the

⚫ Active AI: An AI is active when it can beat

Image Source: https://datacatchup.com/artificial-intelligence-machine-learning-and-deep-learning/

Expert systems are designed to solve complex problems by

Expert systems were among the ﬁrst truly successful forms of

Machine Learning, the input data and output are fed to an

T(°F) = T(°C) × 9/5 +

For Example, let's convert 5°C celsius temperature to

Pattern recognition: Automated recognition of patterns

Pattern recognition is the study of how machine can

Two Phase : Learning and Detection.

It is one of the integral elements of machine learning

Machine learning is basically the idea of training

Character No. of No. of

Trained machine on a labelled dataset.

Labelled dataset is one which have both input and

It is task driven because outcomes of a supervised

In unsupervised learning approach, training data consists

Instead, our algorithm would be fed a lot of data and

The goal in such unsupervised learning problems may be

A cluster is therefore a collection of objects which are

Advantages of Dimensionality Reduction

Disadvantages of Dimensionality Reduction

There may be cases where we don’t know how many/

Buying Habits: Group customers into similar purchasing

Parameter Supervised Learning Unsupervised Learning

Reinforcement learning is the training of machine

It is a subset of machine learning.

Deep learning is a computer software that mimics

In deep learning, the learning phase is done through

Overﬁtting: High training accuracy and low test accuracy

Overﬁtting: High training accuracy and low test accuracy

A robotic agent replaces cameras and infrared range ﬁnders

A software agent has Keystrokes, ﬁle contents, received

Let's suppose a self-driving car then PEAS representation will be:

Agent happiness should be taken into consideration. Utility

❖ In AI one must identify components of problems,

• Search: Searching is a step by step procedure to solve a search-

❑ Optimality: If a solution found for an algorithm is guaranteed

❑ Time Complexity: Time complexity is a measure of time for an