Unit - 5 Natural Language Processing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 66

Unit - 5

Natural Language Processing


Why Natural Language Processing?

 Huge amount of data


 At least 3.6 billion pages
 Applications for processing large amount of texts
 Requires NLP expertise
 Classify text into categories
 Index and search large texts
 Automatic Translation
 Speech understanding
 Understanding phone conversations
 Information extraction
 Extract useful information from resumes
 Automatic summarization Question Answering
 Knowledge acquisition
 Text generations / Dialogs
Natural?

 Natural Language
 Refers to the language spoken by people, Eg. English,
Hindi as opposed to artificial languages like C++, Java
etc.
 Natural Language Processing
 Applications that deal with natural language in a way or
another
 [Computational Linguistics]
 Doing linguistics on computers
 More on the linguistic side than NLP , but closely related.
Why Natural Language Processing?

 Sdgfdgh ngmfngmng
 Fgfh kgokro gkjgkajh
 Dfagfh fghyhyt
 Fdgfh5ythtt
Computers

Artificial
Databases Algorithms Networking
Intelligence

Natural Language Processing Search


Robotics

Machine
Information Language
Translation
Retrieval Analysis

Semantics Parsing
Linguistics Levels of Analysis

 Speech
 Written Language
 Phonology: Sounds / Letters / Pronunciation
 Morphology: the structure of words
 Syntax: How these sequences are structured
 Semantics: meaning of the strings
 Interaction between levels
Issues in Syntax

 “the dog ate my homework” – Who did what?


 1. Identify the part of Speech (POS)
 Dog = noun; ate = verb; homework = noun
 English POS tagging: 95%
 Can be improved!

 2. Identify Colocations – Chunking or local word grouping


 mother in law, hot dog
 Shallow parsing:
 “the dog chased the bear”
 “the dog” “chased the bear”
 Subject – predicate
 Identify basic structures
 NP – [the dog] VP=[chased the bear]
 Shallow parsing on new languages
 Shallow parsing with little training data
 Anaphora Resolution
 The dog entered my room. It scared me
 Preposition Attachment
 I saw the man in the park with telescope
Issues in Semantics

 Understand language! How?


 “plant” = industrial plant
 “plant” = living organism
 Words are ambiguous
 Importance of semantics?
 Machine Translation: wrong translations
 Information Retrieval: wrong information’
 Anaphora Resolution: wrong referents
Contd..

 How to learn the meaning of words?


 From dictionaries:
 Plant, works, industrial plant – (buildings fo carrying on
industrial labor; “they built a large plant to manufacture
automobiles”)
 Plant, flora, plant life – (a living organism lacking the power of
locomotion)
 They are producing about 1000 automobiles in the new plant
 The sea flora consists in 1000 different plant species
 The plant was close to the farm of animals
Contd..

 Learn from annotated examples


 Assume 100 examples containing “plant” previously
tagged by a human
 Train a learning algorithm
 Precisions in the range 60% = 70%-
 How to choose the learning algorithm?
 How to obtain the 100 tagged examples?
Issues in Learning Semantics

 Learning?
 Assume a large amount of annotated data = training
 Assume a new text not annotated = test
 Learn from previous experience to classify new data
 Decision trees, memory based learning, neural networks
 Machine Learning
Issues in Information Extraction

 There was a group of about 8-9 people close tp the


entrance on Highway 75
 Who? 8-9 people
 Where? Highway 75

 Extract information
 Detect new patterns
Issues in Information Retrieval

 General Model
 A huge collection of texts
 A query
 Task: find documents that are relervant to the given
query
 How? Create an index, like the index in a boook
 Examples: Google, Yahoo, Altavista, etc.
Contd…

 Index meaning
 Search for plant(=living organism)
 Should not retrieve texts with plant
 (=industrial plant)
 But should retrieve documents including “flora” or
other related terms
Issues in Machine Translations

 Text to Text Machine Translations


 Speech to Speech Machine Translations
 Most of the work has addressed pairs of widely
spread languages like English-French, English-Chinese
Steps in NLP Process

 Morphological Analysis
 Syntactic Analysis
 Semantic Analysis
 Discourse Integration
 Pragmatic Analysis
 Morphological Analysis
 Individual words are analyzed into their components and
nonword tokens such as punctuation are separated from
words
 Syntactic Analysis
 Linear sequences of words are transformed into structures
that show how the words relate to each other
 Semantic Analysis
 Structures created by the syntactic analyzer are assigned
meanings
 Discourse Integration
 The meaning of an individual sentence may depend on
the sentences that precede it and may influence the
meanings of the sentences that follow it
 Pragmatic Analysis
 The structure representing what was said is
reinterpreted to determine what was actually meant.
 Morphological Analysis

 Example:
I want to print Bill’s .init file

Want, print, file can all function as more than one syntactic
category
 Syntactic Analysis
 Exploits the results of morphological analysis to
build a structural description of the sentence.

 Converts the flat list of words that forms the


sentence into a structure that defines the units that
are represented by that flat list
 Flat sentence is converted into hierarchical structure
and that structure is designed to correspond to
sentence units that will correspond to meaning units
when semantic analysis is performed.
 - Parsing

 Create a set of entities called Reference Markers


 Semantic Analysis
 Must do the following two important things
 It must map individual words into appropriate objects
in the knowledge base or database

 It must create correct structures to correspond to the


way the meanings of the individual words combine
with each other
Discourse Integration
Pragmatic Analysis

 Discover the intended effect by applying a set of rules


that characterizes the dialogues.
Syntactic Processing

 Flat input sentence is converted into a hierarchical


structure that corresponds to the units of meaning in
sentence. – Parsing
 Parsing constrains the number of constituents that
semantics can consider.
 Syntactic parsing is computationally less expensive
than semantic processing
 Systems use two components given below
1. Declarative Representation called Grammar
– Syntactic facts about the language
2. Procedure called Parser – compares the
grammar against input sentences to produce
parsed structures
Grammars and Parsers

 A set of production rules


 A sentence is composed of noun phrase
followed by a verb phrase
A simple Grammar for a
Fragment of English

 SNP VP  VPV
 NPthe NP1  VPV NP
 NPPRO  Nfile | Printer
 NPPN  PN Bill
 NP NP1  PROI
 NPI ADJS N  ADJ short | long | fast
 ADJS ∈ | ADJ ADJS  V printed | created | want
 Parse tree simply records the rules and how
they are matched
 The parsing process takes the rules of the
grammar and compares them against the
input sentence.
 Each rule that matches add something to the
complete structure
S

NP VP

PN V NP

Bill Printed the NP1

ADJS (𝝐) N (file)


Bill printed the file – A Parse Tree for a Structure
 Grammar specifies two things about the language
1. Its weak generative capacity – grammatical sentences
2. Its strong generative capacity – structure to be
assigned
 Top-Down Vs Bottom-Up Parsing
 Find a way in which the sentence could have been
generated from the start symbol.

 Top-Down Parsing
 Begin with the start symbol and apply the grammar
rules forward until the symbols at the terminals of the
tree correspond to the components of the sentence
being parsed
 Bottom-Up Parsing
 Begin with a sentence to be parsed and apply the
grammar rules backward until a single tree whose
terminals are the words of the sentence

 Two approaches can be combined –


-bottom-up parsing with top-down filtering
 Finding One Interpretation or Finding Many
1. All paths
2. Best Path with Backtracking
3. Best Path with Patchup
4. Wait and see
 Example:
 Have the students who missed the exam

 Two paths
1. Have the students who missed the exam take it today
2. Have the students who missed the exam taken it
today?
 Parser
Chart Parsers
 Provides a way of avoiding backup by storing
intermediate constituents

Definite Clause grammars


 Grammar rules are written as PROLOG clauses and
PROLOG interpreter is used to perform top-down and
depth-first parsing
 Augmented Transition Networks
 Parsing process is described as the transition from a
start state to a final state in a transition network
Augmented Transition Network
(ATN)

 An ATN is a top-down parsing procedure that allows


various kinds of knowledge to be incorporated into
the parsing system.
 It is similar to a finite state machine.
 The class of the labels can be attached to the arcs
that defines transitions between states
 Arcs may be labelled with an arbitrary combination
Contd…

 Specific word such as “in”

 Word categories such as “noun”

 Pushes to other networks that recognize significant


components of a sentence

 Procedures that perform arbitrary tests on both the


current input and on sentence components that have
already been identified

 Procedures that build structures that will form part of


the final parse
Example:
The long file has printed
The execution proceeds as follows
1. Begin in state S
2. Push to NP
3. Do a category test to see if “the” is a determiner
4. This test succeeds, so set the DETERMINER register
to DEFINITE and go to state Q6
5. Do a category test to see if “long” is an adjective
6. This test succeeds, so append “long” to the list
contained in the ADJS register. Stay in Q6
7. Do a category test to see if “file” is an adjective. This
test fails
8. Do a category test to see if “file” is a noun. This test
succeeds, so set the NOUN register to “file” and go
to state Q7
9. Push to pp
10. Do a category test to see if “has” is a preposition.
This test fails, so pop and signal failure
11. There is nothing else that can be done from state Q7.
so pop and return the structure
(NP(FILE(LONG) DEFINITE))

The return causes the machine to be in state Q1. with the


SUBJ register set to the structure just returned and the TYPE
register set to DCL
12. Do a category test to see if “has” is a verb. This test
succeeds, so set the AUX register to NIL and set the
V register to “has”, Go to state Q4.
13. Push to state NP. Since the next word “printed” is
not a determiner or a proper noun, NP will pop and
return failure
14. The only other thing to do in state Q4 is to halt. But more
input remains, so a complete parse has not been found.
Backtracking is now required
15. The last choice point was at state Q1, so return there.
The registers AUX and V must be unset
16. Do a category test to see if “has” is an auxiliary. This
test succeeds, so set the AUX register to “has” and
go to state Q3
17. Do a category test to see if “printed” is a verb. This
test succeeds, so set the V register to “printed”. Go
to state Q4
18. Since the output is exhausted, Q4 is an acceptable
final state. Pop and return the structure.

(S DCL (NP (FILE (LONG) DEFINITE))


HAS
(VP PRINTED))
This structure is the output of the parse.
Semantic Analysis

 Lexical Processing
 Look up the individual words in a dictionary and
extract their meaning

 Eg: Diamond
 A geometrical shape
 A baseball field
 A valuable gemstone
 The process of determining the correct meaning of an
individual word is called word sense disambiguation
or lexical disambiguation.

 Properties of word senses are called semantic


markers
 Physical Object
 Animate Object
 Abstract Object
Sentence Level Parsing

 Semantic grammars
 Case grammars
 Conceptual Parsing
 Compositional semantic interpretation / Montague
Analysis
Semantic Grammars

 Combines syntactic, semantic and pragmatic


knowledge into a single set of rules in the form of a
grammar
 The result of parsing and applying all associated
semantic actions is the meaning of the sentence
 Advantages
 Once the parse is complete , the result can be used
immediately
 Syntactic issues that do not affect the semantics can
be ignored.
 Disadvantages
 Number of rules required can become very large
 Parsing process may be expensive
Case Grammars

 Provides a different approach to the problem of how


syntactic and semantic interpretation can be
combined
Conceptual Parsing

 It is a strategy for finding both the structure and


meaning of a sentence in one step.
 This is driven by a dictionary that describes the
meaning of words as conceptual dependency (CD).
 Conceptual Dependency originally developed to represent
knowledge acquired from natural language input.
The goals of this theory are:
 To help in the drawing of inference from sentences.
 To be independent of the words used in the original input.
 For any 2 (or more) sentences that are identical in meaning
there should be only one representation of that meaning.
CD provides:
 a structure into which nodes representing
information can be placed
 a specific set of primitives
 at a given level of granularity.
 Sentences are represented as a series of diagrams depicting actions
using both abstract and real physical situations.
 The agent and the objects are represented
 The actions are built up from a set of primitive acts

Examples of Primitive Acts are:


 ATRANS-- Transfer of an abstract relationship. e.g. give.
 PTRANS-- Transfer of the physical location of an object. e.g. go.
 PROPEL-- Application of a physical force to an object. e.g. push.
 MTRANS-- Transfer of mental information. e.g. tell.
 MBUILD-- Construct new information from old. e.g. decide.
 SPEAK-- Utter a sound. e.g. say.
 ATTEND-- Focus a sense on a stimulus. e.g. listen,
watch.
 MOVE-- Movement of a body part by owner. e.g.
punch, kick.
 GRASP-- Actor grasping an object. e.g. clutch.
 INGEST-- Actor ingesting an object. e.g. eat.
 EXPEL-- Actor getting rid of an object from body.
 Six primitive conceptual categories provide building
blocks which are the set of allowable dependencies in the
concepts in a sentence:
 PP-- Real world objects.
 ACT-- Real world actions.
 PA-- Attributes of objects.
 AA-- Attributes of actions.
 T-- Times.
 LOC-- Locations.

You might also like