NLP Slides 1-55
NLP Slides 1-55
NLP Slides 1-55
Introduction to NLP 2
Scope…
Current Limitations and Potential Applications:
• Existing Language Technology (LT) systems are not on par with human abilities but offer various promising
applications.
Objective: Knowledgeable Language Software:
• The goal is to develop software products equipped with a degree of knowledge in human language.
Impact on Human-Machine Interaction:
• Urgent need for improved human-machine interaction as the primary barrier lies in communication issues,
and incorporating human language can enhance software acceptance and user productivity.
Introduction to NLP 3
Applications
Friendly technology should listen and speak
Introduction to NLP 4
Applications
Machines can also help people communicate with each other
• One of the original aims of language technology has always been fully automatic translation between human
languages.
• Still far away from achieving the ambitious goal of translating unrestricted texts.
• Nevertheless, they have been able to create software systems that simplify the work of human translators and clearly
improve their productivity.
• Less than perfect automatic translations can also be of great help to information seekers who have to search through
large amounts of texts in foreign languages.
• The most serious bottleneck for e-commerce is the volume of communication between business and
customers or among businesses.
• Language technology can help to sort, filter and route incoming email.
• It can also assist the customer relationship agent to look up information and to compose a response.
• In cases where questions have been answered before, language technology can find appropriate earlier replies and
automatically respond.
Introduction to NLP 5
Applications
Language is the fabric of the web
• Although the new media combine text, graphics, sound and movies, the whole world of multimedia
information can only be structured, indexed and navigated through language.
• For browsing, navigating, filtering and processing the information on the web, we need software that can get at the
contents of documents.
• Language technology for content management is a necessary precondition for turning the wealth of digital
information into collective knowledge.
• The increasing multilinguality of the web constitutes an additional challenge for language technology.
• The global web can only be mastered with the help of multilingual tools for indexing and navigating.
• Systems for crosslingual information and knowledge management will surmount language barriers for e-commerce,
education and international cooperation.
Introduction to NLP 6
Technologies
• Speech recognition
• Spoken language is recognized and transformed in
into text as in dictation systems, into commands as
in robot control systems, or into some other
internal representation.
• Speech synthesis
• Utterances in spoken language are produced from
text (text-to-speech systems) or from internal
representations of words or sentences
(concept-to-speech systems)
Introduction to NLP 7
Technologies
• Text categorization
• This technology assigns texts to categories. Texts
may belong to more than one category, categories
may contain other categories. Filtering is a special
case of categorization with just two categories.
• Text Summarization
• The most relevant portions of a text are extracted
as a summary. The task depends on the needed
lengths of the summaries. Summarization is harder
if the summary has to be specific to a certain
query.
Introduction to NLP 8
Technologies
• Text Indexing
• As a precondition for document retrieval, texts are stored
in an indexed database. Usually a text is indexed for all
word forms or – after lemmatization – for all lemmas.
Sometimes indexing is combined with categorization and
summarization.
• Text Retrieval
• Texts are retrieved from a database that best match a
given query or document. The candidate documents are
ordered with respect to their expected relevance.
Indexing, categorization, summarization and retrieval are
often subsumed under the term information retrieval.
Introduction to NLP 9
Technologies
• Information Extraction
• Relevant information pieces of information are discovered
and marked for extraction. The extracted pieces can be:
the topic, named entities such as company, place or
person names, simple relations such as prices,
destinations, functions etc. or complex relations
describing accidents, company mergers or football
matches.
• Data Fusion and Text Data Mining
• Extracted pieces of information from several sources are
combined in one database. Previously undetected
relationships may be discovered.
Introduction to NLP 10
Technologies
• Question Answering
• Natural language queries are used to access
information in a database. The database may be a
base of structured data or a repository of digital
texts in which certain parts have been marked as
potential answers.
• Report Generation
• A report in natural language is produced that
describes the essential contents or changes of a
database. The report can contain accumulated
numbers, maxima, minima and the most drastic
changes.
Introduction to NLP 11
Technologies
• Spoken Dialogue Systems
• The system can carry out a dialogue with a human
user in which the user can solicit information or
conduct purchases, reservations or other
transactions.
• Translation Technologies
• Technologies that translate texts or assist human
translators. Automatic translation is called machine
translation. Translation memories use large
amounts of texts together with existing
translations for efficient look-up of possible
translations for words, phrases and sentences.
Introduction to NLP 12
Methods and Resources
• The methods of language technology come from several disciplines:
• computer science,
• computational and theoretical linguistics,
• mathematics,
• electrical engineering and
• psychology.
Introduction to NLP 13
Methods and Resources
• Generic CS Methods
• Programming languages, algorithms for generic data types, and software
engineering methods for structuring and organizing software development and
quality assurance.
• Specialized Algorithms
• Dedicated algorithms have been designed for parsing, generation and translation,
for morphological and syntactic processing with finite state automata/transducers
and many other tasks.
• Nondiscrete Mathematical Methods
• Statistical techniques have become especially successful in speech processing,
information retrieval, and the automatic acquisition of language models. Other
methods in this class are neural networks and powerful techniques for optimization
and search.
Introduction to NLP 14
Methods and Resources
• Logical and Linguistic Formalisms
• For deep linguistic processing, constraint based grammar formalisms are
employed. Complex formalisms have been developed for the representation
of semantic content and knowledge.
• Linguistic Knowledge
• Linguistic knowledge resources for many languages are utilized: dictionaries,
morphological and syntactic grammars, rules for semantic interpretation,
pronunciation and intonation.
• Corpora and Corpus Tools
• Large collections of application-specific or generic collections of spoken and
written language are exploited for the acquisition and testing of statistical or
rule-based language models.
Introduction to NLP 15
Introduction to NLP
From: Chapter 1 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, By Daniel Jurafsky
and James H. Martin
http://www.cs.colorado.edu/~martin/SLP/slp-ch1.pdf
Chapter 1. Introduction to
NLP
From: Chapter 1 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James H. Martin
Background
• NLP has its roots in the 1950s when researchers first explored
machine translation and language understanding.
• Early efforts, such as the Georgetown-IBM experiment in 1954, laid
the foundation for machine translation systems.
• The 1960s and 1970s saw the development of rule-based systems
and linguistic theories to model language structure
Introduction to NLP 18
Background
• The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey
(Movie released in 1986)
• HAL is an artificial agent capable of such advanced language processing behavior as
speaking and understanding English, and at a crucial moment in the plot, even
reading lips.
• The language-related parts of HAL
• Speech recognition
• Natural language understanding (and, of course, lip-reading),
• Natural language generation
• Speech synthesis
• Information retrieval
• information extraction and
• inference
Introduction to NLP 19
Background
• In the late 20th century, statistical methods gained prominence in
NLP.
• Corpora-based approaches and statistical models, like Hidden Markov
Models (HMMs) and later on, Conditional Random Fields (CRFs),
improved language processing tasks.
• This shift marked a departure from rule-based systems, allowing
algorithms to learn patterns from large datasets.
Introduction to NLP 20
Background
• The 21st century witnessed a paradigm shift with the integration of
machine learning techniques, especially deep learning, in NLP.
• Recurrent Neural Networks (RNNs) and later, Transformer models, like
BERT and GPT, demonstrated significant advancements in language
understanding and generation.
• Pre-trained language models became a cornerstone, enabling transfer
learning for various NLP tasks.
Introduction to NLP 21
Background
• Today:
• NLP is pervasive in our daily lives, powering applications such as
virtual assistants, chatbots, and sentiment analysis tools.
• Social media platforms use NLP for content recommendation, spam
detection, and language understanding.
• Healthcare, finance, and customer service industries leverage NLP for
information extraction, summarization, and automated response
systems.
Introduction to NLP 22
1.1 Knowledge in Speech and Language Processing
Introduction to NLP 23
1.1 Knowledge in Speech and Language Processing
Introduction to NLP 24
1.1 Knowledge in Speech and Language Processing
Introduction to NLP 25
1.1 Knowledge in Speech and Language Processing
• Semantics:
• "Bachelor" - unmarried man.
• Understanding the difference between "house" and "home."
• Pragmatics:
• "Can you pass the salt?" (in a meal context).
No I’m sorry, I’m afraid, I can’t.
No, I won’t open the door. I won’t.
• Discourse:
• A conversation where one introduces a topic, and another responds with
additional information or an opinion.
Introduction to NLP 26
1.2 Ambiguity
• We say some input is ambiguous
• if there are multiple alternative linguistic structures
than can be built for it.
• The spoken sentence, I made her duck, has five
different meanings.
• (1) I cooked waterfowl for her.
• (2) I cooked waterfowl belonging to her.
• (3) I created the (plaster?) duck she owns.
• (4) I caused her to quickly lower her head or body.
• (5) I waved my magic wand and turned her into
undifferentiated waterfowl.
Introduction to NLP 27
1.2 Ambiguity
• These different meanings are caused by a number of ambiguities.
• Duck can be a verb or a noun, while her can be a dative pronoun or a
possessive pronoun.
• The word make can mean create or cook.
• Finally, the verb make is syntactically ambiguous in that it can be transitive, or
it can be ditransitive.
• Finally, make can take a direct object and a verb, meaning that the object
(her) got caused to perform the verbal action (duck).
• In a spoken sentence, there is an even deeper kind of ambiguity; the first
word could have been eye or the second word maid.
Introduction to NLP 28
1.2 Ambiguity Types
• Lexical Ambiguity: Multiple Meanings of a Word
I heard a bark
• "Bark" can mean the sound a dog makes or the outer covering of a tree.
• Syntactic Ambiguity: Multiple
I saw Meanings
the man with of a sentence
the telescope
• It's unclear whether the speaker used the telescope or if the man being
observed had a telescope.
Introduction to NLP 29
1.2 Ambiguity Types
• Semantic Ambiguity:
The bank is by the river The play was performed by the students
• Ambiguity arises from the word ‘by’
• Structural Ambiguity:
Flying planes can be dangerous
• Is it the planes themselves that are flying, or is it the act of flying planes that is
dangerous?
• Referential Ambiguity:
He told his brother he would come
• It's unclear who "he" refers to—could be the brother or someone else.
Introduction to NLP 30
1.2 Ambiguity Types
• Pragmatic Ambiguity:
That was an interesting choice
• Morphological Ambiguity:
impatient important
Introduction to NLP 31
1.3 Models and Algorithms
• The most important model:
• state machines,
• formal rule systems,
• logic,
• probability theory and
• other machine learning tools
• The most important algorithms of these models:
• state space search algorithms and
• dynamic programming algorithms
Introduction to NLP 32
1.3 Models and Algorithms
• State Machine Model:
• A state machine represents a system's behavior through a set of states, transitions,
and actions.
• In NLP, state machines are used to model finite processes, such as dialogues or
language recognition, where the system moves from one state to another based on
input.
• Rule System Model:
• Rule-based systems use predefined rules to analyze and generate linguistic
structures.
• These rules define patterns, conditions, and actions that guide the system's
behavior.
• State machines and rule systems are the main tools used when dealing with
knowledge of phonology, morphology, and syntax.
Introduction to NLP 33
1.3 Models and Algorithms
• Logic Model:
• Logic-based models use formal logic to represent and reason about linguistic
structures.
• Probabilistic Model:
• Probabilistic models assign probabilities to different linguistic structures or
interpretations. These models are trained on data and leverage probability theory to
make predictions.
• State-Vector Model:
• State-vector models represent the state of a system using vectors in a
high-dimensional space.
• In NLP, this can be applied to represent the meaning or context of words, phrases, or
sentences.
• These logical representations have traditionally been the tool of choice
when dealing with knowledge of semantics, pragmatics, and discourse
Introduction to NLP 34
1.3 Models and Algorithms
• State Space Search Algorithm:
• State space search is a problem-solving paradigm that involves navigating
through a set of states to find a goal state.
• It is commonly used in artificial intelligence and optimization problems.
• The process involves defining the initial state, possible actions, transition
rules, and a goal condition.
• Dynamic Programming:
• Dynamic programming is a technique for solving optimization problems by
breaking them down into simpler sub-problems and solving each sub-problem
only once, storing the results in a table to avoid redundant computations.
• It is particularly useful for problems with overlapping sub-problems and
optimal substructure.
Introduction to NLP 35
1.4 Language, Thought, and Understanding
Introduction to NLP 36
1.4 Language, Thought, and Understanding
Introduction to NLP 37
1.4 Language, Thought, and Understanding
Introduction to NLP 39
1.7 Summary
• A good way to understand the concerns of speech and language
processing research is to consider what it would take to create an
intelligent agent like HAL from 2001: A Space Odyssey.
• Speech and language technology relies on formal models, or
representations, of knowledge of language at the levels of phonology
and phonetics, morphology, syntax, semantics, pragmatics and
discourse.
• A small number of formal models including state machines, formal rule
systems, logic, and probability theory are used to capture this knowledge.
Introduction to NLP 40
1.7 Summary
• The foundations of speech and language technology lie in computer
science, linguistics, mathematics, electrical engineering and
psychology.
• The critical connection between language and thought has placed
speech and language processing technology at the center of debate
over intelligent machines.
• Revolutionary applications of speech and language processing are
currently in use around the world.
• Recent advances in speech recognition and the creation of the World-Wide
Web will lead to many more applications.
Introduction to NLP 41
In-Class Assignment
1. Define the ambiguities caused by each of the linguistic terms:
Phonetics and Phonology, Morphology, Syntax, Semantics,
Pragmatics and Discourse
2. Provide five examples of each ambiguity.
• Guidelines:
• Write on your notebook, mention your name and register number on it.
• Scan and upload in Google Classroom.
• Hand-in within the due time.
• Note: No two assignments should be identical.
Introduction to NLP 42
Regular Expressions and
Automata
From: Chapter 2 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James H. Martin
Regular Expressions
• A RE search function will search through the corpus returning all texts
that contain the pattern.
• In a Web search engine, they might be the entire documents or Web pages.
• In a word-processor, they might be individual words, or lines of a document.
• In an Operating Systems, they may be patterns in text files.
• E.g., the UNIX grep command
• grep-Global Regular Expression Print
grep [options] pattern [file(s)]
• Online tool: https://regexr.com/ (will be used here)
• Disjunction
/cat|dog
• Precedence
/pupp(y|ies)
/[tT]he/
/\b[tT]he\b/
/[^a-zA-Z][tT]he[^a-zA-Z]
/
/^|[^a-zA-Z][tT]he[^a-zA-Z]|$/
• “any PC with more than 500 MHz and 32 Gb of disk space for less than $1000”
/$[0-9]+/
/$[0-9]+\.[0-9][0-9]
/
/\b$[0-9]+(\.[0-9][0-9])?\b
/
/\b[0-9]+
*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
/\b[0-9]+
*(Mb|[Mm]egabytes?)\b/
/\b[0-9](\.[0-9]+)?
*(Gb|[Gg]igabytes?)\b/
/\b(Win95|Win98|WinNT|Windows
*(NT|95|98|2000)?)\b/
/\b(Mac|Macintosh|Apple)\b
/
Regular Expressions and Automata 50
Regular Expressions
Advanced Operators
s/regexp1/regexp2/
• E.g. the 35 boxes → the <35> boxes
s/([0-9]+)/<\1>/
• The following pattern matches “The bigger they were, the bigger they
will be”, not “The bigger they were, the faster they will be”
/the (.*)er they were, the\1er they will
be/
• The following pattern matches “The bigger they were, the bigger they
were”, not “The bigger they were, the bigger they will be”
/the (.*)er they (.*), the\1er they
\2/
registers