NLP Slides 1-55

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Introduction to NLP

Dr. Mohd Rafi Lone


Scope
• Language technologies are information
technologies that are specialized for dealing with
the most complex information medium in our
world: human language (Human Language
Technology).

Introduction to NLP 2
Scope…
Current Limitations and Potential Applications:
• Existing Language Technology (LT) systems are not on par with human abilities but offer various promising
applications.
Objective: Knowledgeable Language Software:
• The goal is to develop software products equipped with a degree of knowledge in human language.
Impact on Human-Machine Interaction:
• Urgent need for improved human-machine interaction as the primary barrier lies in communication issues,
and incorporating human language can enhance software acceptance and user productivity.

Introduction to NLP 3
Applications
Friendly technology should listen and speak

• Applications of natural language interfaces


• Database queries, information retrieval from texts, so-called expert systems,
and robot control.
• Spoken language needs to be combined with other modes of
communication such as pointing with mouse or finger.
• If such multimodal communication is finally embedded in an effective general
model of cooperation, we have succeeded in turning the machine into a
partner.
• The ultimate goal of research is the omnipresent access to all kinds of
technology and to the global information structure by natural interaction.

Introduction to NLP 4
Applications
Machines can also help people communicate with each other

• One of the original aims of language technology has always been fully automatic translation between human
languages.
• Still far away from achieving the ambitious goal of translating unrestricted texts.
• Nevertheless, they have been able to create software systems that simplify the work of human translators and clearly
improve their productivity.
• Less than perfect automatic translations can also be of great help to information seekers who have to search through
large amounts of texts in foreign languages.
• The most serious bottleneck for e-commerce is the volume of communication between business and
customers or among businesses.
• Language technology can help to sort, filter and route incoming email.
• It can also assist the customer relationship agent to look up information and to compose a response.
• In cases where questions have been answered before, language technology can find appropriate earlier replies and
automatically respond.

Introduction to NLP 5
Applications
Language is the fabric of the web

• Although the new media combine text, graphics, sound and movies, the whole world of multimedia
information can only be structured, indexed and navigated through language.
• For browsing, navigating, filtering and processing the information on the web, we need software that can get at the
contents of documents.
• Language technology for content management is a necessary precondition for turning the wealth of digital
information into collective knowledge.
• The increasing multilinguality of the web constitutes an additional challenge for language technology.
• The global web can only be mastered with the help of multilingual tools for indexing and navigating.
• Systems for crosslingual information and knowledge management will surmount language barriers for e-commerce,
education and international cooperation.

Introduction to NLP 6
Technologies
• Speech recognition
• Spoken language is recognized and transformed in
into text as in dictation systems, into commands as
in robot control systems, or into some other
internal representation.
• Speech synthesis
• Utterances in spoken language are produced from
text (text-to-speech systems) or from internal
representations of words or sentences
(concept-to-speech systems)

Introduction to NLP 7
Technologies
• Text categorization
• This technology assigns texts to categories. Texts
may belong to more than one category, categories
may contain other categories. Filtering is a special
case of categorization with just two categories.
• Text Summarization
• The most relevant portions of a text are extracted
as a summary. The task depends on the needed
lengths of the summaries. Summarization is harder
if the summary has to be specific to a certain
query.

Introduction to NLP 8
Technologies
• Text Indexing
• As a precondition for document retrieval, texts are stored
in an indexed database. Usually a text is indexed for all
word forms or – after lemmatization – for all lemmas.
Sometimes indexing is combined with categorization and
summarization.
• Text Retrieval
• Texts are retrieved from a database that best match a
given query or document. The candidate documents are
ordered with respect to their expected relevance.
Indexing, categorization, summarization and retrieval are
often subsumed under the term information retrieval.

Introduction to NLP 9
Technologies
• Information Extraction
• Relevant information pieces of information are discovered
and marked for extraction. The extracted pieces can be:
the topic, named entities such as company, place or
person names, simple relations such as prices,
destinations, functions etc. or complex relations
describing accidents, company mergers or football
matches.
• Data Fusion and Text Data Mining
• Extracted pieces of information from several sources are
combined in one database. Previously undetected
relationships may be discovered.

Introduction to NLP 10
Technologies
• Question Answering
• Natural language queries are used to access
information in a database. The database may be a
base of structured data or a repository of digital
texts in which certain parts have been marked as
potential answers.
• Report Generation
• A report in natural language is produced that
describes the essential contents or changes of a
database. The report can contain accumulated
numbers, maxima, minima and the most drastic
changes.

Introduction to NLP 11
Technologies
• Spoken Dialogue Systems
• The system can carry out a dialogue with a human
user in which the user can solicit information or
conduct purchases, reservations or other
transactions.
• Translation Technologies
• Technologies that translate texts or assist human
translators. Automatic translation is called machine
translation. Translation memories use large
amounts of texts together with existing
translations for efficient look-up of possible
translations for words, phrases and sentences.

Introduction to NLP 12
Methods and Resources
• The methods of language technology come from several disciplines:
• computer science,
• computational and theoretical linguistics,
• mathematics,
• electrical engineering and
• psychology.

Introduction to NLP 13
Methods and Resources
• Generic CS Methods
• Programming languages, algorithms for generic data types, and software
engineering methods for structuring and organizing software development and
quality assurance.
• Specialized Algorithms
• Dedicated algorithms have been designed for parsing, generation and translation,
for morphological and syntactic processing with finite state automata/transducers
and many other tasks.
• Nondiscrete Mathematical Methods
• Statistical techniques have become especially successful in speech processing,
information retrieval, and the automatic acquisition of language models. Other
methods in this class are neural networks and powerful techniques for optimization
and search.
Introduction to NLP 14
Methods and Resources
• Logical and Linguistic Formalisms
• For deep linguistic processing, constraint based grammar formalisms are
employed. Complex formalisms have been developed for the representation
of semantic content and knowledge.
• Linguistic Knowledge
• Linguistic knowledge resources for many languages are utilized: dictionaries,
morphological and syntactic grammars, rules for semantic interpretation,
pronunciation and intonation.
• Corpora and Corpus Tools
• Large collections of application-specific or generic collections of spoken and
written language are exploited for the acquisition and testing of statistical or
rule-based language models.

Introduction to NLP 15
Introduction to NLP
From: Chapter 1 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, By Daniel Jurafsky
and James H. Martin
http://www.cs.colorado.edu/~martin/SLP/slp-ch1.pdf
Chapter 1. Introduction to
NLP
From: Chapter 1 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James H. Martin
Background
• NLP has its roots in the 1950s when researchers first explored
machine translation and language understanding.
• Early efforts, such as the Georgetown-IBM experiment in 1954, laid
the foundation for machine translation systems.
• The 1960s and 1970s saw the development of rule-based systems
and linguistic theories to model language structure

Introduction to NLP 18
Background
• The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey
(Movie released in 1986)
• HAL is an artificial agent capable of such advanced language processing behavior as
speaking and understanding English, and at a crucial moment in the plot, even
reading lips.
• The language-related parts of HAL
• Speech recognition
• Natural language understanding (and, of course, lip-reading),
• Natural language generation
• Speech synthesis
• Information retrieval
• information extraction and
• inference

Introduction to NLP 19
Background
• In the late 20th century, statistical methods gained prominence in
NLP.
• Corpora-based approaches and statistical models, like Hidden Markov
Models (HMMs) and later on, Conditional Random Fields (CRFs),
improved language processing tasks.
• This shift marked a departure from rule-based systems, allowing
algorithms to learn patterns from large datasets.

Introduction to NLP 20
Background
• The 21st century witnessed a paradigm shift with the integration of
machine learning techniques, especially deep learning, in NLP.
• Recurrent Neural Networks (RNNs) and later, Transformer models, like
BERT and GPT, demonstrated significant advancements in language
understanding and generation.
• Pre-trained language models became a cornerstone, enabling transfer
learning for various NLP tasks.

Introduction to NLP 21
Background
• Today:
• NLP is pervasive in our daily lives, powering applications such as
virtual assistants, chatbots, and sentiment analysis tools.
• Social media platforms use NLP for content recommendation, spam
detection, and language understanding.
• Healthcare, finance, and customer service industries leverage NLP for
information extraction, summarization, and automated response
systems.

Introduction to NLP 22
1.1 Knowledge in Speech and Language Processing

• What distinguishes language processing applications from other data


processing systems is their use of knowledge of language.
• Microsoft word count (wc) program
• When used to count the number of characters, wc is an ordinary data
processing application.
• However, when it is used to count the words in a file it requires knowledge
about what it means to be a word, and thus becomes a language processing
system.

Introduction to NLP 23
1.1 Knowledge in Speech and Language Processing

• What it means to have knowledge of a language?


• The language processor must know about the following:
• Phonetics and Phonology — knowledge about linguistic sounds
• Morphology — knowledge of the meaningful components of words
• Syntax — knowledge of the structural relationships between words
• Semantics — knowledge of meaning
• Pragmatics — knowledge of the relationship of meaning to the goals and
intentions of the speaker.
• Discourse — knowledge about linguistic units larger than a single utterance

Introduction to NLP 24
1.1 Knowledge in Speech and Language Processing

• Phonetics and Phonology:


• Phonetics: /k/ in "cat," /æ/ in "bat.” (Examines how the lips, tongue, and vocal
cords move to produce specific sounds)
• Phonology: /p/ in "spin" and /b/ in "bin.“ (Sounds that can result in the
change in meaning of a word)
• Morphology:
• "Unhappiness" - "un-" (prefix), "happy" (root), "-ness" (suffix).
• Syntax:
• "The cat is on the mat" vs. "Mat on the cat the is.“

Introduction to NLP 25
1.1 Knowledge in Speech and Language Processing

• Semantics:
• "Bachelor" - unmarried man.
• Understanding the difference between "house" and "home."
• Pragmatics:
• "Can you pass the salt?" (in a meal context).
No I’m sorry, I’m afraid, I can’t.
No, I won’t open the door. I won’t.

• Discourse:
• A conversation where one introduces a topic, and another responds with
additional information or an opinion.

Introduction to NLP 26
1.2 Ambiguity
• We say some input is ambiguous
• if there are multiple alternative linguistic structures
than can be built for it.
• The spoken sentence, I made her duck, has five
different meanings.
• (1) I cooked waterfowl for her.
• (2) I cooked waterfowl belonging to her.
• (3) I created the (plaster?) duck she owns.
• (4) I caused her to quickly lower her head or body.
• (5) I waved my magic wand and turned her into
undifferentiated waterfowl.

Introduction to NLP 27
1.2 Ambiguity
• These different meanings are caused by a number of ambiguities.
• Duck can be a verb or a noun, while her can be a dative pronoun or a
possessive pronoun.
• The word make can mean create or cook.
• Finally, the verb make is syntactically ambiguous in that it can be transitive, or
it can be ditransitive.
• Finally, make can take a direct object and a verb, meaning that the object
(her) got caused to perform the verbal action (duck).
• In a spoken sentence, there is an even deeper kind of ambiguity; the first
word could have been eye or the second word maid.

Introduction to NLP 28
1.2 Ambiguity Types
• Lexical Ambiguity: Multiple Meanings of a Word
I heard a bark

• "Bark" can mean the sound a dog makes or the outer covering of a tree.
• Syntactic Ambiguity: Multiple
I saw Meanings
the man with of a sentence
the telescope

• It's unclear whether the speaker used the telescope or if the man being
observed had a telescope.

Introduction to NLP 29
1.2 Ambiguity Types
• Semantic Ambiguity:

The bank is by the river The play was performed by the students
• Ambiguity arises from the word ‘by’
• Structural Ambiguity:
Flying planes can be dangerous
• Is it the planes themselves that are flying, or is it the act of flying planes that is
dangerous?
• Referential Ambiguity:
He told his brother he would come
• It's unclear who "he" refers to—could be the brother or someone else.

Introduction to NLP 30
1.2 Ambiguity Types
• Pragmatic Ambiguity:
That was an interesting choice

• Is the speaker expressing agreement or disagreement?

• Morphological Ambiguity:
impatient important

Introduction to NLP 31
1.3 Models and Algorithms
• The most important model:
• state machines,
• formal rule systems,
• logic,
• probability theory and
• other machine learning tools
• The most important algorithms of these models:
• state space search algorithms and
• dynamic programming algorithms

Introduction to NLP 32
1.3 Models and Algorithms
• State Machine Model:
• A state machine represents a system's behavior through a set of states, transitions,
and actions.
• In NLP, state machines are used to model finite processes, such as dialogues or
language recognition, where the system moves from one state to another based on
input.
• Rule System Model:
• Rule-based systems use predefined rules to analyze and generate linguistic
structures.
• These rules define patterns, conditions, and actions that guide the system's
behavior.
• State machines and rule systems are the main tools used when dealing with
knowledge of phonology, morphology, and syntax.
Introduction to NLP 33
1.3 Models and Algorithms
• Logic Model:
• Logic-based models use formal logic to represent and reason about linguistic
structures.
• Probabilistic Model:
• Probabilistic models assign probabilities to different linguistic structures or
interpretations. These models are trained on data and leverage probability theory to
make predictions.
• State-Vector Model:
• State-vector models represent the state of a system using vectors in a
high-dimensional space.
• In NLP, this can be applied to represent the meaning or context of words, phrases, or
sentences.
• These logical representations have traditionally been the tool of choice
when dealing with knowledge of semantics, pragmatics, and discourse
Introduction to NLP 34
1.3 Models and Algorithms
• State Space Search Algorithm:
• State space search is a problem-solving paradigm that involves navigating
through a set of states to find a goal state.
• It is commonly used in artificial intelligence and optimization problems.
• The process involves defining the initial state, possible actions, transition
rules, and a goal condition.
• Dynamic Programming:
• Dynamic programming is a technique for solving optimization problems by
breaking them down into simpler sub-problems and solving each sub-problem
only once, storing the results in a table to avoid redundant computations.
• It is particularly useful for problems with overlapping sub-problems and
optimal substructure.

Introduction to NLP 35
1.4 Language, Thought, and Understanding

• The effective use of language is intertwined with our general


cognitive abilities.
• Turing Test by Alan Turing (1950)
• He suggested an empirical test, a game, in which a computer’s use of
language would form the basis for determining if it could think. If the machine
could win the game it would be judged intelligent.

Introduction to NLP 36
1.4 Language, Thought, and Understanding

• ELIZA program (Weizenbaum, 1966)


• ELIZA was an early natural language processing system capable of
carrying on a limited form of conversation with a user.
• Consider the following session with a version of ELIZA that
imitated the responses of a Rogerian psychotherapist.
User1: You are like my father in some ways.
ELIZA1: WHAT RESEMBLANCE DO YOU SEE
User2: You are not very aggressive but I think you don’t want me to notice that.
ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE
User3: You don’t argue with me.
ELIZA3: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User4: You are afraid of me.
ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

Introduction to NLP 37
1.4 Language, Thought, and Understanding

• ELIZA is a remarkably simple program that makes use of


pattern-matching to process the input and translate it into suitable
outputs.
• The success of this simple technique in this domain is due to the fact
that ELIZA doesn’t actually need to know anything to mimic a
Rogerian psychotherapist.
• Eliza
• A. L. I. C. E. Artificial Intelligence Foundation
• Loebner Prize competition, since 1991,
• An event has attempted to put various computer programs to the Turing test.
Introduction to NLP 38
1.5 The State of the Art and the Near-term Future

• Current Applications: • Near-Future Applications:


• Virtual Assistants and Chat-bots • Explainable AI in NLP
• Sentiment Analysis • Multimodal NLP
• Language Translation • Contextual Understanding Improvements
• Text Summarization • Customizable NLP Models
• Named Entity Recognition (NER) • Enhanced Language Generation
• Text Classification • Advancements in Sentiment Analysis
• Question Answering Systems • Ethical and Bias Mitigation
• Speech Recognition
• Healthcare Information Extraction

Introduction to NLP 39
1.7 Summary
• A good way to understand the concerns of speech and language
processing research is to consider what it would take to create an
intelligent agent like HAL from 2001: A Space Odyssey.
• Speech and language technology relies on formal models, or
representations, of knowledge of language at the levels of phonology
and phonetics, morphology, syntax, semantics, pragmatics and
discourse.
• A small number of formal models including state machines, formal rule
systems, logic, and probability theory are used to capture this knowledge.

Introduction to NLP 40
1.7 Summary
• The foundations of speech and language technology lie in computer
science, linguistics, mathematics, electrical engineering and
psychology.
• The critical connection between language and thought has placed
speech and language processing technology at the center of debate
over intelligent machines.
• Revolutionary applications of speech and language processing are
currently in use around the world.
• Recent advances in speech recognition and the creation of the World-Wide
Web will lead to many more applications.

Introduction to NLP 41
In-Class Assignment
1. Define the ambiguities caused by each of the linguistic terms:
Phonetics and Phonology, Morphology, Syntax, Semantics,
Pragmatics and Discourse
2. Provide five examples of each ambiguity.

• Guidelines:
• Write on your notebook, mention your name and register number on it.
• Scan and upload in Google Classroom.
• Hand-in within the due time.
• Note: No two assignments should be identical.

Introduction to NLP 42
Regular Expressions and
Automata
From: Chapter 2 of An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James H. Martin
Regular Expressions

• Regular Expression is a language used for specifying text search


string.
• A regular expression is a formula in a special language that is used for
specifying a simple class of string.
• Formally, a regular expression is an algebraic notation for
characterizing a set of strings.
• Regular Expression search requires
• a pattern that we want to search for, and
• a corpus of texts to search through.

Regular Expressions and Automata 44


Regular Expressions

• A RE search function will search through the corpus returning all texts
that contain the pattern.
• In a Web search engine, they might be the entire documents or Web pages.
• In a word-processor, they might be individual words, or lines of a document.
• In an Operating Systems, they may be patterns in text files.
• E.g., the UNIX grep command
• grep-Global Regular Expression Print
grep [options] pattern [file(s)]
• Online tool: https://regexr.com/ (will be used here)

Regular Expressions and Automata 45


Regular Expressions
Basic Regular Expression Patterns

• The use of the brackets [] to specify a disjunction of characters.

• The use of the brackets [] plus the dash - to specify a range.

Regular Expressions and Automata 46


Regular Expressions
Basic Regular Expression Patterns

• Uses of the caret ^ for negation or just the symbol ^

• The question-mark ? marks optionality of the previous expression.

• The use of period . to specify any character

Regular Expressions and Automata 47


Regular Expressions
Disjunction, Grouping, and Precedence

• Disjunction
/cat|dog

• Precedence
/pupp(y|ies)

• Operator precedence hierarchy

Regular Expressions and Automata 48


Regular Expressions
A Simple Example

• To find the English article the


/the/

/[tT]he/

/\b[tT]he\b/

/[^a-zA-Z][tT]he[^a-zA-Z]
/
/^|[^a-zA-Z][tT]he[^a-zA-Z]|$/

Regular Expressions and Automata 49


Regular Expressions
A More Complex Example

• “any PC with more than 500 MHz and 32 Gb of disk space for less than $1000”

/$[0-9]+/

/$[0-9]+\.[0-9][0-9]
/
/\b$[0-9]+(\.[0-9][0-9])?\b
/
/\b[0-9]+
*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
/\b[0-9]+
*(Mb|[Mm]egabytes?)\b/
/\b[0-9](\.[0-9]+)?
*(Gb|[Gg]igabytes?)\b/
/\b(Win95|Win98|WinNT|Windows
*(NT|95|98|2000)?)\b/
/\b(Mac|Macintosh|Apple)\b
/
Regular Expressions and Automata 50
Regular Expressions
Advanced Operators

Aliases for common sets of characters

Regular Expressions and Automata 51


Regular Expressions
Advanced Operators

Regular expression operators for counting

Regular Expressions and Automata 52


Regular Expressions
Advanced Operators

Some characters that need to be backslashed

Regular Expressions and Automata 53


Regular Expressions
Regular Expression Substitution, Memory, and ELIZA

s/regexp1/regexp2/
• E.g. the 35 boxes → the <35> boxes
s/([0-9]+)/<\1>/
• The following pattern matches “The bigger they were, the bigger they
will be”, not “The bigger they were, the faster they will be”
/the (.*)er they were, the\1er they will
be/
• The following pattern matches “The bigger they were, the bigger they
were”, not “The bigger they were, the bigger they will be”
/the (.*)er they (.*), the\1er they
\2/
registers

Regular Expressions and Automata 54


Regular Expressions
Regular Expression Substitution, Memory, and ELIZA

• Eliza worked by having a cascade of regular expression substitutions that


each match some part of the input lines and changed them
• my → YOUR, I’m → YOU ARE …

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE


\1/ YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE
s/.*
\1/
s/.* all .*/IN WHAT
WAY/ always .*/CAN YOU THINK OF A SPECIFIC
s/.*
EXAMPLE/
User1: Men are all alike.
ELIZA1: IN WHAT WAY
User2: They’re always bugging us about something or other.
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE
User3: Well, my boyfriend made me come here.
ELIZA3: YOUR BOYBRIEND MADE YOU COME HERE
User4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED

Regular Expressions and Automata 55

You might also like