0% found this document useful (0 votes)
27 views43 pages

1 - Introducntion To NLP

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views43 pages

1 - Introducntion To NLP

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction:

• Language is a method of communication with the help of which we can


speak, read and write.

• Natural Language Processing (NLP) is a subfield of linguistic, computer


science and artificial intelligence (AI) that focuses on the interaction between
computers and human language.

• Key components and tasks within NLP include:

1. Text Understanding: the goal is to extract structured information from


unstructured text data.

2. Speech Recognition : the conversion of spoken language into written text.


Introduction:
3. Language Generation: generating text or speech from structured data
Examples include chat bots, language translation, and text summarization.

4. Machine Translation: can automatically translate text or speech from one


language to another.

5. Question Answering: Example include search engines and virtual assistants.

6. Language Understanding: NLP models aim to understand the meaning,


context, and nuances of human language.
Introduction:
7. Text Summarization: automatically summarize long texts.

8. Language Models: Large pre-trained language models, like GPT-3


(Generative Pre-trained Transformer 3) and its successors, have revolutionized
NLP by providing a foundation for a wide range of language-related tasks.

• These models can be fine-tuned for specific applications.

• NLP has numerous real-world applications across various industries,


including healthcare (clinical text analysis), finance (automated trading and
sentiment analysis), customer service (chatbots), education (language
tutoring), and many more.
ORIGINS OF NLP
ORIGINS OF NLP
Alan Turing is the father of Natural language processing. In his 1950’s paper
“Computing Machinery and Intelligence”, he described a test for an intelligent
machine that could understand and respond to natural human conversation.
Introduction:
• ORIGINS OF NLP
• The history of NLP is divided into three phases.

• First Phase (Machine Translation Phase) - Late 1940s to late 1960s

• The work done in this phase focused mainly on machine translation (MT).

• 1948 - the first recognizable NLP application was introduced in Birkbeck


College, London.

• 1950s - In 1950s, there was a conflicting view between linguistics and


computer science.

• Later, Chomsky developed his first book syntactic structures and claimed
that language is generative in nature.
Introduction:
• (1960-1980) – Second Phase Flavored with Artificial Intelligence (AI)

• In the year 1960 to 1980, the key developments were:

• Augmented Transition Networks (ATN) : ATN is a finite state machine that is


capable of recognizing regular languages.

• Case Grammar Case Grammar was developed by Linguist Charles J.


Fillmore in the year 1968. Case Grammar uses languages such as English to
express the relationship between nouns and verbs by using the preposition.

• In Case Grammar, case roles can be defined to link certain kinds of verbs and
objects.
Introduction:
• For example: “Ram broke the glass with the hammer". In this example case
grammar identify Ram as an agent, glass as a theme, and hammer as an
instrument.

• In the year 1960 to 1980, key systems like SHRDLU, LUNAR were developed

• SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to


communicate with the computer and moving objects.

• It can handle instructions such as "pick up the green ball" and also answer
the questions like "What is inside the black box."

• The main importance of SHRDLU is that it shows those syntax, semantics,


and reasoning about the world that can be combined to produce a system
that understands a natural language.
Introduction:
• LUNAR is the classic example of a Natural Language database interface
system that is used ATNs (Augmented Transition Networks) and Woods'
Procedural Semantics.

• It was capable of translating elaborate natural language expressions into


database queries and handle 78% of requests without errors.

• 1980 – Current Third Phase ( Machine and Deep learning algorithms)

• Till the year 1980, natural language processing systems were based on
complex sets of hand-written rules.

• After 1980, NLP were introduced with machine learning algorithms for
language processing.

• In the beginning of the year 1990s, NLP started growing faster and achieved
good process accuracy, especially in English Grammar.
Introduction:
• In 1990 also, an electronic text introduced, which provided a good resource
for training and examining natural language programs.

• Other factors may include the availability of computers with fast CPUs and
more memory.

• The major factor behind the advancement of natural language processing


was the Internet.

• Now, modern NLP consists of various applications, like speech recognition,


machine translation, and machine text reading.

• AMAZON ALEXA combine all these applications allows the artificial


intelligence to gain knowledge of the world and it will reply to the questions
we ask.
Introduction:
• NATURAL LANGUAGE PROCESSING (NLP) CHALLENGES
• NLP is a powerful tool with huge benefits, but there are still a number of Natural
Language Processing limitations and problems:

1. Contextual words, phrases and homonyms

2. Synonyms

3. Irony and sarcasm

4. Ambiguity

5. Errors in text or speech

6. Colloquialisms and slang

7. Domain-specific language

8. Low-resource languages

9. Lack of research and development


Introduction:
1. Contextual words, phrases and homonyms

• The same words and phrases can have different meanings according the
context of a sentence and many words – especially in English – have the
exact same pronunciation but totally different meanings.

• For example
• I ran to the store because we ran out of milk.
• Can I run something past that you real may like?
• The house is looking really run down.

• NLP language models may need to have learned all of the definitions,
differentiating between them in context can present problems.
Introduction:
• Homonyms – two or more words that are pronounced the same but have
different definitions – can be problematic for question answering and speech-
to-text applications.

• Usage of their and there, for example, is even a common problem for humans.

2. Synonyms

• Synonyms can lead to issues similar to contextual understanding because we


use many different words to express the same idea.

• Some of these words may convey exactly the same meaning, while some may
be levels of complexity (small, little, tiny, minute).

• So, building NLP systems, it’s important to include all of a word’s possible
meanings and all possible synonyms.
Introduction:
3. Irony and sarcasm

• Irony and sarcasm present problems for machine learning models that generally use words

and phrases that are may be positive or negative.

• For example : A doctor becomes ill on a healthy lifestyle show.

Saying "Oh, great!" when something bad happens

• Models can be trained with certain cues that frequently accompany ironic or sarcastic

phrases, like “yeah right,” “whatever,” etc.

• Sarcasm is a specific form of verbal irony that involves making a cutting or mocking remark

to convey contempt, criticism, or humor.

• Example : When someone makes a mistake, you might say, "Brilliant move."

4. Ambiguity : Ambiguity in NLP refers to sentences and phrases that potentially have two or

more possible interpretations.


Introduction:
• Lexical ambiguity: a word that could be used as a verb, noun, or adjective.

For example :

Train:

Verb: "They will train the new employees."

Noun: "The train arrived at the station."

Adjective: "She took a train trip."

• Semantic ambiguity: Semantic ambiguity arises from the meaning of words

or phrases within a sentence.

. For example:

He is a smart guy
Introduction:
• Syntactic ambiguity: Syntactic ambiguity arises from the structure or arrangement
of words and phrases within a sentence.

• Fore Example: I saw the man with the telescope.

• NLP requires additional context or semantic information to disambiguate and

determine the intended meaning.

5. Errors in text and speech

• Misspelled or misused words can create problems for text analysis.

• Autocorrect and grammar correction applications can handle common mistakes,


but don’t always understand the writer’s intention.

• With spoken language, mispronunciations, different accents, stutters, etc., can be


difficult for a machine to understand.
Introduction:
• For example

• Grammatical Errors:

Incorrect: "He don't like pizza."

Correct: "He doesn't like pizza."

• Spelling Errors:

Incorrect: "Recieve my email."

Correct: "Receive my email."

6. Colloquialisms and slang

Colloquialisms and slang are informal expressions and words that are
commonly used in everyday speech, particularly within specific regions or
social groups.
Introduction:
• They often deviate from formal language and may not be suitable for formal writing.

• For example Colloquialisms:

"Wanna" (short for "want to") "Gonna" (short for "going to"):

"I wanna go to the movies tonight.“ "She's gonna be here in a minute."

• For Example Slang:

"Lit" (meaning exciting or excellent) "Chill" (meaning relaxed or calm)

"The party last night was lit!“ "Let's just chill at a party.“

7. Domain-specific language

• Domain-specific language (DSL), also known as a specialized or application-specific


language, is a type of programming or scripting language that is designed for a
specific domain.
Introduction:
• For Examples Domain-Specific Languages like :
1. SQL (Structured Query Language)
2. HTML (Hypertext Markup Language)
3. CSS (Cascading Style Sheets)
4. Regular Expressions
5. VHDL (VHSIC(Very High-Speed Integrated Circuit) Hardware Description
Language)
6. LaTeX
7. R (Statistical Programming Language)
8. MATLAB (Matrix Laboratory)
9. DSLs in Game Development
• DSLs play a crucial role in simplifying complex tasks within specific domains and
enabling domain experts to work more effectively by using languages
Introduction:
8. Low-resource languages

• Low-resource languages refer to languages that have limited available


linguistic resources and tools for natural language processing (NLP).

• These languages often face challenges in terms of data availability, linguistic


resources, and technological support.

• For example

• There are several Indian languages that can be considered low-resource


languages in the context of natural language processing (NLP) and language
technology development like : Konkani, Khasi, Maithili, Bodo and so on.

• Efforts to address the challenges of low-resource languages include


collaborative initiatives, data collection projects, and the development of
language technologies.
Introduction:
9. Lack of research and development

• Machine learning requires a lot of data to function to its outer limits –


billions of pieces of training data.

• The more data NLP models are trained on, the smarter they become.

• All of the problems above will require more research and new techniques in
order to improve on them.

• Advanced practices like artificial neural networks and deep learning allow a
multitude of NLP techniques, algorithms, and models to work progressively,
much like the human mind does.
NLP pipeline
• NLP pipeline includes the following steps for building a NLP model

Step1: Sentence Segmentation

• Sentence Segment is the first step for building the NLP pipeline. It breaks the
paragraph into separate sentences.

For Example - Consider the following paragraph

• Independence Day is one of the important festivals for every Indian citizen. It is
celebrated on the 15th of August each year ever since India got independence
from the British rule. The day celebrates independence in the true sense.

• Sentence Segment produces the following result

• 1. "Independence Day is one of the important festivals for every Indian citizen."

• 2. "It is celebrated on the 15th of August each year ever since India got
independence from the British rule.“

• 3. "This day celebrates independence in the true sense.


NLP pipeline
Step2: Word Tokenization

• Word Tokenizer is used to break the sentence into separate words or tokens.

• For Example:

• Vardhaman college offers engineering courses for undergraduates and


postgraduates.

• Word Tokenizer generates the following result:

“Vardhaman”, “college”, “offers”, “engineering”, “courses”, “for”,


“undergraduates”, “and”, “postgraduates”

Step3: Stemming

• Stemming is used to normalize words into its base form or root form.
NLP pipeline
• For Example - celebrates, celebrated and celebrating, all these words are
originated with a single root word "celebrate"

• The big problem with stemming is that sometimes it produces the root word which
may not have any meaning.

• For Example - intelligence, intelligent, and intelligently, all these words are
originated with a single root word "intelligen.“

• In English, the word "intelligen" do not have any meaning.

• Step 4: Lemmatization

• Lemmatization is quite similar to the Stamming. It is used to group different


inflected forms of the word, called Lemma.

• The main difference between Stemming and lemmatization is that it produces the
root word, which has a meaning.
NLP pipeline
• For example: In lemmatization, the words intelligence, intelligent, and
intelligently has a root word intelligent, which has a meaning.

• Step 5: Identifying Stop Words

• In English, there are a lot of words that appear very frequently like "is",
"and", "the", and "a". NLP pipelines will flag these words as stop words.

• Stop words might be filtered out before doing any statistical analysis.

• For Example: He is a good boy

• Step 6: Dependency Parsing

• Dependency Parsing is used to find that how all the words in the sentences
are related to each other (determine the syntactic relationships between
words).
NLP pipeline
Step 7: POS tags

• POS stands for parts of speech, which includes Noun, verb, adverb, and
Adjective.

• It indicates that how a word functions with its meaning as well as


grammatically within the sentences.

• A word has one or more parts of speech based on the context in which it is
used.

• Example: "Google" something on the Internet

• Google is used as a verb, although it is a proper noun.

Step 8: Named Entity Recognition (NER)

• Named Entity Recognition (NER) is the process of detecting the named entity
such as person name, movie name, organization name, or location.
NLP pipeline
• For Example: Steve Jobs introduced iPhone at the Macworld Conference in
San Francisco and California.

• Step 9: Chunking

• Chunking is used to collect the individual piece of information and grouping


them into bigger pieces of sentences.
Phases in Natural Language Processing
• Natural Language Processing is separated into five primary stages or phases,
starting with simple text processing and progressing to identifying
complicated phrase meanings.

• Figure 1.1 shows the five phases of NLP.

Figure 1.1 Five phases of Natural Language Processing


Phases in Natural Language Processing
1) Lexical Analysis

• Lexical or Morphological analysis is the initial step in NLP which entitles to


recognizing and analyzing word structures.

• The collection of words and phrases in a language is referred to as the


lexicon.

• Lexical analysis is the process of breaking down a text file into paragraphs,
sentences, phrases, and words.

• The source code is scanned as a stream of characters and converted into


intelligible lexemes in this phase.
Phases in Natural Language Processing
• It includes the following techniques:

i. Stop word removal (removing ‘and’, ‘of’, ‘the’ etc. from text)

ii. Tokenization (breaking the text into sentences or words)

Word tokenizer

Sentence tokenizer

Tweet tokenizer

iii. Stemming (removing ‘ing’, ‘es’, ‘s’ from the tail of the words)

iv. Lemmatization (converting the words to their base forms)

2) Syntactic Analysis: Syntactic or Syntax analysis is a technique for checking


grammar, arranging words, and displaying relationships between them.
Phases in Natural Language Processing
• Syntax analysis guarantees that the structure of a particular piece of text is
proper.

• It tries to parse the sentence in order to ensure that the grammar is correct
at the sentence level.

• For Example: This word in the sentence does not make sense: “Truck is eating
Oranges “

• Hence there is a need to analyze the intent of the words in a sentence.

• Some of the techniques used in this phase are:

i. Dependency Parsing

ii. Parts of Speech (POS) tagging


Phases in Natural Language Processing
3) Semantic Analysis: Semantic analysis is the process of looking for meaning
in a statement.

• Semantic analysis concentrates mostly on the literal meaning of words,


phrases, and sentences which is the main focus.

• It also deals with putting words together to form sentences and extracts the
text’s exact meaning or dictionary definition.

For Example: “Truck is eating Oranges“ will be ignored from the

information summary.

4) Discourse Integration: Its scope is not only limited to a word or sentence,


rather discourse integration helps in studying the whole text.
Phases in Natural Language Processing
For Example: "John got ready at 9 AM. Later he took the train to California"

• Here, the machine is able to understand that the word “he” in the second sentence is
referring to “John”.

5) Pragmatic Analysis: It is a complex phase where machines should have knowledge not
only about the provided text but also about the real world.

• Pragmatic analysis refers to the process of abstracting or extracting the meaning of a


situation’s use of language.

• There can be multiple scenarios where the intent of a sentence can be misunderstood if the
machine doesn’t have real world knowledge.

Example:

"Thank you for coming so late, we have wrapped up the meeting"

(Contains sarcasm)

"Can you share your screen?" (here the context is about computer’s

screen share during a remote meeting).


Language and Grammar
• Natural Language

• A natural language (or ordinary language) is a language that is


read, spoken, written by humans for general-purpose
communication.

• Example : Hindi, English, French, and Chinese, etc.

• Difference between Natural language and Computer Language

Natural Language Computer Language


Natural language has a very Computer language has a very
large vocabulary. limited vocabulary.
Natural language is easily Computer language is easily
understood by humans. understood by the machines.
Natural language is ambiguous Computer language is
in nature. unambiguous.
Language and Grammar
NLP is usually used for chatbots, virtual assistants, and modern spam
detection etc.

 But NLP isn't perfect, although there are over 7000 languages spoken
around the globe, most NLP processes only use some languages: English,
Hindi, Chinese, Urdu, Farsi, Arabic, French, and Spanish etc.

• Components of NLP

• NLP encompasses anything which a computer needs to understand natural


language (typed or spoken) and also generate the natural language.

• There are two components of Natural Language Processing:

1. Natural Language Understanding (NLU)

2. Natural Language Generation (NLG)


Language and Grammar
1. Natural Language Understanding (NLU)

• NLU enables machines to understand and interpret human


language by extracting metadata from content. It performs the
following tasks:

Helps to analyze different aspects of language.

Helps to map the input in natural language into valid


representations.

• NLU is more difficult than NLG tasks leading to lexical, and


syntactic ambiguity.

• Lexical ambiguity : In lexical ambiguity a word that could be used as a


verb, noun, or adjective and the meaning of the sentence can be changed.
Language and Grammar
For example :

Train:

Verb: "They will train the new employees."

Noun: "The train arrived at the station."

Adjective: "She took a train trip”.

• Syntactic ambiguity: Syntactic ambiguity arises from the structure


or arrangement of words and phrases within a sentence.

• Fore Example: I saw the man with the telescope.

Syntactic Ambiguity

Interpretation 1: "I saw the man who had a telescope

Interpretation 2: "I saw the man using the telescope.


Language and Grammar
2. Natural language generation (NLG)

• NLG is a method of creating meaningful phrases and sentences


(natural language) from data.

• It comprises three stages: text planning, sentence planning, and


text realization.

Text planning: Retrieving applicable content.

Sentence planning: Forming meaningful phrases and setting the


sentence tone.

Text realization: Mapping sentence plans to sentence structures.

• Chatbots, machine translation tools, analytics platforms, voice


assistants, sentiment analysis platforms, and AI-powered
transcription tools are some applications of NLG.
Language and Grammar
• A language is a system, a set of symbols and a set of rules (or grammar).

- The Symbols are combined to convey new information.

- The Rules govern the manipulation of symbols.

• Formal Language

• Before defining formal language, we need to define symbols, alphabets,


strings and words.

• Symbol is a character, an abstract entity that has no meaning by itself.

e.g., Letters, digits and special characters

• Alphabet is finite set of symbols; an alphabet is often denoted by Σ


(sigma)

e.g., B = {0, 1} says B is an alphabet of two symbols, 0 and 1.

C = {a, b, c} says C is an alphabet of three symbols, a, b and c.


Language and Grammar
• String or a word is a finite sequence of symbols that form an
alphabet.

e.g., 01110 and 111 are strings from the alphabet B above.

aaabccc and b are strings from the alphabet C above.

• Language is a set of strings derived from an alphabet .

• Formal language (or simply language) is a set L of strings over some


finite alphabet Σ .

• Formal language is described using formal grammars.

• Grammar

• Grammar refers to the set of rules that govern the structure and
syntax of a language.
Language and Grammar
• These rules define how words and phrases can be combined to
create grammatically correct sentences.

• Grammar is crucial in NLP for tasks such as parsing sentences,


resolving ambiguity, and ensuring that generated text is
linguistically accurate.

• It provides a formal framework for understanding and generating


human language in computational systems.

• It is the specification of the legal structures of a language and has


three basic components : terminal symbols, non-terminal symbols,
and rules (productions) .
Language and Grammar
• In formal language theory context free grammar is a grammar
where every production rules is of the form:

Α → α where Α is a single symbol called non-


terminal , and α is a string that is a sequence of symbols of terminals
and/or non-terminals (possibly empty).

• Mathematically, a grammar G can be written as a 4-tuple (N, T,


S, P) where.

• N or VN = set of non-terminal symbols, or variables.

• T or ∑ = set of terminal symbols.

• S = Start symbol where S ∈ N

• P = Production rules for Terminals as well as Non-terminals.


Assignment Questions

1. Define NLP. What are the key components or tasks of NLP

2. Explain the origins of NLP.

3. What are the challenges of NLP ? Give details of each challenge.

4. Discuss NLP pipeline

5. Explain the phases of NLP.

6. Define natural language. Differentiate between natural language


and computer language.

7. What are the components of NLP ? Explain with example.

8. Define language, formal language, and grammar.

You might also like