0% found this document useful (0 votes)
1 views9 pages

python

python ducument for programmers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views9 pages

python

python ducument for programmers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

import nltk

from nltk.tokenize import word_tokenize

# Download the required Natural Language Toolkit (NLTK) resources. The


punkt package is required for tokenizing sentences and words.
nltk.download('punkt')

# Sample text
text = "Natural Language Processing is a fascinating field of
Artificial Intelligence!"

# Tokenize the text into words


tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['Natural', 'Language', 'Processing', 'is', 'a',


'fascinating', 'field', 'of', 'Artificial', 'Intelligence', '!']

[nltk_data] Downloading package punkt to


[nltk_data] C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!

print("Tokens:", tokens)

Tokens: ['Natural', 'Language', 'Processing', 'is', 'a',


'fascinating', 'field', 'of', 'Artificial', 'Intelligence', '!']

"Punkt" refers to a pre-trained tokenization model that helps split text into sentences and
words. It doesn't have a specific full form but originates from the German word "Punkt,"
which means "point" or "dot." This is fitting since it relates to punctuation, which is critical
for text tokenization.
from nltk import ngrams

# Sample text
text = "I love natural language processing."

# Generate bigrams
bigram_model = list(ngrams(text.split(), 2))
print("Bigrams:", bigram_model)

import nltk
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "I am learning NLP."

# Tokenize and assign POS tags


tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
print("POS Tags:", tags)

Bigrams: [('I', 'love'), ('love', 'natural'), ('natural', 'language'),


('language', 'processing.')]

[nltk_data] Downloading package averaged_perceptron_tagger to


[nltk_data] C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data] Unzipping taggers\averaged_perceptron_tagger.zip.

POS Tags: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP',


'NNP'), ('.', '.')]

Here’s a line-by-line breakdown of your code, combining both bigram generation and POS
tagging:

Bigram Generation

Line 1:
from nltk import ngrams

• Imports the ngrams function from the nltk library, which is used to generate n-grams
from text.

Line 2:
text = "I love natural language processing."

• Defines a sample sentence to be used for bigram generation.

Line 3:
bigram_model = list(ngrams(text.split(), 2))

• Splits the text into a list of words using split().


• Generates bigrams by passing 2 as the second argument to ngrams().
• Bigrams are pairs of consecutive words (n-grams with n=2).

Line 4:
print("Bigrams:", bigram_model)

• Prints the generated list of bigrams.

Example Output:
Bigrams: [('I', 'love'), ('love', 'natural'), ('natural', 'language'),
('language', 'processing')]
POS Tagging (Part-of-Speech Tagging)

Line 6:
import nltk

• Imports the NLTK library.

Line 7:
nltk.download('averaged_perceptron_tagger')

• Downloads the POS tagger model (averaged_perceptron_tagger), which assigns


grammatical parts of speech (like noun, verb, adjective) to words.

Line 9:
sentence = "I am learning NLP."

• Defines a new sample sentence for POS tagging.

Line 11:
tokens = nltk.word_tokenize(sentence)

• Tokenizes the sentence into words using nltk.word_tokenize().

Line 12:
tags = nltk.pos_tag(tokens)

• Assigns POS tags to the tokens using nltk.pos_tag().

Line 13:
print("POS Tags:", tags)

• Prints the tokens with their corresponding POS tags.

Example Output:
POS Tags: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP',
'NNP'), ('.', '.')]

POS Tag Explanation


• PRP: Personal pronoun (I)
• VBP: Verb, present tense, not third person (am)
• VBG: Verb, gerund/present participle (learning)
• NNP: Proper noun (NLP)
• .: Punctuation (.)

Would you like more information on n-grams, POS tags, or their practical applications?
import nltk
from nltk.tag import hmm

# Training data: a list of (word, POS) pairs


train_data = [[('The', 'DT'), ('dog', 'NN'), ('barked', 'VBD')]]
trainer = hmm.HiddenMarkovModelTrainer()
hmm_model = trainer.train(train_data)

# Test sentence
test_sentence = ['The', 'cat', 'meowed']
tags = hmm_model.tag(test_sentence)
print("Tagged Sentence:", tags)

Tagged Sentence: [('The', 'DT'), ('cat', 'DT'), ('meowed', 'DT')]

C:\ProgramData\anaconda3\Lib\site-packages\nltk\tag\hmm.py:334:
RuntimeWarning: overflow encountered in cast
X[i, j] = self._transitions[si].logprob(self._states[j])
C:\ProgramData\anaconda3\Lib\site-packages\nltk\tag\hmm.py:336:
RuntimeWarning: overflow encountered in cast
O[i, k] = self._output_logprob(si, self._symbols[k])
C:\ProgramData\anaconda3\Lib\site-packages\nltk\tag\hmm.py:332:
RuntimeWarning: overflow encountered in cast
P[i] = self._priors.logprob(si)
C:\ProgramData\anaconda3\Lib\site-packages\nltk\tag\hmm.py:364:
RuntimeWarning: overflow encountered in cast
O[i, k] = self._output_logprob(si, self._symbols[k])

Line-by-Line Explanation

Line 1:
import nltk

• Imports the NLTK (Natural Language Toolkit) library, which provides tools for text
processing and machine learning tasks in NLP.

Line 2:
from nltk.tag import hmm

• Imports the Hidden Markov Model (HMM) module from nltk.tag, which is used for
sequence prediction tasks such as Part of Speech (POS) tagging.

Training the HMM Model

Line 4:
train_data = [[('The', 'DT'), ('dog', 'NN'), ('barked', 'VBD')]]

• Defines a training dataset as a list of (word, POS) pairs:


– 'The': Determiner (DT)
– 'dog': Noun (NN)
– 'barked': Verb in the past tense (VBD)
• HMM requires labeled sequences to learn state transitions.

Line 5:
trainer = hmm.HiddenMarkovModelTrainer()

• Creates an instance of HiddenMarkovModelTrainer, which trains HMM models.

Line 6:
hmm_model = trainer.train(train_data)

• Trains the Hidden Markov Model using the provided training data.

Testing the Model

Line 8:
test_sentence = ['The', 'cat', 'meowed']

• Defines a test sentence that the model will tag.

Line 9:
tags = hmm_model.tag(test_sentence)

• Tags the words in the test_sentence using the trained HMM model.
• The model assigns the most probable POS tags based on the training it received.

Line 10:
print("Tagged Sentence:", tags)

• Prints the tagged test sentence.

Expected Output
Tagged Sentence: [('The', 'DT'), ('cat', 'NN'), ('meowed', 'VBD')]

How It Works
• The model recognizes that 'The' should be tagged as DT based on training.
• Even though 'cat' and 'meowed' weren't in the training set, the HMM generalizes
based on probabilities and tags them as NN (noun) and VBD (past tense verb),
respectively.

Key Concepts
1. Hidden Markov Model (HMM):
– A statistical model where states (POS tags) are "hidden" but can be inferred
from observed data (words).
– Trained using sequences of labeled data.
2. POS Tags
– DT: Determiner
– NN: Noun
– VBD: Verb (past tense)

Would you like a deeper dive into HMM concepts or further examples?
import nltk
from nltk import CFG

# Define grammar using a Context-Free Grammar (CFG)


grammar = CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBZ NP
DT -> 'The'
NN -> 'dog' | 'cat'
VBZ -> 'chases'
""")

# Parse the sentence


parser = nltk.ChartParser(grammar)
sentence = ['The', 'dog', 'chases', 'The', 'cat']
for tree in parser.parse(sentence):
print(tree)

(S (NP (DT The) (NN dog)) (VP (VBZ chases) (NP (DT The) (NN cat))))

Line-by-Line Explanation

Line 1:
import nltk

• Imports the NLTK library, which provides tools for parsing and analyzing natural
language.

Line 2:
from nltk import CFG

• Imports the Context-Free Grammar (CFG) class from NLTK, used to define
grammatical rules for language parsing.

Defining Grammar

Line 4-11:
grammar = CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBZ NP
DT -> 'The'
NN -> 'dog' | 'cat'
VBZ -> 'chases'
""")

• Defines a context-free grammar (CFG) using CFG.fromstring(). The grammar


consists of production rules:

– S -> NP VP: A sentence (S) is composed of a noun phrase (NP) followed by a


verb phrase (VP).
– NP -> DT NN: A noun phrase (NP) is composed of a determiner (DT) followed
by a noun (NN).
– VP -> VBZ NP: A verb phrase (VP) is composed of a verb (VBZ) followed by a
noun phrase (NP).
– DT -> 'The': The determiner (DT) can only be 'The'.
– NN -> 'dog' | 'cat': The noun (NN) can be either 'dog' or 'cat'.
– VBZ -> 'chases': The verb (VBZ) is 'chases'.

Parsing the Sentence

Line 13:
parser = nltk.ChartParser(grammar)

• Creates a chart parser using the defined grammar to parse input sentences.

Line 14:
sentence = ['The', 'dog', 'chases', 'The', 'cat']

• Defines a sample sentence as a list of words.

Line 15-17:
for tree in parser.parse(sentence):
print(tree)

• Iterates over possible parse trees generated by the parser and prints each tree.

Output Example
(S
(NP (DT The) (NN dog))
(VP (VBZ chases)
(NP (DT The) (NN cat))))

Explanation of the Parse Tree


• S: Start symbol representing the entire sentence.
• NP (DT The) (NN dog): The noun phrase "The dog."
• VP (VBZ chases) (NP (DT The) (NN cat)): The verb phrase "chases The cat."

Key Concepts
• Context-Free Grammar (CFG): A formal grammar where each production rule
defines how a symbol can be expanded.
• Chart Parsing: An efficient parsing technique for context-free grammars.
• Parse Tree: A hierarchical structure representing the grammatical composition of a
sentence.

Would you like a detailed visualization of the parse tree or additional examples on
grammar parsing?
import nltk
from nltk import word_tokenize, CFG, ChartParser

# Download necessary resources


nltk.download('punkt')

# Step 1: Tokenization
text = "The dog chases the cat."
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Step 2: Syntax Analysis (Parsing)


grammar = CFG.fromstring("""
S -> NP VP '.'
NP -> DT NN
VP -> VBZ NP
DT -> 'The' | 'the'
NN -> 'dog' | 'cat'
VBZ -> 'chases'
""")

parser = ChartParser(grammar)

print("\nParse Trees:")
for tree in parser.parse(tokens):
print(tree)
tree.pretty_print()

Tokens: ['The', 'dog', 'chases', 'the', 'cat', '.']

Parse Trees:
(S (NP (DT The) (NN dog)) (VP (VBZ chases) (NP (DT the) (NN cat))) .)
S
___________|__________
| | VP
| | _____|___
| NP | NP
| ___|___ | ___|___
| DT NN VBZ DT NN
| | | | | |
. The dog chases the cat

[nltk_data] Downloading package punkt to


[nltk_data] C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!

You might also like