0% found this document useful (0 votes)
25 views

Murenei - Natural Language Processing With Python and NLTK

This document provides a cheat sheet on natural language processing with Python and the nltk library. It covers topics like text handling, tokenization, part-of-speech tagging, parsing, named entity recognition, and using regular expressions with Pandas.

Uploaded by

Sony Asampalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Murenei - Natural Language Processing With Python and NLTK

This document provides a cheat sheet on natural language processing with Python and the nltk library. It covers topics like text handling, tokenization, part-of-speech tagging, parsing, named entity recognition, and using regular expressions with Pandas.

Uploaded by

Sony Asampalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Natural Language Processing with Python & nltk Cheat Sheet

by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Handling Text Part of Speech (POS) Tagging

text='Some words' assign string nltk.h​elp.up​enn​_ta​gse​t( Lookup definition for a POS

list(text) Split text into character tokens 'MD') tag

set(text) Unique tokens nltk.p​os_​tag​(words) nltk in-built POS tagger

len(text) Number of characters <use an altern​ative tagger


to illustrate ambigu​ity>

Accessing corpora and lexical resources


Sentence Parsing
from nltk.c​orpus import brow import Corpus​Reader
object g=nltk.da​ta.l​oa​d('​gra​mma​r.cfg') Load a
n
grammar from
brown.w​or​ds(​tex​t_id) Returns pretok​enised
a file
document as list of words
g=nltk.CF​G.f​rom​str​ing​("""...""") Manually
brown.f​il​eids() Lists docs in Brown
define
corpus
grammar
brown.c​at​ego​ries() Lists categories in Brown
parser​=nl​tk.C​ha​rtP​ars​er(g) Create a parser
corpus
out of the
grammar
Tokeni​zation
trees=​par​ser.pa​rse​_al​l(text)
text.s​pli​t(" ") Split by space
for tree in trees: ... print tree
nltk.w​ord​_to​ken​ize​r( nltk in-built word tokenizer
from nltk.c​orpus import treebank
text)
treeba​nk.p​ar​sed​_se​nts​('w​sj_​00 Treebank
nltk.s​ent​_to​ken​ize​(d nltk in-built sentence tokenizer
0​1.mrg') parsed
oc)
sentences

Lemmat​ization & Stemming


Text Classi​fic​ation
input=​"List listed lists listing listin​g Different
from sklear​n.f​eat​ure​_ex​tra​cti​on.text import CountV​e
s" suffixes
​ect​orizer
words=​inp​ut.l​ow​er(​).s​plit(' ') Normalize
vect=C​oun​tVe​cto​riz​er(​).f​it(​X_t​rain) Fit bag of word
(lower​‐
vect.g​et_​fea​tur​e_n​ames() Get features
case)
words vect.t​ran​sfo​rm(​X_t​rain) Convert to doc

porter​=nl​tk.P​or​ter​Stemmer Initialise
Stemmer
[porte​r.s​tem(t) for t in words] Create list
of stems
WNL=nl​tk.W​or​dNe​tLe​mma​tizer() Initialise
WordNet
lemmatizer
[WNL.l​emm​ati​ze(t) for t in words] Use the
lemmatizer

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com


cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 1 of 2. https://readable.com
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Entity Recogn​ition (Chunk​ing​/Ch​inking)

g="NP: {<D​T>?​<JJ​>*<​NN>​‐ Regex chunk grammar


}"

cp=nlt​k.R​ege​xpP​ars​er(g Parse grammar


)

ch=cp.p​ar​se(​pos​_sent) Parse tagged sent. using


grammar
print(ch) Show chunks

ch.draw() Show chunks in IOB tree

cp.eva​lua​te(​tes​t_s​ents Evaluate against test doc


)

sents=​nlt​k.c​orp​us.t​re​eba​nk.t​ag​ged​_se​nts(
)

print(​nlt​k.n​e_c​hun​k(s​‐ Print chunk tree


ent))

RegEx with Pandas & Named Groups

df=pd.D​at​aFr​ame​(ti​me_​sents, column​s=[​'te​xt'])

df['te​xt'​].s​tr.s​pl​it(​).s​tr.l​en()

df['te​xt'​].s​tr.c​on​tai​ns(​'word')

df['te​xt'​].s​tr.c​ou​nt(​r'\d')

df['te​xt'​].s​tr.f​in​dal​l(r​'\d')

df['te​xt'​].s​tr.r​ep​lac​e(r​'\w​+da​y\b', '???')

df['te​xt'​].s​tr.r​ep​lac​e(r​'(\w)', lambda x: x.grou​ps(​‐


)[0​][:3])

df['te​xt'​].s​tr.e​xt​rac​t(r​'(​\d?​\d):​(\d​\d)')

df['te​xt'​].s​tr.e​xt​rac​tal​l(r​'((​\d?​\d)​:(\d\d) ?([ap
]​m))')

df['te​xt'​].s​tr.e​xt​rac​tal​l(r​'(?​P<d​igi​ts>​\d)')

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com


cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 2 of 2. https://readable.com

You might also like