Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/
Handling Text Sentence Parsing
text='Some words' assign string g=nltk.data.load('grammar.cfg') Load a grammar from a file
list(text) Split text into character tokens g=nltk.CFG.fromstring("""..."" Manually define grammar
")
set(text) Unique tokens
parser=nltk.ChartParser(g) Create a parser out of the
len(text) Number of characters
grammar
trees=parser.parse_all(text)
Accessing corpora and lexical resources
for tree in trees: ... print tree
from nltk.corpus import import CorpusReader object
brown from nltk.corpus import treebank
brown.words(text_id) Returns pretokenised document as list treebank.parsed_sents('wsj_000 Treebank parsed sentences
of words 1.mrg')
brown.fileids() Lists docs in Brown corpus
Text Classification
brown.categories() Lists categories in Brown corpus
from sklearn.feature_extraction.text import
Tokenization CountVectorizer, TfidfVectorizer
text.split(" ") Split by space vect=CountVectorizer().fit(X_tr Fit bag of words model to
ain) data
nltk.word_tokenizer(text) nltk in-built word tokenizer
vect.get_feature_names() Get features
nltk.sent_tokenize(doc) nltk in-built sentence tokenizer
vect.transform(X_train) Convert to doc-term matrix
Lemmatization & Stemming
Entity Recognition (Chunking/Chinking)
input="List listed lists listing Different suffixes
listings" g="NP: {<DT>?<JJ>*<NN>}" Regex chunk grammar
words=input.lower().split(' ') Normalize (lowercase) cp=nltk.RegexpParser(g) Parse grammar
words
ch=cp.parse(pos_sent) Parse tagged sent. using grammar
porter=nltk.PorterStemmer Initialise Stemmer
print(ch) Show chunks
[porter.stem(t) for t in words] Create list of stems
ch.draw() Show chunks in IOB tree
WNL=nltk.WordNetLemmatizer() Initialise WordNet
cp.evaluate(test_sents) Evaluate against test doc
lemmatizer
sents=nltk.corpus.treebank.tagged_sents()
[WNL.lemmatize(t) for t in words] Use the lemmatizer
print(nltk.ne_chunk(sent)) Print chunk tree
Part of Speech (POS) Tagging
nltk.help.upenn_tagset Lookup definition for a POS tag
('MD')
nltk.pos_tag(words) nltk in-built POS tagger
<use an alternative tagger to illustrate
ambiguity>
By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com
cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 1 of 2. http://crosswordcheats.com
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/
RegEx with Pandas & Named Groups
df=pd.DataFrame(time_sents, columns=['text'])
df['text'].str.split().str.len()
df['text'].str.contains('word')
df['text'].str.count(r'\d')
df['text'].str.findall(r'\d')
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.replace(r'(\w)', lambda x: x.groups()
[0][:3])
df['text'].str.extract(r'(\d?\d):(\d\d)')
df['text'].str.extractall(r'((\d?\d):(\d\d) ?
([ap]m))')
df['text'].str.extractall(r'(?P<digits>\d)')
By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com
cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 2 of 2. http://crosswordcheats.com