NLP-Neuro Linguistic Programming: What Is A Corpus?
NLP-Neuro Linguistic Programming: What Is A Corpus?
NLP-Neuro Linguistic Programming: What Is A Corpus?
What is a corpus?
A collection of text
We need to create Bag of words, where each unique word in the text is represented by a number
Normalize text –
o Remove punctuations
o Convert to lower case
o Stop words,
o apply stemming to get root words,
o use BeautifulSoup to remove HTML tags,
o Porter Stemming for Part of speech
Convert Text to bag of words using CounterVectorizer
Apply ML Algorithm like RF, Naïve Bayes
TF is the measures how frequently a term occurs in a document. Since every document is different
in length, it is possible that a term would appear much more times in long documents than shorter
ones. Thus, the term frequency is often divided by the document length (aka. the total number of
terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
TF-IDF is the measures how important a term is. While computing TF, all terms are considered
equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a
lot of times but have little importance. Thus we need to weigh down the frequent terms while scale
up the rare ones, by computing the following:
Consider a document containing 100 words wherein the word solar appears 3 times.
The term frequency (i.e., tf) for Solar is then (3 / 100) = 0.03.
Now, assume we have 10 million documents and the word Solar appears in one thousand of these.
Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
What is Collocation and Bigram?
A collocation is a sequence of words that occur together unusually often.
red wine is a collocation, whereas the maroon wine is not.
A characteristic of collocations is that they are resistant to substitution with words that have
similar senses; for example, maroon wine sounds definitely odd.
To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known
as bigrams. This is easily accomplished with the function bigrams():
Information Extraction
Raw text of the document is split into sentences using a sentence segmenter,
Each sentence is further subdivided into words using a tokenizer.
Each sentence is tagged with part-of-speech tags, which will prove very helpful
in the next step, named entity detection.
In Named Entity detection step, we search for mentions of potentially
interesting entities in each sentence.
o basic technique for entity detection is chunking, which segments and
labels multi-token sequences. The smaller boxes show the word-level
tokenization and part-of-speech tagging, while the large boxes show
higher-level chunking. Each of these larger boxes is called a chunk.
Finally, we use relation detection to search for likely relations between
different entities in the text.