0% found this document useful (0 votes)
64 views

Statistical Language Processing

This document provides an overview of statistical language processing concepts and algorithms. It discusses key natural language processing tasks like automatic summarization, machine translation, named entity recognition, part-of-speech tagging, and sentiment analysis. It also covers text mining techniques including vector space models, latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation. Finally, it discusses performance evaluation metrics and references numerous sources for additional information on these topics.

Uploaded by

apostolos1975
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Statistical Language Processing

This document provides an overview of statistical language processing concepts and algorithms. It discusses key natural language processing tasks like automatic summarization, machine translation, named entity recognition, part-of-speech tagging, and sentiment analysis. It also covers text mining techniques including vector space models, latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation. Finally, it discusses performance evaluation metrics and references numerous sources for additional information on these topics.

Uploaded by

apostolos1975
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Statistical language processing Concepts and Algorithms A.

Georgakis, PhD

ToC

Basic definitions Text mining Performance evaluation References

2/32

Definitions

SLP is NLP on steroids Away from rule based methods Cover a wide area:

Automatic summarization, Machine translation, Named entity recognition, Part-of-speech tagging, Sentence boundary disambiguation, Sentiment analysis, Word sense disambiguation, etc
3/32

Automatic summarization

...transformation of source text to summary text through content reduction by selection, generalization and transformation S. Jones, 1999 but there are many more definitions ambiguity for the term For additional info go here
4/32

Machine translation

Substitution of source text into a target language Usage of parallel corpora

Internet is a vast source for such data

Pivot languages

5/32

Named entity recognition

Identify proper names and their types


Peter person Paris city or person Some languages do not not use capitals German Begining of centences
6/32

Capitalization is not always a good tool


Part-of-speech tagging

Determine the part of speech for words

Well<interjection>, she<pron> and<conj> young<adj> John<noun> walk<verb> to<prep> school<noun> slowly<adverb> noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection .. but as a linguist you will need to use somewhere between 50 and 150
7/32

English has 9 parts of speech:

Sentence boundary disambiguation

Where does a centence start and stop?


Punctuation marks are problematic Rule based mathod

Precompiled list of abbreviations

90% of periods are sentence boundaries (Riley, 1999)

~47% in Wall Street Journal are abbreviations (Stammatos, 2009)


8/32

Sentiment analysis

Identify the polarity and emotional state for a given text:


positive or negative angry, sad, unhappy

Rather tough problem to solve due to language ambiguity

9/32

Word sense disambiguation


Identify the sense of different words ML on top of human knowledge


Thesauri Ontologies Corpora ...

For more info go here


10/32

Basic tools I

Corpora

Balanced and representative collection of documents removal of common words I will be at the park tomorrow evening park tomorrow evening removal of word inflection walking walk

Stopping

Stemming

11/32

Basic tools

N-grams

Sequences of unigrams PCA, SVD, NMF, ... LSA, pLSA, LDA, ...

Dimensionality reduction

Language modelling

12/32

Language analysis
Source text Pre-processing Tokenization Disambiguation Dim. reduction Clustering Results
13/32

Syntactic Semantic Results

Text mining I

Keyword indexing

Big, REALLY big table; Term-to-Document matrix Bag-of-words IR, search engines, etc

Use

Unigram N-gram transition


14/32

Text mining II

1968, Salton: Vector Space Model (VSM)

Scalling or normalization:

Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling

Document similarity:

cos or Euclidean distance Inter- and intra-document context N-grams offer a partial solution
15/32

VSM shortcomings

Text mining III

1990, Deerwester: Latent Semantic Analysis (LSA)


SVD on term-by-document matrix K-dim subspace (concepts)


Linear combination of terms Frequencies in Fourier analysis

LSA shortcomings

Computationally expensive Updating is equally expensive Concepts are not intuitive


16/32

Text mining IV

1999, Hofmann: Probabilistic LSA (pLSA) or aspect model


Probabilistic topic models Statistical foundation Latent variable

Hidden states in HMM


pLSA. Source: Berry, 2010

pLSA shortcomings

Overfit
17/32

Text mining V

Source: Blei, 2011


18/32

Text mining VI

Source: Blei, 2011


19/32

Text mining VII

Probabilistic topic models

Uncover the relationship between observed and hidden variables PLSA LDA

Ando's presentation Relax statistical assumptions Use meta data


LDA. Source: Berry, 2010
20/32

LDA extensions

For an indroduction go here

Text mining VIII

Assumptions

Word order irrelevant; bag-of-words


Unrealistic but used extensively Words are generated in condition to previous words; Markov property Word distribution static over time

Order of documents irrelevant; corpus

Number of topics: known and fixed


21/32

Text mining IX

Meta-data

Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc

Hyperlink analysis

22/32

Matrix factorization techniques I

SVD
X =W V

Where Weigenvectors and eigenvalues

PCA
Y =W T X L

ICA

Independence for principal components (neither orthogonal nor in rank order)


23/32

NMFX W H

Matrix factorization techniques II

SVD, PCA and ICA


Eigenvalue based Fast Converge under certain conditions Sub-space is not intuitive Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
24/32

NMF

Source: Lee, 1999

25/32

Matrix factorization techniques III

Problems with NMF

Initialization

Convergence speed

Iterative Local minimum

26/32

Text streams

Detecting changes in sentiment


Surprise Emerging

Text-to-number conversion Time signatures Temporal histogram Teele's work


Source: Berry, 2009
27/32

Performance evaluation I

Contigency matrix
System output Positive True output Positive Negative TP FP Negative FN TN

Accuracy
A=

Recall Precision

TP+TN m TP TP+FN TP TP+FN


28/32

R=

R=

Performance evaluation II

Precision-Recall curve

29/32

Performance evaluation III

F-measure
F= a 1 1 1a +a P R

30/32

References

A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, 2010. M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010. J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, MorganKaufmann, 2012. N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing, CRC, 2010. C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, The MIT Press, 2000. R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications, Elsevier, 2009. M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan & Claypool, 2011. M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web Mining Technologies, IGI, 2009.
31/32

References

D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J. Machine Learning Research, vol. 3, 2003. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, J. American Society for Information Science, vol. 41, no. 6, pp. 391407, 1990. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence (UAI '04), 2004. C. Orsan, Automatic Summarisation in the Information Age, Int. Conf. on Recent Advances in Natural Language Processing (RANLP'09), 2009. R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41, no. 2, 2009. D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S
32/32

You might also like