Statistical Language Processing
Statistical Language Processing
Georgakis, PhD
ToC
2/32
Definitions
SLP is NLP on steroids Away from rule based methods Cover a wide area:
Automatic summarization, Machine translation, Named entity recognition, Part-of-speech tagging, Sentence boundary disambiguation, Sentiment analysis, Word sense disambiguation, etc
3/32
Automatic summarization
...transformation of source text to summary text through content reduction by selection, generalization and transformation S. Jones, 1999 but there are many more definitions ambiguity for the term For additional info go here
4/32
Machine translation
Pivot languages
5/32
Peter person Paris city or person Some languages do not not use capitals German Begining of centences
6/32
Part-of-speech tagging
Well<interjection>, she<pron> and<conj> young<adj> John<noun> walk<verb> to<prep> school<noun> slowly<adverb> noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection .. but as a linguist you will need to use somewhere between 50 and 150
7/32
Sentiment analysis
9/32
Basic tools I
Corpora
Balanced and representative collection of documents removal of common words I will be at the park tomorrow evening park tomorrow evening removal of word inflection walking walk
Stopping
Stemming
11/32
Basic tools
N-grams
Sequences of unigrams PCA, SVD, NMF, ... LSA, pLSA, LDA, ...
Dimensionality reduction
Language modelling
12/32
Language analysis
Source text Pre-processing Tokenization Disambiguation Dim. reduction Clustering Results
13/32
Text mining I
Keyword indexing
Big, REALLY big table; Term-to-Document matrix Bag-of-words IR, search engines, etc
Use
Text mining II
Scalling or normalization:
Document similarity:
cos or Euclidean distance Inter- and intra-document context N-grams offer a partial solution
15/32
VSM shortcomings
LSA shortcomings
Text mining IV
pLSA shortcomings
Overfit
17/32
Text mining V
Text mining VI
Uncover the relationship between observed and hidden variables PLSA LDA
LDA extensions
Assumptions
Unrealistic but used extensively Words are generated in condition to previous words; Markov property Word distribution static over time
Text mining IX
Meta-data
Hyperlink analysis
22/32
SVD
X =W V
PCA
Y =W T X L
ICA
NMFX W H
Eigenvalue based Fast Converge under certain conditions Sub-space is not intuitive Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
24/32
NMF
25/32
Initialization
Convergence speed
26/32
Text streams
Surprise Emerging
Performance evaluation I
Contigency matrix
System output Positive True output Positive Negative TP FP Negative FN TN
Accuracy
A=
Recall Precision
R=
R=
Performance evaluation II
Precision-Recall curve
29/32
F-measure
F= a 1 1 1a +a P R
30/32
References
A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, 2010. M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010. J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, MorganKaufmann, 2012. N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing, CRC, 2010. C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, The MIT Press, 2000. R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications, Elsevier, 2009. M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan & Claypool, 2011. M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web Mining Technologies, IGI, 2009.
31/32
References
D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J. Machine Learning Research, vol. 3, 2003. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, J. American Society for Information Science, vol. 41, no. 6, pp. 391407, 1990. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence (UAI '04), 2004. C. Orsan, Automatic Summarisation in the Information Age, Int. Conf. on Recent Advances in Natural Language Processing (RANLP'09), 2009. R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41, no. 2, 2009. D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S
32/32