0% found this document useful (0 votes)

64 views

Statistical Language Processing

This document provides an overview of statistical language processing concepts and algorithms. It discusses key natural language processing tasks like automatic summarization, machine translation, named entity recognition, part-of-speech tagging, and sentiment analysis. It also covers text mining techniques including vector space models, latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation. Finally, it discusses performance evaluation metrics and references numerous sources for additional information on these topics.

Uploaded by

apostolos1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Statistical Language Processing

Uploaded by

apostolos1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Statistical language processing Concepts and Algorithms A.

Georgakis, PhD

ToC

Basic definitions Text mining Performance evaluation References

2/32

Definitions

SLP is NLP on steroids Away from rule based methods Cover a wide area:

Automatic summarization, Machine translation, Named entity recognition, Part-of-speech tagging, Sentence boundary disambiguation, Sentiment analysis, Word sense disambiguation, etc
3/32

Automatic summarization

...transformation of source text to summary text through content reduction by selection, generalization and transformation S. Jones, 1999 but there are many more definitions ambiguity for the term For additional info go here
4/32

Machine translation

Substitution of source text into a target language Usage of parallel corpora

Internet is a vast source for such data

Pivot languages

5/32

Named entity recognition

Identify proper names and their types

Peter person Paris city or person Some languages do not not use capitals German Begining of centences
6/32

Capitalization is not always a good tool

Part-of-speech tagging

Determine the part of speech for words

Well<interjection>, she<pron> and<conj> young<adj> John<noun> walk<verb> to<prep> school<noun> slowly<adverb> noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection .. but as a linguist you will need to use somewhere between 50 and 150
7/32

English has 9 parts of speech:

Sentence boundary disambiguation

Where does a centence start and stop?

Punctuation marks are problematic Rule based mathod

Precompiled list of abbreviations

90% of periods are sentence boundaries (Riley, 1999)

~47% in Wall Street Journal are abbreviations (Stammatos, 2009)

8/32

Sentiment analysis

Identify the polarity and emotional state for a given text:

positive or negative angry, sad, unhappy

Rather tough problem to solve due to language ambiguity

9/32

Word sense disambiguation

Identify the sense of different words ML on top of human knowledge

Thesauri Ontologies Corpora ...

For more info go here

10/32

Basic tools I

Corpora

Balanced and representative collection of documents removal of common words I will be at the park tomorrow evening park tomorrow evening removal of word inflection walking walk

Stopping

Stemming

11/32

Basic tools

N-grams

Sequences of unigrams PCA, SVD, NMF, ... LSA, pLSA, LDA, ...

Dimensionality reduction

Language modelling

12/32

Language analysis
Source text Pre-processing Tokenization Disambiguation Dim. reduction Clustering Results
13/32

Syntactic Semantic Results

Text mining I

Keyword indexing

Big, REALLY big table; Term-to-Document matrix Bag-of-words IR, search engines, etc

Use

Unigram N-gram transition

14/32

Text mining II

1968, Salton: Vector Space Model (VSM)

Scalling or normalization:

Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling

Document similarity:

cos or Euclidean distance Inter- and intra-document context N-grams offer a partial solution
15/32

VSM shortcomings

Text mining III

1990, Deerwester: Latent Semantic Analysis (LSA)

SVD on term-by-document matrix K-dim subspace (concepts)

Linear combination of terms Frequencies in Fourier analysis

LSA shortcomings

Computationally expensive Updating is equally expensive Concepts are not intuitive

16/32

Text mining IV

1999, Hofmann: Probabilistic LSA (pLSA) or aspect model

Probabilistic topic models Statistical foundation Latent variable

Hidden states in HMM

pLSA. Source: Berry, 2010

pLSA shortcomings

Overfit
17/32

Text mining V

Source: Blei, 2011

18/32

Text mining VI

Source: Blei, 2011

19/32

Text mining VII

Probabilistic topic models

Uncover the relationship between observed and hidden variables PLSA LDA

Ando's presentation Relax statistical assumptions Use meta data

LDA. Source: Berry, 2010
20/32

LDA extensions

For an indroduction go here

Text mining VIII

Assumptions

Word order irrelevant; bag-of-words

Unrealistic but used extensively Words are generated in condition to previous words; Markov property Word distribution static over time

Order of documents irrelevant; corpus

Number of topics: known and fixed

21/32

Text mining IX

Meta-data

Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc

Hyperlink analysis

22/32

Matrix factorization techniques I

SVD
X =W V

Where Weigenvectors and eigenvalues

PCA
Y =W T X L

ICA

Independence for principal components (neither orthogonal nor in rank order)

23/32

NMFX W H

Matrix factorization techniques II

SVD, PCA and ICA

Eigenvalue based Fast Converge under certain conditions Sub-space is not intuitive Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
24/32

NMF

Source: Lee, 1999

25/32

Matrix factorization techniques III

Problems with NMF

Initialization

Convergence speed

Iterative Local minimum

26/32

Text streams

Detecting changes in sentiment

Surprise Emerging

Text-to-number conversion Time signatures Temporal histogram Teele's work

Source: Berry, 2009
27/32

Performance evaluation I

Contigency matrix
System output Positive True output Positive Negative TP FP Negative FN TN

Accuracy
A=

Recall Precision

TP+TN m TP TP+FN TP TP+FN

28/32

Performance evaluation II

Precision-Recall curve

29/32

Performance evaluation III

F-measure
F= a 1 1 1a +a P R

30/32

References

A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, 2010. M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010. J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, MorganKaufmann, 2012. N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing, CRC, 2010. C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, The MIT Press, 2000. R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications, Elsevier, 2009. M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan & Claypool, 2011. M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web Mining Technologies, IGI, 2009.
31/32

References

D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J. Machine Learning Research, vol. 3, 2003. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, J. American Society for Information Science, vol. 41, no. 6, pp. 391407, 1990. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence (UAI '04), 2004. C. Orsan, Automatic Summarisation in the Information Age, Int. Conf. on Recent Advances in Natural Language Processing (RANLP'09), 2009. R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41, no. 2, 2009. D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S
32/32

Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
The Hound of The Baskervilles Creative Project
No ratings yet
The Hound of The Baskervilles Creative Project
7 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Lecture 5- Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5- Text Mining Sentiment and Social Media Analytics
52 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
Text Mining
No ratings yet
Text Mining
25 pages
feature eng
No ratings yet
feature eng
34 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
100% (1)
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
45 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
5 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
NLP FINAL
No ratings yet
NLP FINAL
33 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
S12 Text Analytics
No ratings yet
S12 Text Analytics
15 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Text Mining
No ratings yet
Text Mining
62 pages
Text Mining Applications and Theory
100% (4)
Text Mining Applications and Theory
223 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Intro to statistical nlp
No ratings yet
Intro to statistical nlp
57 pages
7 - Text Analytics Text Mining and Sentiment Analysis
100% (2)
7 - Text Analytics Text Mining and Sentiment Analysis
53 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
MOD-1
No ratings yet
MOD-1
71 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
Intro NLP
No ratings yet
Intro NLP
47 pages
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
No ratings yet
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
239 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
Module 3
No ratings yet
Module 3
40 pages
Text Mining
No ratings yet
Text Mining
85 pages
ETB Text analytics using Machine Learning -20-12-24
No ratings yet
ETB Text analytics using Machine Learning -20-12-24
38 pages
Text Mining - Analytics
No ratings yet
Text Mining - Analytics
35 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Exam-2
No ratings yet
Exam-2
5 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Module 4
No ratings yet
Module 4
63 pages
Chapter 07 - in class
No ratings yet
Chapter 07 - in class
49 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Text
100% (2)
Text
259 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
NLP (4)
No ratings yet
NLP (4)
40 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Favorite Lesson Plan
No ratings yet
Favorite Lesson Plan
8 pages
Incomplete Elliptic Integrals in Ramanujan's Lost Notebook
No ratings yet
Incomplete Elliptic Integrals in Ramanujan's Lost Notebook
46 pages
Describe The Recruitment Documentation Used by Asda
63% (8)
Describe The Recruitment Documentation Used by Asda
4 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Introverted Leadership
100% (1)
Introverted Leadership
8 pages
3 1 Your Family
No ratings yet
3 1 Your Family
3 pages
Developing An Automated Depression Assessment Tool in Bengali - Adhering To WHO mhGAP Intervention G
No ratings yet
Developing An Automated Depression Assessment Tool in Bengali - Adhering To WHO mhGAP Intervention G
3 pages
PrasadReddy Bhumireddy CV
No ratings yet
PrasadReddy Bhumireddy CV
5 pages
Inglés Parte 2
No ratings yet
Inglés Parte 2
36 pages
Bohlman WorldMusicEnd 2002
No ratings yet
Bohlman WorldMusicEnd 2002
33 pages
Wang&Farb - Chatbot-Based Interventions For Mental Health Support
No ratings yet
Wang&Farb - Chatbot-Based Interventions For Mental Health Support
46 pages
Barbers Vs Comelec
No ratings yet
Barbers Vs Comelec
2 pages
Basic Skills in Badminton
No ratings yet
Basic Skills in Badminton
5 pages
Language Awareness Lesson Plan
100% (1)
Language Awareness Lesson Plan
2 pages
Floyd Q&a
No ratings yet
Floyd Q&a
5 pages
Full Download Moving On To Key Stage 1 Improving Transition From The Early Years Foundation Stage 1st Edition Julie Fisher PDF
100% (3)
Full Download Moving On To Key Stage 1 Improving Transition From The Early Years Foundation Stage 1st Edition Julie Fisher PDF
84 pages
Week 3 - Principles of Palliative Care
No ratings yet
Week 3 - Principles of Palliative Care
37 pages
GEMINI: A Novel by Carol Cassella - Special Preview Excerpt
No ratings yet
GEMINI: A Novel by Carol Cassella - Special Preview Excerpt
24 pages
Buddhism in 10 Steps (eBook)
No ratings yet
Buddhism in 10 Steps (eBook)
62 pages
NGHE
No ratings yet
NGHE
20 pages
Model of Curriculum Development
No ratings yet
Model of Curriculum Development
28 pages
Warren D. KissinJulio Herrera, (1990), International Mergers and Acquisitions - Due Diligence - ECON PDF
No ratings yet
Warren D. KissinJulio Herrera, (1990), International Mergers and Acquisitions - Due Diligence - ECON PDF
6 pages
To All To Whom Shall Notary Public Duly The Republic of That The Attachment I Have Be The Exclusive Distributor IN and Testimony Whereof Have My
No ratings yet
To All To Whom Shall Notary Public Duly The Republic of That The Attachment I Have Be The Exclusive Distributor IN and Testimony Whereof Have My
22 pages
Experiment 2A: Kirchhoff's Rules, Light Bulbs in Series and in Parallel
No ratings yet
Experiment 2A: Kirchhoff's Rules, Light Bulbs in Series and in Parallel
7 pages
ARBITRATION PROBLEM (1)
No ratings yet
ARBITRATION PROBLEM (1)
8 pages
Uji Validitas Dan Reliabilitas Uji Valid 1: Item-Total Statistics
No ratings yet
Uji Validitas Dan Reliabilitas Uji Valid 1: Item-Total Statistics
5 pages
Most Important Congress Sessions
No ratings yet
Most Important Congress Sessions
3 pages
Resume
No ratings yet
Resume
3 pages
hw4 PDF
No ratings yet
hw4 PDF
3 pages