0% found this document useful (0 votes)
16 views

Text Classification

Uploaded by

ranjanagirish30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Text Classification

Uploaded by

ranjanagirish30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Text Classification

Besat Kassaie
2 Outline

 Introduction
 Text classification definition
 Naive Bayes
 Vector Space Classification
 Rocchio Classficiation
 KNN Classification
 New Applications
3 Spam Detection
4 Authorship identification

 Was Mark Twain the writer of “Quintus Curtius Snodgrass” (QCS) letters?
 Mark Twain's role in the Civil War has been a subject of dispute for years.
 The evidence of Twain’s possible military connection in New Orleans was
from content of ten letters published in New Orleans’ Daily Crescent.
 In these letters, which have been credited to Twain and were signed
“Quintus Curtius Snodgrass” (QCS), the writer described his military
adventures.
 Bringar (1963) applied statistical tests to QCS letters.
 Bringar used word frequency distribution to determine Mark Twain’s writing
style
 Mark Twain was not the author of the disputed letters!
5 Gender Identification

 Determining if an author is male or female


 Men and women inherently use different classes of language styles.
 Samples:
 a large number of determiners (a, the, that, these) and quantifiers (one, two,
more, some) are male indicators.
 Using large number of the pronouns (I, you, she, her, their, myself, yourself,
herself) are all strong female indicators.

 Applications:
 Marketing, personalization, legal investigation
6 Sentiment Analysis

 Is the attitude of this text positive or negative?


 Rank the attitude of this text from 1 to 5
 Sample Applications:
 Movie: is this review positive or negative?
 Products: what do people think about the new iPhone?
 Public sentiment: how is consumer confidence? Is despair
increasing?
 Politics: what do people think about this candidate or issue?
 Prediction: predict election outcomes or market trends from
sentiment
7 Subject Assignment

 What is the subject of this article:


 Antagonists and inhibitors
 Blood Supply
 Chemistry
 Drug Therapy
??  Embryology
 ….
8 Text Classification Definition

Input:
a document d
a fixed set of classes C={c1,c2,…,cj}

Output: a predicted class c ∈ C


9 Approaches to Text Classification

 Manual
 Many classification tasks have traditionally been solved manually.
 e.g. Books in a library are assigned Library of Congress categories by a librarian
 manual classification is expensive to scale

 Hand-crafted rules
 A rule captures a certain combination of keywords that indicates a class.
 e.g. (multicore OR multi-core) AND (chip OR processor OR microprocessor)
 good scaling properties
 creating and maintaining them over time is labor intensive

 Machine learning based methods


10 Supervised Learning

 Input:
 a document d
 a fixed set of classes C={c1,c2,…,cj}
 a training set of m labeled documents (d1,c1), (d2,c2),…,(dm,cm)

 Output: a learned classifier γ :d → c


11 Supervised Learning

 We can use any kind of classifiers:


 Naïve bayes
 kNN
 Rocchio
 SVM
 Logistic Regression
 ….
12

Naïve Bayes Classification


13 Naïve Bayes Intuition

 It is a simple classification technique based on Bayes rule


 Documents are represented by “bag of words” model
 The bag of words model :
 a document d, is represented by the set of words and their weights
determined by the TF weights
 TF is the number of occurrences of each term.
 The document “Mary is quicker than John” is, in this view, identical to
the document “John is quicker than Mary”.
14 The bag of words representation

I love this movie! It's sweet,


but with satirical humor. The
dialogue is great and the

γ( )=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
15 The bag of words representation

γ( )=c
16 Features

 Supervised learning classifiers can use any sort of feature


 URL, email address, punctuation, capitalization, social network features
and so on
 In the bag of words view of documents
 We use only word features
17 Baye’s Rule Applied to Documents and
Classes
 For a document d and a class c

𝑃 𝑑 𝑐 𝑃(𝑐)
𝑃 𝑐𝑑 =
𝑃(𝑑)
18 Naïve Bayes Classifier

𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑐 𝑑 MAP is “maximum a


𝑐∈𝐶 posteriori” = most
likely class

𝑃 𝑑 𝑐 𝑃(𝑐)
= 𝑎𝑟𝑔𝑚𝑎𝑥 Bayes Rule
𝑐∈𝐶 𝑃(𝑑)

Dropping the
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑑 𝑐 𝑃(𝑐) denominator
𝑐∈𝐶
Document d
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑥1 , 𝑥1 , … , 𝑥𝑛 𝑐 𝑃(𝑐) represented as Features
𝑐∈𝐶 x1..xn
19 Multinomial NB VS Bernoulli NB

 There are two different approaches for setting up an NB classifier.


 multinomial model: generates one term from the vocabulary in each
position of the document
 multivariate Bernoulli model: generates an indicator for each term of the
vocabulary, either 1 indicating presence of the term in the document or 0
indicating absence

 Difference in estimation strategies


20 Document Features

𝑐𝑚𝑎𝑝 =𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑑 𝑐 𝑃(𝑐)


𝑐∈𝐶
 Multinomial:
 𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑐 𝑃(𝑐)
𝑐∈𝐶

 𝑤1 , 𝑤2 , … , 𝑤𝑛 is the sequence of terms as it occurs in d

 Bernoulli:
 𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑒1 , 𝑒2 , … , 𝑒𝑣 𝑐 𝑃(𝑐)
𝑐∈𝐶

 𝑒1 , 𝑒2 , … , 𝑒𝑣 is a binary vector of dimensionality V that indicates for each


term whether it occurs in d or not
21 ASSUMPTIONS: Conditional independence

 𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑐 = 𝑃 𝑥1 𝑐 ∙ 𝑃 𝑥2 𝑐 ∙ ⋯ ∙ 𝑃 𝑥𝑛 𝑐
 Multinomial:
 𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑐 𝑃 𝑐
𝑐∈𝐶

 𝑐𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 1≤𝑘≤𝑛 𝑃 𝑋𝑘 = 𝑤𝑘 𝑐 𝑃(𝑐)


𝑐∈𝐶

 Xk is the random variable for position k in the document and takes


as values terms from the vocabulary.
 𝑃 𝑋𝑘 = 𝑤𝑘 𝑐 is the probability that in a document of class c the term
w will occur in position k
22 ASSUMPTIONS: Conditional independence

 Bernoulli:
 𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑒1 , 𝑒2 , … , 𝑒𝑣 𝑐 𝑃 𝑐
𝑐∈𝐶

 𝑐𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 1≤𝑖≤𝑣 𝑃 𝑈𝑖 = 𝑤𝑖 𝑐 𝑃(𝑐)


𝑐∈𝐶

U is the random variable for vocabulary term i and takes as values


i

0 (absence) and 1 (presence).


 𝑃 𝑈𝑖 = 𝑤𝑖 𝑐 is the probability that in a document of class c the term
wi will occur
23 ASSUMPTIONS: Bag of words
assumption
 if we assumed different term distributions for each position k, we would have to
estimate a different set of parameters for each k

 positional independence: The conditional probabilities for a term are the same,
independent of position in the document.
 E.g. P(Xk1 = t|c) = P(Xk2 = t|c)

 we have a single distribution of terms that is valid for all positions k

𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑐 𝑃 𝑐
𝑐∈𝐶

𝑐𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤 𝑐 𝑃(𝑐)


𝑐∈𝐶
𝑤𝜖𝑊
24 Learning Parameters for Multinomial
Naïve Bayes
 Based on Maximum Likelihood Estimation

Fraction of documents
𝐶𝑜𝑢𝑛𝑡 𝐶 = 𝑐𝑗 which are labeled as
𝑃 𝑐𝑗 =
𝑁𝑑𝑜𝑐 class cj

Create mega document


𝐶𝑜𝑢𝑛𝑡(𝑤𝑖, 𝑐𝑗 ) Fraction of times word wi
for class j By concatenating 𝑃(𝑤𝑖|𝑐𝑗 ) =
𝑤𝜖𝑉 𝐶𝑜𝑢𝑛𝑡(𝑤, 𝑐𝑗 )
appears among all words
all docs In this class.
of class cj
Use frequency of w in
mega document.
25 Laplace (add-1) smoothing for Naïve
Bayes
 Solving the problem of zero probabilities

𝐶𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1
𝑃 𝑤𝑖 𝑐𝑗 =
𝑤𝜖𝑉 𝐶𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 + 1

𝐶𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1
=
𝑤𝜖𝑉 𝐶𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 + |𝑉|
26 Laplace (add-1) smoothing: unknown
words
 Add one extra word to the vocabulary, the “unknown word” , wu

𝐶𝑜𝑢𝑛𝑡 𝑤𝑢, 𝑐𝑗 + 1
𝑃 𝑤𝑢 𝑐𝑗 =
( 𝑤𝜖𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐𝑗 )) + |𝑉 + 1|

1
=
( 𝑤𝜖𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐𝑗 )) + |𝑉 + 1|

Probability for all unknown words


27 Example
28

Vector Space Classification


29 Vector Space Representation

 The document representation in Naive Bayes is a sequence of terms or a binary


vector
 A different representation for text classification: the vector space model
 Each document represented as a vector with one real-valued component,
usually a tf-idf weight, for each term.

 Each vector is composed of all terms


Of all documents.

 Terms are axis


 High dimensionality
30 Vector Space Classification

 Vector space classification is based on the contiguity hypothesis:


 Objects in the same class form a contiguous region, and regions of different
classes do not overlap

 Classification is to compute the boundaries in the space that separate the


classes; the decision boundaries

 Two methods will be discussed:


Rocchio and kNN
31

Vector Space Classification


Rocchio
32 Rocchio Classification

 Basic idea: Rocchio forms a simple representative for each class by using
the centroids.
 To compute the centroid of each class:

1
𝜇 𝑐 = 𝑑𝜖𝐷𝑐 𝑣(𝑑)
|𝐷𝑐|

Dc is the set of all documents that belong to class c

v(d) is the vector space representation of d.


33 Rocchio Classification

 The boundary between two classes in


Rocchio classification is the set of points
with equal distance from the two centroids

 To classify a new object determine


which centroid is closest to and assign it to the corresponding class ci .
34 Problems with Rocchio Classification

 Implicitly assumes that classes are spheres with similar radius


 Ignores details of distribution of points within a class and only considers
centroid distances.

 Red cube is assigned to Kenya class


 But is a better fit for UK class
since UK is more scattered than Kenya

 Not requiring similar radius → modified decision rule:


Assign d to class c iff 𝜇 𝑐 − 𝑣 𝑑 ≤ 𝜇 𝑐 −𝑣 𝑑 −𝑏
35 Problems with Rocchio Classification

 Does not work well for classes than cannot be accurately represented by a
single “center”

 Class “a” consists of two different clusters.


 Rocchio classification will misclassify “o” as “a” because
it is closer to the centroid A of the “a” class than to the
centroid B of the “b” class.
36

Vector Space Classification


K Nearest Neighbor
37 K Nearest Neighbor Classification

 For kNN we assign each document to the majority class of its k closest
neighbors where k is a parameter.

 The rationale of kNN classification is that, based on the contiguity


hypothesis, we expect a test document d to have the same label as the
training documents located in the local region surrounding d.

 Learning: just store the labeled training examples D


38 K Nearest Neighbor Classification

 Determining class of a new document :

 For k = 1: Assign each object to the class of its closest neighbor.


 For k > 1: Assign each object to the majority class among its k closest
neighbors.
 The parameter k must be specified in advance. k = 3 and k = 5 are
common choices, but much larger values between 50 and 100 are also
used
39 kNN decision boundary

 The decision boundary defined by the Voronoi tessellation


 Assuming k = 1: For a given set of objects in the space, each object define
a cell consisting of all points that are closer to that object than to other
objects.
 Results in a set of convex polygons;
so-called Voronoi cells.

 Decomposing a space into such cells


gives us the so-called Voronoi tessellation.
 In the case of k > 1, the Voronoi cells
are given by the regions in the space for which
the set of k nearest neighbors is the same.
40

 Decision boundary for 1NN (double line): defined along the regions of
Voronoi cells for the objects in each class. Shows the non-linearity of kNN
41 kNN: Probabilistic version

 There is a probabilistic version of kNN classification algorithm.


 We can estimate the probability of membership in class c as the proportion of
the k nearest neighbors in c.

 𝑃 (𝑐𝑖𝑟𝑐𝑙𝑒 𝑐𝑙𝑎𝑠𝑠|𝑠𝑡𝑎𝑟) = 1/3


 𝑃(𝑋 𝑐𝑙𝑎𝑠𝑠|𝑠𝑡𝑎𝑟) = 2/3
 𝑃(𝑑𝑖𝑎𝑚𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠|𝑠𝑡𝑎𝑟) = 0

 The 3NN estimate: 𝑃(𝑐𝑖𝑟𝑐𝑙𝑒 𝑐𝑙𝑎𝑠𝑠|𝑠𝑡𝑎𝑟) = 1/3


 The 1NN estimate: 𝑃(𝑐𝑖𝑟𝑐𝑙𝑒 𝑐𝑙𝑎𝑠𝑠|𝑠𝑡𝑎𝑟) = 1
 3NN preferring the X class
and 1NN preferring the circle class .
42 kNN: distance weighted version

 We can weight the “votes” of the k nearest neighbors by their cosine


similarity:

𝑆𝑐𝑜𝑟𝑒(𝑐, 𝑑) = 𝐼𝑐 𝑑 cos(𝑣 𝑑 , 𝑣 𝑑 )
𝑑𝜖𝑆𝑘 (𝑑)

 Sk(d) is the set of d’s k nearest neighbors


 Ic(d′) = 1 iff d′ is in class c and 0 otherwise.
 Assign the document to the class with the highest score.
 Weighting by similarities is often more accurate than simple voting.
43 kNN Properties

 No training necessary
 Scales well with large number of classes
 Don’t need to train n classifiers for n classes
 May be expensive at test time
 In most cases it’s more accurate than NB or Rocchio
44 Some Recent Applications of Text
Classification
 Social networks are rich source of text data e.g. twitter
 Many new applications are emerging based on these sources
 We will discuss three recent works based on twitter data
 “Learning Similarity Functions for Topic Detection in Online Reputation
Monitoring. Damino Spina, et. Al, SIGIR, 2014”
 “Target-dependent Churn Classification in Microblogs, Hadi Amiri, et.al,
AAAI 2015”
 “Fine-Grained Location Extraction from Tweets with Temporal Awareness,
Chenliang Li, SIGIR, 2014”
45 Learning Similarity Functions for Topic
Detection in Online Reputation Monitoring

 What are people are saying about a given entity (company, brand,
organization, personality,…)
 Is there any issues that may damage the reputation of the entity?
 In order to answer such questions, reputation experts have to daily monitor
social networks such as twitter
 The paper aim is to solve this problem automatically as topic detection task
 Reputation alerts must be detected early before they explode. So there are
a few number of related tweets
 Probabilistic generative approaches are less appropriate here because of
data sparsity
46 Learning Similarity Functions…

 Paper proposes a hybrid approach: supervised + unsupervised


 Tweets need to be clustered based on their topics
 Clusters change over time based on data stream
 Tweets with similar topics belong to the same cluster
 For clustering we need a similarity function.
 Classification (SVM) is used to learn the similarity function.
 Two tweets belong to either similar or not similar classes

SVM

Hierarchical Agglomerative Clustering


47 Learning Similarity Functions…

 Dataset is comprised of 142,527 manually annotated tweets.


 Features used for classification:
 Term Features:
 taking into account similarity between terms in the tweets
 Tweets sharing high percentage of vocabulary are likely about same topic

 Semantic features:
 Useful for tweets that do not have words in common.
 Used Wikipedia as a knowledge based to find semantically related words. e.g. mexico,
mexicanas

 Metadata features:
 Such as author, URLs, hashtags,…

 Time-aware features:
 Time stamp of tweets
48 Fine-Grained Location Extraction from
Tweets with Temporal Awareness
 Through tweets, users often casually or implicitly reveal their current
locations and short term plans where to visit next, at fine-grained granularity

 Such information enables tremendous opportunities for personalization and


location-based services/marketing

 This paper is about extracting fine-grained locations mentioned in tweets


with temporal awareness.
49 Fine-Grained Location Extraction…

 Dataset: a sample of 4000 manually annotated tweets


 Classifier: Conditional Random Fields
 Some features:
 Lexical Features: words itself, its lower case, bag-of-words of the context window,
 Grammatical Features: POS tags (for verb tense), word groups (shudd, shuld,
shoud)
 …
 If a user mentions a point-of-interest (POI) (e.g., restaurant, shopping
mall,…) in her tweet:
 The name of the POI is extracted
 The tweet is assigned into one of these classes: The POI is visited, is currently at, or
will soon visit (temporal awareness).
50 Target-dependent Churn Classification
in Microblogs
 The problem of classifying micro-posts as churny or non-churny with respect
to a given brand (telecom companies) using Twitter data.
 Whether a tweet is an indicator that user is going to cancel a service
 Sample tweets:
 “One of my main goals for 2013 is to leave BrandName”
 “I cant take it anymore, the unlimited data isn’t even worth it”
 “My days with BrandName are numbered”
51 Target-dependent Churn Classification
in Microblogs
 Dataset : 4800 tweets about three telecommunication brands
 Classification method: linear regression, SVM, logistic regression
 Features:
 Demographic Churn Indicators:
 Activity ratio (if a user is not active with respect to any competitors, then he is less likely
to send churny contents about a target brand),
 # followers and friends, …

 Content Churn Indicators:


 Sentiment features, Tense of tweet,…


52

THANK YOU
53 References

https://web.stanford.edu/class/cs124/lec/sentiment.pptx
https://web.stanford.edu/class/cs124/lec/naivebayes.pdf
An introduction to Information Retrieval, Christopher D. Manning
,Prabhakar Raghavan,Hinrich Schütze,2009
Learning Similarity Functions for Topic Detection in Online
Reputation Monitoring. Damino Spina, et. Al, SIGIR, 2014
Target-dependent Churn Classification in Microblogs, Hadi Amiri,
et.al, AAAI 2015
Fine-Grained Location Extraction from Tweets with Temporal
Awareness, Chenliang Li, SIGIR, 2014

You might also like