Becker and Kuropka - Topic-Based Vector Space Model PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Topic-based Vector Space Model

Jörg Becker Dominik Kuropka


Dept. of Information Systems Dept. of Information Systems
Leonardo-Campus 3 Leonardo-Campus 3
48149 Münster; Germany 48149 Münster; Germany
becker@wi.uni-muenster.de isdoku@wi.uni-muenster.de

Abstract tionally, it should be powerful enough to allow complex


expressions (e. g. using Boolean operators like “or”,
This paper motivates and presents the Topic-based “and”, “not”, etc.). On the one hand this leads to a huge
Vector Space Model (TVSM), a new vector-based amount of time the user needs to master the language in
approach for document comparison. The approach does case it is very powerful. On the other hand the filtering
not assume independence between terms and it is flexible results of the IF system will be deficient if the language is
regarding the specification of term-similarities. Stop- easy to use but not powerful enough.
word-list, stemming and thesaurus can be fully integrated One solution of this dilemma is the use of an adaptive
into the model. This paper shows further how the TVSM approach. The idea is to present some evaluated
can be fully implemented within the context of relational documents to the IF system and to let it generate the user
databases. This facilitates the use of this approach by profile on its own. As a side-effect the system can
generic applications. At the end short comparisons with improve the user profile continuously if the user himself
other vector-based approaches namely the Vector Space gives a feedback on misclassified documents. This
Model (VSM) and the Generalized Vector Space Model approach has already been implemented in NewsSIEVE
(GVSM) are presented. [Haneke 2001] and PI-Agent [Kuropka 2001] systems.
NewsSIEVE adapts the user profile by using evolutionary
algorithms while the PI-Agent uses neuronal networks.
1. Motivation Both approaches have in common that the initial
information about user profiles (= training set) are
Information filtering (IF) systems are designed for transformed into an internal representation (e. g. neuron
permanently scanning document streams (e. g. news- weights in case of a neuronal network) which makes the
ticker or Usenet). They identify potentially important or profile representation difficult to understand for users. So
interesting documents for users by classifying them. For the system is not able to explicate its classification rules
designers of IF systems the conceptualization of user in a user-friendly way. This leads to the following
profiles is a great challenge. For each user a profile has to problems: Firstly, the user has to rely on the classification
be defined. This profile determines criteria which are given by the IF system without knowing how the
utilized for user-specific classification of documents. classification is done in detail. Secondly, in case the
One approach (the explication approach) is the user’s information demand shifts from one day to another
definition of a formalized language which is utilized by or the system is unable to adapt his information demand,
the user to describe his profile. This approach has been it is impossible for him to make reasonable corrections on
implemented in the IF prototypes Rama [Binkley 1991], his profile. Consequently, the user has to wait until the
Borges V2 [Smeaton 1996] and Sift [Yan 2000]. The system has corrected his profile automatically.
main problem with this approach is that users are often Meanwhile, a lot of documents may be misclassified.
not capable of specifying their information demand Our intention is the use of a case-based approach for
properly. There are two main reasons for this: Firstly, it is defining user profiles. This means, the user defines his
difficult for a user to explicate required criteria. Secondly, profile by presenting some evaluated documents to the
the formalized language has to be powerful enough to system, like in the adaptive approach. In contrast to the
deal with the challenges of natural language processing
like flexions of words1, synonyms2 and polysems3. Addi-
2
Synonyms are different words (e. g. “automobile” and “car”) with
the same meaning.
3
Depending of the context a word could have different meanings
1
Words may appear in different forms like singular or plural (e. g. “mouse” can stand for a small rodent or for a computer input
depending on their use within a text (e. g. “house” and “houses”). device).

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA
8 BUSINESS INFORMATION SYSTEMS – BIS 2003

adaptive approach this initial profile information is not similarity function can be fully implemented within
transformed into some kind of internal representation. the context of a relational database using the
New documents are classified by adopting the classifi- Structured Query Language (SQL) [ISO 1992].
cation of the most similar documents from the user d) Optimized adoption of user profiles: The simple
profile. A simple adoption of user profiles can be adoption method described above adds all documents
implemented this way: Every time the system has mis- (and their affiliation to a class) to the user profile,
classified a document, user corrected evaluations of which have been misclassified by the system. On the
documents are added to the profile. The benefit of this long run this leads to two problems: Firstly, the user
approach is the possibility to understand the profile, profile extends without any limit. This reduces the
because it is just an set of evaluated documents. By classification performance. Secondly, the system is
removing evaluated documents or by inserting new not able to “forget” anything adopted. This makes an
evaluated documents reasonable corrections on the profile adoption of changes of user’s information demand
are possible. The following aspects should be considered impossible. Therefore, it is necessary to develop some
when implementing case-based approaches: kind of reorganisation and garbage collection
a) Document similarity: A similarity function is methods. Qualified data is advantageous for the
necessary to determine the similarities between old development of an optimized adoption method. This
and new cases. The “cases” here are documents. point belongs to future work and will not be discussed
Therefore, some kind of document similarity function in this paper in more detail.
is needed. This function will be derivated from a
theoretical foundation which is presented in section 2 2. Development of the model
in detail.
b) Document classification: For document classification
This section introduces the Topic-based Vector Space
we propose the use of k-Nearest Neighbour classifi-
Model (TVSM). Relations to other vector-based models,
cation (kNN). This is a simple method. If you insert a
in particular the Vector Space Model (VSM) [Salton
test document into the system, the system finds the k
nearest neighbours among the profile documents. It 1968; Baezea-Yates 1999, pp. 27-30] and the Generalized
uses the categories of the k neighbours to weight the Vector Space Model (GVSM) [Wong 1987; Baezea-Yates
1999, pp. 41-44], are discussed in section 4.
category of the test document. Referring to the
examination of YANG and LIU, the kNN (using the
cosine similarity on document vectors) is one of the 2.1 Topic-based Vector Space
best methods for text categorization [Yang 1999]. This
paper describes our work which is still in progress. At The basic premise of the TVSM is the existence of a d
present, we do not have enough qualified data to run dimensional space R which only has positive axis-
optimizations on parameters of the kNN. Thus, the intercepts:
kNN is future work and will not be discussed in this R ∈ R ≥d 0 with d ∈ N >0 (1)
paper in more detail. Interpretation of R: Each dimension of R represents a
c) Efficient processing of huge amounts of persistent fundamental topic. It is assumed that all fundamental
data: The IF system should work reliable for several topics are independent from each other (= orthogonal).
news sources as well as for a large amount of users. The precise number of fundamental topics d has no
Further requirements are: Every user should have an further relevance for the TVSM. Therefore, there is no
own profile which supports several classification need to specify this number in more detail. &
schemes (e. g. classification by importance or by A Term i ∈ {1, , n} is represented by a term-vector ti
topic). Reliability determines persistency of data. The and has a term-weight between one and zero. The term-
other requirements entail the possibility of storing and weight is defined as the algebraic length of the term-
processing a huge amount of data (documents, user &
vector ti :
profiles, natural language depending data and ad- &
ministrative data). Databases are optimized for pro- t i = (t i1 , t i 2 ,..., t id ) ∈ R (2)
cessing huge amounts of data and they can be used &
with t i = t i1 + t i 2 + ... + t id ∈ [0;1]
2 2 2
effectively if the main part of data is processed within
the database. Section 3 will show how the document

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA
TOPIC-BASED VECTOR SPACE MODEL 9

2.3 Similarity of documents


topic: literature
The similarity between two documents k and l is
1 Goethe defined as the scalar product of the document-vectors,
which is equal to the cosine of the angles between the
document-vectors:
& & & &
d k d l = d k ⋅ d l ⋅ cos(ω kl ) = cos(ω kl ) (5)
& &
because d k = d l = 1
This similarity measure has the following properties:
is software & & & & & & & &
the
program d k d l ∈ [0;1] and d k d l = d l d k and d k d k = 1 (6)
0 topic: computer With the knowledge of term-weights and term-angles, the
1
similarity between two documents can be computed as
Figure 1. Interpretation of term-vectors follows:
& & & &
d k d l = &1 δ k 1& δ l (7)
Figure 1 illustrates the interpretation of term-vectors: δk δl
Topic specific terms (e. g. “program”, “software” and 1 n n &&
“Goethe”) point into the same direction as their = & &
δk ⋅ δl
∑∑ e e tt
ki lj i j
fundamental topic which they are related to. In case a i =1 j =1

term combines two or more fundamental topics (due to & &


The length of the unnormed document-vectors δ k and δ l
visualisation limits this case is not shown in figure 1) it &
points into a combined direction of the two or more can be computed as follows (here only shown for δ k ):
fundamental topics. Terms which are not topic specific at & n &
all (e. g. “is” or “the”) have a term-vector which has an δk = ∑ eki t i (8)
angle near 45° towards all fundamental topics of the space i =1

R. The term-weight represents the relevance of a term n &


concerning its applicability to derive information about its = (∑ eki t i ) 2
i =1
relation to a topic in general. Term-weights are near to
value one for topic specific terms and near to value zero
n n &&
for unspecific terms.
= ∑∑ e
i =1 j =1
e tt
ki kj i j

We premise that the following data is known for all Calculation complexity for term similarity is low enough
terms i, j ∈ {1, , n} : to enable an implementation in SQL. This facilitates a
&
• Term-weight for each term i: t i ∈ [0;1] database-optimized computation of document similarities
within a relational database as shown in section 3.
• Term-angle between each term i and j: ω ij ∈ [0°;90°]
&&
Using the data the scalar product t i t j can be deduced as: 2.4 Ideal term-weights and -angles
&& & &
t i t j = t i ⋅ t j ⋅ cos(ω ij ) (3) &
Term-weights t i should have the following character-
istics in order to ensure good behaviour of the similarity
2.2 Document definition
function:
In the TVSM a document k ∈ {1,  , m} is represented • Terms which can be used as credible topic indicators
& for documents (e. g. technical terms) should have a
by a document-vector d k ∈ R which vector-length is term-weight near value one.
standardized to the length of value one for better • Terms which do not have a relation to a topic (e. g. the
comparison: so-called stopwords “is”, “a”, “the”, etc.) should have
& 1 & & a term-weight near or equal value zero. Weighting
dk = & δ k ⇒ dk = 1 (4)
stopwords by value zero allows the integration of a
δk
classic stopword-list into the TVSM.
& n & The angle ω ij between two terms should follow these
with δ k = ∑ eki t i
i =1 rules:
while e ki = number of occurrences of • ω ii = 0
term i in document k. • ω ij = ω ji

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA
10 BUSINESS INFORMATION SYSTEMS – BIS 2003

• The angle between two terms which have the same 3. Database model
stem (e. g. “house” and “houses”) should have zero
degrees. This integrates the idea of stemming into the
TVSM. 3.1 Schema definition
• The angle between two synonyms or between two
words from a similar topic should have a value near The TVSM has been tested by implementing it in a
zero degrees. This integrates the concept of a relational database using the following statements (we
thesaurus into the TVSM. assume that an index on all unique or primary key
• The angle between two terms from different topics attributes is automatically created by the database):
should have a value near 90 degrees. CREATE TABLE document (
• The angle between terms which are credible topic id INTEGER UNIQUE NOT NULL,
indicators and terms without relation to a topic length DOUBLE PRECISION DEFAULT NULL;
(stopwords) should have a value near 45 degrees. text TEXT NOT NULL,
PRIMARY KEY(id));
• The angle between two terms without relation to a
topic (stopwords) should have a value near zero. CREATE TABLE term (
id INTEGER UNIQUE NOT NULL,
text TEXT UNIQUE NOT NULL,
2.5 Estimation of term-weights and -angles weight DOUBLE PRECISION DEFAULT NULL,
PRIMARY KEY(id));
A manual setting of all term-weights and term-angles
is not efficient because of the high amount of terms and CREATE TABLE doc_term_ass (
document INTEGER NOT NULL
the even higher amount of angles. We used the following REFERENCES document(id),
estimation for our first experiment: term INTEGER NOT NULL
Assumption: All terms which occur in more than 50% REFERENCES term(id),
quantity INTEGER NOT NULL,
of the documents are not useful for any kind of PRIMARY KEY(document, term));
classification. Thus, term-weights for these terms are set CREATE INDEX doc_term_ass_term_document_idx
to value zero which implies that the scalar product of ON doc_term_ass (term, document);
these terms in relation to other terms is also zero.
CREATE TABLE skalarproduct (
Assumption: Terms which occur in less than 1% of the term1 INTEGER NOT NULL
documents do not provide enough data to calculate a REFERENCES term(id),
reliable estimation on term relations. Therefore, these term2 INTEGER NOT NULL
terms are treated as orthogonal to all other terms. REFERENCES term(id),
value DOUBLE PRECISION NOT NULL,
Consequently, ω ij = 90° for all i ≠ j . The term-weights PRIMARY KEY(term1, term2));
for these terms are set to value one. Table “document” stores text as well as unnormed
The term-weights for all other remaining terms are set &
to value one while the angles are set to the following document-vector length δ k of the documents. Table
values: &
“term” stores all terms and their term-weights t i .
90° − 90° ⋅ Corr(Ti ´,T j ´) if Corr(Ti ´,T j ´) ≥ 0
ω ij =  Documents and terms are connected by “doc_term_ass”
 90° else which also stores the number of occurrences of a term
Corr(Ti ´,T j ´) is the empirical correlation of terms i and j within a document. Finally, “skalarproduct” stores the
&&
within a document base. scalar-product t i t j between two terms. Each time a
The estimation of term-weights and -angles can be document is inserted, the “doc_term_ass” has to be
improved by using explicit information about the natural updated. After this the unnormed document-vector length
language of the documents and by using explicit infor- has to be calculated once and stored for performance
mation about semantic coherence of words. The following reasons in table “document” using the following view as
improvements should be investigated in the future data source:
regarding their influence on classification quality: CREATE VIEW doc_length AS
• A stopword-list should be used to assign a null value SELECT dta1.document,
as weight to all stopwords. sqrt(sum(dta1.quantity *
dta2.quantity * value))
• A stemming-list or a stemming-algorithm should be AS length
used to assign a null-value to all angles between two FROM doc_term_ass dta1,
words with the same stem. doc_term_ass dta2, skalarproduct s
• Term-angles reflecting semantic coherence of words WHERE dta1.document = dta2.document
AND dta1.term = s.term1
could be derived from a thesaurus, a semantic web or AND dta2.term = s.term2
from a formalized ontology. GROUP BY dta1.document;

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA
TOPIC-BASED VECTOR SPACE MODEL 11

Two4 views should be created in order to calculate the this data-basis term-weights and -angles have been
similarity between documents: derived as already described in section 2.5 (with the
CREATE VIEW unnorm_doc_sim AS restriction of the scalar-threshold). Table “scalarproduct”
SELECT dta1.document AS document1, contained 97509 entries. The calculation of the similarity
dta2.document AS document2, between a general document (having 164 different terms)
sum(value * dta1.quantity * and all 7184 documents (including reverse ordering by
dta2.quantity)
AS unnorm_sim similarity) needed approximately five seconds on our
FROM doc_term_ass dta1, generic PC (Athlon XP 1600+ processor with 768 MByte
doc_term_ass dta2, skalarproduct s Ram and FreeBSD operating system). First performance
WHERE s.term1 = dta1.term
AND s.term2 = dta2.term
tests showed that the calculation speed highly depends on
GROUP BY dta1.document, dta2.document; the number of entries in table “scalarpoduct” and that it
only depends very low on the number of terms or
CREATE VIEW doc_sim AS documents. This means the scalar-threshold is a good
SELECT uds.document1, uds.document2,
uds.unnorm_sim / n1.length /
variable to adjust the calculation speed versus the quality
n2.length AS sim of similarity-calculation.
FROM unnorm_doc_sim uds,
document n1, document n2
WHERE uds.document1 = n1.id 4. Comparison with other vector-based
AND uds.document2 = n2.id;
approaches
The first view calculates the unnormed similarity between
two documents: Both, the Vector Space Model (VSM) [Salton 1968;
n n && Baezea-Yates 1999, pp. 27-30] and the TVSM assign a
∑∑ eki elj ti t j
i =1 j =1
document-vector to each document. In contrast to the
TVSM the VSM assumes that all terms are independent
The second view calculates the norming factor using the
(orthogonal) to each other. This leads to a relatively high
previously stored unnormed document-vector lengths.
performance. The assumption of orthogonal terms is
This results in a virtual table which “holds” the similarity
incorrect regarding natural languages which causes
for all combinations of documents:
& & problems with synonyms or strong related terms. In order
&&
d k d l = & 1 & ∑∑ eki elj t i t j
n n
to reduce these problems messages are usually passed
δ k ⋅ δ l i =1 j =1 through a stopword-list, stemming- and thesaurus-
The following statement has to be executed in order to algorithms before they are forwarded to the VSM. This
gain the similarity between document 5 and document 7: abrogates the assumption of term independence only in
parts, because two terms can simply be treated as
SELECT * FROM doc_sim equivalent or as not equivalent. Similarity levels between
WHERE document1=5 AND document2=7;
these two extremes are not possible. From the theoretical
The following statement has to be executed in order to point of view the TVSM has the advantage of not
compare document 5 with all other documents (including assuming independence for terms which allows a full
reverse ordering by similarity): integration of stopword-list, stemming and thesaurus into
SELECT * FROM dok_sim the model. Similarity between terms can be gradually
WHERE nachricht1=5 defined from “not equivalent” (term-angle: 90°) to
ORDER BY sim DESC; “equivalent” (term-angle: 0°).
The Generalized Vector Space Model (GVSM) [Wong
3.2 Performance 1987; Beaza-Yates 1999, pp. 41-44] assigns a document-
vector to each document without the assumption of
We implemented the TVSM on a PostgreSQL5 version orthogonal terms. In contrast to the TVSM the GVSM
7.2 relational database. For a better performance, only allows no flexibility regarding the computation of term-
entries with a scalar-product larger than the scalar- angles: in the GVSM term-angles are based on the com-
threshold 0.5 are stored in the “scalarproduct” table (this putation of co-occurrence of terms. Because of this limi-
is equivalent to set all scalar-products lesser than the tation messages have to be pre-processed in a similar way
scalar-threshold to value zero). For our tests, we used like for the VSM: Messages are passed through a
7184 news documents from the German Heise-Ticker6 stopword-list and stemming-algorithms before they are
Website. 96887 terms have been extracted from these forwarded to the GVSM. In contrast to the GVSM the
documents and have been stored in the “term” table. From TVSM specifies only ideal properties of term-angles
4
(refer section 2.4). Therefore the TVSM allows more
We are using two views in order to avoid nested select statements. flexibility regarding the calculation of term-angles. Term-
5
http://www.postgresql.org
6 angles can be computed using different statistical methods
http://www.heise.de/newsticker

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA
12 BUSINESS INFORMATION SYSTEMS – BIS 2003

like co-occurrence or correlation (refer section 2.5). [Yang 1999] Y. Yang, X. Liu: “A re-examination of text
Further the TVSM allows the deduction of term-angles categorization methods”, Proceedings of the 22nd Annual
from explicit information about the semantic coherence of ACM SIGIR Conference on Research and Development in
words (e. g. from semantic networks or ontology). Information Retrieval, Berkley, 1999, pp. 42-49.

5. Conclusion

This paper presents a new approach (TVSM) to


compare documents regarding their content. This
approach has the following advantages: From the
theoretical point of view, the TVSM is an open approach
which enables the integration of several natural language
processing algorithms as stopword-list, stemming and
thesaurus into one model. This facilitates the possibility
of exploration of dependencies between these algorithms
and provides a potential to optimize natural language
processing models in general.
From the practical point of view, the TVSM enables
complete calculation of document-similarities within a
relational-database by using plain SQL. Therefore reliable
processing of huge amounts of data is supported by using
database-integrated optimization algorithms for accessing
and processing the data.

6. Bibliography

[Beaza-Yates 1999] R. Beaza-Yates, B. Ribeiro-Neto:


Modern Information Retrival. ACM Press, New York, 1999.
[Binkley 1991] J. Binkley, L. Young: Rama: An Architecture for
Internet Information Filtering. ftp://ftp.cs.pdx.edu/pub/faculty/
jrb/rama/rama_kl.ps, download: 2002-06-10.
[Haneke 2001] E. Haneke: NewsSIEVE Ein selbstadaptiver
Filter für textuelle Informationen. 2001, ftp://ftp3.informatik.
uni-bonn.de/pub/paper/tr/IAI-TR-2001-1.pdf.gz, download:
2002-06-11.
[ISO 1992] International Standards Organization: ISO/IEC
9075. 1992.
[Kuropka 2001] D. Kuropka, T. Serries: ”Personal Information
Agent”, Informatik 2001 Conference Proceedings,
books@ocg.at, 2001, pp. 940-946.
[Salton 1968] G. Salton, M. Lesk: “Computer evaluation of
indexing and text processing”, Journal of the ACM, 15(1), pp.
8-36, 1968.
[Smeaton 1996] A. Smeaton: “Lessons learned from developing
and delivering the BORGES information filtering tool”,
Ariadne, 1996, http://www.ariadne.ac.uk/issue6/borges, down-
load: 2002-06-14.
[Wong 1987] S. Wong, W. Ziarko, V. Raghavan, P. Wong:
“On Modeling of Information Retrival Concepts in Vector
Spaces”, ACM Transactions on Database Systems (TODS),
Volume 12, Issue 2, June 1987, pp. 299-321.
[Yan 2000] T. Yan, H. Garcia-Molina: “The SIFT Information
Dissemination System”, ACM TODS, 2000.

Witold Abramowicz, Gary Klein (eds.), Business Information Systems, Proceedings of BIS 2003, Colorado Springs, USA

You might also like