Information Filtering Based On User Behavior Analysis and Best Match Text Retrieval
Information Filtering Based On User Behavior Analysis and Best Match Text Retrieval
Information Filtering Based On User Behavior Analysis and Best Match Text Retrieval
Abstract
Information filtering systems have potential power that may provide an efficient means of navigating through large
and diverse data space. However, current information filtering technology heavily depends on a user’s active
participation for describing the user’s interest to information items, forcing the user to accept extra load to overcome
thealready loaded situation. Fumhemo~, because theuser's interests weoften expressed indiscrete fomat such
as a set of keywords sometimes augmented with if-then rules, it is difficult to express ambiguous interests, which
users often want to do. We propose a technique that uses user behavior monitonng to transparently capture the
user’sinterest in information, andatechnique to use this interest to fikerincoming information in avery efficient
way. The proposed techniques are verified to perform very well by having conducted a field experiment and a series
of simulation.
1 Introduction
Recent developments in computers and computer networks interconnecting large numbers of systems have brought
us convenience in many aspects, but have introduced a situation of “information overloading” also. As highly
developed computer networks into our everyday life, the amount of information we create and exchange
disseminate
has grown by an order of magnitude. The amount is far more than a person can understand and manage, and the task
of retrieving a piece of information from the magnitude of information is beginning to take considerable amount
of time and effort.
A concept of information jiltering [1] was introduced as a key technology to overcome this situation, but we
are all aware that technologies such as user profile acquisition should be studied in detail before the information
filtering technologies are put into any practical use.
In this paper, we propose a profile acquisition and user feedback technique to accumulate a user’s preference
for information, based on user behavior monitoring, as well as an information filtering technique using the acquired
profile. The proposed techniques are based on an assumption that a user is performing a quick pattern matching
task upon deciding if a piece of information is relevant to the user or not, before the user attempts to understand the
semantics of the information any further. If this is all true, we can expect that information not of interest to the user
is immediately rejected, and conversely, information of interest requires a noticeable amount of time to process.
In the following, we will briefly review the general architecture of information filtering systems and associated
issues in section 2, followed by an explanation of our assumption in section 3. Section 4 describes the method and
results of the field experiment we conducted with readers of NetNews, to prove the correctness of our assumption
for the profile acquisition. Then, in section 5, the proposed information filtering technique using the acquired profile
is described. A simulation of the proposed information filtering technique showed that the technique can retrieve
relevant information for a user with sufficient precision and recall rates for casual source of information such as
NetNews. Finally, in section 6, summaries of the proposals and results as well as remaining issues and directions
for future research are presented.
Information Seleeted
-&-
Items tteme
Relevance
“’ihlfrl?rij: “ Faadback
!tq’::’
3
counts for humans’ creative work by capturing potentially important information from casual information sources
such as NetNews. From here on, we focus heavily on cognitive filtering systems.
Figure 1 shows a generalized architecture of information filtering systems [4]. This figure explains an infor-
mation filtering system as consisting of seven logical units, namely the information source, the information filter,
the user and four buffers. In this system, the source represents some descriptors of the information items to the
filter, and the filter forwards to the individual user a subset of the items selected based on advance knowledge of
the user’s needs that is stored as the user’s projile. The profile is like a reference database representing information
needs or preferences of the user. Users may have the option of providing the filter with feedback indicating an
extent to which the selection met their current information needs.
There are four buffers between the source, the filter, and the user as shown in the figure. These buffers are
required in order to accommodate diverse information types or usage scenario types. For instance, buffer~ stores
the user’s profile.
In our survey, we concluded that there are three important issues, listed below, that must be studied to realize
any practical information filtering system.
● Accumulation of user’s preference to information
User’s preferences must be accumulated in advance as a profile, and should be updated appropriately.
Most of the previous work on cognitive filtering relied heavily on the user’s active participation in profile
accumulation and feedback. Profiles were often expressed in discrete forms such as a set of keywords possibly
augmented with if-then rules, so that the user can enter and edit the profiles with an editor of some kind.
However, we claim that it is generally difficult to express one’s “interest”, which is a non-discrete quantity,
in the discrete way. Discrete profiles would result in incomplete description of the user’s interest, which
would lead to possible loss of potentially interesting information.
Relying on the user’s active participation in updating the profile database also have a problem in tracking
temporal changes in the user’s interest.
We claim that a cognitive filtering system must be able to handle user’s interest in a non-discrete fashion.
● Representation of the profile
This issue is the direct product of the above issue. If a filtering system must deal with non-discrete informa-
tion, the representation of the profile that stores the information will immediately be an issue.
● Filtering mechanism
This issue, again, is the direct product of the first issue. Because a filtering system must deal with non-discrete
profile and discrete target information, what will the best filtering technique be?
We also point out that a practical cognitive filtering system should do its best not to filter out potentially impor-
tant or stimulating information, to allow users of the system to have opportunity to be exposed to new information.
Foltz took this approach of using similarity among documents for information filtering. He reported in [2] that
using latent semantic indexing (LSI) for filtering news articles using articles pre-selected by a readers as the reader’s
profile can perform significantly better than filtering based on simple keyword matching on filtering performance.
Although their results are interesting, we felt that we need to study further on in following two points before
directly employing the mechanism to judge similarity among documents in an information filtering system.
● Time spent for an article is affected by the readability of the article, i.e., denseness of the article, in some
degree.
● Time spent for an article is affected by a size of a back log that a user have to process, e.g. if the user have
large quantity of unread articles, time spend for each article get shorter.
In the following, we will first describe how the experiment was carried out, followed by result of analyzing the
collected data.
275
1. Read NetNews
as usually do
GNUS
-“’+’”+
x- 4+n
,,,,,,,,
,,,,,:,:,:,,,,
,;:,:,::,,
Mesg-lD
Emacs
Reading lima
Action (Ssve,Follow)
2. Inout immeeaions Postad Masssge
fo; asch’articie J/”
4.1 Experimentation
4.1.1 Subjects of the Experiment
This experimentation is aimed at NetNews (we focused on f j . * newsgroups, which is written mostly in Japanese)
user, which is a popular Internet information service. Subjects of our experiment were recruited from actual user of
NetNews by the NetNews itself, and 8 people volunteered as subjects. These 8 subjects read total of 8,000 articles
during a period of 6 weeks.
Figure 2 illustrates over all setting for the experiment. We modified GNUS, a widely used NetNews reader based
on GNU Emacs, to record following information for each article that were read by subjects of this experimentation.
1. Message identifier of the article.
2. Time spend for the article (How much time was spent for the article).
3. Action for the article (The article was saved or follow-up was posted).
4. Copy of posted messages if the subject post new articles.
After a GNUS session is over, subjects rate each articles they read in the last session using chknews com-
mand, The chknews command uses the record of message identifiers to display contents of each article on users
terminal and interviews subjects about impression of the article. Then it merges the records described above and
the impressions about articles, and send them into a data collecting server using ordinal e-mail messages.
The impressions of article was to be answered in four grade of ‘A’, ‘B’, ‘C’ or ‘F’ where ‘A’ meaning very
interesting, ‘B’ meaning interesting, ‘C’ meaning neither interesting nor not-interesting, and ‘F’ meaning not-
interesting.
We requested subjects to behave as close to usual news reading session as possible during the experimental
GNUS session, but also requested the followings to avoid abnormalities in the resulting data.
● During the GNUS session, subjects must wholeheartedly read articles and should not do other things such as
leaving the terminal a while to get a cup of coffee or read newly arrived e-mail messages.
● During the period of experiment, subjects should read NetNews everyday as possible.
● Subjects should read all articles in registered newsgroups, e.g. subjects should not skip articles without
having a glance at them or issue “catch up” commands.
Note that we separated the reading session and the rating session for the same reason.
Table 1 summarizes collected data over the period of 6 weeks. In following sections, we use the terms ‘useful’
or ‘interesting’ for those article that subject rated ‘A’ or ‘B’, and’ un-useful’ or ‘not-interesting’ for those rated ‘F’.
L
3 40 21 528 1218 27 1794
4 11 13 28 58 201 300
5 25 3 387 208 3 601
6 38 1 19 746 7 773
7 41 35 352 233 14 634
8 30 26 162 697 14 899
Total 290 220 2083 4338 1191 7832
interesting Art!cles —
Nol-mtereslm Articles ------
A~Amd6s ---
.- m
Av tlmelor Unusefularltdes
Av llmefor Allari!cles
s
Av t!mefor Use fulamcles
.
.
““” “’.\.,
. .. . .m
a,a
,
‘\
~,
‘1 103
Reading }!me (see)
those rated not-interesting. Other factors played only aveWlittle role inaffecting reating time ofm~ticle. The
results from our analysis can be summarized as follows.
● There is a strong tendency to spend a long time to read articles those rated interesting.
● There is a tendency not to spend a long time to read articles those rated not-interesting.
● There was very low correlation of less than 0.08 between length of an article and a time to read the article.
This probably indicates that not all articles are completely read, e.g. the decision of whether an article is
interesting or not, especially in the not-interesting articles, are made by looking at first couple of dozens of
lines.
● Readability of an article is not an important factor that affects its reading time. In our analysis, we used
several metrics for denseness of characters in article, including average number of characters per line, ratio
of blank lines in an article. We suspected that reading time become longer in proportion as denseness of
the article become higher, but the results showed very low correlation between reading time and any of the
metrics we used. The highest correlation we observed was 0.25, which was for average number of characters
per line.
● Effect of back 10C [unread articles) was also verv small. Correlation between number of unread articles and
reading time is Ie;s than 0.18. ‘
From the observation made above, we can conclude that preference of a user to an article is the dominating
factor that affects time to spend reading it, meaning that it is possible, to a certain extent, to monitor the time spend
for an article and know if the user is interested in the article or not. Consider for example, with some threshold time
T seconds, and treating articles which took longer than -r seconds to read as interested articles, and use these article
as feedbacks, we can perform user feedback with significant precision. For example, our result indicates that with
T being 20 seconds, 30% of interesting articles can be retrieved with precision of 70%.
277
Table 3. % of Mutual Relation of Reading Time between Initial and Follow up Articles
Sato showed in [10] that best match information retrieval method using this sub-string indexing is very powerful
and efficient. The basic idea of his method is to put Japanese texts in database with its sub-strings as keys, and
a search is carried out using a query key, which also is decomposed into sub-strings, by searching a text that has
highest number of matching sub-string keys. Figure 5 gives an example of score calculation.
+Hf31x$iJi?
Query
(How IS the weather Today?)
- . . ..} . . . . . . . . . . . . . . . . . . . ..-.
Decomposed
---;:;---/:% ------------
+~ ~ ~~ ~jb~ b~v~ b~x x% $F,r: m!3 Eli ii~’ ~’~’ ~’x x% %7!? f:~ 972
Keys ——————— —————————
---- ---- . . . . . . . . . . . ---- . . . . . . . . . . . . . . . . . . . ---- . . . . ---- --
t t
Database +El itblb~x%f: BIEILtb’b’x%f:972
Entries (It
IS a fine day today,) (It was a fine day yesterday.)
Our filtering method is a variation of the best match retrieval method. When a new article is given, we calculate
its score, or occurrence count of all sub-string of the article in a sub-string database containing sub-strings made by
splitting articles determined to be interesting in the past. We then use some threshold score to determine if the new
article is interesting or not. Similarly, to determine if a new atticle is not-interesting, we can carry out the same
scoring procedure using those articles that are determined not-interesting in the past.
1— I
........ _ ........
l—
#!!mE --------
N-1 N N+l N+ M-2 N+ M-l N+M N+M+l
Profile
- ProfJe with Useful article
Profile with Unuseful arlcle
M Sessions
4
Y
+
v
I Scoring each atilcle
~ FILTER
&
‘C3
5.2 Evaluation
We evaluated the proposed information filtering method using the sub-string index method by a simulation using
the collected data.
Figure 6 illustrates show how the simulation was carried out. We took records of M consecutive sessions
starting at session N from a subject’s records of GNUS interaction and rating work, and made a profile database of
articles rated interesting and not-interesting by the subjects. Then, articles in N + M th session was scored using
the profile database. To increase the samples, we shifted the starting session N within the entire recorded session.
279
Ilxl
Interesting Arlicle ----
Not-inlerestinq Arlick
~ ~
!!
Recall: 20,0 % Recak 60.0%
80 -:
!;
Precision: 59.5 %
(Error Rate 190 %)
70 - j /,:
:j’ , Precision: 51.6 %
: : ,.,,,., ,.., (Error Raw 22.8 %)
.,,, ...; Precision: 48,7 %
(Error Rate 2.0 O/.)
60 -)
~~
50 i L.---- . . . . .. . . ...
“.-...
; IL ‘-- . . . . . . . ..........., :
‘-------- . ... . . . ..
40 ~ - .-. ,..,.
-“ ----------- .. ..,’..,
Chance Periormanc=s for Uselul Arlcles 30.1 % .. . . .
-- .---+,...
...
20 .,
i~
o 10 20 30 40 50 60 70 80 80 100
Rwall (%)
Iwo
Mutual Relation ---- :
.
. .
..
.//’”
.. .“
.
:.
100
.
. . . .. . . //’’”
. .*..+
.“ ; ::%+:: 9 ● ,,:...”’ .
.
. . . ; >;:: $$.;j’f;~‘:.”;“
. .
‘“ “
Correlation cceflicient. 0.49
i;”jii?‘ W
.?. “ *’”’”:-’: “
.
, , .::.:,.:;;%5:::
. ...- .“.” “
. ..— m.. ..:” 0 . .
10 -’-’-’% .. ... ..........- . . . .. . . . . .. .. .
..- ....... . ...---------- . . . — .-- ”...... . . . ...?” .
. . .. —.-— . . . . . .
. ------
. . . . . . ..-
.l...:u:-=———.-
— —- . . . . .
1
.. .
I
t
.
““”—–—
. . ..*
—.
..- .... . .
. . .. .
1“ . . . *.. - -—- . .. . . ..
Table 4 shows dimension of data sets used in this evaluation. We varied the value of M in the range of 1,2,5, 10
and 20, and obtained best results with M = 5. Following discussions are based on this value of M.
Figure 7 shows precision-recall curves of predicting interesting and not-interesting articles, obtained by varying
the threshold scores that determines whether the articles are interesting or not. It illustrates that the filtering for both
interesting and not-interesting articles performed considerably better than their chance performance. For example,
at the recall rate of 20’%0,precision are 48.7% for interesting articles and 59.59Z0for not-interesting articles. Filtering
of not-interesting articles perform well even at the high recall rate of 60% with precision of 51.6910against chance
performance of 11.6%.
280
To further justify our hypothesis given at the beginning of this section, we examined the correlation between
estimated usefulness scores and reading time for all articles. Figure 8 shows the correlation between these two
values. This figure shows there is a strong correlation between them (correlation coefficient is 0.49), suggesting
that the score calculated by our method can predict user’s impression with very high precision.
Acknowledgements
Results presented in this report owe a lot to those volunteered to be a subject of our experiment. Without their
patience through tedious task of rating once read articles, this report would never be completed. We would also like
to thank Professor Satoshi Sato of JAIST for his valuable advice in information retrieval using sub-string indexing.
281
Our appreciation also goes to Professor Naohito Chino of Aichi Gakuin University, department of psychology, for
his advice in statistical analysis of the collected data.
References
1. Nicholas J. Belkin and W. Bruce Croft. Information Filtering and Information Retrieval: Two Sides of the
Same Coin? Communications of the ACM, 35(1 2):29–38, December 1992.
2. Peter W, Foltz. Using Latent Semantic Indexing for Information Filtering. In Proceedings of the ACM
Conference on Ojjice Information Systems, pages 40-47, April 1990. Boston.
3. Brewster Kahle. Wide Area Information Server Concepts. Thinking Machines (FTP:/wais/doc/wais-
concepts. txt@quake.think. tom), 1989.
4. Shoshana Loeb. Architecting Personalized Delivery of Multimedia Information. Communications of the
ACM, 35(12):39-48, December 1992.
5. Thomas W. Malone, Kenneth R. Grant, Franldyn A. Turbak, Stephen A. Brobst, and Michael D. Cohen.
Intelligent Information-Shareing System. Communications of the ACM, 30(5):390-402, May 1987.
6. Masahiro Morita. A Survey of Information Filtering Technology (In Japanese). JAIST Research Report IS-
RR–93-91, School of Information Science, Japan Advanced Institute of Science and Technology, 1993. FTP:
/pub/jaist/doc/93/m/is-m-93-9i.ps@f@.jaist.ac.jp.
7. Masahiro Morita. Information Filtering Based on User Behavior Analysis (In Japanese). Master’s thesis,
School of Information Science, Japan Advanced Institute of Science and Technology, February 1994.
8. Ashwin Ram. A Knowledge goals: A Theory of Interestingness. In Proceedings of the 12th annual conference
of the cognitive science society, pages 206–214, July 1990. Cambridge, Mass.
9. Ashwin Ram. Natural Language Understanding for Inforamtion Filtering Systems. Communications of the
ACM, 35(12):80–81, December 1992.
10. Satoshi Sate. CTM2: A Japanese-English Example-Based Translation Aid System using N-gram Index
Method (In Japanese). JAIST Research Report IS–RR–93–61, School of Information Science, Japan Advanced
Institute of Science and Technology, 1993. FTP: /pub/jaist/doc/93/m/is-rr-93-6i.ps@ftp.jaist.ac.jp.
11. Satoshi Sato and Takeshi Kawase. A High-Speed Best Match Retrieval Method for Japanese Text. Submitted
to COLING’94, 1994.
12. Irene Stadnyk and Robert Kass. Modeling Users’ Interests in Information Filters. Communications of the
ACM, 35(12):49–50, December 1992.