Using The Wisdom of The Crowds For Keyword Generation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

Using the Wisdom of the Crowds for Keyword Generation

Ariel Fuxman Panayiotis Tsaparas


Microsoft Research Microsoft Research
Mountain View, California Mountain View, California
arielf@microsoft.com panayiotis.tsaparas@microsoft.com

Kannan Achan Rakesh Agrawal


Microsoft Research Microsoft Research
Mountain View, California Mountain View, California
kachan@microsoft.com rakesha@microsoft.com

ABSTRACT word research). This problem is important for both the


In the sponsored search model, search engines are paid by search engines and the advertisers. Providing a broad set
businesses that are interested in displaying ads for their site of relevant keywords to an advertiser enables the search en-
alongside the search results. Businesses bid for keywords, gine to tap the long tail of small businesses and organiza-
and their ad is displayed when the keyword is queried to tions. At the same time, such small businesses now have in-
the search engine. An important problem in this process is expensive access to a powerful advertising medium, and they
keyword generation: given a business that is interested in can attract traffic to their site by bidding on the appropri-
launching a campaign, suggest keywords that are related to ate keywords. All major search engines provide services for
that campaign. We address this problem by making use of keyword research (e.g., Google’s AdWords Keyword Tool1 ,
the query logs of the search engine. We identify queries re- Overture/Yahoo! Keyword Selector Tool2 , and Microsoft
lated to a campaign by exploiting the associations between adCenter Labs’ Keyword Group Detection3 ).
queries and URLs as they are captured by the user’s clicks. Previous approaches to the keyword generation problem
These queries form good keyword suggestions since they cap- have exploited the content of either Web pages [10, 20, 26]
ture the “wisdom of the crowd” as to what is related to a or search engine results [14]. In this paper, we tackle the
site. We formulate the problem as a semi-supervised learn- problem by making use of search engine query-click logs, an
ing problem, and propose algorithms within the Markov approach that has received limited attention in the literature
Random Field model. We perform experiments with real [4]. Search engine query-click logs maintain the queries that
query logs, and we demonstrate that our algorithms scale to users pose to the search engine and the documents that are
large query logs and produce meaningful results. clicked in return. Clicks define a strong association between
queries and URLs. The semantics of a query is captured
in the URLs that are clicked as a result of the query, while
Categories and Subject Descriptors: H.3.5 [Informa- the queries that result in a click to a URL provide a short,
tion Systems]: Information Storage and Retrieval — On-line concise description of that URL. We exploit this reinforcing
Information Services; J.0 [Computer Applications]: General relationship between queries and URLs to find queries that
are related to the interests of the advertiser. Our approach
General Terms: Algorithms has the advantage that it exploits the proverbial “wisdom of
the crowds” for keyword generation. Furthermore, the sug-
Keywords: Absorbing Random Walks, Keyword Genera-
gested keywords take into account the click-through traffic
tion, Markov Random Fields,Query Click Logs, Sponsored
that they generate in the search engine, and as a result they
Search
are more directly monetizable.
As an illustrative example, suppose that the owner of the
1. INTRODUCTION shoes.com online store decides to launch an ad campaign.
The main source of income for search engines is web search We can safely assume that most of the queries that end up
advertising, which places relevant advertisements together in the Web site of shoes.com come from users interested
with the search engine results. Given a specific keyword in buying shoes. Furthermore, the click-logs enable us to
(a single word or a short phrase), advertisers bid for the expand this immediate set of queries further by exploiting
keyword, and the winner of the auction has her ads displayed semantic associations between queries and documents. Con-
as sponsored links next to the “algorithmic” results of the sider the sample of a search engine click log shown in Table 1.
search engine. Since some user issued the query “running shoes” and clicked
The problem of identifying an appropriate set of keywords on www.shoes.com, we can assume that this query is about
for a specific advertiser is called keyword generation (or key- shoes. Notice that the query “running shoes” also resulted
Copyright is held by the International World Wide Web Conference Com- 1
mittee (IW3C2). Distribution of these papers is limited to classroom use, http://adwords.google.com/select/KeywordToolExternal
2
and personal use by others. http://inventory.overture.com/d/searchinventory/suggestion
3
WWW 2008, April 21–25, 2008, Beijing, China. http://adlab.msn.com/contextSim/Default.aspx
ACM 978-1-60558-085-2/08/04.

61
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

random walk algorithm can be modeled within this


Table 1: A sample of a click log framework, and we consider an alternative approach.
Query Clicked Url
running shoes www.shoes.com • We perform an experimental study in the context of a
running shoes www.runningshoes.com large-scale click log. The experiments show that this
reebok shoes www.runningshoes.com approach can be used to produce large, high-quality
rebok shoes www.runningshoes.com lists of keywords with minimal effort. For example, in
one of the experiments we show that by leveraging a
list of 12 popular domains, it is possible to construct a
list of 500,000 keywords, 95.9% of which are relevant
in a click to www.runningshoes.com, which in turn attracted to the desired concept.
clicks from the queries “reebok shoes” and “rebok shoes”. We The rest of the paper is structured as follows. In Section
may then conclude (perhaps with less certainty) that these 2, we define the keyword generation problem. In Section 3,
two queries and the URL www.runningshoes.com are also we present an algorithm for the keyword generation prob-
about shoes. Notice that although the latter query has a lem that is based on random walks with absorbing states.
misspelling, it still represents an interest in buying shoes of In Section 4, we demonstrate how the keyword generation
a specific brand, and is therefore valuable to the advertiser. problem and the random walk algorithm can be formulated
We can now define the keyword generation problem as fol- within the Markov Random Field model, and we consider
lows. Given a concept (such as shoes), a set of elements that an alternative algorithm within the Markov Random Field
represent the concept (such as a set of URLs), and the re- model. In Section 5, we perform an experimental valida-
lationships between the documents and the queries, obtain tion of our techniques using real click logs. In Section 6 we
a set of keywords that best capture the concept. We for- present related work, and in Section 7 we give some conclud-
mulate the problem as a semi-supervised learning problem. ing remarks.
The input consists of (i) a set of labeled objects about a
concept (e.g., the URLs in the shoes.com domain); (ii) a set 2. PROBLEM DEFINITION
of unlabeled objects (the remaining URLs and the queries
in the log); and (iii) a set of constraints between labeled and We now define the click-based keyword generation problem.
unlabeled objects (the click log). The goal is to label some We assume that the following is given as input.
of the unlabeled elements in a meaningful way. 1. A search engine click log L. The search engine click log
There is an extensive literature in the area of semi-super- consists of triples hq, u, fqu i, where q is a query, u is the
vised learning [7]. In this work, we employ Markov Ran- URL of a document, and fqu is the number of times
dom Fields [17] to model the query click graph. Specifically, that the users issued query q to the search engine and
the clicks quantify pairwise relationships between documents clicked on URL u. We use Q and U to denote the set
and queries, and define a bipartite graph which induces a of all queries and all URLs, respectively, in the click
Markov Random Field. Informally, in a Markov Random log L. We have that L ⊆ Q × U × N + .
Field the probability that an object is associated to a label We will consider the click log L as a weighted bipartite
depends only on the labels of its neighbors. In our con- graph G = (Q, U, E), henceforth called the click graph.
text, this means that the probability of a query being as- The URLs U and queries Q constitute the partitions
sociated to a concept is determined by the documents that of the graph, and for every record hq, u, fqu i in the log,
were clicked for this query (and, analogously, for the proba- there is an edge (q, u) ∈ E with weight fqu .
bilities of the documents).
The Markov Random Field model lends itself to differ- 2. A set of concepts C = {c1 , ..., ck }. The concepts repre-
ent inference algorithms for computing the probability dis- sent abstract themes that the advertiser is interested
tribution over the labels for unlabeled objects. The main in. Concepts may be general (e.g., shoes), or specific
algorithm we consider in this paper makes use of absorbing (e.g., running shoes for teenagers). In the simplest
random walks. This is an inference algorithm for Gaussian case, C consists of a single concept provided by the ad-
Markov Random Fields, that computes the probability of a vertiser. In the more complex one, it is a full taxonomy
query to belong to a concept as the average of the probabil- of different classes.
ity of the clicked URLs for this query (similarly for URLs). 3. A seed set S ⊆ U × C of URLs in the click log that are
We also consider a variational inference algorithm where we manually assigned to the concepts in C. The seed set
attempt to maximize the entropy of an approximating label S consists of pairs hu, ci where u ∈ U and c ∈ C is the
distribution while respecting the underlying constraints. label of concept c. The set of URLs assigned to the
The contributions of this paper are the following: concept c can be thought of as a representation of the
concept c in terms of URLs.
• We provide an approach for keyword generation based
on exploiting the query-click graph. Given G, C, and S the goal of the keyword generation
problem is to populate the concepts in C with queries from
• We propose an algorithm for keyword generation based Q. These queries will be used as keyword suggestions to the
on a random walk with absorbing states. The al- advertisers that are interested in the specific concept.
gorithm is intuitively appealing and performs well in Note that in the problem definition we defined the seed
practice. set to be a set of labeled URLs. We could also define the
seed set as a set of labeled queries, or a mix of queries and
• We define a formal framework for keyword generation URLs. For simplicity, we will restrict ourselves to the case
in terms of Markov Random Fields. We show how the that the seed set consists only of URLs.

62
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

3. A RANDOM WALK ALGORITHM where


We begin the algorithmic exploration of the keyword gen- fqu
eration problem by describing an algorithm that is based on wqu = P
u:(q,u)∈E fqu
a random walk with absorbing states. The algorithm is intu-
itively appealing and can be implemented efficiently. In the is the transition probability from query q to URL u, and
next section, we show the connection between this algorithm P (ℓu = c) is the probability that a random walk that starts
and the theory of Markov Random Fields. from URL u ends up being absorbed in class c. For all URLs
Recall the motivating example presented in the previous u in the seed set, we have that P (ℓu = c) = 1 if the pair
section. The advertiser is interested in the general concept of hu, ci belongs in the seed set, and zero otherwise. For all
“shoes” and the seed set consists of the URL www.shoes.com. other URLs, the probability P (ℓu = c) is again recursively
Starting from this URL, we were able to associate the query computed as
“running shoes” to the concept “shoes” as some user had X
P (ℓu = c) = (1 − α) wuq P (ℓq = c) (2)
posed this query and clicked on the above URL. That is, we
q:(u,q)∈E
assumed that the query is associated to the concept “shoes”
because one of its immediate neighbors (i.e., the URL) was where
also related to this concept. We also made a similar assump- fqu
tion for URLs associated to queries about shoes. Intuitively wuq = P
fqu
the queries (and URLs) are related to the concept of shoes q:(q,u)∈E

because they are connected closely in the graph to the seed is the transition probability from u to q.
set that represents this concept. The iterative process defined by Equations 1 and 2 is
We capture this intuition using a random walk. For some known to converge [9]. The computation has interesting con-
query q ∈ Q, we compute the affinity of q to some seed nections with electrical networks. Consider the click graph
node s ∈ S as the probability that a random walk that as an electrical network, where each edge (q, u) is a wire, and
starts from q ends up at node s. The affinity of q to the the weight fqu of the edge is the conductance of the wire. If
class c ∈ C is the probability that the random walk that we apply a unit of voltage to the nodes in the seed set, and
starts from q ends up in any seed node in the class c. Note we ground the absorbing node ω, then the probabilities that
that in this walk the nodes in the seed set act as absorbing we compute are the voltages on the non-seed nodes.
nodes, that is, sink nodes in the state transition graph from The outline of the Absorbing Random Walk (ARW) algo-
which the random walk cannot escape. Absorbing nodes rithm is shown in Algorithm 1. Note that in lines 5 and 9
naturally model the fact that for the URLs in the seed set of the algorithm we discard probabilities that are below a
we have complete certainty about their class since they were threshold γ. This step is mainly for efficiency purposes. If
manually assigned to that class, and thus their assignment we do not do any pruning, then all nodes that are reachable
should not be changed by the algorithm. from the seed set will get some probability of belonging to
Note that for the case where there exists a single seed node the class regardless of how small that probability is. How-
s in the graph, every query that is reachable from the seed ever, we are not interested in such queries, since they are
node will be absorbed with probability 1 at node s. How- too far from the seed set. Making the probabilities of these
ever, the further away a query is from the seed URL, the less nodes to be zero conceptually means that we temporarily
related it should be to the URL’s class. We model this by make them absorbing nodes and place them in the null class.
introducing an absorbing “null class” node ω to the graph, At convergence, we have a set of null class nodes that define
and connecting this node to every other node in the graph. the boundary of the graph that we have explored.
The effect of this null class node is that long paths are penal-
ized by the algorithm, and the probability of a node being Algorithm 1 The ARW algorithm for a single class
absorbed at the seed node decreases exponentially with the
Input: the seed set S for class c, the click-graph G, the
length of the path.
threshold parameter γ, the transition probability α to ω
Performing a random walk for every query in the graph
Output: P (ℓq = c), for every query q.
is computationally prohibitive for a large graph. However,
1: for u ∈ S do P (ℓu = c) = 1
there is a simple iterative algorithm that can compute the
2: repeat
class probabilities efficiently. We will now describe the al-
3: for all q ∈ Q do
gorithm for the case that we have a single concept c (that P
4: P (ℓq = c) = (1 − α) u:(q,u)∈E wqu P (ℓu = c)
is, C = {c}), and then show how to generalize to the case of
multiple classes. 5: if P (ℓq = c) < γ then P (ℓq = c) = 0
Let ℓq (or ℓu ) denote the random variable pertaining to the 6: end for
concept label for query q (or URL u). We want to compute 7: for all u ∈ U \ S do P
P (ℓq = c), that is, the probability that a random walk that 8: P (ℓu = c) = (1 − α) q:(u,q)∈E wuq P (ℓq = c)
starts from q will be absorbed at some node of the class c. 9: if P (ℓu = c) < γ then P (ℓu = c) = 0
Let α be the probability of making a transition to the null 10: end for
class absorbing node, from any node in the graph. Then we 11: until convergence
have that 12: Output P (ℓq = c), for every query q in Q

The algorithm generalizes naturally to the case where


X
P (ℓq = c) = (1 − α) wqu P (ℓu = c) (1)
u:(q,u)∈E
there are multiple concepts. More specifically, Equations 1
and 2 can still be used for computing the probabilities P (ℓq =
c) and P (ℓu = c) for each concept c ∈ C. We can thus

63
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

perform the same iterative algorithm for all classes simul- Further optimizations are possible. Once the run for con-
taneously. However, this process is memory intensive since cept ci is completed, we know that with probability Pqi =
it requires to maintain probability vectors for all k classes. P (ℓq = ci ), the query q belongs to class ci . Therefore,
Furthermore, the fraction of the graph that will be explored the probability mass “available” for the remaining classes
will increase substantially even if we prune nodes with low is 1 − Pqi . When considering another class cj , we can create
probabilities, since the size of the seed set may be signifi- a random jump with probability Pqi to the null absorbing
cantly larger. node. This will result in faster convergence for each individ-
This problem can be addressed by considering the con- ual run.
cepts one at the time. However, this should be done care-
fully so as to take into account all the information contained 4. MARKOV RANDOM FIELDS
in the seed set. Let S = {S1 , · · · , Sk } be the partition of the
We now demonstrate how the keyword generation prob-
seed set into the k concepts. Applying Algorithm 1 directly
lem and the ARW algorithm that we described in the previ-
for each set Si is incorrect, since we discard the information
ous section can be formulated within the Markov Random
we have about the seed nodes in the other concepts. For
Field model. In order to compare different Markov Random
these nodes, we know with certainty that they cannot be-
Field formulations, we also present another algorithm for
long to class ci ; thus, they are negative examples for the class
computing the class probabilities. In Section 5, we report
ci . We can easily model this effect in the absorbing random
experiments comparing this algorithm against the ARW al-
walk: when considering class ci we make all the remaining
gorithm.
seed nodes in S \ Si to be in the null class.
The outline of the ARW algorithm for multiple classes 4.1 Model and definitions
is shown in Algorithm 2. The outer loop of the algorithm
A Markov Random Field (MRF) is an undirected graph,
iterates over all classes, and for each class we perform an
where each node in the graph is associated with a random
absorbing random walk. The absorbing random walk for
variable and the edges model the pairwise relationships be-
each individual class ci is very similar to Algorithm 1, albeit
tween the random variables. MRFs define a probability
with two distinct differences. In line 2 of the algorithm, we
model where the value of a random variable in the field de-
fix the probability of the nodes in S \ Si to belong to the
pends on the prior knowledge we may have for that variable,
class ci to zero, thus making them into null absorbing nodes.
and the values of all the adjacent variables in the graph.
Also, when we update the URL probabilities (lines 9 – 12),
The characteristic of the MRFs is the Markovian assump-
we do not update the probabilities for any of the URLs in
tion that the value of a random variable is independent of
S. In effect, the seed set for the absorbing random walk
the rest of the graph, given the values of all its neighbors.
for class ci consists of the entire set S; the nodes in Si have
Given a set of observations about certain variables in the
probability 1, while the nodes in S \Si have probability zero.
MRF, an important problem is to compute the most likely
The effect of this optimization is that the memory foot-
assignment of values for the other unobserved variables.
print is decreased dramatically, since we only need to con-
We model the keyword generation problem using an MRF
sider a single class at the time. Single class runs are also
as follows. Consider the random variables ℓq and ℓu that
fast, since the size of the explored graph is smaller. Further-
represent the concept label for each query q and URL u,
more, the null absorbing nodes in S \ Si “block” the random
respectively. The domain of these variables is the set of
walk, and thus speed up the convergence. Note also that the
concept labels C. The pairwise relationships between the
algorithm is amenable to parallel execution.
nodes are defined by the edges derived from the click graph,
and their strength depends on their weight. The seed set S
defines the observations for the variables. The observations
Algorithm 2 The ARW algorithm for multiple classes in our case are of the form P (ℓu = c) = 1 for all pairs hu, ci
Input: the seed set S = {S1 , · · · , Sk } for concepts C = in the seed set. We are interested in assigning concepts to
{c1 , · · · , ck }, the click-graph G, the threshold parameter the queries and the unlabeled URLs in the graph in such
γ, the transition probability α to ω. a way that we respect the constraints defined by the click
Output: P (ℓq = c), for every query q and every class c. graph.
1: for all ci ∈ C do In the MRF model this is translated to finding the assign-
2: for u ∈ S \ Si do P (ℓu = ci ) = 0 ment of values to the random variables such that the pos-
3: for u ∈ Si do P (ℓu = c) = 1 terior probability P ({ℓq }, {ℓu }|S) is maximized. The main
4: repeat assumption is that given the Markovian assumption we can
5: for all q ∈ Q do P decompose the joint probability distribution to factors de-
6: P (ℓq = c) = (1 − α) u:(q,u)∈E wqu P (ℓu = c) fined over the edges in the graph. For some edge (q, u) in the
7: if P (ℓq = c) < γ then P (ℓq = c) = 0 MRF, we define a compatibility function ψqu (ℓq , ℓu ) (also
8: end for called potential function) that scores the relationship be-
9: for all u ∈ U \ S do P tween q and u. The choice of potential function typically
10: P (ℓu = c) = (1 − α) q:(u,q)∈E wuq P (ℓq = c) reflects the prior knowledge one has about the problem un-
11: if P (ℓu = c) < γ then P (ℓu = c) = 0 der consideration. As noted earlier, in our case a click to a
12: end for URL from a query can be viewed as an endorsement of sim-
13: until convergence ilar concept. Hence, a larger click count implies a stronger
14: Output P (ℓq = c), for every query q and class c agreement of concepts between the connected nodes. The
15: end for definition of ψ will be made to reflect this prior belief about
concept agreement and click counts.
Given the potential function we can now succinctly write

64
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

the joint probability distribution of the MRF as follows: as to penalize discrepancy between the labels of edges with
Y Y large weight.
P ({ℓq }, {ℓu }, S) ∝ ψqu (ℓq , ℓu ) φs (ℓs ; k) (3) Finding the optimal labeling is now a tractable problem.
(q,u)∈E s∈S Zhu et al. [27] demonstrate that the optimal labeling is har-
monic and the optimal value can be found by iteratively
where φs (ℓs ; k) = P (ℓs = ck ), and P (ℓs = ck ) = 1 if hs, ck i ∈
setting each label value to be the weighted average of the la-
S.
bels of its neighbors. Note that this is exactly the property
The exact definition of the MRF depends on the choice of
satisfied by the absorbing random walk algorithm we de-
the actual potential function, which dictates the way that
scribed in the previous section. The probability of a query
the probabilities are computed. However, given a specific
to belong to a class is computed as the weighted average of
potential function ψqu (ℓq , ℓu ), finding the optimal label as-
the probabilities of its neighbors. The harmonic property
signment is NP-hard for graphs that have cycles, which is the
guarantees that the iterative computation leads to a unique
case with the click-log graph. Thus, approximate inference
solution. Thus the ARW algorithm can be naturally viewed
algorithms are required. In the next sections we describe
as an inference algorithm in the Gaussian Markov Random
two different ways for defining the potential function, and
Field.
performing approximate inference. The first approach leads
to the absorbing random walk algorithm we described in the 4.3 Variational Inference and the Mean Field
previous section. Algorithm
4.2 Gaussian Markov Random Fields For comparison purposes we also consider a different Markov
Random Field formulation, and a different algorithm for
One way to relax the MRF inference problem is to assume computing the posterior distribution. Unlike the Gaussian
that instead of having discrete labels C we have continuous Markov Random Field, in this case we consider the labels
ones. For simplicity assume that C = {0, 1}, that is we have to be discrete. To model the fact that for edges with large
only two classes and we want to assign each query q and weight we want the assigned labels to agree, we set the po-
URL u to one of those classes. The continuous relaxation tential function to be
assumes that C = [0, 1], that is the class labels ℓq and ℓu are 
now real numbers in the [0, 1] interval. These continuous exp(λfqu ), if ℓu = ℓq
ψ(ℓq , ℓu ) =
labels can then be converted to discrete ones by rounding α, if ℓu 6= ℓq
them to the closest discrete value. where α is a constant. Therefore, the potential function
Given this relaxation assumption, we can now work within rewards agreement in class for query-URL pairs when there
the Gaussian Markov Random Field model. The potential is a large number of clicks.
function ψqu (ℓq , ℓu ) is now defined as follows As we have already argued, finding the optimal label con-
figuration that maximizes the posterior probability is in-
ψqu (ℓq , ℓu ) = exp(−fqu (ℓq − ℓu )2 )
tractable. In the previous section, we addressed this prob-
The joint distribution of the MRF can be written as lem by relaxing the labels. Here, we relax our requirements
Y on the posterior distribution, and we approximate it by one
P ({ℓq }, {ℓu }, S) ∝ exp(−fqu (ℓq − ℓu )2 ) that is easier to compute. This approach falls under the
(q,u)∈E general framework of variational inference.
X Variational inference is an important class of approximate
= exp( −fqu (ℓq − ℓu )2 ) inference algorithms [13]. The goal of variational inference is
(q,u)∈E
to approximate the true intractable posterior distribution P
with a tractable distribution P̂ that has a simpler form. The
where ℓu = 1 for all seed URLs u ∈ S. We have that parameters of the approximating P̂ distribution are com-
puted by directly optimizing a similarity measure between
P ({ℓq }, {ℓu }, S)
P ({ℓq }, {ℓu }|S) = the approximating and the true distribution. A common
P (S) choice for the (dis)similarity measure is the Kullback-Liebler
where P (S) is the probability of the observations in the divergence (KL-divergence). The KL-divergence between
seed set. Since P (S) is independent of the labels {ℓq } and two distributions P̂ and P is defined as
{ℓu }, maximizing P ({ℓq }, {ℓu }|S) is the same as maximizing
P ({ℓq }, {ℓu }, S). Finding the label assignment that maxi- h i X P̂ (x)
mizes the probability P ({ℓq }, {ℓu }, S) is the same as mini- K P̂ kP = P̂ (x) log (4)
x P (x)
mizing
X For the variational inference algorithm described here, we
E = − log P ({ℓq }, {ℓu }, S) ∝ fqu (ℓq − ℓu )2 define the approximate distribution to take a fully factored
(q,u)∈E form
Y Y
The quantity E is often referred to as the energy of the P̂ ({ℓq }, {ℓu }|S) = P̂q (ℓq ) P̂u (ℓu ) (5)
Markov Random Field. This minimization criterion requires q u

that for edges (q, u) with large weight fqu the labels ℓq and ℓu This is also known as the Mean Field approximation. The
should be close, so that the value fqu (ℓq − ℓu )2 is minimized. distributions P̂q and P̂u in the above equation model the pos-
This is intuitive; a large number of clicks fqu implies a strong terior over queries and URLs respectively; more specifically,
association between the query q and URL u, and thus their from the point of view of parametrization, P̂q and P̂u are
labels should be similar. This effect is a direct consequence multinomial distributions that define the posterior over cat-
of the choice of the potential function which was chosen so egories. For example, if there are k categories {c1 , c2 , . . . ck },

65
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

then the parametrization of P̂q will have k numbers, P̂qi = minimum, and we obtain a distribution over the class la-
bels.
P
P̂q (ℓq = ci ) such that i P̂qi = 1.
We now seek to estimate the parameters of the approxi-
mating distributions by minimizing the KL divergence be- 5. EXPERIMENTS
tween it and the true posterior. The minimization will be We now present an experimental evaluation of our ap-
performed with respect to every parameter of the P̂ distri- proach, centered around the ARW algorithm. In Section
bution. The KL-divergence can be computed as follows. 5.1, we present the experimental setup. In Section 5.2, we
study the properties of the ARW algorithm and the qual-
h i X P̂ (ℓq , ℓu ) ity of the results that it returns. Finally, in Section 5.3, we
K P̂ kP = P̂ (ℓq , ℓu ) log compare the ARW algorithm with other algorithms, includ-
P (ℓq , ℓu |S)
ℓq ,ℓu ing the Mean Field algorithm presented in Section 4.3.
P̂ (ℓq , ℓu )P (S) 5.1 Experimental setup
X
= P̂ (ℓq , ℓu ) log
P (ℓq , ℓu |S)P (S)
ℓq ,ℓu
The query click graph. We perform experiments on a
X P̂ (ℓq , ℓu ) click graph constructed from a snapshot of the query log
= P̂ (ℓq , ℓu ) log + log(P (S))
P (ℓq , ℓu , S) obtained from a major search engine. The graph has 41
ℓq ,ℓu
X million queries, 55 million URLs, and 93 million edges. The
= −H(P̂ ) − P̂ (ℓq , ℓu ) log P (ℓq , ℓu , S) + log P (S) URLs correspond to both clicked documents and ads. The
ℓq ,ℓu query-URL pairs account for 490 million clicks in the log.
= E + log P (S) (6) Each edge has a frequency of at least two: we pruned all
query-URL pairs with just one click in the snapshot, since
The first term E in Equation 6 is generally referred to we considered them to be too rare to define a meaningful
as the mean field free energy4 , the intuition being low en- association.
ergy configurations are more stable (and likely). The sec-
ond term log P (S) is the log probability of observations, Seed sets. We use three different seed sets that are mo-
and does not depend on the parameters of the P̂ distribu- tivated by specific scenarios. In the first scenario, the seed
tion; hence we can safely ignore the term while performing set is created by just specifying the domain of the adver-
the minimization and work only with E. To ensure that tiser’s Web site. We consider the scenario in our motivating
P example in Section 1 where an advertiser is interested in
c P̂q (ℓq = c) = 1 for all queries q (and similarly for P̂u ) we
need to introduce appropriate Lagrangian multipliers during promoting the online shoe store www.shoes.com. The goal
the minimization. The minimization is performed by doing of the advertiser is to find queries that are related to her do-
gradient descent on the parameter space {P̂qi }, {P̂qi } main, and could potentially result in clicks to her advertise-
The expansion of the mean-field energy term ment. Therefore, the URLs in the domain shoes.com can
X be thought of as a representation of the abstract concept
E = −H(P̂ ) − P̂ (ℓq , ℓu ) log P (ℓq , ℓu , S) “shoes” that the advertiser is interested in. We construct
ℓq ,ℓu the seed set by obtaining all the URLs in the domain (i.e.,
all URLs that start with www.shoes.com) that appear in the
results in two terms, one that is independent of observations click log, and we label them with the “shoes” concept. We
S, and the other encoding it. The former corresponds to the call this seed set Shoes.
(negative) entropy of the approximating distribution and the In the second scenario, the advertiser leverages publicly-
latter can be associated with the expectation of the joint accessible data to produce the seed set. Consider an adver-
distribution with respect to the approximating distribution. tiser who is interested in promoting a health-specific site,
During the minimization, the first terms favors a P̂ distri- that is, she is interested in the abstract concept of “health”.
bution with high entropy, while the evidence term, which The advertiser is interested in capturing queries related to
encodes S, tends to explain the observations by pulling P̂ her site, but also queries that go to sites which are similar
away from the uniform distribution. The updates for the P̂q to hers and are authoritative in the health domain. In this
and the P̂u distributions will be coupled whenever there is case, the seed set can be obtained by resorting to a collec-
an edge between the particular query and URL. Using the tion of popular health sites. For our experiment, we use the
fact that the distribution P ({ℓq }, {ℓu }, S) for a given config- 12 most popular sites in the health domain obtained from
uration readily factors into a product of pairwise potentials, Nielsen NetRating5 . The sites are the following:
we can now write the update equations as follows.
health.yahoo.com, webmd.com, health.msn.com,
X X kidshealth.org, www.prevention.com, www.cdc.gov,
log P̂q (ℓq ) = P̂u (ℓu = c) log ψqu (ℓq , ℓu ) − 1 + λq www.mayoclinic.com, familydoctor.org,
u:(q,u)∈E c emedicine.com, health.ivillage.com,
X X parenting.ivillage.com, www.medicinenet.com
log P̂u (ℓu ) = P̂q (ℓq = c) log ψqu (ℓu , ℓq ) − 1 + λu
q:(q,u)∈E c Similarly to the Shoes seed set, once we have the domains
we construct the seed set by obtaining all the URLs in the
where λq and λu are the Lagrangian multipliers that nor- click log from those domains. We call this seed set Health.
malize the resulting distributions. Iterating yields a local In the last scenario, we assume that the search engine of-
4 fers the advertiser a set of pre-existing concepts in the form
Different from the Markov Random Field energy we mini-
5
mized before. http://www.nielsen-netratings.com/

66
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

query set γ |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k | query set γ |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k |
Q0k 10−5 50,000 100% Q0k 0.0001 500000 100%
Q1k 10−4 45,645 91.3% Q1k 0.001 456892 91.3 %
Q2k 10−3 37,338 74.7% Q2k 0.01 393582 78.7%
Q3k 10−2 15,276 31.2% Q3k 0.05 211391 42.3%

Table 2: Effect of threshold γ for α = 0.001 for Shoes Table 4: Effect of threshold γ for α = 0.001 for Health

query set α |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k | query set α |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k |
Q0k 0.0001 50000 100 % Q0k 0.0001 500,000 100%
Q1k 0.001 49812 99.6 % Q1k 0.001 493,258 98.6%
Q2k 0.01 48259 96.5% Q2k 0.01 456,344 91.2%
Q3k 0.1 43608 87.2% Q3k 0.1 399,607 80%
Q4k 0.2 41359 82.7% Q4k 0.2 381,747 76.3%
Q5k 0.3 39564 79.1% Q5k 0.3 370,751 74.1%

Table 3: Effect of α for γ = 0.0001 for Shoes Table 5: Effect of α for γ = 0.0001 for Health

of a taxonomy. The advertiser can select the concepts that the fraction of Q0k that it represents. We can observe in
she interested in and choose keywords from these concepts. Table 2 that if we decrease γ by one order of magnitude
Our objective is to populate the taxonomy with keywords (from 10−5 to 10−4 ), we still retain 91.3% of the queries in
by assigning queries to the different concepts. For the seed the intersection. If we decrease it by two orders of magnitude
set, we assume the existence of a directory that maps URLs (from 10−5 to 10−3 ) the intersection still contains 74.7% of
to some fixed taxonomy. In our experiment, we employ a the queries in Q0k . For γ = 10−2 , the intersection drops
directory designed for marketing purposes that is similar in to 31.2%. However, for this high value of the threshold,
spirit to the ODP directory. We focus on the third level of the algorithm retrieves only 15,360 queries, 99.4% of which
the taxonomy, which contains 1,158 categories. We consider intersect with Q0k .
each third-level category as a distinct concept, and we set We perform a similar study for the effect of the param-
the seed set for that concept to be the URLs in the cor- eter α. We set the threshold γ to 10−4 and run the ARW
responding category in the directory. We call this seed set algorithm on the Shoes seed set for different values of α.
Directory. For each run we consider the top k = 50K queries. The
results are shown in Table 3. We denote the different query
5.2 Experiments with the ARW algorithm result sets as Q0k , . . . , Q5k , corresponding to values of α rang-
The goal of the following experiments is to evaluate the ing from 0.0001 to 0.3. As before, for each query set Qik , we
properties of the ARW algorithm and the quality of the re- show the size of the intersection with Q0k , and the fraction
sults that it returns. Given a concept c and a seed set S, of Q0k that it represents. We observe in the table that if
the ARW algorithm returns a set of queries related to the we decrease the threshold by one order of magnitude (from
concept. The queries are then sorted according to the prob- 0.0001 to 0.001), we still retain 99.6% of the queries. Even
ability of belonging to the concept, and the top-k queries Qk if we decrease the threshold by three orders of magnitude
are returned as candidates to the advertiser. We are inter- (from 0.0001 to 0.1), the intersection is still 87.2% of the
ested in evaluating the answer set Qk . We first investigate top 50K queries.
how the parameters of the ARW algorithm affect the set Qk , We observe very similar trends for the Health seed set.
and then the quality and properties of the queries in Qk for The results are shown in Tables 4 and 5. We can thus
different values of k. We use the Shoes and Health seed sets conclude that there is a wide range of values for γ and α
for these experiments. for which the query set Qk returned by the algorithm re-
mains relatively unchanged. The algorithm is robust to the
Setting the ARW parameters. We begin by studying
changes of the parameters. This is not surprising since we
the effect of the ARW parameters α and γ on the result set
expect nodes that are close to the seed set to be in the top-k
Qk . Increasing α causes query probabilities to decrease, so
results. The ranking of these nodes should not be signifi-
very large values of α, in combination with the threshold γ,
cantly affected by moderate changes to α or γ. This has
will cause the size of the set Qk to diminish. However, it is
implications for the efficiency of the algorithm as well, since
not clear what the effect of a small increase of α is on the
we can pick relatively high values for the parameters and
ranking produced by the query probabilities, and thus the
speed-up the convergence of the algorithm, while retaining
contents of Qk . Similarly, increasing γ will cause some more
consistent results.
nodes to be pruned, but it is not clear what effect it will
have on the high-probability queries. Query set evaluation. We now evaluate the quality of the
We study the effect of the threshold value γ for the Shoes query set Qk produced by the ARW algorithm. We consider
seed set. We set α to 0.001, we run the ARW algorithm for the following two metrics for evaluating the results.
different values of γ, and we observe the change in the result
set Qk for k = 50K. The results are presented in Table 2. In • Relevance: We compute the relevance ratio R(Qk ) of
this table, we consider query sets Q0k , . . . , Q3k corresponding the query set Qk as the fraction of relevant queries in
to values of γ ranging from 10−5 to 10−2 . For each query the set S. The evaluation of queries was done by hu-
set Qik , we show the size of the intersection with Q0k , and man evaluators. In each experiment, the judges were

67
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

Relevance Indirectness Relevance Indirectness


k R(Qk ) N (Qk ) k R(Qk ) N (Qk )
5K 97.6% 64.2% 50K 97.1% 2.0%
10K 97.7% 78.1% 100K 96.9% 23.3%
20K 93.6 % 87.8% 200K 96.2% 56.7%
30K 88.3% 91.6% 300K 96.1% 70.8%
40K 84.7% 93.6% 400K 96.4% 78.1%
50K 82.4% 94.7% 500K 95.9% 82.4%

Table 6: Relevance and indirectness for Shoes Table 7: Relevance and indirectness for Health

100 100 100


100

80 80 80
80

Indirectness
Relevance
Indirectness
Relevance

60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
K 4 K 4 K 5
x 10 K 5
x 10
x 10 x 10

(a) relevance R(Qk ) (b) indirectness N (Qk ) (a) relevance R(Qk ) (b) indirectness N (Qk )

Figure 1: Plots for evaluation metrics for Shoes Figure 2: Plots for evaluation metrics for Health

presented with a description of the concept and a set for k = 500K, the percentage of indirect queries is 82.4%.
of queries, and they were instructed to consider the This increase in indirectness does not come at the expense
queries as “good” if they could find a relationship be- of relevance, which is 97.1% in the former case and 95.9% in
tween the query and the concept, and “bad” otherwise. the latter. These numbers indicate that we can obtain high
quality results even when we consider queries that are not
• Indirectness: Given the seed set S, let Q(S) denote directly connected to the seed set in the click graph.
the set of queries that are directly connected to S. We
say that a query q is indirect, if q 6∈ Q(S). We de- 5.3 Comparison with other algorithms
note with N (Qk ) the fraction of indirect queries in a
result set Qk . Indirect queries are of particular inter- In this section we compare the ARW algorithm with two
est when the seed set contains URLs associated to the other algorithms. The first is the Mean Field algorithm as it
advertiser’s site, such as in the Shoes scenario, since is described in Section 4.3. The second is an algorithm that
the advertiser may be interested in discovering rele- categorizes a query by grouping the snippets of the returned
vant queries that have not been generating traffic to URLs into a snippet document, and classifying the snippet
her site. document. This method was first employed by Broder et
al. [6]. For the classification of the snippet document we
We first study the trade-off between the size of the query use a Rocchio classifier [18], trained using the content of
result set and its relevance. For each seed set, we run a ran- all the documents in the Directory seed set. The snippet
dom walk with α = 0.001 and γ = 10−4 . For each result set, document is constructed by merging the snippets of the top-
we evaluate the relevance of the top-k queries, for different 40 documents, a number that was shown to perform well by
values of k. The relevance results for Shoes and Health are Broder et al. [6]
shown in Tables 6 and 7, respectively. We also plot them Both algorithms we compare against are better defined
in Figures 1(a) and 2(a), respectively. For the Shoes ex- for the case that there are multiple concepts, so for this
periment, the relevance decreases with the value of k, but experiment we use the Directory seed set. We measure
even for the top 50K queries it is reasonably high (82.4%). relevance by sampling from the results and evaluating the
For the Health experiment, the relevance remains relatively query-category pairs. For the evaluation of individual pairs,
constant as we increase k, and it is 95.9% for the largest we consider the third level of the taxonomy. Given the large
result set that we consider (k = 500K). size of the taxonomy, for presentation purposes we aggregate
We then consider how indirectness changes for different the results into the 20 top-level categories of the taxonomy.
values of k. The results for Shoes are given in Table 6, and We present micro-averaged relevance [25] for each category.
plotted in Figure 1(b). Indirectness increases considerably The micro-averaged relevance for a high level category c is
when k goes from 5K to 20K (from 64.2% to 87.8%), and computed using the query-category pairs hq, c′ i, where c′
then continues to increase, but more slowly. Indirectness of ranges over all subcategories of c. The micro-average rel-
87.8% can be achieved with a relevance of 93.6%. Table 7 evance of category c is the fraction of such pairs that are
and Figure 2(b) show the indirectness values for the Health judged as “good”. Notice that even though we aggregate
seed set. Indirectness increases sharply from k = 50K to the values at the top level of the taxonomy, the queries are
k = 500K. Initially, only 2% of the queries are indirect, while actually evaluated at the third-level. Table 8 summarizes

68
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

Category ARW
Micro-averages
Mean Snippets
6. RELATED WORK
Field Most of the work on keyword generation has focused on
Food and Drinks 99% 84.6% 92% extracting keywords from documents. Turney [20] proposed
Social Sciences and Humanities 91.3% 35% 78% GenEx, a rule-based keyword extraction system tuned us-
Computers and Computing 89.6% 71.8% 70.1% ing a genetic algorithm. Another well-known keyword ex-
Health and Wellness 97.8% 83.8% 84.9%
traction system is KEA [10], which employs a naive Bayes
Vehicles and Transportation 83.3% 90.9% 90.9%
Lawn and Garden 96.6% 66.7% 60% learning algorithm. Later work added web document related
Travel 94.7% 80% 96.3% features to KEA, such as the number of documents returned
Sports and Recreation 91.1% 51.9% 76.7 % by a search engine [21] and link information [15]. The use
Financial Services 96.7% 85.4% 82.3% of natural language techniques for keyword extraction was
Arts and Entertainment 88.1% 78.7% 78.6% initially studied by Hulth [12]. Yih et al. [26] proposed a sys-
Games Puzzles 94.2% 94.7% 77.8%
tem for extracting keywords from Web pages for contextual
Kids and Teens Lifestyle 89.3% 77.3% 72.2%
Society and Culture 67.1% 54.5% 78.7% advertising. Among other features, they use the frequen-
Business 87.2% 72.1% 70.9% cies of candidate keywords in the query log. However, they
Clothing and Shoes 98.4% 95.5% 95.7% do not employ click information (i.e., relationships between
Science 86.2% 57.3% 86.3% queries and documents). These techniques for keyword ex-
Educational Institutions 87.2% 34.4% 61% traction from documents can be viewed as complementary to
Families and Relationships 94.1% 90.9% 80%
ours. In particular, we can consider the keywords extracted
Animals 94.2% 90.9% 100%
Home Improvement 96.8% 84.2% 93.8% from the given document as a seed set (of queries) for our
Micro-average 88.2% 69% 80.2% click-based algorithm.
Some approaches in the literature exploit the results re-
Table 8: Micro-average relevance. Aggregated for turned by a search engine instead of the user clicks [1, 14].
top-level categories of the taxonomy, evaluated for The idea is to start with a seed set of keywords, submit
third-level categories. them to the search engine and then use the retrieved text
snippets or documents to extract relevant keywords. Joshi
and Motwani [14] introduced a notion of non-obviousness
which, though similar in spirit to the notion of indirectness
the results. The first column lists the 20 top-level categories presented in our work, is defined quite differently. In partic-
in the taxonomy. The second column corresponds to the ular, they assume that a set of keywords is given as input,
ARW algorithm with parameters α = 0.01 and γ = 0.001; and a term is considered non-obvious if it does not contain
the third column to the Mean Field algorithm; and the last any of the input keywords. To the best of our knowledge,
column to the snippets-based algorithm. the only previous approach to keyword generation that em-
The ARW algorithm has the highest micro-average rele- ploys the click graph is by Bartz et al. [4]; however, their
vance (88.2%, as opposed to 80.2% for the Snippets algo- techniques are quite different from ours (they employ logistic
rithm). Notice that although the Snippets algorithm has a regression and collaborative filtering techniques).
lower micro-average relevance than the ARW on the entire The techniques presented in this paper can be used to
sample, it has higher micro-average relevance for 5 out of populate queries within a taxonomy if a directory of URLs
the 20 classes that we consider. This suggests that, as fu- is used to represent the concepts. There are a number of
ture work, it may be fruitful to consider an approach that proposals in the literature that obtain a query category by
exploits both the content of documents and their clicks. On classifying the search results of the query (either documents
the other hand, the Mean Field algorithm has a lower micro- or snippets) [6, 19]. In contrast, we use clicks, and we do
average relevance of 69%. It produces the best results for not need to crawl content in order to train a classifier. More
two of the categories, but usually it exhibits the lowest rel- related to our approach is the work of Xue et al. [23], which
evance ratio. It appears that the ARW algorithm benefits considers the combination of the signal coming from docu-
significantly by the introduction of the null category, and ment content and the click logs, and shows an improvement
the exponential decay in the probability of reaching a node over pure content-based classification methods.
in the seed set. This effect is not modeled well in the case The click graph has been used extensively for other ap-
of the Mean Field algorithm, which results in poor perfor- plications. Some related approaches include the following.
mance in the case that there are weak connections between Beeferman and Burger [5] and Wen et al. [22] employ clus-
queries and URLs of non-relevant classes. Understanding tering techniques to determine query-to-query similarity. Xue
fully how the Mean Field algorithm can be optimized is also et al. [24] use the click graph to find document-to-document
an interesting problem for future work. similarities. Craswell and Szummer [8] consider random
This experiment offers also an interesting insight into how walk techniques on the click graph to produce a ranking
the ARW algorithm performs when the seed set contains of documents for image search. Baeza-Yates and Tiberi [3]
multiple concepts, as well as concepts of fine granularity. use the click graph to extract semantic relationships between
The third level of the taxonomy contains 1158 categories, queries.
some of which are fairly narrow. For example, under the top- In our approach, we start with a seed set (of queries or
level category “Arts and Entertainment”, there are third- URLs) and expand it in order to generate keywords. The
level categories such as “Movies/Film Festivals”, “Movies/A- seed set expansion problem has been considered in other con-
wards”, “Movies/Filmaking” and “Movies/Theaters”. The texts such as detecting hubs and authorities [16], community
micro-average relevance that we obtained indicates that the discovery [2], and detecting spam pages [11].
algorithm produces relevant results even for fine-grained cat-
egories of the taxonomy.

69
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

7. CONCLUSIONS [7] Olivier Chapelle, Bernhard Scholkopf, and Alexander


We have introduced an approach to keyword generation Zien. Semi-Supervised Learning. The MIT Press, 2006.
that leverages the information available in the search engine [8] N. Craswell and M. Szummer. Random walks on the
click logs. Our approach requires minimal effort from the click graph. SIGIR, pages 239–246, 2007.
part of the advertisers. In some cases, it just suffices to [9] P. Doyle and L. Snell. Random Walks and Electrical
provide one domain in order to produce large lists of relevant Networks. Mathematical Association of America, 1984.
keywords. Promising experimental results demonstrate that [10] E. Frank, G. Paynter, I. Witten, C. Gutwin, and
our algorithms can scale to large query logs and produce C. Nevill-Manning. Domain-specific keyphrase
high-quality results. detection. Proc. of IJCAI, pages 668–673, 1999.
There are several directions for future work. One impor- [11] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen.
tant direction is to consider richer click graphs, with addi- Combating web spam with trustrank. VLDB, pages
tional attributes such as dwell times, and position of clicked 576–587, 2004.
documents in the click graph. We can also add other types [12] A. Hulth. Improved automatic keyword extraction
of edges to the graph. For example, the query reformula- given more linguistic knowledge. EMNLP, pages
tions made by users induce edges between queries; and hy- 216–223, 2003.
perlinks induce edges between URLs. In addition to using [13] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and
click information, we could also use the impressions (i.e., the L. K. Saul. An introduction to variational methods for
results that are shown to the users). This may yield valu- graphical models. In M. I. Jordan, editor, Learning in
able negative information: a “non-click” to a URL may be Graphical Models. Kluwer Academic Publishers,
an indication of the fact that it is not relevant to the query. Norwell MA., 1998.
Another interesting direction is to consider the content of
[14] A. Joshi and R. Motwani. Keyword generation for
Web pages. We could use a standard document classifier to search engine advertising. ICDM Workshops, pages
classify some of the documents, and use the output of the 490–496, 2006.
classifier as priors in the click-based algorithm. The query
[15] D. Kelleher and S. Luz. Automatic hypertext
string could be used as well; if a query is not in the output
keyphrase extraction. IJCAI, pages 1608–1609, 2005.
of our algorithm, we could consider approximate matching
techniques to find a similar query which is in the output. [16] J. Kleinberg. Authoritative sources in a hyperlinked
Finally, it would be interesting to study the effect of spam environment. Journal of the ACM, 46(5):604–632,
on the keyword generation techniques. There are two types 1999.
of spam that should be taken into account: spurious clicks [17] S. Li. Markov random field modeling in computer
(usually from bot agents) and spam Web pages, each one vision. Springer-Verlag, 1995.
affecting the algorithm in a different way. [18] J. Rocchio. The SMART Retrieval System:
Experiments in Automatic Document Processing,
chapter Relevance feedback in Information Retrieval,
Acknowledgements: We would like to thank Marc Najork pages 313–323. Prentice Hall, 1971.
for providing the tools that we used for storing and process- [19] D. Shen, R. Pan, J. Sun, J. Pan, K. Wu, J. Yin, and
ing the click logs, and Alan Halverson for his help in using Q. Yang. Q2C@UST: Our winning solution to query
these tools. We are also grateful to Vu Ha and Lingfeng Wu classification in KDDCUP 2005. SIGKDD
for providing the Rocchio classifier used in the Snippets al- Explorations, 7:100–110, 2005.
gorithm, and Alex Ntoulas for his help in setting up a Web [20] P. Turney. Learning algorithms for keyphrase
interface to evaluate the queries. extraction. Information Retrieval, 2(4):303–336, 2000.
[21] P. Turney. Coherent keyphrase extraction via web
8. REFERENCES mining. IJCAI, pages 434–439, 2003.
[22] J. Wen, J. Nie, and H. Zhang. Clustering user queries
[1] V. Abhishek and K Hosanagar. Keyword generation of a search engine. WWW, pages 162–168, 2001.
for search engine advertising using semantic similarity
[23] G. Xue, Y. Yu, D. Shen, Q. Yang, H. Zeng, and
between terms. International Conference on Electronic
Z. Chen. Reinforcing web-object categorization
Commerce, pages 89–94, 2007.
through interrelationships. Data Mining and
[2] R. Andersen and K. Lang. Communities from seed Knowledge Discovery, 12:229–248, 2006.
sets. WWW, pages 223–232, 2006.
[24] G. Xue, H. Zeng, Z. Chen, Y. Yu, W. Ma, W. Xi, and
[3] R. Baeza-Yates and A. Tiberi. Extracting semantic W. Fan. Optimizing web search using web
relations from query logs. KDD, pages 76–85, 2007. click-through data. CIKM, pages 118–126, 2004.
[4] K. Bartz, V. Murthi, and S. Sebastian. Logistic [25] Y. Yang. An evaluation of statistical approaches to
regression and collaborative filtering for sponsored text categorization. Information Retrieval,
search term recommendation. Second Workshop on 1(1-2):69–90, 1999.
Sponsored Search Auctions, 2006.
[26] W. Yih, J. Goodman, and V. Carvalho. Finding
[5] D. Beeferman and A. Berger. Agglomerative clustering advertising keywords on web pages. WWW, pages 213
of a search engine query log. KDD, pages 407–416, – 222, 2006.
2000.
[27] X. Zhu, Z. Ghahramani, and J. Lafferty.
[6] A. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, Semi-supervised learning using gaussian fields and
V. Josifovski, and T. Zhang. Robust classification of harmonic functions. ICML, pages 912–919, 2003.
rare queries using web knowledge. SIGIR, pages
231–238, 2007.

70

You might also like