Using The Wisdom of The Crowds For Keyword Generation
Using The Wisdom of The Crowds For Keyword Generation
Using The Wisdom of The Crowds For Keyword Generation
61
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
62
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
because they are connected closely in the graph to the seed is the transition probability from u to q.
set that represents this concept. The iterative process defined by Equations 1 and 2 is
We capture this intuition using a random walk. For some known to converge [9]. The computation has interesting con-
query q ∈ Q, we compute the affinity of q to some seed nections with electrical networks. Consider the click graph
node s ∈ S as the probability that a random walk that as an electrical network, where each edge (q, u) is a wire, and
starts from q ends up at node s. The affinity of q to the the weight fqu of the edge is the conductance of the wire. If
class c ∈ C is the probability that the random walk that we apply a unit of voltage to the nodes in the seed set, and
starts from q ends up in any seed node in the class c. Note we ground the absorbing node ω, then the probabilities that
that in this walk the nodes in the seed set act as absorbing we compute are the voltages on the non-seed nodes.
nodes, that is, sink nodes in the state transition graph from The outline of the Absorbing Random Walk (ARW) algo-
which the random walk cannot escape. Absorbing nodes rithm is shown in Algorithm 1. Note that in lines 5 and 9
naturally model the fact that for the URLs in the seed set of the algorithm we discard probabilities that are below a
we have complete certainty about their class since they were threshold γ. This step is mainly for efficiency purposes. If
manually assigned to that class, and thus their assignment we do not do any pruning, then all nodes that are reachable
should not be changed by the algorithm. from the seed set will get some probability of belonging to
Note that for the case where there exists a single seed node the class regardless of how small that probability is. How-
s in the graph, every query that is reachable from the seed ever, we are not interested in such queries, since they are
node will be absorbed with probability 1 at node s. How- too far from the seed set. Making the probabilities of these
ever, the further away a query is from the seed URL, the less nodes to be zero conceptually means that we temporarily
related it should be to the URL’s class. We model this by make them absorbing nodes and place them in the null class.
introducing an absorbing “null class” node ω to the graph, At convergence, we have a set of null class nodes that define
and connecting this node to every other node in the graph. the boundary of the graph that we have explored.
The effect of this null class node is that long paths are penal-
ized by the algorithm, and the probability of a node being Algorithm 1 The ARW algorithm for a single class
absorbed at the seed node decreases exponentially with the
Input: the seed set S for class c, the click-graph G, the
length of the path.
threshold parameter γ, the transition probability α to ω
Performing a random walk for every query in the graph
Output: P (ℓq = c), for every query q.
is computationally prohibitive for a large graph. However,
1: for u ∈ S do P (ℓu = c) = 1
there is a simple iterative algorithm that can compute the
2: repeat
class probabilities efficiently. We will now describe the al-
3: for all q ∈ Q do
gorithm for the case that we have a single concept c (that P
4: P (ℓq = c) = (1 − α) u:(q,u)∈E wqu P (ℓu = c)
is, C = {c}), and then show how to generalize to the case of
multiple classes. 5: if P (ℓq = c) < γ then P (ℓq = c) = 0
Let ℓq (or ℓu ) denote the random variable pertaining to the 6: end for
concept label for query q (or URL u). We want to compute 7: for all u ∈ U \ S do P
P (ℓq = c), that is, the probability that a random walk that 8: P (ℓu = c) = (1 − α) q:(u,q)∈E wuq P (ℓq = c)
starts from q will be absorbed at some node of the class c. 9: if P (ℓu = c) < γ then P (ℓu = c) = 0
Let α be the probability of making a transition to the null 10: end for
class absorbing node, from any node in the graph. Then we 11: until convergence
have that 12: Output P (ℓq = c), for every query q in Q
63
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
perform the same iterative algorithm for all classes simul- Further optimizations are possible. Once the run for con-
taneously. However, this process is memory intensive since cept ci is completed, we know that with probability Pqi =
it requires to maintain probability vectors for all k classes. P (ℓq = ci ), the query q belongs to class ci . Therefore,
Furthermore, the fraction of the graph that will be explored the probability mass “available” for the remaining classes
will increase substantially even if we prune nodes with low is 1 − Pqi . When considering another class cj , we can create
probabilities, since the size of the seed set may be signifi- a random jump with probability Pqi to the null absorbing
cantly larger. node. This will result in faster convergence for each individ-
This problem can be addressed by considering the con- ual run.
cepts one at the time. However, this should be done care-
fully so as to take into account all the information contained 4. MARKOV RANDOM FIELDS
in the seed set. Let S = {S1 , · · · , Sk } be the partition of the
We now demonstrate how the keyword generation prob-
seed set into the k concepts. Applying Algorithm 1 directly
lem and the ARW algorithm that we described in the previ-
for each set Si is incorrect, since we discard the information
ous section can be formulated within the Markov Random
we have about the seed nodes in the other concepts. For
Field model. In order to compare different Markov Random
these nodes, we know with certainty that they cannot be-
Field formulations, we also present another algorithm for
long to class ci ; thus, they are negative examples for the class
computing the class probabilities. In Section 5, we report
ci . We can easily model this effect in the absorbing random
experiments comparing this algorithm against the ARW al-
walk: when considering class ci we make all the remaining
gorithm.
seed nodes in S \ Si to be in the null class.
The outline of the ARW algorithm for multiple classes 4.1 Model and definitions
is shown in Algorithm 2. The outer loop of the algorithm
A Markov Random Field (MRF) is an undirected graph,
iterates over all classes, and for each class we perform an
where each node in the graph is associated with a random
absorbing random walk. The absorbing random walk for
variable and the edges model the pairwise relationships be-
each individual class ci is very similar to Algorithm 1, albeit
tween the random variables. MRFs define a probability
with two distinct differences. In line 2 of the algorithm, we
model where the value of a random variable in the field de-
fix the probability of the nodes in S \ Si to belong to the
pends on the prior knowledge we may have for that variable,
class ci to zero, thus making them into null absorbing nodes.
and the values of all the adjacent variables in the graph.
Also, when we update the URL probabilities (lines 9 – 12),
The characteristic of the MRFs is the Markovian assump-
we do not update the probabilities for any of the URLs in
tion that the value of a random variable is independent of
S. In effect, the seed set for the absorbing random walk
the rest of the graph, given the values of all its neighbors.
for class ci consists of the entire set S; the nodes in Si have
Given a set of observations about certain variables in the
probability 1, while the nodes in S \Si have probability zero.
MRF, an important problem is to compute the most likely
The effect of this optimization is that the memory foot-
assignment of values for the other unobserved variables.
print is decreased dramatically, since we only need to con-
We model the keyword generation problem using an MRF
sider a single class at the time. Single class runs are also
as follows. Consider the random variables ℓq and ℓu that
fast, since the size of the explored graph is smaller. Further-
represent the concept label for each query q and URL u,
more, the null absorbing nodes in S \ Si “block” the random
respectively. The domain of these variables is the set of
walk, and thus speed up the convergence. Note also that the
concept labels C. The pairwise relationships between the
algorithm is amenable to parallel execution.
nodes are defined by the edges derived from the click graph,
and their strength depends on their weight. The seed set S
defines the observations for the variables. The observations
Algorithm 2 The ARW algorithm for multiple classes in our case are of the form P (ℓu = c) = 1 for all pairs hu, ci
Input: the seed set S = {S1 , · · · , Sk } for concepts C = in the seed set. We are interested in assigning concepts to
{c1 , · · · , ck }, the click-graph G, the threshold parameter the queries and the unlabeled URLs in the graph in such
γ, the transition probability α to ω. a way that we respect the constraints defined by the click
Output: P (ℓq = c), for every query q and every class c. graph.
1: for all ci ∈ C do In the MRF model this is translated to finding the assign-
2: for u ∈ S \ Si do P (ℓu = ci ) = 0 ment of values to the random variables such that the pos-
3: for u ∈ Si do P (ℓu = c) = 1 terior probability P ({ℓq }, {ℓu }|S) is maximized. The main
4: repeat assumption is that given the Markovian assumption we can
5: for all q ∈ Q do P decompose the joint probability distribution to factors de-
6: P (ℓq = c) = (1 − α) u:(q,u)∈E wqu P (ℓu = c) fined over the edges in the graph. For some edge (q, u) in the
7: if P (ℓq = c) < γ then P (ℓq = c) = 0 MRF, we define a compatibility function ψqu (ℓq , ℓu ) (also
8: end for called potential function) that scores the relationship be-
9: for all u ∈ U \ S do P tween q and u. The choice of potential function typically
10: P (ℓu = c) = (1 − α) q:(u,q)∈E wuq P (ℓq = c) reflects the prior knowledge one has about the problem un-
11: if P (ℓu = c) < γ then P (ℓu = c) = 0 der consideration. As noted earlier, in our case a click to a
12: end for URL from a query can be viewed as an endorsement of sim-
13: until convergence ilar concept. Hence, a larger click count implies a stronger
14: Output P (ℓq = c), for every query q and class c agreement of concepts between the connected nodes. The
15: end for definition of ψ will be made to reflect this prior belief about
concept agreement and click counts.
Given the potential function we can now succinctly write
64
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
the joint probability distribution of the MRF as follows: as to penalize discrepancy between the labels of edges with
Y Y large weight.
P ({ℓq }, {ℓu }, S) ∝ ψqu (ℓq , ℓu ) φs (ℓs ; k) (3) Finding the optimal labeling is now a tractable problem.
(q,u)∈E s∈S Zhu et al. [27] demonstrate that the optimal labeling is har-
monic and the optimal value can be found by iteratively
where φs (ℓs ; k) = P (ℓs = ck ), and P (ℓs = ck ) = 1 if hs, ck i ∈
setting each label value to be the weighted average of the la-
S.
bels of its neighbors. Note that this is exactly the property
The exact definition of the MRF depends on the choice of
satisfied by the absorbing random walk algorithm we de-
the actual potential function, which dictates the way that
scribed in the previous section. The probability of a query
the probabilities are computed. However, given a specific
to belong to a class is computed as the weighted average of
potential function ψqu (ℓq , ℓu ), finding the optimal label as-
the probabilities of its neighbors. The harmonic property
signment is NP-hard for graphs that have cycles, which is the
guarantees that the iterative computation leads to a unique
case with the click-log graph. Thus, approximate inference
solution. Thus the ARW algorithm can be naturally viewed
algorithms are required. In the next sections we describe
as an inference algorithm in the Gaussian Markov Random
two different ways for defining the potential function, and
Field.
performing approximate inference. The first approach leads
to the absorbing random walk algorithm we described in the 4.3 Variational Inference and the Mean Field
previous section. Algorithm
4.2 Gaussian Markov Random Fields For comparison purposes we also consider a different Markov
Random Field formulation, and a different algorithm for
One way to relax the MRF inference problem is to assume computing the posterior distribution. Unlike the Gaussian
that instead of having discrete labels C we have continuous Markov Random Field, in this case we consider the labels
ones. For simplicity assume that C = {0, 1}, that is we have to be discrete. To model the fact that for edges with large
only two classes and we want to assign each query q and weight we want the assigned labels to agree, we set the po-
URL u to one of those classes. The continuous relaxation tential function to be
assumes that C = [0, 1], that is the class labels ℓq and ℓu are
now real numbers in the [0, 1] interval. These continuous exp(λfqu ), if ℓu = ℓq
ψ(ℓq , ℓu ) =
labels can then be converted to discrete ones by rounding α, if ℓu 6= ℓq
them to the closest discrete value. where α is a constant. Therefore, the potential function
Given this relaxation assumption, we can now work within rewards agreement in class for query-URL pairs when there
the Gaussian Markov Random Field model. The potential is a large number of clicks.
function ψqu (ℓq , ℓu ) is now defined as follows As we have already argued, finding the optimal label con-
figuration that maximizes the posterior probability is in-
ψqu (ℓq , ℓu ) = exp(−fqu (ℓq − ℓu )2 )
tractable. In the previous section, we addressed this prob-
The joint distribution of the MRF can be written as lem by relaxing the labels. Here, we relax our requirements
Y on the posterior distribution, and we approximate it by one
P ({ℓq }, {ℓu }, S) ∝ exp(−fqu (ℓq − ℓu )2 ) that is easier to compute. This approach falls under the
(q,u)∈E general framework of variational inference.
X Variational inference is an important class of approximate
= exp( −fqu (ℓq − ℓu )2 ) inference algorithms [13]. The goal of variational inference is
(q,u)∈E
to approximate the true intractable posterior distribution P
with a tractable distribution P̂ that has a simpler form. The
where ℓu = 1 for all seed URLs u ∈ S. We have that parameters of the approximating P̂ distribution are com-
puted by directly optimizing a similarity measure between
P ({ℓq }, {ℓu }, S)
P ({ℓq }, {ℓu }|S) = the approximating and the true distribution. A common
P (S) choice for the (dis)similarity measure is the Kullback-Liebler
where P (S) is the probability of the observations in the divergence (KL-divergence). The KL-divergence between
seed set. Since P (S) is independent of the labels {ℓq } and two distributions P̂ and P is defined as
{ℓu }, maximizing P ({ℓq }, {ℓu }|S) is the same as maximizing
P ({ℓq }, {ℓu }, S). Finding the label assignment that maxi- h i X P̂ (x)
mizes the probability P ({ℓq }, {ℓu }, S) is the same as mini- K P̂ kP = P̂ (x) log (4)
x P (x)
mizing
X For the variational inference algorithm described here, we
E = − log P ({ℓq }, {ℓu }, S) ∝ fqu (ℓq − ℓu )2 define the approximate distribution to take a fully factored
(q,u)∈E form
Y Y
The quantity E is often referred to as the energy of the P̂ ({ℓq }, {ℓu }|S) = P̂q (ℓq ) P̂u (ℓu ) (5)
Markov Random Field. This minimization criterion requires q u
that for edges (q, u) with large weight fqu the labels ℓq and ℓu This is also known as the Mean Field approximation. The
should be close, so that the value fqu (ℓq − ℓu )2 is minimized. distributions P̂q and P̂u in the above equation model the pos-
This is intuitive; a large number of clicks fqu implies a strong terior over queries and URLs respectively; more specifically,
association between the query q and URL u, and thus their from the point of view of parametrization, P̂q and P̂u are
labels should be similar. This effect is a direct consequence multinomial distributions that define the posterior over cat-
of the choice of the potential function which was chosen so egories. For example, if there are k categories {c1 , c2 , . . . ck },
65
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
then the parametrization of P̂q will have k numbers, P̂qi = minimum, and we obtain a distribution over the class la-
bels.
P
P̂q (ℓq = ci ) such that i P̂qi = 1.
We now seek to estimate the parameters of the approxi-
mating distributions by minimizing the KL divergence be- 5. EXPERIMENTS
tween it and the true posterior. The minimization will be We now present an experimental evaluation of our ap-
performed with respect to every parameter of the P̂ distri- proach, centered around the ARW algorithm. In Section
bution. The KL-divergence can be computed as follows. 5.1, we present the experimental setup. In Section 5.2, we
study the properties of the ARW algorithm and the qual-
h i X P̂ (ℓq , ℓu ) ity of the results that it returns. Finally, in Section 5.3, we
K P̂ kP = P̂ (ℓq , ℓu ) log compare the ARW algorithm with other algorithms, includ-
P (ℓq , ℓu |S)
ℓq ,ℓu ing the Mean Field algorithm presented in Section 4.3.
P̂ (ℓq , ℓu )P (S) 5.1 Experimental setup
X
= P̂ (ℓq , ℓu ) log
P (ℓq , ℓu |S)P (S)
ℓq ,ℓu
The query click graph. We perform experiments on a
X P̂ (ℓq , ℓu ) click graph constructed from a snapshot of the query log
= P̂ (ℓq , ℓu ) log + log(P (S))
P (ℓq , ℓu , S) obtained from a major search engine. The graph has 41
ℓq ,ℓu
X million queries, 55 million URLs, and 93 million edges. The
= −H(P̂ ) − P̂ (ℓq , ℓu ) log P (ℓq , ℓu , S) + log P (S) URLs correspond to both clicked documents and ads. The
ℓq ,ℓu query-URL pairs account for 490 million clicks in the log.
= E + log P (S) (6) Each edge has a frequency of at least two: we pruned all
query-URL pairs with just one click in the snapshot, since
The first term E in Equation 6 is generally referred to we considered them to be too rare to define a meaningful
as the mean field free energy4 , the intuition being low en- association.
ergy configurations are more stable (and likely). The sec-
ond term log P (S) is the log probability of observations, Seed sets. We use three different seed sets that are mo-
and does not depend on the parameters of the P̂ distribu- tivated by specific scenarios. In the first scenario, the seed
tion; hence we can safely ignore the term while performing set is created by just specifying the domain of the adver-
the minimization and work only with E. To ensure that tiser’s Web site. We consider the scenario in our motivating
P example in Section 1 where an advertiser is interested in
c P̂q (ℓq = c) = 1 for all queries q (and similarly for P̂u ) we
need to introduce appropriate Lagrangian multipliers during promoting the online shoe store www.shoes.com. The goal
the minimization. The minimization is performed by doing of the advertiser is to find queries that are related to her do-
gradient descent on the parameter space {P̂qi }, {P̂qi } main, and could potentially result in clicks to her advertise-
The expansion of the mean-field energy term ment. Therefore, the URLs in the domain shoes.com can
X be thought of as a representation of the abstract concept
E = −H(P̂ ) − P̂ (ℓq , ℓu ) log P (ℓq , ℓu , S) “shoes” that the advertiser is interested in. We construct
ℓq ,ℓu the seed set by obtaining all the URLs in the domain (i.e.,
all URLs that start with www.shoes.com) that appear in the
results in two terms, one that is independent of observations click log, and we label them with the “shoes” concept. We
S, and the other encoding it. The former corresponds to the call this seed set Shoes.
(negative) entropy of the approximating distribution and the In the second scenario, the advertiser leverages publicly-
latter can be associated with the expectation of the joint accessible data to produce the seed set. Consider an adver-
distribution with respect to the approximating distribution. tiser who is interested in promoting a health-specific site,
During the minimization, the first terms favors a P̂ distri- that is, she is interested in the abstract concept of “health”.
bution with high entropy, while the evidence term, which The advertiser is interested in capturing queries related to
encodes S, tends to explain the observations by pulling P̂ her site, but also queries that go to sites which are similar
away from the uniform distribution. The updates for the P̂q to hers and are authoritative in the health domain. In this
and the P̂u distributions will be coupled whenever there is case, the seed set can be obtained by resorting to a collec-
an edge between the particular query and URL. Using the tion of popular health sites. For our experiment, we use the
fact that the distribution P ({ℓq }, {ℓu }, S) for a given config- 12 most popular sites in the health domain obtained from
uration readily factors into a product of pairwise potentials, Nielsen NetRating5 . The sites are the following:
we can now write the update equations as follows.
health.yahoo.com, webmd.com, health.msn.com,
X X kidshealth.org, www.prevention.com, www.cdc.gov,
log P̂q (ℓq ) = P̂u (ℓu = c) log ψqu (ℓq , ℓu ) − 1 + λq www.mayoclinic.com, familydoctor.org,
u:(q,u)∈E c emedicine.com, health.ivillage.com,
X X parenting.ivillage.com, www.medicinenet.com
log P̂u (ℓu ) = P̂q (ℓq = c) log ψqu (ℓu , ℓq ) − 1 + λu
q:(q,u)∈E c Similarly to the Shoes seed set, once we have the domains
we construct the seed set by obtaining all the URLs in the
where λq and λu are the Lagrangian multipliers that nor- click log from those domains. We call this seed set Health.
malize the resulting distributions. Iterating yields a local In the last scenario, we assume that the search engine of-
4 fers the advertiser a set of pre-existing concepts in the form
Different from the Markov Random Field energy we mini-
5
mized before. http://www.nielsen-netratings.com/
66
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
query set γ |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k | query set γ |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k |
Q0k 10−5 50,000 100% Q0k 0.0001 500000 100%
Q1k 10−4 45,645 91.3% Q1k 0.001 456892 91.3 %
Q2k 10−3 37,338 74.7% Q2k 0.01 393582 78.7%
Q3k 10−2 15,276 31.2% Q3k 0.05 211391 42.3%
Table 2: Effect of threshold γ for α = 0.001 for Shoes Table 4: Effect of threshold γ for α = 0.001 for Health
query set α |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k | query set α |Qik ∩ Q0k | |Qik ∩ Q0k |/|Q0k |
Q0k 0.0001 50000 100 % Q0k 0.0001 500,000 100%
Q1k 0.001 49812 99.6 % Q1k 0.001 493,258 98.6%
Q2k 0.01 48259 96.5% Q2k 0.01 456,344 91.2%
Q3k 0.1 43608 87.2% Q3k 0.1 399,607 80%
Q4k 0.2 41359 82.7% Q4k 0.2 381,747 76.3%
Q5k 0.3 39564 79.1% Q5k 0.3 370,751 74.1%
Table 3: Effect of α for γ = 0.0001 for Shoes Table 5: Effect of α for γ = 0.0001 for Health
of a taxonomy. The advertiser can select the concepts that the fraction of Q0k that it represents. We can observe in
she interested in and choose keywords from these concepts. Table 2 that if we decrease γ by one order of magnitude
Our objective is to populate the taxonomy with keywords (from 10−5 to 10−4 ), we still retain 91.3% of the queries in
by assigning queries to the different concepts. For the seed the intersection. If we decrease it by two orders of magnitude
set, we assume the existence of a directory that maps URLs (from 10−5 to 10−3 ) the intersection still contains 74.7% of
to some fixed taxonomy. In our experiment, we employ a the queries in Q0k . For γ = 10−2 , the intersection drops
directory designed for marketing purposes that is similar in to 31.2%. However, for this high value of the threshold,
spirit to the ODP directory. We focus on the third level of the algorithm retrieves only 15,360 queries, 99.4% of which
the taxonomy, which contains 1,158 categories. We consider intersect with Q0k .
each third-level category as a distinct concept, and we set We perform a similar study for the effect of the param-
the seed set for that concept to be the URLs in the cor- eter α. We set the threshold γ to 10−4 and run the ARW
responding category in the directory. We call this seed set algorithm on the Shoes seed set for different values of α.
Directory. For each run we consider the top k = 50K queries. The
results are shown in Table 3. We denote the different query
5.2 Experiments with the ARW algorithm result sets as Q0k , . . . , Q5k , corresponding to values of α rang-
The goal of the following experiments is to evaluate the ing from 0.0001 to 0.3. As before, for each query set Qik , we
properties of the ARW algorithm and the quality of the re- show the size of the intersection with Q0k , and the fraction
sults that it returns. Given a concept c and a seed set S, of Q0k that it represents. We observe in the table that if
the ARW algorithm returns a set of queries related to the we decrease the threshold by one order of magnitude (from
concept. The queries are then sorted according to the prob- 0.0001 to 0.001), we still retain 99.6% of the queries. Even
ability of belonging to the concept, and the top-k queries Qk if we decrease the threshold by three orders of magnitude
are returned as candidates to the advertiser. We are inter- (from 0.0001 to 0.1), the intersection is still 87.2% of the
ested in evaluating the answer set Qk . We first investigate top 50K queries.
how the parameters of the ARW algorithm affect the set Qk , We observe very similar trends for the Health seed set.
and then the quality and properties of the queries in Qk for The results are shown in Tables 4 and 5. We can thus
different values of k. We use the Shoes and Health seed sets conclude that there is a wide range of values for γ and α
for these experiments. for which the query set Qk returned by the algorithm re-
mains relatively unchanged. The algorithm is robust to the
Setting the ARW parameters. We begin by studying
changes of the parameters. This is not surprising since we
the effect of the ARW parameters α and γ on the result set
expect nodes that are close to the seed set to be in the top-k
Qk . Increasing α causes query probabilities to decrease, so
results. The ranking of these nodes should not be signifi-
very large values of α, in combination with the threshold γ,
cantly affected by moderate changes to α or γ. This has
will cause the size of the set Qk to diminish. However, it is
implications for the efficiency of the algorithm as well, since
not clear what the effect of a small increase of α is on the
we can pick relatively high values for the parameters and
ranking produced by the query probabilities, and thus the
speed-up the convergence of the algorithm, while retaining
contents of Qk . Similarly, increasing γ will cause some more
consistent results.
nodes to be pruned, but it is not clear what effect it will
have on the high-probability queries. Query set evaluation. We now evaluate the quality of the
We study the effect of the threshold value γ for the Shoes query set Qk produced by the ARW algorithm. We consider
seed set. We set α to 0.001, we run the ARW algorithm for the following two metrics for evaluating the results.
different values of γ, and we observe the change in the result
set Qk for k = 50K. The results are presented in Table 2. In • Relevance: We compute the relevance ratio R(Qk ) of
this table, we consider query sets Q0k , . . . , Q3k corresponding the query set Qk as the fraction of relevant queries in
to values of γ ranging from 10−5 to 10−2 . For each query the set S. The evaluation of queries was done by hu-
set Qik , we show the size of the intersection with Q0k , and man evaluators. In each experiment, the judges were
67
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
Table 6: Relevance and indirectness for Shoes Table 7: Relevance and indirectness for Health
80 80 80
80
Indirectness
Relevance
Indirectness
Relevance
60 60 60 60
40 40 40 40
20 20 20 20
0 0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
K 4 K 4 K 5
x 10 K 5
x 10
x 10 x 10
(a) relevance R(Qk ) (b) indirectness N (Qk ) (a) relevance R(Qk ) (b) indirectness N (Qk )
Figure 1: Plots for evaluation metrics for Shoes Figure 2: Plots for evaluation metrics for Health
presented with a description of the concept and a set for k = 500K, the percentage of indirect queries is 82.4%.
of queries, and they were instructed to consider the This increase in indirectness does not come at the expense
queries as “good” if they could find a relationship be- of relevance, which is 97.1% in the former case and 95.9% in
tween the query and the concept, and “bad” otherwise. the latter. These numbers indicate that we can obtain high
quality results even when we consider queries that are not
• Indirectness: Given the seed set S, let Q(S) denote directly connected to the seed set in the click graph.
the set of queries that are directly connected to S. We
say that a query q is indirect, if q 6∈ Q(S). We de- 5.3 Comparison with other algorithms
note with N (Qk ) the fraction of indirect queries in a
result set Qk . Indirect queries are of particular inter- In this section we compare the ARW algorithm with two
est when the seed set contains URLs associated to the other algorithms. The first is the Mean Field algorithm as it
advertiser’s site, such as in the Shoes scenario, since is described in Section 4.3. The second is an algorithm that
the advertiser may be interested in discovering rele- categorizes a query by grouping the snippets of the returned
vant queries that have not been generating traffic to URLs into a snippet document, and classifying the snippet
her site. document. This method was first employed by Broder et
al. [6]. For the classification of the snippet document we
We first study the trade-off between the size of the query use a Rocchio classifier [18], trained using the content of
result set and its relevance. For each seed set, we run a ran- all the documents in the Directory seed set. The snippet
dom walk with α = 0.001 and γ = 10−4 . For each result set, document is constructed by merging the snippets of the top-
we evaluate the relevance of the top-k queries, for different 40 documents, a number that was shown to perform well by
values of k. The relevance results for Shoes and Health are Broder et al. [6]
shown in Tables 6 and 7, respectively. We also plot them Both algorithms we compare against are better defined
in Figures 1(a) and 2(a), respectively. For the Shoes ex- for the case that there are multiple concepts, so for this
periment, the relevance decreases with the value of k, but experiment we use the Directory seed set. We measure
even for the top 50K queries it is reasonably high (82.4%). relevance by sampling from the results and evaluating the
For the Health experiment, the relevance remains relatively query-category pairs. For the evaluation of individual pairs,
constant as we increase k, and it is 95.9% for the largest we consider the third level of the taxonomy. Given the large
result set that we consider (k = 500K). size of the taxonomy, for presentation purposes we aggregate
We then consider how indirectness changes for different the results into the 20 top-level categories of the taxonomy.
values of k. The results for Shoes are given in Table 6, and We present micro-averaged relevance [25] for each category.
plotted in Figure 1(b). Indirectness increases considerably The micro-averaged relevance for a high level category c is
when k goes from 5K to 20K (from 64.2% to 87.8%), and computed using the query-category pairs hq, c′ i, where c′
then continues to increase, but more slowly. Indirectness of ranges over all subcategories of c. The micro-average rel-
87.8% can be achieved with a relevance of 93.6%. Table 7 evance of category c is the fraction of such pairs that are
and Figure 2(b) show the indirectness values for the Health judged as “good”. Notice that even though we aggregate
seed set. Indirectness increases sharply from k = 50K to the values at the top level of the taxonomy, the queries are
k = 500K. Initially, only 2% of the queries are indirect, while actually evaluated at the third-level. Table 8 summarizes
68
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
Category ARW
Micro-averages
Mean Snippets
6. RELATED WORK
Field Most of the work on keyword generation has focused on
Food and Drinks 99% 84.6% 92% extracting keywords from documents. Turney [20] proposed
Social Sciences and Humanities 91.3% 35% 78% GenEx, a rule-based keyword extraction system tuned us-
Computers and Computing 89.6% 71.8% 70.1% ing a genetic algorithm. Another well-known keyword ex-
Health and Wellness 97.8% 83.8% 84.9%
traction system is KEA [10], which employs a naive Bayes
Vehicles and Transportation 83.3% 90.9% 90.9%
Lawn and Garden 96.6% 66.7% 60% learning algorithm. Later work added web document related
Travel 94.7% 80% 96.3% features to KEA, such as the number of documents returned
Sports and Recreation 91.1% 51.9% 76.7 % by a search engine [21] and link information [15]. The use
Financial Services 96.7% 85.4% 82.3% of natural language techniques for keyword extraction was
Arts and Entertainment 88.1% 78.7% 78.6% initially studied by Hulth [12]. Yih et al. [26] proposed a sys-
Games Puzzles 94.2% 94.7% 77.8%
tem for extracting keywords from Web pages for contextual
Kids and Teens Lifestyle 89.3% 77.3% 72.2%
Society and Culture 67.1% 54.5% 78.7% advertising. Among other features, they use the frequen-
Business 87.2% 72.1% 70.9% cies of candidate keywords in the query log. However, they
Clothing and Shoes 98.4% 95.5% 95.7% do not employ click information (i.e., relationships between
Science 86.2% 57.3% 86.3% queries and documents). These techniques for keyword ex-
Educational Institutions 87.2% 34.4% 61% traction from documents can be viewed as complementary to
Families and Relationships 94.1% 90.9% 80%
ours. In particular, we can consider the keywords extracted
Animals 94.2% 90.9% 100%
Home Improvement 96.8% 84.2% 93.8% from the given document as a seed set (of queries) for our
Micro-average 88.2% 69% 80.2% click-based algorithm.
Some approaches in the literature exploit the results re-
Table 8: Micro-average relevance. Aggregated for turned by a search engine instead of the user clicks [1, 14].
top-level categories of the taxonomy, evaluated for The idea is to start with a seed set of keywords, submit
third-level categories. them to the search engine and then use the retrieved text
snippets or documents to extract relevant keywords. Joshi
and Motwani [14] introduced a notion of non-obviousness
which, though similar in spirit to the notion of indirectness
the results. The first column lists the 20 top-level categories presented in our work, is defined quite differently. In partic-
in the taxonomy. The second column corresponds to the ular, they assume that a set of keywords is given as input,
ARW algorithm with parameters α = 0.01 and γ = 0.001; and a term is considered non-obvious if it does not contain
the third column to the Mean Field algorithm; and the last any of the input keywords. To the best of our knowledge,
column to the snippets-based algorithm. the only previous approach to keyword generation that em-
The ARW algorithm has the highest micro-average rele- ploys the click graph is by Bartz et al. [4]; however, their
vance (88.2%, as opposed to 80.2% for the Snippets algo- techniques are quite different from ours (they employ logistic
rithm). Notice that although the Snippets algorithm has a regression and collaborative filtering techniques).
lower micro-average relevance than the ARW on the entire The techniques presented in this paper can be used to
sample, it has higher micro-average relevance for 5 out of populate queries within a taxonomy if a directory of URLs
the 20 classes that we consider. This suggests that, as fu- is used to represent the concepts. There are a number of
ture work, it may be fruitful to consider an approach that proposals in the literature that obtain a query category by
exploits both the content of documents and their clicks. On classifying the search results of the query (either documents
the other hand, the Mean Field algorithm has a lower micro- or snippets) [6, 19]. In contrast, we use clicks, and we do
average relevance of 69%. It produces the best results for not need to crawl content in order to train a classifier. More
two of the categories, but usually it exhibits the lowest rel- related to our approach is the work of Xue et al. [23], which
evance ratio. It appears that the ARW algorithm benefits considers the combination of the signal coming from docu-
significantly by the introduction of the null category, and ment content and the click logs, and shows an improvement
the exponential decay in the probability of reaching a node over pure content-based classification methods.
in the seed set. This effect is not modeled well in the case The click graph has been used extensively for other ap-
of the Mean Field algorithm, which results in poor perfor- plications. Some related approaches include the following.
mance in the case that there are weak connections between Beeferman and Burger [5] and Wen et al. [22] employ clus-
queries and URLs of non-relevant classes. Understanding tering techniques to determine query-to-query similarity. Xue
fully how the Mean Field algorithm can be optimized is also et al. [24] use the click graph to find document-to-document
an interesting problem for future work. similarities. Craswell and Szummer [8] consider random
This experiment offers also an interesting insight into how walk techniques on the click graph to produce a ranking
the ARW algorithm performs when the seed set contains of documents for image search. Baeza-Yates and Tiberi [3]
multiple concepts, as well as concepts of fine granularity. use the click graph to extract semantic relationships between
The third level of the taxonomy contains 1158 categories, queries.
some of which are fairly narrow. For example, under the top- In our approach, we start with a seed set (of queries or
level category “Arts and Entertainment”, there are third- URLs) and expand it in order to generate keywords. The
level categories such as “Movies/Film Festivals”, “Movies/A- seed set expansion problem has been considered in other con-
wards”, “Movies/Filmaking” and “Movies/Theaters”. The texts such as detecting hubs and authorities [16], community
micro-average relevance that we obtained indicates that the discovery [2], and detecting spam pages [11].
algorithm produces relevant results even for fine-grained cat-
egories of the taxonomy.
69
WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China
70