Academia.eduAcademia.edu

The query-flow graph: Model and applications

2008, International Conference on Information and Knowledge Management, Proceedings

The query-flow graph: model and applications Paolo Boldi1∗ boldi@dsi.unimi.it Debora Donato2 debora@yahoo-inc.com Francesco Bonchi2 bonchi@yahoo-inc.com Aristides Gionis2 gionis@yahoo-inc.com DSI, Università degli Studi di Milano, Italy 1 ABSTRACT Query logs record the queries and the actions of the users of search engines, and as such they contain valuable information about the interests, the preferences, and the behavior of the users, as well as their implicit feedback to search-engine results. Mining the wealth of information available in the query logs has many important applications including querylog analysis, user profiling and personalization, advertising, query recommendation, and more. In this paper we introduce the query-flow graph, a graph representation of the interesting knowledge about latent querying behavior. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same “search mission”. Any path over the query-flow graph may be seen as a searching behavior, whose likelihood is given by the strength of the edges along the path. The query-flow graph is an outcome of query-log mining and, at the same time, a useful tool for it. We propose a methodology that builds such a graph by mining time and textual information as well as aggregating queries from different users. Using this approach we build a realworld query-flow graph from a large-scale query log, and we demonstrate its utility in concrete applications, namely, finding logical sessions, and query recommendation. We believe, however, that the usefulness of the query-flow graph goes beyond these two applications. 1. INTRODUCTION The huge volume of information recorded daily in query logs contains a wealth of valuable knowledge about how web users interact with search engines as well as information about the interests and the preferences of those users. Extracting behavioral patterns from this wealth of information is a key step towards improving the service provided by search engines and towards developing innovative web∗ Part of this work was done while the authors were visiting Yahoo! Research Labs, Barcelona Submitted for confidential review. Last updated: June 4, 2008. Carlos Castillo2 chato@yahoo-inc.com Sebastiano Vigna1∗ vigna@dsi.unimi.it 2 Yahoo! Research Labs Barcelona, Spain search paradigms. Unfortunately, mining query logs poses many technical challenges that arise due to the very large volume of data, the high level of noise, poorly formulated queries, ambiguity, and sparsity, among others. In this paper we introduce the concept of the query-flow graph, which is a graph modeling user behavioral patterns and query dependencies. The query-flow graph is an actionable, aggregated representation of the interesting information contained in a large query-log. In particular, the phenomenon of interest is the sequentiality of similar queries: the fundamental two dimensions that drive the construction of the query-flow graph are the temporal order of queries and their similarity. Given a query log, the nodes of the query-flow graph are all the queries contained in the log, and a directed edge between two queries qi , qj has a weight w(qi , qj ) representing the probability that the two queries are part of the same search mission, and that they appear in the given order. Thus when w(qi , qj ) is high, we may think of qj as a typical reformulation of qi , thus a step ahead towards the successful completion of a possible search mission. The main contribution of this paper is introducing the query-flow graph and providing a methodology for constructing such a graph based on mining query logs. Besides this, we demonstrate the usefulness of the query-flow graph in two applications: finding logical sessions and query recommendation. With respect to finding logical sessions, we allow them to be intertwined, thus modeling the behavior of users who have a number of interests/goals and submit queries related to the information needs of those interests/goals but in an interleaved fashion. We also address this problem starting from the entire query history of users and not from timeoutdriven sessions. To our knowledge, this is the first time that the modeling of the problem of finding query chains allows such complexity. We formulate the problem of finding intertwined query chains as an asymmetric traveling salesman problem (ATSP), which we approximate with a greedy heuristic. For the problem of query recommendation we propose an algorithm that builds on the concept of query-flow graph and allows leveraging not only similarity between queries but the overall complex structure in a neighborhood of the graph. Our recommendation algorithm is based on performing a random walk with restart to the original query of the user or to a small set of queries representing the recent querying history. This paper is summarized as follows. Section 2 is an overview of the related work. In Section 3 we define our notation and concepts and in Section 4 we discuss our algorithm for constructing the query-flow graph. Then we describe two applications: finding query chains in 5, and query recommendations in Section 6. Finally, Section 7 includes a few concluding remarks. 2. RELATED WORK Query logs are widely considered as a very rich source of knowledge on user behavior. The main challenge in analyzing query logs lies in extracting interesting relations from the raw lists of user actions. Many different approaches have been proposed in order to discover essential features or hidden relations in query logs. Query graphs. One main research line attempts to infer the hidden semantics of user interactions with search engines by projecting the data over different types of graphs. BaezaYates [1], identifies five different types of graphs. In all cases, the nodes are queries; a link is introduced between two nodes respectively if: (i) the queries contain the same word(s) (word graph), (ii) the queries belong to the same session (session graph), (iii) users clicked on the same urls in the list of their results (url cover graph), (iv) there is a link between the two clicked urls (url link graph) (v) there are l common terms in the content of the two urls (link graph). In [1], it is suggested that one application of these graphs is session segmentation which is one of the applications we study in this paper. Baeza-Yates and Tiberi [2], study a weighted version of the cover graph. Their analysis provides information not only about how people query but also about how they behave after a query and the content distribution of what they look at. Moreover the authors study several characteristics of click graphs, i.e., bipartite graphs of queries and urls, where a query and a url are connected if a user clicked on a url that was an answer for a query. This framework is used to infer semantic relations among queries and to detect multitopical urls, i.e., urls that cover either several topics or a single very general topic. Query recommendation. Query recommendation is a core task for large industrial search engines. Most of the work on query recommendation is focused on measures of query similarity [20, 10] that can be used for query expansion [3] or query clustering [3, 19]. A first attempt to model the users’ sequential search behavior is presented by Zhang and Nasraoui [20]: the arcs between consecutive queries in the same session are weighted by a dumping factor d, meanwhile the similarity values for non consecutive queries are calculated by multiplying the values of arcs that join them. Instead, Fonseca et al. [10] discover related queries with a method based on association rules. Each transaction in the query log is seen as a session in which a single user submits a sequence of related queries in a time interval. Their notion of session is similar to the one we use in this paper. Reference [3] studies the problem of suggesting related queries issued by other users and query expansion methods to construct artificial queries. Their method is used to recommend queries that are related to the input query but may search for different issues. The clustering is based on a term-weight vector representation of queries, obtained from the aggregation of the term-weight vectors of the urls clicked after the query. Wen et al. [19] also present a clustering method for query recommendation that is centered around four notions of query distance: the first notion is based on keywords or phrases of the query; the second on string matching of keywords; the third on common clicked urls; and the fourth on the distance of the clicked documents in some pre-defined hierarchy. Query Segmentation. Segmenting the query stream into sets of related information-seeking queries, i.e., logical sessions, has many applications: apart for query recommendation, since logical session can help in understanding the relationship between queries given the user intent, they are valuable for user profiling and personalization. He and Göker [11] studied different timeouts to segment user sessions, and later extended their work [12] to consider other features such as the overlap between terms in two consecutive queries. Radlinski and Joachims [16] observe that users often perform a sequence, or chain, of queries with a similar information need; they refer to this sequence of reformulated queries as query chains. Their paper presents a simple method for automatically detecting query chains in query and clickthrough logs and show how to learn better retrieval functions using evidence of query chains. Recently the problem of query session detection was also considered by Jones and Klinkner [14] where a method for automated segmentation is proposed and evaluated. Temporal classification. Considering time features might have other applications beyond segmenting query stream. Jones and Diaz [13] introduce a model to measure the distribution of documents retrieved in response to a query, over the time domain, in order to create a temporal profile for a query. They show that such a temporal profile can provide valuable information about the likely quality of query results. Random walk models. Craswell and Szummer [8] describe a Markov random walk model for ranking documents. A backward random walk is performed over the click graph, leading to a method for retrieving relevant documents that have not yet been clicked for a predefined query and rank those effectively. The random walk we introduce is performed over a completely different graph and with the objective of ranking queries instead of documents. CollinsThompson and Callan [7] use a Markov random model for query expansion. Their setting is also different from ours: the stationary distribution of the model is used to obtain probability estimates that a potential expansion term reflects aspects of the original query. 3. BASIC CONCEPTS In this section we provide the basic idea behind the queryflow graph. In summary the query-flow graph is an usageoriented, actionable, compact representation of the information contained in a query log, and it is aimed at facilitating the analysis of user behavior. Query log. A query log records information about the search actions of the users of a search engine. Such information includes the queries submitted by the users, documents viewed as a result to each query, and documents clicked by the users. A typical query log L is a set of records hqi , ui , ti , Vi , Ci i, where: qi is the submitted query, ui is an anonymized identifier for the user who submitted the query, ti is a timestamp, Vi is the set of documents returned as results to the query, and Ci is the set of documents clicked by the user. In the above representation, we assume that if U is the set of users to the search engine and D is the set of documents indexed by the search engine, then ui ∈ U and Ci ⊆ Vi ⊆ D. For the purposes of this paper, we do not use any information from the results of the queries (Ci and Vi )— we are only mentioning them above for completeness. Thus, subsequently we denote query logs by L = { hqi , ui , ti i }. Sessions. A user query session, or session, is defined as the sequence of queries of one particular user within a specific time limit. More formally, if tθ is a timeout threshold, a user query session S is a maximal ordered sequence ˙ ¸ S = hqi1 , ui1 , ti1 i, . . . , hqik , uik , tik i , where ui1 = · · · = uik = u ∈ U , ti1 ≤ · · · ≤ tik , and tij+1 − tij ≤ tθ , for all j = 1, 2, . . . , k − 1. Given a query log L , the corresponding set of sessions can be constructed by sorting all records of the query log first by userid ui , and then by timestamp ti , and by performing one additional pass to split sessions of the same user whenever the time difference of two queries exceeds the timeout threshold. Whenever we used a timeout threshold for splitting sessions, we set tθ = 30 minutes, as this is the typical timeout that is often used in web log analysis [6, 18, 15]. Supersessions. The sequence of all the queries of a user in the querylog, ordered by timestamp, is called a supersession. Thus, a supersession is a sequence of sessions in which consecutive sessions have time difference larger than tθ . Chains. A chain is a topically coherent sequence of queries of one user. Radlinski and Joachims [16] defined a chain as “a sequence of queries with a similar information need”. For instance, a query chain may contain the following sequence of queries [14]: “brake pads”; “auto repair”; “auto body shop”; “batteries”; “car batteries”; “buy car battery online”. The concept of chain is also referred to in the literature with the terms mission [14] and logical session [1]. Unlike the concept of session, chains involve relating queries based on the user information need, which is an extremely hard problem, so we do not try to formally define chains here. We note that for chains we do not impose any timeout constraint. Therefore, as an example, all the queries of a user who is interested in planning a trip to a far-away destination and searches for tickets, hotels, and other tourist information over a period of several weeks should be grouped in the same chain. Additionally, for the queries composing a chain we do not require them to be consecutive. Following the previous example, the user who is planning the far-away trip may search for tickets in one day, then make some other queries related to a newly released movie, and then return to trip planning the next day by searching for a hotel. Thus, a session may contain queries from many chains, and inversely, a chain may contain queries from many sessions. The query-flow graph. The final concept we define is the query-flow graph, which is a central contribution in our paper. The query-flow graph Gqf is a directed graph Gqf = (V, E, w) where: • the set of nodes is V = Q ∪ {s, t}, i.e., the distinct set of queries Q submitted to the search engine and two special nodes s and t, representing a starting state and a terminal state which can be seen as the begin and the end of a chain; • E ⊆ V × V is the set of directed edges; • w : E → (0 . . 1] is a weighting function that assigns to every pair of queries (q, q ′ ) ∈ E a weight w(q, q ′ ) representing the probability that q and q ′ are part of the same chain. We will mainly focus only on how to compute the weighting function w. We will see how different applications may lead to different weighting schemes. In our setting, even if a query has been submitted multiple times to the search engine, possibly by many different users, it is anyway represented by a single node in the query-flow graph. The two special nodes s and t are used to capture the begin and the end of query chains. In other words, the existence of an edge (s, qi ) represents that qi may be pontentially a starting query in a chain, and w(s, qi ) quantifies the probability of this event happening. Similarly an edge (qi , t) models the probability of qi being a terminal query in a chain. The edge weights in the query-flow graph are obtained from the query log by a machine-learning algorithm as described in the following section. 4. BUILDING THE QUERY-FLOW GRAPH In this section we describe our approach for building the query-flow graph Gqf = (V, E, w). Our algorithm takes as input a set of sessions S (L ) = {S1 , . . . , Sm }, which in our case are extracted from a query log L from the Yahoo! UK search engine in early 2008. As we already mentioned, the set of sessions can be easily constructed by sorting the queries by userid and by timestamp, and splitting them using the timeout threshold. As stated in the previous section, the set of nodes V in the query-flow graph is the distinct set of queries Q in L plus the two special nodes s and t. The key aspect of the construction of the query-flow graph is to define the weighting function w : E → (0 . . 1]. For the moment we leave apart the two special nodes s and t: we will discuss later about how to connect them with the other nodes of the graph. Given two queries q, q ′ ∈ Q we tentatively connect them with an edge if there is at least one session in S (L ) in which q and q ′ are consecutive. In other words, we form the set of tentative edges T as: T = {(q, q ′ ) | ∃Sj ∈ S (L ) s.t. q = qi ∈ Sj ∧q ′ = qi+1 ∈ Sj }. Next, we want to associate a probability w(q, q ′ ) to each edge (q, q ′ ) ∈ T . We do this by building a machine learning model. The first step is to build for each edge (q, q ′ ) ∈ T a set of features associated with the edge. Those features are computed over all sessions in S (L ) that contain the queries q and q ′ appearing in this order and consecutively. The features we use aggregate, among other, information about the time difference in which the queries are submitted [11], textual similarity of the queries [12, 14], and the number of sessions in which they appear. We shortly describe the features in more detail. For learning the weighting function from these features, we use training data. This training data is created by picking at random a set of edges (q, q ′ ) (excluding the edges where q = s or q ′ = t) and manually assigning them a label 1e−01 1e−03 1e−05 Frequency 1 2 5 10 20 50 100 200 Count Figure 1: The distribution of counts (number of times a given pair of query appears consecutively in that order in S (L )); it is a power law with a spike at 1 (most pairs being hapax). same chain. This label, or target variable, is assigned by human editors and is 0 if q and q ′ are not part of the same chain, and it is 1 if they are part of the same chain. The probability of having an edge included in the training set is proportional to the number of times the queries forming that edge occur in that order and consecutively in the query log. We then use this training data to learn the function w(−, −), given the set of features and the label for each edge in T . The features. We built a set of 18 features to be used to compute the function w(−, −) for each edge in T . Several of these features were shown to be effective for query segmentation [11, 12, 14] and can be summarized as follows: • Textual features. We compute the textual similarity of queries q and q ′ using various similarity measures, including cosine similarity, Jaccard coefficient, and size of intersection. Those measures are computed on sets of stemmed words and on character-level 3-grams. • Session features. We compute the number of sessions in which the pair (q, q ′ ) appears. We also compute other statistics of those sessions, such as, average session length, average number of clicks in the sessions, average position of the queries in the sessions, etc. • Time-related features. We compute average time difference between q and q ′ in the sessions in which (q, q ′ ) appears, and the sum of reciprocals of time difference over all appearances of the pair (q, q ′ ). The learning function. The next step for constructing the query-flow graph is to train a machine learning model to predict the label same chain. The training dataset consists of approximately 5, 000 labeled examples; the labels were assigned by the authors of this paper. We tested and compared many different machine learning approaches. As shown in Figure 1, the frequency of query pairs follows a power-law with a spike at 1. After experimenting with different settings, we decided to divide the classification problem into two subproblems, and thus the data were also partitioned into two training sets T1 and T2 , by distinguishing between pairs of queries appearing together only once (we name this set T1 , which contain approximately 50% of the cases), and pairs appearing together more than once (we name this T2 ). The distribution of the target variable same chain is 66% positive and 34% negative in T1 , and 70% positive and 30% negative in T2 . After various comparisons we selected the best models for T1 and T2 with respect to classification accuracy and simplicity of the model. For T1 we adopted a very simple yet accurate logistic regression model using only 3 of the features available, namely (a) the Jaccard coefficient between sets of stemmed words, (b) the number of n-grams in common between the two queries, and (c) the time between the two queries in seconds. For T2 instead we adopted a rule-based model consisting of a total of 8 simple rules (4 for each class). We use the model we selected to assign the weight w(q, q ′ ) to each edge (q, q ′ ). In particular, we label each edge which has been classified as being in class 1 same chain, with the conviction with which the model makes the prediction. All the edges that are classified in class 0, are labelled by 0, that corresponds to removing the edge from the query-flow graph Gqf . The query-flow graph as a stochastic matrix. Next we consider normalizing the weights w(q, q ′ ) of the query-flow graph so that the sum of the weights of the edges going out from each node is equal to 1. The result of such a normalization can be viewed as the transition matrix P of a Markov chain1 whose states correspond to queries and where P (q, q ′ ) is the probability that the query q ′ immediately follows the query q. Adding the starting and terminal state. Finally, we have to describe how the two special nodes s and t enter in the query-flow graph Gqf = (V, E, w). For each session S ∈ S (L ) = {S1 , . . . , Sm }, we add an edge (s, q) where q is the first query of the session S, and we assign it a weight of m′ /m, where m′ ≤ m is the number of times q appears at the beginning of a session. Similarly, let now q be the ending query of any session S ∈ S (L ); moreover suppose that q appears c times across all the session, but only m′ ≤ m times at the end of a session; then, we add an edge (q, t) with weight m′ /c, and we multiply the weights of the other arcs of the form (q, −) by 1 − m′ /c. In Figure 2 we show a small snapshot of the query flow graph we produced. This contains the query “barcelona” and some of its followers up to a depth of 2, selected in decreasing order of count. Also the terminal node t is present in the figure. Note that the sum of outgoing edges from each node does not reach 1 just because not all outgoing edges (and relative destination nodes) are reported. 5. FINDING CHAINS In this section we describe our first application of the query-flow graph: finding chains of queries in user sessions. As we have already mentioned, finding chains is a very im1 The matrix itself is only substochastic, due to the presence of queries with no successors; when we need to turn it into a proper stochastic matrix (as we do for recommendation in Section 6), we add an artificial, uniformly weighted set of edges from every query q with no successors to all other queries. barcelona fc website 0.005 0.005 u −1 barcelona fc fixtures real madrid 0.002 0.005 0.506 0.439 barcelona hotels 0.005 0.009 0.002 0.005 cheap barcelona hotels t barcelona barcelona weather luxury barcelona hotels 0.002 0.009 u P (Cu ) = P (s, qiu1 )P (qiu1 , qiu2 ) . . . P (qiuℓ 0.005 barcelona fc hs, qiu1 , . . . , qiuℓ , ti, that is associated the probability 0.416 0.524 0.009 barcelona weather online , qiuℓ )P (qiuℓ , t) u u and we want to find a chain cover maximizing P (C1 ) . . . P (Ch ). When a query appears more than once, “duplicate” nodes for that query are added to the formulation, which makes the description of the algorithm slighly more complicated than what is presented here. For simplicity of the presentation we omit the details related to queries appearing more than once below, which are not fundamental to the understanding of the algorithm. TSP formulation. We shall now provide an alternative, equivalent formulation of the same problem. Given the sesson S = hq1 , q2 , . . . qk i, consider a directed weighted graph GS = (V, E, ω) with nodes V = {s, q1 , . . . , qk , t}, and arcs defined as follows: • for every i and j with i < j, there is an arc (qi , qj ) with  − log P (qi , qj ), if i + 1 < j ω(qi , qj ) = − log max{P (qi , qj ), P (qi , t)P (s, qj )}, else; • for every i and j with j < i, there is an arc (qi , qj ) with ω(qi , qj ) = − log P (qi , t)P (s, qj ); Figure 2: A portion of the query flow graph using the weighting scheme described on Section 4. • for every i, there are arcs (s, qi ), (qi , t) with ω(s, qi ) = − log P (s, qi ) and ω(qi , t) = − log P (qi , t). Intuitively, the graph GS contains two kinds of arcs between the session queries: portant problem as it allows improving query-log analysis, user profiling, mining user behavior, and more. The problem we consider is the following: We are given a supersession S = hq1 , q2 , . . . , qk i of one particular user. We are also given the query-flow graph, which has been computed with the sessions of S as part of its input. The chainfinding problem can also be defined in the case that the sessions of S have not participated in the construction of the query-flow graph. However, in this paper we focus on the former case and we leave the latter for future work. One of the challenges of the problem we consider arises from our definition of chains: we allow chains not to be consecutive in the supersession S; in other words, the supersession S may contain many intertwined chains such as the ones shown in the Table 1. Previous work has mostly focused on the case where all chains are consecutive. Chain #1 ... football results january 2nd royal carribean cruises holidays motherwell football club ... Chain #2 ... pointui forum audi ipswich golfers elbow cox ipswich ... Table 1: Two fragments from sessions containing non-consecutive chains. • green arcs: a green arc from qi to qj is an arc whose weight is − log P (qi , qj ), and it is present only if i < j; when we follow a green arc we are “going on” with a chain (possibly skipping some queries, those with indices i + 1, . . . , j − 1, that will be part of some other chain); • red arcs: a red arc from qi to qj is an arc whose weight is − log P (qi , t)P (s, qj ), and it is present only if i + 1 ≥ j; when we follow a red arc we are deciding to stop the current chain; at this point, some other chain will start, and qj is the starting point of the new chain. Notice that we shall always restart from the query qj with smallest possible index that is as yet unassigned to any chain, so either j < i or j = i + 1. We also have arcs starting from s and leading to t: these are used only for the very first query of the very first chain, say qi , and for the very last query of the very last chain, say qj , to account for the additional probability contribution given by P (s, qi ) and P (qj , t), respectively. We will prove that finding a chain cover maximizing P (C1 ) . . . P (Ch ) can be reduced to solving min-TSP on GS . For sake of simplicity, assume for the moment that P (qi , qi+1 ) = P (qi , t)P (s, qi+1 ) for every i < k. Take a chain conver C1 , . . . , Ch (without loss of generality, let i11 < i21 < · · · < ih1 ) and identify each Cu with a path Ĉu as follows: Ĉu The chain-finding problem can be formalized as follows: let us define a chain cover of S = hq1 , q2 , . . . qk i as a partition of the set {1, . . . , k} into subsets C1 , . . . , Ch ; each set Cu = {iu1 < · · · < iuℓu } is thought of as a chain Cu = = qiu1 → · · · → qiuℓ . u Because of the way we defined weights of green arcs, we have ω(Ĉu ) = − log P (Cu ) + log A(s, qiu1 )A(qiuℓ , t) u 5.2 (a) It is well known that min-TSP is NP-hard even when weights are symmetric; exact branch-and-bound solutions exist, but are anyway rather slow and work reasonably only for few tens of nodes. Instead of trying to produce exact solutions, we content ourselves of a greedy heuristics that simply chooses every time the arc with minimum weight going out of the current node: in the following, we shall refer to this heuristic algorithm simply as the ATSP algorithm. The ATSP algorithm works in time O(k2 ), where k is the size of the supersession. It would be interesting to know how far the solution produced by this algorithm is from the exact solution on real data; on a more theoretical side, it would be nice to determine if our problem is still NP-hard, or if it is actually simpler, maybe polynomial. Both questions are left for future work. 5.3 (b) Figure 3: (a) A sub-graph from an example query graph, containing only three queries; (b) The graph GS corresponding to a specific session S = hB, A, Ci. Now, since ω(qiuℓ , qiu+1 ) = − log P (qiuℓ , t) − log P (s, qiu+1 ), u u 1 1 we have ω(s → Ĉ1 → · · · → Ĉh → t) = −log (P (C1 ) . . . P (Ch )) , and s → Ĉ1 → · · · → Ĉh → t is a Hamiltonian path from s to t. Conversely, it is easy to see that every Hamiltonian path from s to t can be produced by a chain cover of S. So, finding a maximal chain cover can be reduced to finding a Hamiltonian path of minimum weight2 . This is a version of the minimum travelling salesman problem (min-TSP), with asymmetric costs (the fact that the cycle is to contain the arc t → s does not make any difference, because all other arcs going out of t have infinite weight). 5.1 Example In Figure 3, we show an example of a query-flow graph and the graph GS corresponding to the session S = hB, A, Ci: logarithms have been approximated, and the Markov chain is shown only for the queries involved in the session. Dotted lines are used for red arcs. The solution to min-TSP is s → C → B → A → t (with cost 0.36 + 1.97 + 2.3 + 0.36 = 4.99), which corresponds to the cycle cover C1 = hs, C, ti and C2 = hs, B, A, ti; note that P (C1 ) = 0.7 · 0.7 and P (C2 ) = 0.2 · 0.1 · 0.7, with the overall product being P (C1 ) · P (C2 ) = .00686 and − log(P (C1 )P (C2 )) ≈ 4.99 as expected. 2 Dealing with the case P (qi , qi+1 ) 6= P (qi , t)P (s, qi+1 ) is easy, because then, even though there can be more than one cover associated with the same Hamiltonian path (one that breaks a chain between qi and qi+1 , and one that does not), only one such chain will have maximal probability, and it is precisely the one that is used to assign weights in the graph. Approximated greedy solution Comparison with session timeout In this section we describe our experiments for evaluating the chain-finding algorithm we propose, and compare it with a simple timeout-based method. The query-flow graph is created as described in Section 4. For creating a training set for evaluating the session-breaking task, we sampled uniformly at random a set of 586 supersessions containing 2 queries or more—if there is only one query the task is trivial. Each of these 586 supersessions is classified by human editors using the following methodology: (i) first duplicate queries are eliminated, (ii) each query is assigned by the human editors to one chain (possibly nonconsecutive), (iii) some queries remained unassigned in this process (due to the impossibility, by the human editor, to clearly map a query to one chain). The chains obtained in the above process consitute the “golden standard” with which we compare our algorithm. We then apply the ATSP algorithm we described above for spliting the 586 supersessions into chains. For comparison we also implemented a “baseline” algorithm, which splits each supersession into sessions (using only the timeout threshold tθ ) and considers each resulting session as a chain. Given a supersession S, the chains produced for S by the human evaluation or by the algorithms we test define a partition of S. We evaluate the ATSP and the Baseline algorithms by comparing the chains they produce with the chains produced by the human evaluation using the Rand index [17], a commonly employed measure of similarity between partitions. Notice that the chains produced by the human evaluation do not contain duplicate queries, while the chains produced by the ATSP and the Baseline algorithms might contain duplicates, so before computing the Rand index we remove duplicate queries. Results. Given a supersession S, let RA (S) be the Rand index of comparing the chains produced for S by the ATSP algorithm with the “golden standard” chains for S, and let RB (S) be the Rand index of comparing the chains produced for S by the Baseline algorithm with the “golden standard” chains for S. The distributions of RA and RB are summarized in Table 2. From Table 2, it appears that the two algorithms are almost equivalent. However, taking a closer look at the results reveals that the seemingly similar performance is caused by many easy supersessions, e.g., superssesions consisting 1.0 Table 2: Rand index distributions for ATSP and Baseline. 1st Qu. Median Mean 3rd Qu. ATSP 0.7778 1.0000 0.8687 1.0000 Baseline 0.7521 1.0000 0.8554 1.0000 0.4 0.6 0.8 Table 3: Rand index distributions for ATSP and Baseline, when RB (S) < 1. 1st Qu. Median Mean 3rd Qu. ATSP 0.5556 0.7778 0.7464 0.9022 Baseline 0.5100 0.7500 0.6984 0.8485 of one or two queries that the Baseline is able of handling correctly. A more detailed analysis reveals that the ATSP algorithm is able of handling better more difficult supersessions. For instance, in the 92% of the cases in which RB (S) = 1 we also have RA (S) = 1. In the cases in which RB (S) < 1 (supersession difficult for the Baseline) the average RB score is 0.6984, while the average RA score is 0.7464, a 6.4% improvement (see Table 3) On the other hand, in the cases in which RA (S) < 1 (supersessions difficult for the ATSP) the average RB score is 0.7248, while the average RA score is 0.7140, which is only a 1.4% improvement of the Baseline with respect to the ATSP algorithm. In other words, we can say that simple cases are treated comparatively well by the ATSP algorithm and the Baseline, while in difficult cases the ATSP algorithm clearly outperforms the Baseline; in Figure 4 we show the situation for the case RB (S) < 1 through a scatter plot. We note again that the ATSP algorithm has the ability to find intertwined chains, which, to our knowledge, is a significant novelty with respect to the current state of the art. We also note that given a supersession, the ATSP algorithm does not utilize at all the timestamp information of the queries, which, in fact, is the information exploited by the Baseline algorithm. require for the edge (q, q ′ ) to be composed of two queries in the same chain. It is worth noting here, that intuitively the problem of query recommendation may benefit for handling query similarities in an non-symmetric way, and indeed, the query flow graph is strongly non-symmetric. Excluding the s and t nodes whose arcs are obviously not symmetric, 93% of the arcs in the graph do not have a reciprocal arc. Moreover, even for the few arcs that possess a reciprocal, the weights in both directions w(q, q ′ ) and w(q ′ , q) are uncorrelated (Kendall’s τ is about 0.26), and the same is true of w′ (Kendall’s τ is 0.16). 6. QUERY RECOMMENDATIONS 6.1 Most modern search engines include some form of automatic query recommendation, to suggest new queries that may be relevant to the current user’s mission. Using querylog massive information to this purpose was suggested in [20]. Here we obtain query recommendations as an application of the query flow graph. The query recommendation task is different from the session breaking task described in Section 5; while we can use the same query flow graph, we find that for the algorithm we propose it is more meaningful to use different weights on the edges. In particular, we define new weights w′ (q, q ′ ) as follows: ( count(q,q ′ ) if (w(q, q ′ ) > θ) ∨ (q = s) ∨ (q = t) ′ ′ d(q) w (q, q ) = 0 otherwise, A simple recommendation scheme that uses the query flow graph is to pick, for an input query q, the node having the largest w′ (q, q ′ ). An example output from this scheme is shown on the first column of Table 4 for the queries “apple” and “jeep”. An issue with this method, that we observed for several test queries, is that it tends to “drift” towards those queries that are popular in the query log, but unrelated with the query at hand. where count(q, q ′ ) is the number of times query q is followed by query q ′ , or the number of times a chain starts with query q ′ if q = s, or the number of times with query P a chain ends ′ q if q ′ = t. The factor d(q) = q ′ count(q, q ) is used for normalization, and θ = 0.9 is the value of confidence we 0.4 0.6 0.8 1.0 Figure 4: Every point in this plot corresponds to a supersession S with RB (S) < 1; its coordinates are (RB (S), RA (S)). The fact that the points in the upper-left corner are denser than in the lower-right corner supports further the evidence that the ATSP algorithm outperforms the Baseline when RB (S) < 1. 6.2 Recommendation by maximum weight Recommendation by random walk A recommendation algorithm can be built upon a measure of relative importance: when a user submits a query q to the engine, the recommendation that the engine provides should be the most important query q ′ relatively to q. If we look at the problem under this point of view, we are naturally led to apply a form of personalized PageRank [9], where the preference vector is concentrated in a single node. Alternatively, this can be described as a random walk with restart to a single node [5]: a random surfer starts at the initial query q; then, at each step, with probability α < 1 Max. weight t apple ipod apple store apple trailers amazon apple mac itunes pc world argos currys sq t apple apple ipod apple store apple trailers google amazon argos itunes pc world ŝq apple apple fruit apple ipod apple belgium eating apple apple.nl apple monitor apple usa apple jobs apple movie ... s̄q apple apple ipod apple trailers apple store apple mac apple fruit apple usa apple ipod nano apple.com/ipod... t t jeep cherokee jeep grand ... jeep wrangler land rover landrover ebay chrysler bmw nissan jeep jeep trails jeep kinderk... jeep compass jeep cherokee swain and jon... jeep bag country living ... buy range rov... craviotto snare t jeep jeep cherokee jeep grand ... bmw jeep wrangler land rover landrover chrysler google jeep jeep cherokee jeep trails jeep compass jeep kinderkled... jeep grand ... jeep wrangler chryslar jeepcj7 buses to Knowl... preference vector). Experiments performed show that indeed in most cases ŝq (q ′ ) produces rankings that are more reasonable, but sometimes tend to boost too much scores having a very low absolute scorepr(q ′ ). To use a bigger denominator, we also tried with r(q ′ ) as r(q ′ ) < 1; this corresponds also to the geometric mean between sq (q ′ ) and ŝq (q ′ ), that is s̄q (q ′ ) = Table 4: Top 10 recommendation for the queries q =“apple”, and q =“jeep” according to the baseline, and to the various random-walk scores proposed. the surfer follows one of the outlinks from the current node chosen proportionally to the weights present on the arcs, and with probability 1 − α (s)he instead jumps back to q. This process describes the transition matrix A of a Markov chain that can be more formally defined as: A = αP + (1 − α)1eTq where P is the row-normalized weight matrix of the query flow graph, and ej is the vector whose entries are all zeroes, except for the j-th whose value is 1. Although A is not ergodic in general, as proven in [5] A is unichain as long as α ∈ [0 . . 1), so it has a unique stationary distribution, namely, a unique distribution vector v such that vT A = v. Such a distribution (called the random-walk score relative to q) can be computed using the power iteration method, and then employed to determine the relevance of all queries with respect to q, as explained below. In all our experiments, we chose α = 0.85, as it is customary in the PageRank literature [4], and used the ℓ1 -norm of the difference of two successive iterates to decide when to stop. Recommendations can be deduced from the random-walk score by taking either the single top-scored query, or the best queries up to a certain lower score threshold. Notice that, in particular, if the most relevant query for q is t, this means that it is wise for the engine not to give any suggestion, because the query flow graph is showing that the chain at that point is more likely to end than to continue. Using just the random-walk score, though, can be misleading, because in many cases a query has a high random-walk score simply because it is a very common query altogether; the situation, here, is not dissimilar to what happens in the classical weighting schemes used for document retrieval, like tf-idf, where the term frequency within a document needs to be discounted by the absolute importance of the term (the idf part of the formula). Instead of using the pure random-walk score sq (q ′ ) of the query q ′ with respect to q, we can consider the ratio ŝq (q ′ ) = sq (q ′ )/r(q ′ ) where r(q ′ ) is the absolute randomwalk score of q ′ (i.e., the one computed using a uniform p sq (q ′ ) . sq (q ′ ) · ŝq (q ′ ) = p r(q ′ ) Table 4 shows the output of the random-walk scoring and the adjusted variants discussed above: note that, except for the first few queries, the baseline soon “gets lost” in completely unrelated queries; sq works well, but as expected popular queries (like “ebay”) pollute the results; on the other hand ŝq tends to overpenalize common queries, and tends to produce exotic recommendations (“apple belgium”), whereas s̄q gives the most pertinent results. 6.3 Recommendation with history A further step in the same direction is providing recommendation that depends not only on the last query input by the user, but on some of the last queries in the user’s history. This approach may help to alleviate the data sparsity problem –the current query may be rare, but among the previous queries there might be queries for which we have enough information in the query flow graph. Basing the recommendation on the user’s query history may also help to solve ambiguous queries, as we have more informative suggestions based on what the user is doing during the current session. Using the same notation as before, suppose that q1 , . . . , qk is the current query chain (ordered starting from the most recent); then, we consider the Markov process whose transition matrix is defined by A = αP + (1 − α)1eTq1 ,...,qk where v = eq1 ,...,qk is a vector whose entries are such that vq1 > vq2 > · · · > vqk > 0. Equivalently, the overall process may be described using the random surfer metaphor, where v is the distribution used to choose the teleportation node, when teleportation is decided. Although other choices are possible, we always fixed v to be such that vq = 0 for all q 6∈ {q1 , . . . , qk }, and vqi ∝ β i for some β < 1. Also in this case, we are not going to use the pure randomwalk score sq1 ,...,qk (q ′ ) of the query q ′ with respect to the sequence q1 , . . . , qk , but the adjusted score s̄q1 ,...,qk (q ′ ) instead. It is interesting to compare the relevance score s̄q1 ,...,qk (q ′ ) that can provide recommendation using the whole history with the score s̄q1 (q ′ ) that can only exploit the last query. Table 5 shows the output for two hypothetical chains. In the first one, the query q ′ =“apple”’ is preceded by the query q =“banana”’, or by the query q =“beatles”’ (“Apple Records” is a record label founded by The Beatles). The parameter β is set to 0.8 and the scoring uses s̄q . In Table 6, two actual query sessions are processed by the algorithm. Table 5: Recommendations for the query q =“apple”, considering that the previous query was “banana” (top) or “beatles”’ (bottom). banana → apple banana apple usb no banana cs giant chocolate bar where is the seed in anut banana shoe fruit banana banana cloths eating bugs banana banana eating bugs banana holiday opening a banana banana shoe fruit banana recipe 22 feb 08 banana jules oliver banana cs banana cloths beatles → apple beatles apple apple ipod scarring srg peppers artwork ill get you bashles dundee folk songs the beatles love album place lyrics beatles beatles beatles scarring paul mcartney yarns from ireland statutory instrument A55 silver beatles tribute band beatles mp3 GHOST’S ill get you fugees triger finger remix Table 6: Recommendations for two actual query chains. music facebook → gabriella → music music music yahoo music gabriella music videos yahoo music music downloads music videos free music music downloads yahoo music videos free music music yahoo gabriella sweet like me free music videos lighting bug rotherham yahoo music launch ccp npa ndf free music downloads gabriela lighting evening dress evening dress formal evening dress red evening dress myevening dress prom 008 dresses long dressess evening dress uk fashion women dress dresses for the evening 1900evening dress orion → orion dress orion evening dress → evening dress evening dress orion evening dress formal evening dress red evening dress long dressess myevening dress fashion women dress prom 008 dresses evening dress uk 1900evening dress 7. CONCLUSIONS The query-flow graph summarizes a query log in a compact representation. This representation can be obtained efficiently from the source data and enables several key search and mining operations. The query-flow graph is sparse, and about half of the query pairs appear only once in the query log. Also, the graph is strongly non-symmetrical, as 93% of the edges have no reciprocal edge. In this paper, we have shown two key applications in usage mining that are supported by the query-flow graph. We have shown a method that exploits the information in the queryflow graph for segmenting the user sessions into logicallycoherent query chains. We have also shown several methods for generating query suggestions based on random walks in the query-flow graph. Extensive evaluation and tuning of these methods is necessary to implement them effectively in practice. So far we have shown that these tasks can be implemented efficiently using the abstraction we have developed here. Specific aspects to look at in future work include: features for the query segmentation model, weighting schemes for the recommendation systems, scoring methods for the output of the random walks, and better evaluation methods. [3] [4] [5] [6] [7] [8] 8. REFERENCES [1] R. Baeza-Yates. Graphs from search engine queries. In Theory and Practice of Computer Science (SOFSEM), volume 4362 of LNCS, pages 1–8, Harrachov, Czech Republic, January 2007. Springer. [2] R. Baeza-Yates and A. Tiberi. Extracting semantic relations from query logs. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on [9] [10] Knowledge discovery and data mining, pages 76–85, New York, NY, USA, 2007. ACM Press. R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. In EDBT Workshops, volume 3268 of LNCS, pages 588–596. Springer, 2004. M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Interet Technol., 5(1):92–128, 2005. P. Boldi, V. Lonati, M. Santini, and S. Vigna. Graph fibrations, graph isomorphism, and PageRank. RAIRO Inform. Théor., 40:227–253, 2006. L. Catledge and J. Pitkow. Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems, 6(27), 1995. K. Collins-Thompson and J. Callan. Query expansion using random walk models. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 704–711, New York, NY, USA, 2005. ACM. N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 239–246, New York, NY, USA, 2007. ACM Press. K. Csalogány, D. Fogaras, B. Rácz, and T. Sarlós. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Math., 2(3):333–358, 2005. B. M. Fonseca, P. B. Golgher, E. S. de Moura, and N. Ziviani. Using association rules to discover search [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] engines related queries. In LA-WEB ’03: Proceedings of the First Latin American Web Congress, Washington, DC, USA, 2003. IEEE Computer Society. D. He and A. Göker. Detecting session boundaries from web user logs. In Proceedings of the BCS-IRSG 22nd annual colloquium on information retrieval research, pages 57–66, Cambridge, UK, 2000. D. He, A. Göker, and D. J. Harper. Combining evidence for automatic web session identification. Inf. Process. Manage., 38(5):727–742, September 2002. R. Jones and F. Diaz. Temporal profiles of queries. ACM Trans. Inf. Syst., 25(3), July 2007. R. Jones and K. L. Klinkner. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. Submitted for publication, 2008. B. Piwowarski and H. Zaragoza. Predictive user click models based on click-through history. In CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 175–182, New York, NY, USA, 2007. ACM. F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239–248, New York, NY, USA, 2005. ACM Press. W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:622–626, 1971. J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo’s logs. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 151–158, New York, NY, USA, 2007. ACM. J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. In WWW ’01: Proceedings of the 10th international conference on World Wide Web, pages 162–168, New York, NY, USA, 2001. ACM. Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 1039–1040, New York, NY, USA, 2006. ACM.