Model-Directed Web Transactions Under Constrained Modalities
Model-Directed Web Transactions Under Constrained Modalities
Model-Directed Web Transactions Under Constrained Modalities
Online transactions (e.g., buying a book on the Web) typically involve a number of steps spanning
several pages. Conducting such transactions under constrained interaction modalities as exempli-
fied by small screen handhelds or interactive speech interfaces—the primary mode of communica-
tion for visually impaired individuals—is a strenuous, fatigue-inducing activity. But usually one
needs to browse only a small fragment of a Web page to perform a transactional step such as a
form fillout, selecting an item from a search results list, and so on. We exploit this observation to
develop an automata-based process model that delivers only the “relevant” page fragments at each
transactional step, thereby reducing information overload on such narrow interaction bandwidths.
We realize this model by coupling techniques from content analysis of Web documents, automata
learning and statistical classification. The process model and associated techniques have been incor-
porated into Guide-O, a prototype system that facilitates online transactions using speech/keyboard
interface (Guide-O-Speech), or with limited-display size handhelds (Guide-O-Mobile). Performance
of Guide-O and its user experience are reported.
Categories and Subject Descriptors: I.7.5 [Document and Text Processing]: Document Cap-
ture—Document analysis
General Terms: Algorithms, Human Factors
Additional Key Words and Phrases: Web transaction, content adaption, assistive device
This work was supported in part by NSF grants CCR-0311512 and IIS-0534419.
This article is an extended version of our paper “Model-Directed Web Transactions under Con-
strained Modalities,” which appears in the proceedings of International World Wide Web Confer-
ence 2006 (WWW ’06).
Authors’ addresses: Z. Sun, J. Mahmud, and I. V. Ramakrishnan, Department of Computer Sci-
ence, Stony Brook University, Stony Brook, NY 11794; email: {zsun,jmahmud,ram}@cs.sunysb.edu;
S. Mukherjee, Siemens Corporate Research, Princeton, NJ 08540; email: saikat.mukherjee@
siemens.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permission@acm.org.
C 2007 ACM 1559-1131/2007/09-ART12 $5.00 DOI 10.1145/1281480.1281482 http://doi.acm.org/
10.1145/1281480.1281482
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
1. INTRODUCTION
The World Wide Web has become the dominant medium for doing e-commerce.
The volume of goods bought from online stores continues to grow dramatically.
A Web transaction such as buying a CD player from an online store involves
a number of user steps spanning several Web pages. As an illustration let us
examine some common steps for buying a CD player from Best Buy (http://
www.bestbuy.com). To begin the user fills out the search form with “electronics”
as the category and “CD Player” as the item. The search result generated in
response is shown in Figure 1(b). The user selects the very first item in the
result list which leads to the page in Figure 1(c) containing product details.
To complete the transaction the user adds this selected item to the cart which
leads to the page in Figure 1(d) wherein the user selects checkout. Observe
that there are two essential components to a Web transaction: (i) locating the
relevant content, such as a search form or the desired item in a Web page, and
(ii) performing a sequence of steps, such as filling out a search form, selecting
an item from the search list and doing a checkout. For completing a transaction
these steps usually span several pages.
Online transactions such as the one described above are usually performed
with graphical Web browsers. The primary mode of interaction with graphical
browsers is visual, an intrinsically spatial modality. Hence, users can quickly
scan through the rich engaging content in Web pages scripted for e-commerce
and locate the objects of interest quite easily. Moreover, the spatial organiza-
tion of content in these pages helps users comprehend the sequence of steps
necessary to complete a transaction.
Now consider scenarios where visual interaction is impossible ( e.g., when the
user is a visually handicapped individual) or the interaction media has small
displays (e.g., mobile handhelds). Speech is the primary modality of interac-
tion in the former situation. Speech interfaces and small displays offer narrow
interaction bandwidths making it cumbersome and tedious to get to the perti-
nent content in a page. For instance, state-of-the-art screen readers and audio
browsers (e.g., JAWS and IBM’s Home Page Reader) provide almost no form of
filtering of the content in a Web page resulting in severe information overload.
This problem is further exacerbated when such an interaction spans several
pages as in an online transaction. In particular the loss of spatially organized
content makes it difficult for users to comprehend the sequence of transactional
steps. While content summarization can compensate somewhat for this loss, it
alone cannot handle the information overload that the user faces.
Thus, there is a need for techniques that facilitate Web transactions using
constrained interaction modalities that are far less cumbersome to do than cur-
rent approaches. This article addresses this problem and describes our solution.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
page containing a detailed description of this player (Figure 1(c)). In this page
Guide-O extracts three concepts, namely “Search Form,” “Item Detail,” and “Add
To Cart.” Alice is asked if she wishes to hear the product details. When Alice
responds in the affirmative, Guide-O reads out the detailed description of the
CD player she picked earlier. At the end Alice is asked if she wishes to add this
to her shopping cart. Alice responds “yes.” Guide-O follows the link labeled Add
To Cart in Figure 1(c) to the page shown in Figure 1(d). In this page Guide-O
extracts the concepts of “Search Form,” “Shopping Cart,” “Checkout” and “Con-
tinue Shopping.” When presented with these choices, Alice chooses “Checkout.”
To complete the transaction Alice must provide credit card information upon
checkout. Its details have been omitted as they are quite similar to the form
filling step described in the first step of the scenario. At any point Alice can also
say any one of a set of general-purpose navigation commands, such as “Back
to top page,” “Start over,” “Repeat” (last item), or “Stop.” Besides, on a laptop
or desktop computer Alice could also use a keyboard in addition to speech to
interact with Guide-O.
1 http://www.w3.org/TR/voicexml20/
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
in Web pages it classifies these segments to the concepts in the ontology and
labels them with their names. (Section 4 describes the classification technique.)
Table I shows concept names in our shallow ontology for online shopping trans-
action domain (e.g., books, consumer electronics, office supplies). Table II lists
some concept names in a shallow ontology for the utility bill payment trans-
action domain. Associated with each concept is an operation as shown in the
tables. When the user selects a concept in a state, the corresponding opera-
tion is invoked. The ontology also includes information for classification of page
segments to concepts.
The Browser Object Interface fetches pages from Web servers. Special fea-
tures include automatic form fillouts and retrieval of pages pointed to by navi-
gable links, which requires execution of javascript.
The Process Model orchestrates all the actions that take place in Guide-O.
In a state, it takes a URL as the input, calls the Browser Object Interface to
fetch the page and the Content Analyzer to extract from the page the concepts
that it expects. The extracted concepts are organized as a concept tree that is
dispatched by the Process Model to the Interface Manager for presentation to
the user. When the user selects a concept, it is sent to the Process Model as an
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
operation on the concept. A state transition based on the operation is made and
the cycle repeats.
The architecture described is in essence the runtime engine that drives the
Guide-O system. The Process Model and the Concept Extractor in the engine
are learned a priori in the learning component.
3. PROCESS MODEL
Formally, our process model is defined as follows: Let C = {c0 , c1 , . . .} be a set
of concepts, and I (c) denote the set c of concept instances. Let Q = {q0 , q1 , . . .}
be a set of states. With every state qi we associate a set Si ⊆ C of concepts. Let
O = {o0 , o1 , . . .} be a set of operations. An operation oi can take parameters. A
transition δ is a function Q ×O → Q, and a concept operation ρ is also a function
C → O. Operations label transitions; that is, if δ(qi , o) = q j then o is the label
on this transition. An operation o = ρ(c) is enabled in state qi whenever the
user selects an instance of c ∈ Si and when it is enabled a transition is made
from qi to state q j = δ(qi , o).
Technically a concept instance is the occurrence of a concept in a Web page.
For example, the circled items in Figure 1 are all concept instances. But for
brevity we choose not to make this distinction explicitly and use concepts and
concept instances interchangeably when referring to content segments in Web
pages.
Figure 4 illustrates a process model. The concepts associated with the start-
ing state q1 are “Item Taxonomy,” “Item List,” and “Search Form.” This means
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
that if these concept instances are present in the Web page given to q1 as its
input, they will be extracted and presented to the user. User can select any
of these concepts. When the user selects the “Search Form” concept, he is re-
quired to supply the form input upon which the submit searchform operation
is invoked. This amounts to submitting the form with the user-supplied form
input. A Web page consisting of the search results is generated and a transition
is made to q3 . As discussed in the use scenario in Section 2 we have omitted the
payment steps following checkout.
Hence q6 , which is entered upon a check out operation, is deemed as the final
state.
Figure 5 is an example of a process model for utility bill payment correspond-
ing to the use scenario for Guide-O-Mobile described in Section 2.
order. Next we generalize the prefix tree automata by state merging. We choose
state pairs (i, j ), i < j as candidates for merging. The candidate pair (i, j ) is
merged if it results in a consistent automata. For example, merging the pair
(1,2) is consistent whereas (3,4) is not merged as the resulting automata will
accept the string <submit searchform, check out> in S−. The DFA that results
upon termination of this merging process on the above example set is shown
in Figure 6(b).
Since we do at most Q 2 state mergings, where Q is the cardinality of S+,
the time complexity is polynomial. In Section 5 experimental evaluation of the
performance of the learned process model is presented.
4. CONTENT ANALYSIS
Herein we describe the content analyzer module that extracts the relevant con-
cepts. In a nutshell this is achieved by partitioning a Web page into segments
of semantically related items and classifying them against concepts in the on-
tology. In the following we provide the detail of our approach.
The Approach
It is based on our previous work on learning-based semantic analysis of Web
content [Mukherjee et al. 2005]. Briefly, the technique rests on three key steps:
(i) inferring the logical structure of a Web page via structural analysis of its
content, (ii) learning statistical models of semantic concepts using lightweight
features extracted from both the content and its logical structure in a set of
training Web pages, and (iii) applying these models on the logical structures of
new Web pages to automatically identify concept instances.
purpose of partitioning the content, the functional tags are omitted since they
don’t contribute to the presentation style of the content.
Our type system is based on the Document Object Model (DOM) of the HTML
document. A segment of the DOM presentation of the HTML page is shown in
Figure 7(a), where the internal nodes of the tree are labeled with HTML tags
and the displayable content are at the leaf nodes. Formally:
Definition 4.1. Every node in DOM tree has an associated type T , which is
defined as follows.
2 There are many attributes defined in CSS styles. However in our work, we use only a small subset
of the CSS style attributes.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
Algorithm 1 PartitionTree(n)
Require: n: a node in a DOM tree
1: if n is a leaf node then
2: n.type = the primary type defined above.
3: else if n has only one child node c then
4: PartitionTree(c)
5: Replace n with c and remove n from the DOM tree.
6: else
7: for all child node x of n do
8: PartitionTree(x)
9: end for
10: Analyze(n)
11: end if
Intuitively, the primary type encodes the presentation style (including lo-
cation and visual cues such as font type and size) of a piece of content (text
or image) that corresponds to a leaf node in a DOM tree. For example, the
leaf nodes of the product names (“SONY 5-Disc...,” “SONY CD-R...,” etc.) in
Figure 7(a) have the same primary type T1 =< D1 , L1 >, where D1 ={bold =
true, italic = true, link = true, size = 11px, face = serif, color = #000000, im-
age = false} and L1 = ..FORM.TABLE.TR.TD. Similarly, all the descriptions of
items (“Variable line...,” “Internal storage...,” etc.) share the same primary type
T2 =< D2 , L2 >, where D2 = {bold = false, italic = false, link = false, size =
9px, face = serif, color = #000000, image = false} and L2 = ..FORM.TABLE.
TR.TD.
A compound type essentially summarizes the structural recurrence infor-
mation of a subtree rooted at an internal node. For example, in Figure 7(a)
the subtree rooted at the circled TABLE node are the search results. The prop-
erty of spatial locality combined with consistency in presentation style reveals
structural recurrence information about semantically related items. For exam-
ple, the TR nodes under the circled TABLE node have the same compound type
TT R = T1 T2 T3 T4 T5 where each Ti is the primary type of the ith leaf node under
them. By concatenating all the types of the circled TABLE node’s children, we
get a type sequence TTR TTR TTR ... where the sequential pattern, TTR , exactly
captures the structural recurrence information of each semantic unit (i.e., the
items in the search results). In our type system, such a repeating type sequence
TT R results in a compound type TTABLE = T1 T2 T3 T4 T5 for the circled TABLE
node.
Algorithm 2 Analyze(n)
Require: n: an internal node in a DOM tree
1: finalpattern =null
2: S = the type sequence of all the child nodes of n
3: repeat
4: max = 0; maxpattern =null; maxscheme =null
5: collapse adjacent nodes in S which share the same type
6: = MaximalPatterns(S)
7: for all each maximal pattern p in do
8: scheme = GenerateScheme( p, S)
9: score = EvaluateScheme(scheme)
10: if score > max then
11: max = score; maxpattern = p; finalpattern = p; maxscheme = scheme
12: end if
13: end for
14: if maxpattern =null then
15: partition and update S with scheme maxscheme
16: end if
17: untill maxpattern =null
18: if finalpattern =null then
19: set the type of n to maxpattern
20: else
21: set the type of n to S
22: flatten all the grandchild nodes of n (each child of n will be a 1-level subtree)
23: end if
It is quite common that a sequence contains more than one pattern, and only
one pattern can be used to identify the related items. In our previous work, we
defined a maximal pattern to be the only candidate based on the frequency of
its appearance. A more robust definition is as follows,
Algorithm 3 GenerateScheme( p, S)
Require: p: a maximal pattern. S: a sequence of nodes.
1: Partition S into β0 γβ1 γ . . . γβm where γ = p and βi represents the node sequence
between each pair of patterns. Set the initial partition ρi to the ith γ .
2: For each βi , split it to βi,1 and βi,2 if there is a separator between βi,1 and βi,2 .
3: For each βi , if the nodes in βi have different parent nodes, split it to βi,1 and βi,2 such
that the nodes in each of them come from one parent node.
4: For any sequence ρi βi ρi+1 , if after all the above steps, it becomes ρi βi,1 βi,2 . . . βi,k ρi+1 ,
we merge βi,1 into ρi and βi,k into ρi+1 .
5: The final scheme is {ρ1 , ρ2 , . . . , ρm }.
3A separator denotes a horizontal or vertical blank region in a Web page. Its purpose is to indicate
the boundary between visible content. Examples of separators include stand-alone BR tags, blank
table cells and images used as spacers.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
Now we take a closer look at the partition scheme generation and its eval-
uation. In the sequel, for simplicity, we will refer to a node by its type unless
otherwise stated. Therefore, the type sequence can also be regarded as the se-
quence of nodes in the DOM tree. Next we generate the partition scheme of S
for a pattern p using GenerateScheme below.
After a partition scheme is created under node n, we assign a “goodness” score
to it. This is done as follows: suppose the scheme λ has nλ partitions under node
n. Let us denote each partition as ρiλ , 1 ≤ i ≤ nλ . Then the score for λ is:
2 × |λ| × nλ 2
− |ρiλ | − |ρ λ | , (1)
|λ| + nλ 1≤i≤n λ
of all the partitions. The combination is based on the observation that the se-
mantically related items usually have similar size of contents (low standard
square deviation), and we prefer the scheme whose coverage and the number
of partitions are both large (high harmonic mean).
We pick the partition scheme with the highest score and do the partition-
ing based on the selected scheme. The partitioning step first removes all the
substructure in the child nodes, resulting a 1-level tree rooted at n. It then
aggregates all the child nodes in partition ρi under a newly created node nρi ,
namely pattern node, whose type is set to the pattern p. nρi is then inserted as
the child of n.
Let us continue the example in Figure 8. When Analyze is invoked
on n1 , the algorithm assigns n1 with the type sequence of its children
T1 T2 T3 T4 T5 T2 T3 T6 T7 T1 T2 T3 . The maximal patterns discovered are p1 = T2 T3
and p2 = T1 T2 T3 . Algorithm GenerateScheme is invoked on each of the
maximal patterns. For p1 , the initial partition scheme is
[T1 ](T2 T3 )[T4 T5 ](T2 T3 )[T6 T7 T1 ](T2 T3 ) (we use parenthesis to indicate the
boundaries of ρi , and bracket to indicate the boundaries of βi ). Since there
is a separator between n11 and n4 , β2 = T6 T7 T1 is split into [T6 T7 ][T1 ].
Also since n4 and n5 have different parent nodes, β1 = T4 T5 is split into
[T4 ][T5 ]. The same holds for β2,1 = T6 T7 . Accordingly the type sequence is
now [T1 ](T2 T3 )[T4 ][T5 ](T2 T3 )[T6 ][T7 ][T1 ](T2 T3 ). GenerateScheme modifies the
scheme to (T1 T2 T3 T4 )(T5 T2 T3 T6 )[T7 ](T1 T2 T3 ). This scheme is assigned the
score 4.05 (|λ| = 11, nλ = 3, ρ1 = ρ2 = 4, ρ3 = 3) by the goodness scoring
formula given in (1).
Similarly the maximal pattern p2 results in the partition scheme
(T1 T2 T3 T4 )[T5 T2 T3 T6 ][T7 ](T1 T2 T3 ), with the score of 2.93 (|λ| = 7, nλ = 2, ρ1 =
4, ρ2 = 3). Therefore, the partition scheme based on p1 has the highest score
and is selected to actually partition the sequence. The partitioning step re-
moves n2 , n3 , and n4 . We insert the new pattern nodes P ’s (see Figure 8(c))
according to the scheme. The newly inserted pattern nodes are assigned the
compound type Tp = T2 T3 . The type sequence of n1 becomes Tp Tp T7 Tp . Since
no more patterns can be found in the new type sequence, the loop ends and
n1 is assigned the type Tp . The final result of Analyze on n1 is shown in
Figure 8(c), where the nodes labeled with P are the newly inserted pattern
nodes.
Finally in the restructured tree, known also as the partition tree, there are
three classes of internal nodes: (i) group—which encapsulates repeating pat-
terns in its immediate children type sequence, (ii) pattern—which captures each
individual occurrence of the repeat pattern, or (iii) block—when it is neither
group nor pattern. Intuitively the subtree of a group node denotes homogenous
content consisting of semantically related items. For example, observe how all
the items in the search results list in Figure 1(a) are rooted under the group
node in the partition tree. The leaf nodes of the partition tree correspond to the
leaf nodes in the original DOM tree and have content associated with them.
The partition tree resulting from structural analysis of the DOM in Figure 7(a)
is shown in Figure 7(b). The partition tree represents the logical organization
of the page’s content.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
4.2.1 Feature Space. Each node in the partition tree can be represented by
a vector of values representing its attributes, namely features. The set of feature
vectors forms a multidimensional feature space. In our concept identification
problem, given a partition tree node p, the value of feature f i , denoted as n f i , p , is
the frequency of occurrence of feature f i in p. The features of p are categorized
into three types, namely word, p-gram, and t-gram.
(i) Word features—These are features drawn from the text encapsulated
within a partition tree node. Intuitively, an instance of a certain concept usu-
ally contains a set of particular words occurring more often than others. For
example, given the Search Form concept, an instance usually contains words
like “search” and “find.” Such concept-specific words are of great importance in
identifying the instance. Word features are the most widely used features in
text-related machine learning tasks such as text categorization.
For a leaf node in the partition tree, its word features are drawn from its
own text while for an internal partition tree node, the words present in all the
leaves within the subtree rooted at it are aggregated. All the words are case-
insensitive, and stop words (e.g., “the,” “in,” and “please”) are ignored in both
cases. n f i , p is the number of times the word f i occurs in the text of p.
(ii) p-gram features—These are features representing the presentation of
content. We have observed that in Web sites sharing similar content semantics,
the presentation of a semantic concept in those sites exhibits similarity. For
instance, in Figure 1(b), each item is presented as a link with the item name,
followed by a short text description, and ending with miscellaneous text infor-
mation. Similar visual presentation can also be found on other sites. A p-gram
feature captures these presentation similarities. The basic p-gram features are
link, text, and image found in the leaf nodes of a partition tree. Recall that
during structural analysis, pattern nodes aggregate every individual repeat
in a type sequence. Since repeats are typically associated with similar visual
presentation, complex p-gram features are constructed only at pattern nodes
by concatenating the p-gram features of their immediate children nodes. In-
ternal nodes aggregate basic and possibly complex p-grams from the subtrees
rooted at them. Let n f i , p be the number of times f i occurs in the subtree rooted
at p. For instance, in the left tree of Figure 9, the p-gram feature at the pat-
tern node labeled “P” is < text · link >, while in the right tree of Figure 9,
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
the node labeled “B” at the same position has the p-gram features {< text >,
< link >}.
(iii) t-gram features— While p-gram features capture the visual presenta-
tion, t-gram features represent the structure of the partition tree. Recall that
internal partition tree nodes can be either group, pattern, or block while link,
text, and image are the different classes of leaf nodes. The structural arrange-
ment of these classes of nodes also characterizes a concept and this is what
is captured by t-gram features. Given a partition tree node with N nodes in
its subtree, the complete structural arrangement within the node can be de-
scribed in terms of a set of subtrees of k (2 ≤ k ≤ N ) nodes where each subtree
is an arrangement of group, pattern, block, link, text, or image type nodes.
Since enumerating all these subtrees has exponential complexity, we restrict
our analysis to subtrees of 2 nodes. When k = 2 the t-gram is essentially a
parent-child feature. For instance, in Figure 9, when k = 2 the t-gram feature
space of the left tree is {< G, P >, < P, T ext >, < P, Link >}, and the right
tree is {< B, B >, < B, T ext >, < B, Link >}, where G and B are labels of
group and block nodes respectively.
1
i=|F | .
i=1 p∈L n f i , p + |F | + 1
The number of nodes within the subtree of a partition tree node for a concept c j
is modeled as a Gaussian distribution with parameters mean uc j and variance
σc j and is defined as:
p∈L | p| p∈L (| p| − μ c j )2
μc j = , σc j = ,
|L| |L| − 1
where | p| is the number of nodes within the subtree of p, and |L| is the number
of training examples.
For a node p of any new partition tree, we use a modified multinomial dis-
tribution to model the likelihood P ( p|c j ):
|
i=|F
N!
P ( p|c j ) = × P ( f i |c j ) N f i , p
N f 1 , p ! · · · N f |F | , p ! i=1
(| p|−μ )2 /(2σ 2 )
where N = K × e cj cj
, with K being a normalized total feature fre-
quency count, | p| being the total number of partition tree nodes within the
subtree rooted at p, and N f i , p is a scaled value of n f i , p such that i N f i , p = N .
The normalization of the multinomial coefficient ensures that the subtree of p
with the size closer to μc j will have greater likelihood to c j .
Note that the above formulation of the likelihood takes into consideration
both the number of nodes within p as well as the frequencies of the various
features in the content encapsulated within p. This results in a tight coupling
between content analysis and document structure during concept identification.
The partition tree node with the maximum likelihood value is identified as the
concept instance. The concept tree created with the identified concept instances
is shown in Figure 10. Observe that the irrelevant content in the original DOM
tree has been filtered.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
5. EVALUATION
Herein we report on the experimental evaluation of the learned process model,
concept extraction and the integrated Guide-O system. Thirty Web sites span-
ning the three content domains of books, consumer electronics, and office sup-
plies were used in the evaluation.
To create the concepts in the ontology we did an analysis of the transactional
activities done by users on the Web in each of the three domains. Based on the
analysis we came up with an “inclusive” set of concepts in Table I.
4 Recallfor a process model is the ratio of the number of completed transactions accepted by the
model over the total number of completed transactions. For Precision, this denominator becomes
the total number of accepted transactions (completed and not completed).
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
The second metric we measured was the number of transitions that remained
to be completed when a true trace (completed transaction) in the test set failed
to reach the final state (check out state). Figure 11(b) shows the failure curve.
Analysis of the failure curve indicates that almost half of the failure traces end
one hop away from the final state of the model and only 10% of the failures
end three or more hops away from the final state. This means that fast error
recovery techniques can be designed with such a process model.
5 Recall
value for a concept is the ratio of the number of correctly labeled concept instances in Web
pages over the actual number of concept instances present in them.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
of consistency of the presentation styles of these concepts across all these Web
sites. The low recall figures for the “Item Detail” (about 65% averaged over
the three domains) and “Shopping Cart” (about 70%) is mainly due to a high
degree of variation in their features across different Web sites. A straightfor-
ward way to improve the recall of such concepts is to use more training data.
However, even this may not help for concepts such as “Add To Cart” that rely
on keywords as the predominant feature. Quite often these are embedded in a
image precluding textual analysis. It appears that in such cases local context
surrounding the concept can be utilized as a feature to improve recall.
We also observed that Web pages with sufficient textual content relevant to
the concept was identified as an instance of that concept with higher accuracy.
Furthermore, concept extraction performed better for schematic Web pages (i.e.,
Web pages generated from a template), because structural feature extraction
is dependent on the presentation style and organization of different elements
of the page.
we installed our own VoiceXML interpreter along with off-the-shelf speech SDK
components. Guide-O-Mobile was architected to run in a client/server mode
with the machine running the core Guide-O system as the server and a 400MHz
iPaq with 64MB RAM as the client. Thirty CS graduate students were used as
evaluators. Prior to evaluation they were trained on how to use the two systems.
We chose 18 Web sites (6 for each domain) to evaluate Guide-O-Mobile and 9
(3 for each domain) to evaluate Guide-O-Speech. We conducted roughly 5 to 6
transactions on each of them and calculated mean (μ) and standard deviation
(σ ) for all the measured metrics. Over 93.33% of those transactions could be
completed. Evaluation of the integrated system discussed below is based on
these completed transactions.
could take the right steps to make progress on their transaction (response
to Question C1). Some evaluators felt that notification of promotional offers,
coupons, etc. was important and that such concepts ought to presented (re-
sponse to Question C2).
Most were able to find the items they were looking for (response to Question
S1). However, at times they were unable to complete the transaction (the “no”
response to Questions C3 and the unfinished transactions in S2). Analysis of
such transactions revealed that in many cases the problem arose because: (a)
the expected concepts in a state were not extracted; (b) the extracted concepts
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
were mislabeled; (c) the model could not make the correct transition. The last
two problems could be addressed by training the concept identifier and the
process model with more examples.
Quite a few evaluators felt that they expected more flexibility on how they can
complete the transaction (response to Question S3). Observe that the number
of possible paths to complete a transaction is limited by the training data and
hence this criticism can be addressed with more training. Overall they all felt
that the system was adequate to do their tasks (response to Question S4).
Evaluators also had general comments. In particular they all felt that the
system requires help utilities to assist users to become familiar with the use
and effect of each concept.
6. RELATED WORK
The work described in this paper has broad connections to research in Web ser-
vices, semantic understanding of Web content, automata learning, non-visual
Web access and browsing with mobile devices.
over Web pages instead of services. A vast majority of transactions on the Web
are still conducted over HTML pages. This focus on Web pages is what sets our
work apart from those in Web services. However using our approach, users can-
not understand the goal of the step because this has been split in several Web
pages. Thus users lose the global vision of the transaction. We are currently
working on an implementation where the local step of the user can be visually
shown in the context of the global task—the process model and the semantics.
Using standard web services constrains the user to a set of pre-defined tasks
(e.g., searching for a product). However, our approach does not confine a user to
whatever service is exposed by the provider. Using our transactional approach,
users are able to create services (e.g., service for buying airline ticket, service
for job search) which can accomplish his personal goal. Thus we are providing
a scalable technique for creating services.
Process Model Learning. Our work on learning process models from user ac-
tivity logs is related to research in mining workflow process models (see [van der
Aalst and Weijters 2004] for a survey). However, our current definition of a
process is simpler than traditional notions of workflows. For instance, we do
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
7. CONCLUSION
The use of complete Web pages for doing online transactions under constrained
interaction modalities can cause severe information overload on the end user.
Model-directed Web transaction can substantially reduce if not eliminate this
problem. Our preliminary experimentation with Guide-O seems to suggest that
it is possible to achieve such reductions in practice. We are corrently conduct-
ing preliminary usability studies of Guide-O’s speech at Helen Keller Services
for the Blind (http://www.helenkeller.org) to get feedback from the visually
handicapped community.
There are several avenues of future research. In this article we have pre-
sented a client-side Web transactional system, where process model resides in
client side. In the future we will also develop server-side Web transactional sys-
tem, where content providers will annotate the Web pages such that concept in-
stances will be labeled with the concept name. Content providers can easily use
XHTML to label concept instances in their Web sites and describe process mod-
els specific to their sites. Most online stores generate their Web pages automat-
ically from templates. So we expect that labeling the content of templates will
not require extensive effort on the part of content providers. At the same time,
the use of XHTML for semantic labeling will imposes no limitation, allowing
content providers to define their own states and actions as they deem necessary.
We build separate ontology and process model for each type of Web transac-
tion (e.g., online shopping, utility bill payment, flight ticket booking). An inter-
esting problem is to develop a generic ontology and process model to support
different types of Web transactions. Currently, we do not consider the security
mechanisms defined in the Web sites to handle sensitive information. In fu-
ture, we will investigate the possibility of handling security mechanisms that
are already present in the Web site.
The Guide-O system works with the assumption that Web transaction is
performed correctly. We plan to extend our system to incorporate failure sce-
narios. To handle failure scenarios, we will maintain a history of transactional
progress. On failure (e.g., when the system does not identify any concept on the
current page), the model will go back to the previous state. Similarly, when a
user chooses to go back, the model will go back to the previous state.
In our current framework, Guide-O-Mobile is set up to run as a client/server
application with the server doing the bulk of the processing. In the future it
should be possible to port all the server steps to the handheld when improve-
ments in handheld hardware and software technologies will provide more mem-
ory and a richer set of browser operations than what are currently available in
handheld.
In addition, displaying concepts compactly in handhelds along the lines sug-
gested in Florins and Vanderdonckt [2004] will further improve the user’s ease
of interaction.
Finally, integration of our framework with Web services standards is an inter-
esting problem. Success here will let us specify the process model in BPEL4WS
which in turn will enable interoperation with sites exposing Web pages as well
as those exposing Web services.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
REFERENCES
AGARWAL, S., HANDSCHUH, S., AND STAAB, S. 2003. Surfing the service web. In International Seman-
tic Web Conference (ISWC). 211–216.
ANGLUIN, D. 1978. On the complexity of minimum inference of regular sets. Inform. Control 39, 3,
337–350.
ASAKAWA, C. AND ITOH, T. 1998. User interface of a home page reader. In Proceedings of the ACM
International Conference on Assistive Technologies (ASSETS). ACM Press, 149–156.
BALUJA, S. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient
machine learning framework. In Proceedings of the 15th International Conference on World Wide
Web (WWW’06). ACM Press, 33–42.
BUYUKKOTEN, O., GARCIA-MOLINA, H., AND PAEPCKE, A. 2001. Seeing the whole in parts: Text sum-
marization for web browsing on handheld devices. In International World Wide Web Conference
(WWW). ACM Press, 652–662.
BUYUKKOTEN, O., GARCIA-MOLINA, H., PAEPCKE, A., AND WINOGRAD, T. 2000. Power browser: Efficient
web browsing for pdas. In Proceedings of the ACM Conference on Human Factors in Computing
Systems (CHI). ACM Press, 430–437.
CHEN, L., SHADBOLT, N., GOBLE, C. A., TAO, F., COX, S. J., PULESTON, C., AND SMART, P. R. 2003. Towards
a knowledge-based approach to semantic service composition. In Proceedings of the International
Semantic Web Conference (ISWC). 319–334.
CHEN, Y., MA, W.-Y., AND ZHANG, H.-J. 2003. Detecting web page structure for adaptive viewing
on small form factor devices. In Proceedings of the International World Wide Web Conference
(WWW). ACM Press, 225–233.
CHI, E., PIROLLI, P., CHEN, K., AND PITKOW, J. 2001. Using information scent to model user infor-
mation needs and actions on the web. In Proceedings of the ACM Conference on Human Factors
in Computing Systems (CHI). ACM Press, 490–497.
CIMIANO, P., HANDSCHUH, S., AND STAAB, S. 2004. Towards the self-annotating web. In Proceedings
of the International World Wide Web Conference (WWW). ACM Press, 462–471.
DILL, S., EIRON, N., GIBSON, D., GRUHL, D., GUHA, R., JHINGRAN, A., KANUNGO, T., RAJAGOPALAN, S.,
TOMKINS, A., TOMLIN, J., AND YIEN, J. 2003. SemTag and Seeker: Bootstrapping the semantic
web via automated semantic annotation. In Proceedings of the International World Wide Web
Conference (WWW). 831–840.
FLORINS, M. AND VANDERDONCKT, J. 2004. Graceful degradation of user interfaces as a design
method for multiplatform systems. In Proceedings of the 9th International Conference on In-
telligent User Interface (IUI’04). ACM Press, New York, NY, 140–147.
GOLD, E. M. 1978. Complexity of automaton identification from given data. Inform. Control 37, 3,
302–320.
GUSFIELD, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Compu-
tational Biology. Cambridge University Press.
HAMMOND, B., SHETH, A., AND KOCHUT, K. 2002. Semantic enhancement engine: A modular docu-
ment enhancement platform for semantic applications over heterogenous content. In Real World
Semantic Applications, V. Kashyap and L. Shklar, Eds. IOS Press.
HARPER, S. AND PATEL, N. 2005. Gist summaries for visually impaired surfers. In Proceedings
of the ACM International Conference on Assistive Technologies (ASSETS). ACM Press, 90–
97.
HEFLIN, J., HENDLER, J. A., AND LUKE, S. 2003. SHOE: A blueprint for the semantic web. In Spin-
ning the Semantic Web, D. Fensel, J. A. Hendler, H. Lieberman, and W. Wahlster, Eds. MIT Press,
29–63.
HESS, A., JOHNSTON, E., AND KUSHMERICK, N. 2004. Assam: A tool for semi-automatically annotating
semantic web services. In Proceedings of the International Semantic Web Conference (ISWC).
320–334.
HUANG, A. AND SUNDARESAN, N. 2000. A semantic transcoding system to adapt web services for
users with disabilities. In Proceedings of the ACM International Conference on Assistive Tech-
nologies (ASSETS). ACM Press, 156–163.
IPhone. http://www.apple.com/iphone/.
JAWS SCREEN READER. http://www.freedomscientific.com.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
KAASINEN, E., AALTONEN, M., KOLARI, J., MELAKOSKI, S., AND LAAKKO, T. 2000. Two approaches to
bringing internet services to wap devices. In Proceedings of the International World Wide Web
Conference (WWW).
KAHAN, J., KOIVUNEN, M., PRUD’HOMMEAUX, E., AND SWICK, R. 2001. Annotea: An open rdf infrastruc-
ture for shared web annotations. In Proceedings of the International World Wide Web Conference
(WWW). ACM Press, 623–632.
LAENDER, A., RIBEIRO-NETO, B., DA SILVA, A., AND TEIXEIRA, J. 2002. A brief survey of web data
extraction tools. SIGMOD Rec. 31, 2.
MILIC-FRAYLING, N. AND SOMMERER, R. 2002. Smartview: Flexible viewing of web page contents. In
Proceedings of the International World Wide Web Conference (WWW). ACM Press, 172.
MUKHERJEE, S., RAMAKRISHNAN, I., AND SINGH, A. 2005. Bootstrapping semantic annotation for
content-rich html documents. In Proceedings of the International Conference on Data Engineering
(ICDE). ACM Press.
MURPHY, K. 1996. Passively learning finite automata. Tech. rep., Janta Fe Institute.
NetECHO[tm]. http://www.internetspeech.com.
ONCINA, J. AND GARC, P. 1991. Inferring Regular Languages in Polynomial Update Time. World
Scientific Publishing.
PATIL, A., OUNDHAKAR, S., SHETH, A., AND VERMA, K. 2004. Meteor-s web service annotation
framework. In Proceedings of the International World Wide Web Conference (WWW). 553–
562.
PONTELLI, E., XIONG, W., GILLIAN, D., SAAD, E., GUPTA, G., AND KARSHMER, A. 2002. Navigation of
html tables, frames, and xml fragments. In Proceedings of the ACM International Conference on
Assistive Technologies (ASSETS). ACM Press, 25–32.
POPOV, B., KIRYAKOV, A., KIRILOV, A., MANOV, D., OGNYANOFF, D., AND GORANOV, M. 2003. Kim—
semantic annotation platform. In Proceedings of the International Semantic Web Conference
(ISWC).
RAMAKRISHNAN, I., STENT, A., AND YANG, G. 2004. Hearsay: Enabling audio browsing on hypertext
content. In Proceedings of the International World Wide Web Conference (WWW). ACM Press,
80–89.
RAMAN, T. V. 1998. Audio system for technical readings. 1410, xvi + 121.
RAMASWAMY, L., IYENGAR, A., LIU, L., AND DOUGLIS, F. 2004. Automatic detection of fragments in
dynamically generated web pages. In Proceedings of the International World Wide Web Conference
(WWW). ACM Press, 443–454.
RAO, J., KÜNGAS, P., AND MATSKIN, M. 2004. Logic-based web services composition: From service
description to process model. In Proceedings of the International Conference on Web Services
(ICWS).
RICHARDS, J. T. AND HANSON, V. L. 2004. Web accessibility: a broader view. In Proceedings of the
13th International Conference on World Wide Web (WWW’04). ACM Press, New York, NY, 72–
79.
SABOU, M., WROE, C., GOBLE, C., AND MISHNE, G. 2005. Learning domain ontologies for web service
descriptions: An experiment in bioinformatics. In Proceedings of the International World Wide
Web Conference (WWW). 190–198.
SONG, R., LIU, H., WEN, J.-R., AND MA, W.-Y. 2004. Learning block importance models for web
pages. In Proceedings of the International World Wide Web Conference (WWW). ACM Press, 203–
211.
TAKAGI, H., ASAKAWA, C., FUKUDA, K., AND MAEDA, J. 2002. Site-wide annotation: Reconstructing
existing pages to be accessible. In Proceedings of the ACM International Conference on Assistive
Technologies (ASSETS). ACM Press.
THUNDERHAWK. http://www.bitstream.com/wireless/index.html.
TRAVERSO, P. AND PISTORE, M. 2004. Automated composition of semantic web services into exe-
cutable processes. In Proceedings of the International Semantic Web Conference (ISWC).
VAN DER AALST, W. AND WEIJTERS, A. 2004. Process mining: A research agenda. Comput. Industry 53,
231–244.
WOMBACHER, A., FANKHAUSER, P., AND NEUHOLD, E. J. 2004. Transforming bpel into annotated deter-
ministic finite state automata for service discovery. In Proceedings of the International Conference
on Web Services (ICWS).
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.
P1: OJL
ACMJ321-02 ACM-TRANSACTION September 19, 2007 23:48
XIANGYE XIAO, QIONG LUO, D. H. AND FU, H. 2005. Slicing* tree based web page transformation
for small displays. In Proceedings of the Conference on Information and Knowledge Management.
ACM Press, 303–304.
YANG, C. AND WANG, F. L. 2003. Fractal summarization for mobile devices to access large docu-
ments on the web. In Proceedings of the International World Wide Web Conference (WWW). ACM
Press, 215–224.
YI, L. AND LIU, B. 2003. Eliminating noisy information in web pages for data mining. In Proceed-
ings of the ACM Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM Press,
117–118.
YIN, X. AND LEE, W. S. 2004. Using link analysis to improve layout on mobile devices. In Proceed-
ings of the International World Wide Web Conference (WWW). ACM Press, 338–344.
YU, S., CAI, D., WEN, J.-R., AND MA, W.-Y. 2003. Improving pseudo-relevance feedback in web
information retrieval using web page segnmentation. In Proceedings of the International World
Wide Web Conference (WWW).
ZAJICEK, M., POWELL, C., AND REEVES, C. 1999. Web search and orientation with brookestalk. In
Proceedings of Technology and Persons with Disabilities Conference.
ZHANG, Z., HE, B., AND CHANG, K. C.-C. 2004. Understanding web query interfaces: Best-effort
parsing with hidden syntax. In Proceedings of the ACM Conference on Management of Data
(SIGMOD). ACM Press, 107–118.
ACM Transactions on the Web, Vol. 1, No. 3, Article 12, Publication date: September 2007.