1
Semantic Web 0 (2015) 1–0
A Hierarchical Aggregation Framework for
Efficient Multilevel Visual Exploration and
Analysis 1
Nikos Bikakis a,b,∗ , George Papastefanatos b , Melina Skourla a and Timos Sellis c
a
National Technical University of Athens, Greece
ATHENA Research Center, Greece
c
Swinburne University of Technology, Australia
b
Abstract. Data exploration and visualization systems are of great importance in the Big Data era, in which the volume and
heterogeneity of available information make it difficult for humans to manually explore and analyse data. Most traditional systems operate in an offline way, limited to accessing preprocessed (static) sets of data. They also restrict themselves to dealing
with small dataset sizes, which can be easily handled with conventional techniques. However, the Big Data era has realized the
availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer API or query endpoints
for online access, or the data is received in a stream fashion. Therefore, modern systems must address the challenge of on-the-fly
scalable visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for
information abstraction and summarization. Further, they must take into account different user-defined exploration scenarios
and user preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large
dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale
model. We define two versions of this structure, that adopt different data organization approaches, well-suited to exploration
and analysis context. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering
different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction and prefetching via user interaction, and dynamic adaptation of the hierarchies based on user preferences. A
thorough theoretical analysis is presented, illustrating the efficiency of the proposed methods. The presented model is realized in
a web-based prototype tool, called SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets.
Finally, we provide a performance evaluation and a empirical user study employing real datasets.
Keywords: visual analytics, big data, multiscale, progressive, incremental indexing, linked data, multiresolution, visual
aggregation, binning, adaptive, hierarchical navigation, personalized exploration, data reduction, summarization, SynopsViz
1. Introduction
Exploring, visualizing and analysing data is a core
task for data scientists and analysts in numerous applications. Data exploration and visualization enable
users to identify interesting patterns, infer correlations
1 To
appear in Semantic Web Journal (SWJ), 2016
author. e-mail: bikakis@dblab.ntua.gr
* Corresponding
and causalities, and support sense-making activities
over data that are not always possible with traditional
data mining techniques [54,29]. This is of great importance in the Big Data era, where the volume and
heterogeneity of available information make it difficult for humans to manually explore and analyse large
datasets.
One of the major challenges in visual exploration
is related to the large size that characterizes many
2
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
datasets nowadays. Considering the visual information
seeking mantra: “overview first, zoom and filter, then
details on demand” [94], gaining overview is a crucial
task in the visual exploration scenario. However, offering an overview of a large dataset is an extremely challenging task. Information overloading is a common issue in large dataset visualization; a basic requirement
for the proposed approaches is to offer mechanisms for
information abstraction and summarization.
The above challenges can be overcome by adopting hierarchical aggregation approaches (for simplicity we also refer to them as hierarchical) [36]. Hierarchical approaches allow the visual exploration of
very large datasets in a multilevel fashion, offering
overview of a dataset, as well as an intuitive and usable
way for finding specific parts within a dataset. Particularly, in hierarchical approaches, the user first obtains
an overview of the dataset (both structure and a summary of its content) before proceeding to data exploration operations, such as roll-up and drill-down, filtering out a specific part of it and finally retrieving details
about the data. Therefore, hierarchical approaches directly support the visual information seeking mantra.
Also, hierarchical approaches can effectively address
the problem of information overloading as it provides
information abstraction and summarization.
A second challenge is related to the availability of
API and query endpoints (e.g., SPARQL) for online
data access, as well as the cases where that data is received in a stream fashion. The latter pose the challenge of handling large sets of data in a dynamic setting, and as a result, a preprocessing phase, such as traditional indexing, is prevented. In this respect, modern techniques must offer scalability and efficient processing for on-the-fly analysis and visualization of dynamic datasets.
Finally, the requirement for on-the-fly visualization must be coupled with the diversity of preferences
and requirements posed by different users and tasks.
Therefore, the proposed approaches should provide the
user with the ability to customize the exploration experience, allowing users to organize data into different
ways according to the type of information or the level
of details she wishes to explore.
Considering the general problem of exploring big
data [95,81,18,54,49,43], most approaches aim at providing appropriate summaries and abstractions over
the enormous number of available data objects. In
this respect, a large number of systems adopt approximation techniques (a.k.a. data reduction techniques)
in which partial results are computed. Existing ap-
proaches are mostly based on: (1) sampling and filtering [39,83,2,67,55,13] and/or (2) aggregation (e.g.,
binning, clustering) [36,59,44,58,78,113,12,77,1,57].
Similarly, some modern database-oriented systems
adopt approximation techniques using query-based
approaches (e.g., query translation, query rewriting)
[13,59,58,108,114]. Recently, incremental approximation techniques are adopted; in these approaches approximate answers are computed over progressively
larger samples of the data [39,2,55]. In a different context, an adaptive indexing approach is used in [118],
where the indexes are created incrementally and adaptively throughout exploration. Further, in order to improve performance many systems exploit caching and
prefetching techniques [101,61,56,12,25,66,32]. Finally, in other approaches, parallel architectures are
adopted [35,63,62,55].
Addressing the aforementioned challenges, in this
work, we introduce a generic model that combines
personalized multilevel exploration with online analysis of numeric and temporal data. At the core lies
a lightweight hierarchical aggregation model, constructed on-the-fly for a given set of data. The proposed model is a tree-based structure that aggregates
data objects into multiple levels of hierarchically related groups based on numerical or temporal values of
the objects. Our model also enriches groups (i.e., aggregations/summaries) with statistical information regarding their content, offering richer overviews and insights into the detailed data. An additional feature is
that it allows users to organize data exploration in different ways, by parameterizing the number of groups,
the range and cardinality of their contents, the number of hierarchy levels, and so on. On top of this
model, we propose three user exploration scenarios
and present two methods for efficient exploration over
large datasets: the first one achieves the incremental
construction of the model based on user interaction,
whereas the second one enables dynamic and efficient
adaptation of the model to the user’s preferences. The
efficiency of the proposed model is illustrated through
a thorough theoretical analysis, as well as an experimental evaluation. Finally, the proposed model is realized in a web-based tool, called SynopsViz that offers a variety of visualization techniques (e.g., charts,
timelines) for multilevel visual exploration and analysis over Linked Data (LD) datasets.
Contributions. The main contributions of this work
are summarized as follows.
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
− We introduce a generic model for organizing,
exploring, and analysing numeric and temporal
data in a multilevel fashion.
− We implement our model as a lightweight, main
memory tree-based structure, which can be efficiently constructed on-the-fly.
− We propose two tree structure versions, which
adopt different approaches for the data organization.
− We describe a simple method to estimate the tree
construction parameters, when no user preferences are available.
− We define different exploration scenarios assuming various user exploration preferences.
− We introduce a method that incrementally constructs and prefetches the hierarchy tree via user
interaction.
− We propose an efficient method that dynamically
adapts an existing hierarchy to a new, considering user’s preferences.
− We present a thorough theoretical analysis, illustrating the efficiency of the proposed model.
− We develop a prototype system which implements the presented model, offering multilevel
visual exploration and analysis over LD.
− We conduct a thorough performance evaluation
and an empirical user study, using the DBpedia
2014 dataset.
Outline. The remainder of this paper is organized as
follows. Section 2 presents the proposed hierarchical
model, and Section 3 provides the exploration scenarios and methods for efficient hierarchical exploration. Then, Section 4 presents the SynopsViz tool
and demonstrate the basic functionality. The evaluation of our system is presented in Section 5. Section 6
reviews related work, while Section 7 concludes this
paper.
2. The HETree Model
In this section we present HETree (Hierarchical
Exploration Tree), a generic model for organizing, exploring, and analysing numeric and temporal data in a
multilevel fashion. Particularly, HETree is defined in
the context of multilevel (visual) exploration and analysis. The proposed model hierarchically organize arbitrary numeric and temporal data, without requiring it
to be described by an hierarchical scheme. We should
note that, our model is not bound to any specific type
3
of visualization; rather it can be adopted by several
"flat" visualization techniques (e.g., charts, timeline),
offering scalable and multilevel exploration over nonhierarchical data.
In what follows, we present some basic aspects
of our working scenario (i.e., visual exploration and
analysis scenario) and highlight the main assumptions
and requirements employed in the construction of our
model. First, the input data in our scenario can be retrieved directly from a database, but also produced dynamically; e.g., from a query or from data filtering
(e.g., faceted browsing). Thus, we consider that data
visualization is performed online; i.e., we do not assume an offline preprocessing phase in the construction of the visualization model. Second, users can
specify different requirements or preferences with respect to the data organization. For example, a user
prefers to organize the data as a deep hierarchy for a
specific task, while for another task a flat hierarchical organization is more appropriate. Therefore, even
if the data is not dynamically produced, the data organization is dynamically adapted to the user preferences. The same also holds for any additional information (e.g., statistical information) that is computed
for each group of objects. This information must be recomputed when the groups of objects (i.e., data organization) are modified.
From the above, a basic requirement is that the
model must be constructed on-the-fly for any given
data and users preferences. Therefore, we implement
our model as a lightweight, main memory tree structure, which can be efficiently constructed on-the-fly.
We define two versions of this tree structure, following data organization approaches well-suited to visual exploration and analysis context: the first version
considers fixed-range groups of data objects, whereas
the second considers fixed-size groups. Finally, our
structure allows efficient on-the-fly statistical computations, which are extremely valuable for the hierarchical exploration and analysis scenario.
The basic idea of our model is to hierarchically
group data objects based on values of one of their properties. Input data objects are stored at the leaves, while
internal nodes aggregate their child nodes. The root of
the tree represents (i.e., aggregates) the whole dataset.
The basic concepts of our model can be considered
similar to a simplified version of a static 1D R-Tree
[45].
Regarding the visual representation of the model
and data exploration, we consider that both data objects sets (leaf nodes contents) and entities represent-
4
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
ing groups of objects (leaf or internal nodes) are visually represented enabling the user to explore the data
in a hierarchical manner. Note that our tree structure
organizes data in a hierarchical model, without setting
any constraints on the way the user interacts with these
hierarchies. As such, it is possible that different strategies can be adopted, regarding the traversal policy, as
well as the nodes of the tree that are rendered in each
visualization stage.
In the rest of this section, preliminaries are presented in Section 2.1. In Section 2.2, we introduce the
proposed tree structure. Sections 2.3 and 2.4 present
the two versions of the structure. Finally, Section 2.5
discusses the specification of the parameters required
for the tree construction, and Section 2.6 presents how
statistics computations can be performed over the tree.
2.1. Preliminaries
In this work we formalize data objects as RDF
triples. However, the presented methods are generic
and can be applied to any data objects with numeric or
temporal attributes. Hence, in the following, the terms
triple and (data) object will be used interchangeably.
We consider an RDF dataset R consisting of a set
of RDF triples. As input data, we assume a set of RDF
triples D, where D ⊆ R and triples in D have as objects either numeric (e.g., integer, decimal) or temporal values (e.g., date, time). Let tr be an RDF triple,
tr.s, tr.p and tr.o represent, respectively, the subject,
predicate and object of the RDF triple tr.
Given input data D, S is an ordered set of RDF
triples, produced from D, where triples are sorted
based on objects’ values, in ascending order. Assume that S[i] denotes the i-th triple, with S[1] the
first triple. Then, for each i < j, we have that
S[i].o ≤ S[j].o. Also, D = S, i.e., for each tr, tr ∈ D
iff tr ∈ S.
Figure 1 presents a set of 10 RDF triples, representing persons and their ages. In Figure 1, we assume that
the subjects p0-p9 are instances of a class Person and
the predicate age is a datatype property with integer
range.
p0 age 35
p1 age 100
p2 age 55
p3 age 37
p4 age 30
p5 age 35
p6 age 45
p7 age 80
p8 age 20
p9 age 50
Fig. 1. Running example input data (data objects)
Example 1. In Figure 1, given the RDF triple tr =
p0 age 35, we have that tr.s = p0, tr.p = age and
tr.o = 35. Also, given that all triples comprise the
input data D and S is the ordered set of D based on
the object values, in ascending order; we have that
S[1] = p8 age 20 and S[10] = p1 age 100.
Assume an interval I = [a, b], where a, b ∈ R; then,
I = {k ∈ R | a ≤ k ≤ b}. Similarly, for I = [a, b),
we have that I = {k ∈ R | a ≤ k < b}. Let I − and
I + denote the lower and upper bound of the interval
I, respectively. That is, given I = [a, b], then I − = a
and I + = b. The length of an interval I is defined as
|I + − I − |.
In this work we assume rooted trees. The number of
the children of a node is its degree. Nodes with degree
0 are called leaf nodes. Moreover, any non-leaf node is
called internal node. Sibling nodes are the nodes that
have the same parent. The level of a node is defined by
letting the root node be at level zero. Additionally, the
height of a node is the length of the longest path from
the node to a leaf. A leaf node has a height of 0.
The height of a tree is the maximum level of any
node in the tree. The degree of a tree is the maximum
degree of a node in the tree. An ordered tree is a tree
where the children of each node are ordered. A tree is
called an m-ary tree if every internal node has no more
than m children. A full m-ary tree is a tree where every
internal node has exactly m children. A perfect m-ary
tree is a full m-ary tree in which all leaves are at the
same level.
2.2. The HETree Structure
In this section, we present in more detail the HETree
structure. HETree hierarchically organizes numeric
and temporal data into groups; intervals are used to
represents these groups.1 HETree is defined by the tree
degree and the number of leaf nodes.2 Essentially, the
number of leaf nodes corresponds to the number of
groups where input data objects are organized. The
tree degree corresponds to the (maximum) number of
groups where a group is split in the lower level.
1 Note that our structure handles numeric and temporal data in a
similar manner. Also, other types of one-dimensional data may be
supported, with the requirement that a total order can be defined over
the data.
2 Note that following a similar approach, the HETree can also be
defined by specifying the tree height instead of degree or number of
leaves.
5
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Given a set of data objects (RDF triples) D, a positive integer ℓ denoting the number of leaf nodes; and a
positive integer d denoting the tree degree; an HETree
(D, ℓ, d) is an ordered d-ary tree, with the following
basic properties.
− The tree has exactly ℓ number of leaf nodes.
− All leaf nodes appear in the same level.
− Each leaf node contains a set of data objects,
sorted in ascending order based on their values.
Given a leaf node n, n.data denote the data objects contained in n.
− Each internal node has at most d children nodes.
Let n be an internal node, n.ci denotes the i-th
child for the node n, with n.c1 be the leftmost
child.
− Each node corresponds to an interval. Given a
node n, n.I denotes the interval for the node n.
− At each level, all nodes are sorted based on the
lower bounds of their intervals. That is, let n be
an internal node, for any i < j, we have that
n.ci .I − ≤ n.cj .I − .
− For a leaf node, its interval is bounded by the
values of the objects included in this leaf node.
Let n be the leftmost leaf node; assume that n
contains x objects from D. Then, we have that
n.I − = S[1].o and n.I + = S[x].o, where S is
the ordered object set resulting from D.
− For an internal node, its interval is bounded by
the union of the intervals of its children. That
is, let n be an internal node, having k child
nodes; then, we have n.I − = n.c1 .I − and
n.I + = n.ck .I + .
Furthermore, we present two different approaches
for organizing the data in the HETree. Assume the
scenario in which a user wishes to (visually) explore
and analyse the historic events from DBpedia [8], per
decade. In this case, user orders historic events by their
date and organizes them into groups of equal ranges
(i.e., decade). In a second scenario, assume that a user
wishes to analyse in the Eurostat dataset the gross domestic product (GDP) organized into fixed groups of
countries. In this case, the user is interested in finding information like: the range and the variance of the
GDP values over the top-10 countries with the highest
GDP factor. In this scenario, the user orders countries
by their GDP and organizes them into groups of equal
sizes (i.e., 10 countries per group).
In the first approach, we organize data objects into
groups, where the object values of each group covers
equal range of values. In the second approach, we or-
[20, 100]
a
[20, 45]
b
c
[50, 100]
e
d
[20, 30] [35, 35]
f
[37, 45]
g
[50, 55]
h
[80, 100]
p8 age 20 p0 age 35
p4 age 30 p5 age 35
p3 age 37
p6 age 45
p9 age 50
p2 age 55
p7 age 80
p1 age 100
Fig. 2. A Content-based HETree (HETree-C)
ganize objects into groups, where each group contains
the same number of objects. In the following sections,
we present in detail the two approaches for organizing
the data in the HETree.
2.3. A Content-based HETree (HETree-C)
In this section we introduce a version of the HETree,
named HETree-C (Content-based HETree). This
HETree version organizes data into equally sized
groups. The basic property of the HETree-C is that
each leaf node contains approximately the same number of objects and the content (i.e., objects) of a leaf
node specifies its interval. For the tree construction,
the objects are first assigned to the leaves and then the
intervals are defined.
An HETree-C (D, ℓ, d) is an HETree, with the following extra property. Each lleaf mnode contains λ or
.3 Particularly, the
λ − 1 objects, whereλ = |D|
ℓ
ℓ−(λ·ℓ−|D|) leftmost leaves contain λ objects, while
the rest leaves contain λ − 1.4 We can equivalently define the HETree-C by providing the number of objects
per leaf λ, instead of the number of leaves ℓ.
Example 2. Figure 2 presents an HETree-C constructed by considering the set of objects D from
Figure 1, ℓ = 5 and d = 3. As we can observe,
all the leaf nodes contain equal number
of objects.
Particularly, we have that λ = 10
=
2.
Regard5
−
ing the leftmost interval, we have d.I = 20 and
d.I + = 30.
3 We assume that, the number of objects is at least as the number
of leaves; i.e., |D| ≥ ℓ.
4 As an alternative we can construct the HETree-C, so each leaf
contains λ objects, except the rightmost leaf which will contain between 1 and λ objects.
6
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Algorithm 1. createHETree-C/R (D, ℓ, d)
Input: D: set of objects; ℓ: number of leaf nodes;
d: tree degree
Output: r: root node of the HETree tree
1
2
3
4
S ← sort D based on objects values
L ← constrLeaves-C/R(S, ℓ)
r ← constrtInterlNodes(L, d)
return r
Procedure 1: constrLeaves-C(S, ℓ)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input: S: ordered set of objects; ℓ: number of leaf nodes
Output: L: ordered set of leaf nodes
l m
|S|
λ← ℓ
k ← ℓ − (λ · ℓ − |S|)
beg ← 1
for i ← 1 to ℓ do
create an empty leaf node n
if i ≤ k then
num ← λ
else
num ← λ − 1
end ← beg + num
for t ← beg to end do
n.data ← S[t]
n.I − ← S[beg].o
n.I + ← S[end].o
L[i] ← n
beg ← end + 1
return L
2.3.1. The HETree-C Construction
We construct the HETree-C in a bottom-up way. Algorithm 1 describes the HETree-C construction. Initially, the algorithm sort the object set D in ascending
order, based on objects values (line 1). Then, the algorithm uses two procedures to construct the tree nodes.
Finally, the root node of the constructed tree is returned
(line 4).
The constrLeaves-C procedure (Procedure 1) construct ℓ leaf nodes (lines 4–16). For the first k leaves, λ
objects are inserted, while for the rest leaves, λ − 1 objects are inserted (lines 6–9). Finally, the set of created
leaf nodes is returned (line 17).
The constrtInterNodes procedure (Procedure 2)
builds the internal nodes in a recursive manner. For
the nodes H, their parents nodes P are created (lines
4-16); then, the procedure calls itself using as input
the parent nodes P (line 21). The recursion terminates
when the number of created parent nodes is equal to
one (line 17); i.e., the root of the tree is created.
Computational Analysis. The computational cost for
the HETree-C construction (Algorithm 1) is the sum
of three parts. The first is sorting the input data, which
can be done in the worst case in O(|D|log|D|), employing a linearithmic sorting algorithm (e.g., mergesort). The second part is the constrLeaves-C procedure, which requires O(|D|) for scanning all data objects. The third part is the
proce ℓ constrtInterNodes
ℓ ℓ
dure, which requires d · ( d + d2 + d3 + . . . + 1),
with the sum being the number of internal nodes in the
tree. Note that the maximum number of internal nodes
in a d-ary tree corresponds to the number of internal
nodes in a perfect d-ary tree of the same height. Also,
note the number of internal nodes of a perfect d-ary
h
−1
tree of height h is dd−1
. In our case, the height of our
tree is h = ⌈logd ℓ⌉. Hence, the maximum number of
⌈logd ℓ⌉
internal nodes is d d−1 −1 ≤ d·ℓ−1
d−1 . Therefore, the
constrtInterNodes procedure, in worst case requires
2
·ℓ−d
). Therefore, the overall computational cost
O( d d−1
for the HETree-C construction in the worst case is
2
·ℓ−d
O(|D|log|D| + |D| + d d−1
) = O(|D|log|D| +
d2 ·ℓ−d 5
d−1 ).
2.4. A Range-based HETree (HETree-R)
The second version of the HETree is called HETree-R
(Range-based HETree). HETree-R organizes data into
equally ranged groups. The basic property of the
HETree-R is that each leaf node covers an equal range
of values. Therefore, in HETree-R, the data space defined by the objects values is equally divided over the
leaves. As opposed to HETree-C, in HETree-R the interval of a leaf specifies its content. Therefore, for the
HETree-R construction, the intervals of all leaves are
first defined and then objects are inserted.
An HETree-R (D, ℓ, d) is an HETree, with the following extra property. The interval of each leaf node
has the same length; i.e., covers equal range of values.
Formally, let S be the sorted RDF set resulting from
D, for each leaf node its interval has length ρ, where
ρ = |S[1].o−S[|S|].o|
.6 Therefore, for a leaf node n, we
ℓ
−
have that |n.I − n.I + | = ρ. For example, for the
leftmost leaf, its interval is [S[1].o, S[1].o + ρ). The
5 In the complexity computations presented through the paper,
terms that are dominated by others (i.e., having lower growth rate)
are omitted.
6 We assume here that, there is at least one object in D with different value than the rest objects.
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Procedure 2: constrtInterNodes(H, d)
1
2
3
4
5
6
7
8
9
t ← d − (pn · d − |H|)
//last parent’s number of children
cbeg ← 1
//first child node
for p ← 1 to pnum do
create an empty internal node n
if p = pnum then
cnum ← t
//number of children
else
cnum ← d
12
13
n.I −
11
14
15
16
17
18
19
20
21
Procedure 3: constrLeaves-R(S, ℓ)
Input: H: ordered set of nodes; d: tree degree
Output: r: root node for H
Variables: P : ordered set of H’s parent nodes
m
l
|H|
pnum ← d
//number of parents nodes
cend ← cbeg + cnum
for j ← cbeg to cend do
n.c[j] ← H[j]
10
//last child node
HETree-R is equivalently defined by providing the interval length ρ, instead of the number of leaves ℓ.
Example 3. Figure 3 presents an HETree-R tree constructed by considering the set of objects D (Figrue 1), ℓ = 5 and d = 3. As we can observe from
Figure 3, each leaf node covers equal range of values. Particularly, we have that the interval of each
leaf must have length ρ = |20−100|
= 16.
5
[20, 100]
a
b
c
g
e
d
f
[20, 36) [36, 52) [52, 68) [68, 84)
p2 age 55
1
2
3
4
5
6
7
8
9
10
11
13
if pnum = 1 then
r←P
return r
else
return constrtInterlNodes(P, d)
p8 age 20 p3 age 37
p4 age 30 p6 age 45
p0 age 35 p9 age 50
p5 age 35
Input: S: ordered set of objects; ℓ: number of leaf nodes
Output: L: ordered set of leaf nodes
12
].I −
← H[cbeg
n.I + ← H[cend ].I +
P [p] ← n
cbeg ← cend + 1
[20, 68)
7
p7 age 80
[68, 100]
h
[84, 100]
p1 age 100
Fig. 3. A Range-based HETree (HETree-R)
2.4.1. The HETree-R Construction
This section studies the construction of the HETree-R
structure. The HETree-R is also constructed in a
bottom-up fashion.
|S[1].o−S[|S|].o|
ρ←
ℓ
for i ← 1 to ℓ do
create an empty leaf node n
if i = 1 then
n.I − ← S[1].o
else
n.I − ← L[i − 1].I +
n.I + ← n.I − + ρ
L[i] ← n
for t ← 1jto |S| do
j←
S[t].o−S[1].o
ρ
k
+1
L[j].data ← S[t]
return L
Similarly with the HETree-C version, Algorithm 1
is used for the HETree-R construction. The only difference is the constrLeaves-R procedure (line 2), which
creates the leaf nodes of the HETree-R and is presented
in Procedure 3.
The procedure The procedure constructs ℓ leaf
nodes (lines 2–9) and assigns same intervals to all of
them (lines 4–8), it traverses all objects in S (lines
10–12) and places them to the appropriate leaf node
(line 12). Finally, returns the set of created leaves (line
13).
Computational Analysis. The computational cost for
the HETree-R construction (Algorithm 1) for sorting the input data (line 1) and creating the internal nodes (line 3) is the same as in the HETree-C
case. The constrLeaves-R procedure (line 2) requires
O(ℓ + |D|) = O(|D|) (since |D| ≥ ℓ). Using the computational costs for the first and the third part from
Section 2.3.1, we have that in worst case, the overall computational cost for the HETree-R construction
2
·ℓ−d
) = O(|D|log|D| +
is O(|D|log|D| + |D| + d d−1
d2 ·ℓ−d
d−1 ).
2.5. Estimating the HETree Parameters
In our working scenario, the user specifies the parameters required for the HETree construction (e.g.,
number of leaves ℓ). In this section, we describe our
approach for automatically calculating the HETree parameters based on the input data, when no user preferences are provided. Our goal is to derive the parameters by the input data, such that the resulting HETree
can address some basic guidelines set by the visualiza-
8
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
tion environment. In what follows, we discuss in detail
the proposed approach.
An important parameter in hierarchical visualizations is the minimum and maximum number of objects
that can be effectively rendered in the most detailed
level.7 In our case, the above numbers correspond to
the number of objects contained in the leaf nodes. The
proper calculation of these numbers is crucial such that
the resulting tree avoids overloaded visualizations.
Therefore, in HETree construction, our approach
considers the minimum and the maximum number of
objects per leaf node, denoted as λmin and λmax , respectively. Besides the number of objects rendered in
the lowest level, our approach considers perfect m-ary
trees, such that a more "uniform" structure (i.e., all the
internal nodes have exactly m child nodes) results. The
following example illustrates our approach of calculating the HETree parameters.
Example 4. Assume that based on an adopted visualization technique, the ideal number of data objects
to be rendered on a specific screen is between 25 and
50. Hence, we have that λmin = 25 and λmax = 50.
Now, let’s assume that we want to visualize the
object set D1 , using an HETree-C, where |D1 | = 500.
Based on the number of objects and the λ bounds,
we can estimate the bounds for the number of leaves.
Let ℓmin and ℓmax denote the lower and the upper
bound
we have
number of
leaves.
Therefore,
for the
|D1 |
500
|D1 |
≤ ℓ ≤
≤ ℓ ≤
⇔
that
λmin
50
λmax
500
⇔ 10 ≤ ℓ ≤ 20.
25
Hence, our HETree-C should have between
ℓmin = 10 and ℓmax = 20 leaf nodes. Since, we
consider perfect m-ary trees, from Table 1 we can
identify the tree characteristics that conform to the
number of leaves guideline. The candidate setting
(i.e., leaf number and degree) is indicated in Table 1,
using dark-grey colour. Note that, the settings with
d = 2 are not examined since visualizing two groups
of objects in each level is considered a small number
under most visualization settings. Hence, in any case
we only assume settings with d ≥ 3 and height ≥ 2.
Therefore, an HETree-C with ℓ = 16 and d = 4 is a
suitable structure for our case.
Now, let’s assume that we want to visualize the
object set D2 , where |D2 | = 1000. Following a similar approach, we have that 20 ≤ ℓ ≤ 40. The candi7 Similar
bounds can also be defined for other tree levels.
Table 1. Number of leaf nodes for perfect m-ary trees
Degree
Height
3
4
5
6
1
2
3
4
5
6
3
9
27
81
243
729
4
16
64
256
1024
4048
5
25
625
3125
15625
78125
6
36
216
1296
7776
46656
date settings are indicated in Table 1 using light-grey
colour. Hence, we have the following settings that
satisfy the considered guideline: S1: ℓ = 27, d = 3;
S2: ℓ = 25, d = 5; and S3: ℓ = 36, d = 6.
In the case where more than one setting satisfies the considered guideline, we select the preferable one according to following set of rules. From
the candidate settings, we prefer the setting which
results in the highest tree (1st Criterion).8 In case
that the highest tree is constructed by more than one
settings, we consider the distance c, between ℓ and
the centre of ℓmin and ℓmax (2nd Criterion); i.e.,
max
c = |ℓ − ℓmin +ℓ
|. The setting with the lowest c
2
value is selected. Note that, based on the visualization context, different criteria and preferences may
be followed.
In our example, from the candidate settings, setting S1 is selected, since it will construct the highest
tree (i.e., height = 3). On the other hand, settings
S2 and S3 will construct trees with lower heights
(i.e., height = 2).
Now, assume a scenario where only S2 and S3 are
candidates. In this case, since both settings result to
trees with equal heights, the 2nd Criterion is considered. Hence, for the S2 we have c2 = |25− 20+40
2 |=
5. Similarly, for the S3 c3 = |36 − 20+40
|
= 6.
2
Therefore, between the S2 and S3, the setting S2 is
preferable, since c2 < c3 .
In case of HETree-R, a similar approach is followed, assuming normal distribution over the values
of the objects.
8 Depending on user preferences and the examined task, the shortest tree may be preferable. For example, starting from the root, the
user wishes to access the data objects (i.e., lowest level) by performing the smallest amount of drill-down operations possible.
9
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
[20, 100]
2.6. Statistics Computations over HETree
Data statistics is a crucial aspect in the context of
hierarchical visual exploration and analysis. Statistical
informations over groups of objects (i.e., aggregations)
offer rich insights into the underlying (i.e., aggregated)
data. In this way, useful information regarding different set of objects with common characteristics is provided. Additionally, this information may also guide
the users through their navigation over the hierarchy.
In this section, we present how statistics computation is performed over the nodes of the HETree.
Statistics computations exploit two main aspects of
the HETree structure: (1) the internal nodes aggregate their child nodes; and (2) the tree is constructed
in bottom-up fashion. Statistics computation is performed during the tree construction; for the leaf nodes,
we gather statistics from the objects they contain,
whereas for the internal nodes we aggregate the statistics of their children.
For simplicity, here, we assume that each node
contains the following extra fields, used for simple
statistics computations, although more complex or
RDF-related (e.g., most common subject, subject with
the minimum value, etc.) statistics can be computed.
Assume a node n, as n.N we denote the number of
objects covered by n; as n.µ and n.σ 2 we denote the
mean and the variance of the objects’ values covered
by n, respectively. Additionally, we assume the minimum and the maximum values, denoted as n.min and
n.max, respectively.
Statistics computations can be easily performed in
the construction algorithms (Algorithm 1) without any
modifications. The follow example illustrates these
computations.
Example 5. In this example we assume the
HETree-C presented in Figure 2. Figure 4 shows the
HETree-C with the computed statistics in each node.
When all the leaf nodes have been constructed, the
statistics for each leaf is computed. For instance, we
can see from Figure 4, that for the rightmost leaf
h we have: h.N = 2, h.µ = 80+100
= 90 and
2
h.σ 2 = 12 · ((80 − 90)2 + (100 − 90)2 ) = 100. Also,
we have h.min = 80 and h.max = 100. Following
the above process, we compute the statistics for all
leaf nodes.
Then, for each parent node we construct, we
compute its statistics using the computed statistics of its child nodes. Considering the c internal
node, with the child nodes g and h, we have that
a
N=6
= 33.7
2 = 57.2
[20, 45]
d
[20, 30]
N=4
= 71.3
404.7
2=
b
e
[35, 35]
p8 age 20 p0 age 35
p4 age 30 p5 age 35
N=2
= 25
2 = 25
N = 10
= 48.7
535.2
2=
N=2
= 35
2= 0
c
[50, 100]
f
[37, 45]
g
[50, 55]
h
[80, 100]
p3 age 37
p6 age 45
p9 age 50
p2 age 55
p7 age 80
p1 age 100
N=2
= 41
2 = 16
N=2
= 52.5
6.25
2=
N=2
= 90
100
2=
Fig. 4. Statistics computation over HETree
c.min = 50 and c.max = 100. Also, we have
that c.N = g.N + h.N = 2 + 2 = 4. Now
we will compute the mean value by combining the
·h.µ
=
children mean values: c.µ = g.N ·g.µ+h.N
g.N +h.N
2·52.5+2·90
=
71.3.
Similarly,
for
variance
we
have
2+2
c.σ 2 =
g.N ·g.σ 2 +h.N ·h.σ 2 +g.N ·(g.µ−c.µ)2 +h.N ·(h.µ−c.µ)2
g.N +h.N
=
2·6.25+2·100+2·(52.5−71.3)2 +2·(90−71.3)2
2+2
= 404.7.
The similar approach is also followed for the case
of HETree-R.
Computational Analysis. Most of the well known
statistics (e.g., mean, variance, skewness, etc.) can
be computed linearly w.r.t. the number of elements.
Therefore, the computation cost over a set of numeric
values S is considered as O(|S|). Assume a leaf node
n containing k objects, then the cost for statistics computations for n is O(k). Also, the cost for all leaf nodes
is O(|D|). Let an internal node n, then the cost for n
is O(d); since the statistics in n are computed by aggregating the statistics of the d child nodes. Considering that d·ℓ−1
d−1 is the maximum number of internal
nodes (Section 2.3.1), we have that in the worst case
2
·ℓ−d
the cost for the internal nodes is O( d d−1
). Therefore, the overall cost for statistics computations over
2
·ℓ−d
).
an HETree is O(|D| + d d−1
3. Efficient Multilevel Exploration
In this section, we exploit the HETree structure in
order to efficiently handle different multilevel exploration scenarios. Essentially, we propose two methods for efficient hierarchical exploration over large
datasets. The first method incrementally constructs the
hierarchy via user interaction; the second one achieves
dynamic adaptation of the data organization based on
user’s preferences.
10
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
3.1. Exploration Scenarios
In a typical multilevel exploration scenario, referred
here as Basic exploration scenario (BSC), the user explores a dataset in a top-down fashion. The user first
obtains an overview of the data through the root level,
and then drills down to more fine-grained contents for
accessing the actual data objects at the leaves. In BSC,
the root of the hierarchy is the starting point of the exploration and, thus, the first element to be presented
(i.e., rendered).
The described scenario offers basic exploration capabilities; however it does not assume use cases with
user-specified starting points, other than the root, such
as starting the exploration from a specific resource, or
from a specific range of values.
Consider the following example, in which the user
wishes to explore the DBpedia infoboxes dataset to
find places with very large population. Initially, she selects the populationTotal property and starts her exploration from the root node, moves down the right part of
the tree and ends up at the rightmost leaf that contains
the highly populated places. Then, she is interested in
viewing the area size (i.e., areaTotal property) for one
of the highly populated places and, also, in exploring
places with similar area size. Finally, she decides to explore places based on the water area size (i.e., areaWater) they contain. In this case, she prefers to start her
exploration by considering places that their water area
size is within a given range of values.
In this example, besides BSC one we consider
two additional exploration scenarios. In the Resourcebased exploration scenario (RES), the user specifies a
resource of interest (e.g., an IRI) and a specific property; the exploration starts from the leaf containing the
specific resource and proceeds in a bottom-up fashion.
Thus, in RES the data objects contained in the same
leaf with the resource of interest are presented first. We
refer to that leaf as leaf of interest.
The third scenario, named Range-based exploration
scenario (RAN) enables the user to start her exploration from an arbitrary point in the hierarchy providing a range of values; the user starts from a set of internal nodes and she can then move up or down the
hierarchy. The RAN scenario begins by rendering all
sibling nodes that are children of the node covering the
specified range of interest; we refer to these nodes as
nodes of interest.
Note that, regarding the adopted rendering policy
for all scenarios, we only consider nodes belonging to
the same level. That is, sibling nodes or data objects
contained in the same leaf, are rendered.
Regarding the "navigation-related" operations, the
user can move down or up the hierarchy by performing a drill-down or a roll-up operation, respectively. A
drill-down operation over a node n enables the user to
focus on n and render its child nodes. If n is a leaf
node, the set of data objects contained in n are rendered. On the other hand, the user can perform a rollup operation on a set of sibling nodes S. The parent
node of S along with the parent’s sibling nodes are rendered. Finally, the roll-up operation when applied to a
set of data objects O will render the leaf node that contains O along its sibling leaves, whereas a drill-down
operation is not applied to a data object.
3.2. Incremental HETree Construction
In the Web of Data, the dataset might be dynamically retrieved by a remote site (e.g., via a SPARQL
endpoint), as a result, in all exploration scenarios, we
have assumed that the HETree is constructed on-the-fly
at the time the user starts her exploration. In the previous DBpedia example, the user explores three different
properties; although only a small part of their hierarchy is accessed, the whole hierarchies are constructed
and the statistics of all nodes are computed. Considering the recommended HETree parameters for the employed properties, this scenario requires that 29.5K
nodes will be constructed for populationTotal property, 9.8K nodes for the areaTotal and 3.3K nodes for
the areaWater, amounting to a total number of 42.6K
nodes. However, the construction of the hierarchies for
large datasets poses a time overhead (as shown in the
experimental section) and, consequently, increased response time in user exploration.
In this section, we introduce ICO (Incremental
HETree Construction) method, which incrementally
constructs the HETree, based on user interaction.
The proposed method goes beyond the incremental
tree construction, aiming at further reducing the response time during the exploration process by "preconstructing" (i.e., prefetching) the parts of the tree
that will be visited by the user in her next roll-up
or drill-down operation. Hence, a node n is not constructed when the user visits it for the first time; instead, it has been constructed in a previous exploration
step, where the user was on a node in which n can be
reached by a roll-up or a drill-down operation. This
way, our method offers incremental construction of the
tree, tailored to each user’s exploration. Finally, we
11
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
⤴ roll-up
explore
1
create: d, e, f
Resource of interest
create: b, c
d
[20, 36)
e
[36, 52)
f
[52, 68)
p8 age 20
p4 age 30
p0 age 35
p5 age 35
p3 age 37
p6 age 45
p9 age 50
p2 age 55
Range of
interest
Age
[20, 100]
a
create: a, g, h
[20, 68)
http://persons.com/p6
2
⤴ roll-up
c
b
[68, 100]
[20, 68)
e
d
f
[20, 36) [36, 52) [52, 68)
p8 age 20 p3 age 37
p4 age 30 p6 age 45
p0 age 35 p9 age 50
p5 age 35
explore
30-50
create: d, e, f, b, c
b
c [68, 100]
g
e
d
f
[20, 36) [36, 52) [52, 68) [68, 84)
p2 age 55
⤴ roll-up
p8 age 20 p3 age 37
p4 age 30 p6 age 45
p0 age 35 p9 age 50
p5 age 35
create: a, g, h
p2 age 55
p7 age 80
h
[84, 100]
p1 age 100
Current view
Fig. 5. Incremental HETree construction example. ➊ Resource-based (RES) exploration scenario; ➋ Range-based (RAN) exploration scenario
show that, during an exploration scenario, ICO constructs the minimum number of HETree elements.
Employing ICO method in the DBpedia example,
the populationTotal hierarchy will only construct 76
nodes (the root along its child nodes and 9 nodes in
each of the lower tree levels) and the areaTotal will
construct 3 nodes corresponding to the leaf node containing the requested resource and its siblings. Finally,
the areaWater hierarchy initially will contain either 6
or 15 nodes, depending on whether the user’s input
range corresponds to a set of sibling leaf nodes, or to a
set of sibling internal nodes, respectively.
Example 6. We demonstrate the functionality of
ICO through the following example. Assume the
dataset used in our running examples, describing
persons and their ages. Figure 5 presents the incremental construction of the HETree presented in Figure 3 for the RES and RAN exploration scenarios.
Blue color is used to indicate the HETree elements
that are presented (rendered) to the user, in each exploration stage.
In the RES scenario (upper flow in Figure 5),
the user specifies "http://persons.com/p6" as her resource of interest; all data objects contained in the
same leaf (i.e., e) with the resource of interest are
initially presented to the user. The ICO initially constructs the leaf e, along with its siblings, i.e., leaves
d and f . These leaves correspond to the nodes that
the user can reach in a next (roll-up) step. Next, the
user rolls up and the leaves d, e and f are presented
to her. At the same time, parent node b and its sibling
c are constructed. Note that all elements which are
accessible to the user by moving either down (i.e., d,
e, f data objects), or up (i.e., b, c nodes) are already
constructed. Finally, when the user rolls up b and c
nodes are rendered and parent node a, along with the
children of c, i.e., g and h, are constructed.
In the RAN scenario (lower flow in Figure 5), the
user specifies [30, 50] as her range of interest. The
nodes covering this range (i.e., d, e) are initially presented along with their sibling f . Also, ICO constructs the parent node b and its sibling c because
they are accessible by one exploration step. Then, the
user performs a roll-up and ICO constructs the a, g,
h nodes (as described in the RES scenario above).
In the beginning of each exploration scenario, ICO
constructs a set of initial nodes, which are the nodes
initially presented, as well as the nodes potentially
reached by the user’s first operation (i.e., required
HETree elements). The required HETree elements of
an exploration step are nodes that can be reached
by the user by performing one exploration operation.
Hence, in the RES scenario, the initial nodes are the
leaf of interest and its sibling leaves. In the RAN, the
initial nodes are the nodes of interest, their children,
and their parent node along with its siblings. Finally,
in the BSC scenario the initial nodes are the root node
and its children.
In what follows we describe the construction rules
adopted by ICO through the user exploration process.
These rules provide the correspondences between the
types of elements presented in each exploration step
and the elements that ICO constructs. Note that these
rules are applied after the construction of the initial
nodes, in all three exploration scenarios. The correctness of these rules is verified later in Proposition 1.
Rule 1: If a set of internal sibling nodes C is presented, ICO constructs: (i) the parent node of C along
with the parent’s siblings, and (ii) the children of each
node in C.
Rule 2: If a set of leaf sibling nodes L is presented,
ICO does not construct anything (the required nodes
have been previously constructed).
12
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Rule 3: If a set of data objects O is presented, ICO
does not construct anything (the required nodes have
been previously constructed).
The following proposition shows that, in all case,
the required HETree elements have been constructed
earlier by ICO.9
Proposition 1. In any exploration scenario, the
HETree elements a user can reach by performing
one operation (i.e., required elements), have been
previously constructed by ICO.
Also, the following theorem shows that over any
exploration scenario ICO constructs only the required
HETree elements.
Theorem 1. ICO constructs the minimum number
of HETree elements in any exploration scenario.
3.2.1. ICO Algorithm
In this section, we present the incremental HETree
construction algorithm. Note that, here we include the
pseudocode only for the HETree-R version, since the
only difference with the HETree-C version is in the
way that the nodes’ intervals are computed and that the
dataset is initially sorted. In the analysis of the algorithms, both versions are studied.
Here, we assume that each node n contains the following extra fields. Let a node n, n.p denotes the parent node of n, and n.h denotes the height of n in the
hierarchy. Additionally, given a dataset D, D.minv
and D.maxv denote the minimum and the maximum
value for all objects in D, respectively. The user preferences regarding the exploration’s starting point are
represented as an interval U . In the RES scenario,
given that the value of the explored property for the
resource of interest is o, we have U − = U + = o.
In the RAN scenario, given that the range of interest is R, we have that U − = max (D.minv, R− )
and U + = min (D.maxv, R+ ). In the BSC scenario,
the user does not provide any preferences regarding
the starting point, so we have U − = D.minv and
U + = D.maxv. Finally, according to the definition of
HETree, a node n encloses a data object (i.e., triple) tr
if n.I − ≥ tr.o and n.I + ≤ tr.o.
The algorithm ICO-R (Algorithm 2) implements
the incremental method for HETree-R. The algorithm uses two procedures to construct all required
nodes (available in Appendix B). The first procedure constrRollUp-R (Procedure 4) constructs the
9 Proofs
are included in Appendix A.
Algorithm 2. ICO-R(D, ℓ, d, U , cur, H)
Input: D: set of objects; ℓ: number of leaf nodes; d: tree
degree; U : interval representing user’s starting point;
cur: currently presented elements;
H: currently created HETree-R
Output: H: updated HETree-R
Variables: len: the length of the leaf’s interval
1
2
3
4
5
6
7
8
9
10
if cur = null then
//first ICO call
len ← D.maxv−D.minv
ℓ
from U compute I0 , h0
//used for constructing initial nodes
cur , H ← constrSiblingNodes-R(I0 , null, D, h0 )
if RES then return H
if cur[1].p = null and D 6= ∅ then
H ← constrRollUp-R(D, d, cur, H)
//cur are not leaves
if cur[1].h > 0 then
H ← constrDrillDown-R(D, d, cur, H)
return H
nodes which can be reached by a roll-up operation, whereas constrDrillDown-R (Procedure 5) constructs the nodes which can be reached by a drilldown operation. Additionally, the aforementioned
procedures exploit two secondary procedures (Appendix B): computeSiblingInterv-R (Procedure 6) and
constrSiblingNodes-R (Procedure 7), which are used
for nodes’ intervals computations and nodes construction.
The ICO-R algorithm is invoked at the beginning
of the exploration scenario, in order to construct the
initial nodes, as well as every time the user performs
an operation. The algorithm takes as input the dataset
D, the tree parameters d and ℓ, the starting point U ,
the currently presented (i.e., rendered) elements cur,
and the constructed HETree H. ICO-R begins with
the currently presented elements cur equal to null
(lines 1-5). Based on the starting point U , the algorithm computes the interval I0 corresponding to the
sibling nodes that are first presented to the user, as well
as its hierarchy height h0 (line 3). For sake of simplicity, the details for computing I0 and h0 are omitted. For example, the interval I for the leaf that contains the resource of interest with object
valueo, is
computed as I − = D.minv + len · o−D.minv
and
len
I + = min(D.maxv, I − + len). Following a similar
approach, we can easily compute I0 and h0 .
Based on I0 , the algorithm constructs the sibling
nodes that are first presented to the user (line 4). Then,
the algorithm constructs the rest initial nodes (lines 69). In the RES case, as I0 we consider the interval that
includes the leaf that contains the resource of interest along with its sibling leaves. Hence, all the initial
13
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
nodes are constructed in line 4 and the algorithm terminates (line 5) until the next user’s operation.
After the first call, in each ICO execution, the algorithm initially checks if the parent node of the currently presented elements is already constructed, or
if all the nodes that enclose data objects10 have been
constructed (line 6). Then, procedure constrRollUp-R
(line 7) is used to construct the cur parent node, as
well as the parent’s siblings. In the case that cur
are not leaf nodes or data objects (line 8), procedure
constrDrillDown-R (line 9) is used to construct all cur
children. Finally, the algorithm returns the updated
HETree (line 10).
Discussion. The worst case for the computational cost
is higher in HETree-R than in HETree-C, for all exploration scenarios. Particularly, in HETree-R worst case,
ICO must build leaves that contain the whole dataset
and the computational cost is O(|D|log|D|) for all scenarios. In HETree-C, for the RES and RAN scenarios,
the cost is O(d2 + d−1
d |D|), and for the BSC scenario
the cost is O(d2 +|D|). A detailed computational analysis for both HETree-R and HETree-C is included in
Appendix C.
3.3. Adaptive HETree Construction
In a (visual) exploration scenario, users wish to
modify the organization of the data by providing user10 Note that in the HETree-R version, we may have nodes that do
not enclose any data objects.
ℓ=4
a
[20, 44]
d'=4
[46, 100]
b
[30, 44]
[46, 55]
e
d
h
i
ℓ'=2
c
[20, 24]
[66, 100]
g
f
k
m
l
n
p
o
[90, 100]
[20, 21] [23, 24] [30, 35] [36, 44] [46, 51] [52, 55] [66, 69]
20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100
[20, 100]
a
[20, 44]
[46, 100]
ℓ'=2
b
d'=4
h
[20, 21]
i
[23, 24]
[46, 55]
k
l
[30, 35] [36, 44]
20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44
3.2.2. Computational Analysis
Here we analyse the incremental construction for
both HETree versions.
Number of Constructed Nodes. Regarding the number of initial nodes constructed in each scenario: in
RES scenario, at most d leaf nodes are constructed; in
RAN scenario, at most 2d + d2 nodes are constructed;
finally in BSC scenario, d + 1 are constructed.
Regarding the maximum number of nodes constructed in each operation in RES and RAN scenarios: (1) A roll-up operation constructs at most
d + d · (d − 1) = d2 nodes. The d nodes are
constructed in constrRollUp, whereas the d · (d − 1)
in constrDrillDown. (2) A drill-down operation constructs at most d2 nodes in constrDrillDown. As for the
BSC scenario: (1) A roll-up operation does not construct any nodes. (2) A drill-down operation constructs
at most d2 nodes in constrDrillDown.
[20, 100]
d=2
c
[66, 100]
f
g
46, 50, 51
66, 68, 69
52, 53, 55
90, 94, 100
Fig. 6. Adaptive HETree example
specific preferences for the whole hierarchy or part
of it. The user can select a specific subtree and alter
the number of groups presented in each level (i.e., the
tree degree) or the size of the groups (i.e., number of
leaves). In this case, a new tree (or a part of it) pertaining to the new parameters provided by the user should
be constructed on-the-fly.
For example, consider the HETree-C of Figure 6
representing ages of persons.11 A user may navigate to
node b, where she prefers to increase the number of
groups presented in each level. Thus, she modifies the
degree of b from 2 to 4 and the subtree is adapted to
the new parameter as depicted on the bottom tree of
Figure 6. On the other hand, the user prefers exploring the right subtree (starting from node c) with less
details. She chooses to increase the size of the groups
by reducing (from 4 to 2) the number of leaves for
the subtree of c. In both cases, constructing the subtree from scratch based on the user-provided parameters and recomputing statistics entails a significant
time overhead, especially, when user preferences are
applied to a large part of or the whole hierarchy.
In this section, we introduce ADA (Adaptive HETree
Construction) method, which dynamically adapts an
existing HETree to a new, considering a set of userdefined parameters. Instead of both constructing the
tree and computing the nodes’ statistics from scratch,
our method reconstructs the new part(s) of the hierarchy by exploiting the existing elements (i.e., nodes,
statistics) of the tree. In this way, ADA achieves to reduce the overall construction cost and enables the onthe-fly reorganization of the visualized data. In the ex11 For
simplicity, Figure 6 presents only the values of the objects.
14
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Table 2. Summary of Adaptive HETree Construction⋆
Modify Degree
Full Construction
Tree Construction
Complexity O(mlogm+d′ e)
′
d =d
O(mlog √
k
ℓ′
0
e
0
0
0
0
0
O(m+d′ e)
O(1)
#leaves0
ℓ
′
0
#leaves+
0
0
#internals0
e
0
#internals+
0
0
#leaves0
#leaves+
#internals0
#internals+
k
d′
elsewhere
ℓ >ℓ
O(d′ e)
O(d′k r)
O(d′ e)
O(m+d′ e)
O(m)
O(m+d′ e)
O(mlogm+d′ e)
0
0
e
0
0
0
e−r
0
0
0
e
0
ℓ′
0
e
0
0
ℓ′
0
0
0
ℓ′
e
0
ℓ′
0
e
0
O( kℓ′ +d′ e)
O(d′ (e−r))
O(d′ e)
O(m+d′ e)
O(1)
O(m+d′ e)
0
0
ℓ′
O(m+d′ e−ℓ′ −k)
0
0
0
0
0
0
0
0
ℓ′
e−r
e
e
0
e
0
0
0
0
d =k·d
m)
Modify Num. of Leaves
√
k
d =
d
′
′
′
′
ℓ =
ℓ
dk
ℓ′ =
ℓ
k
ℓ′ = ℓ − k
Statistics Computations
Complexity
⋆
m = |D|, e =
d′ ℓ′ −1
d′ −1
d
′
e−
l ′m
ℓ
l ′ md′
ℓ
0
d′
(maximum number of internal nodes), and r =
ℓ′ −
ℓ′2
d′
ℓ′2
d′
e
0
d′k ℓ′ −1
d′k −1
ample of Figure 6, the new subtree of b can be derived
from the old one, just by removing the internal nodes d
and e, while the new subtree of c results from merging
leaves together and aggregating their statistics.
Let T (D, ℓ, d) denote the existing HETree and
T ′ (D, ℓ′ , d′ ) is the new HETree corresponding to the
new user preferences for the tree degree d′ and the
number of leaves ℓ′ . Note that T could also denote a
subtree of an existing HETree (in the scenario where
the user modifies only a part of it). In this case, the
user indicates the reconstruction root of T .
Then, ADA identifies the following elements of T :
(1) The elements of T that also exist in T ′ . For example, consider the following two cases: the leaf nodes
of T ′ are internal nodes of T in level x; the statistics
of T ′ nodes in level x are equal to the statistics of T
nodes in level y. (2) The elements of T that can be
reused (as "building blocks") for constructing elements
in T ′ . For example, consider the following two cases:
each leaf node of T ′ is constructed by merging x leaf
nodes of T ; the statistics for the node n of T ′ can be
computed by aggregating the statistics from the nodes
q and w of T .
Consequently, we consider that an element (i.e.,
node or node’s statistics) in T ′ can be: (1) constructed/computed from scratch12 , (2) reused as is
from T or (3) derived by aggregating elements from
T.
Table 2 summarizes the ADA reconstruction process. Particularly, the table includes: (1) the computational complexity for constructing T ′ , denoted as
12 Note that it is possible for a from scratch constructed node in
T ′ to aggregate statistics from nodes in T .
Complexity; (2) the number of leaves and internal nodes of T ′ constructed from scratch, denoted
as #leaves0 and #internals0 , respectively; and (3)
the number of leaves and internal nodes of T ′ derived from nodes of T , denoted as #leaves+ and
#internals+ , respectively. The lower part of the table
presents the results for the computation of node statistics in T ′ . Finally, the second table column, denoted as
F ull Construction, presents the results of constructing T ′ from scratch.
The following example demonstrates the ADA results, considering a DBpedia exploration scenario.
Example 7. The user explores the populationTotal
property of the DBpedia dataset. The default system
organization for this property is a hierarchy with degree 3. The user modifies the tree parameters in order
to fit better visualization results as following. First,
she decides to render more groups in each hierarchy
level and increases the degree from 3 to 9 (1st Modification). Then, she observes that the results overflow the visualization area and that a smaller degree fits better; thus she re-adjusts the tree degree to
a value of 6 (2nd Modification). Finally, she navigates through the data values and decides to increase
the groups’ size by a factor of three (i.e., dividing
by three the number of leaves) (3rd Modification).
Again, she corrects her decision and readjusts the final group size to twice the default size (4th Modification).
Table 3 summarizes the number of nodes, constructed by a Full Construction and ADA in each
modification, along with the required statistics computations. Considering the whole set of modifications, ADA constructs only the 22% (15.4K vs.
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
70.2K) of the nodes that are created in the case of the
full construction. Also, ADA computes the statistics
for only 8% (5.6K vs. 70.2K) of the nodes.
Table 3. Full Construction vs. ADA over DBpedia Exploration
Scenario (cells values: Full / ADA)
Modify Degree
1st Modification 2nd Modification
Modify Num. of Leaves
3rd Modification 4th Modification
Tree Construction
#nodes
22.1K / 0
23.6K / 3.9K
9.8K / 6.6K
14.7K / 4.9K
23.6K / 659
9.8K / 0
14.7K / 4.9K
Statistics Computations
#nodes
22.1K / 0
In the next sections, we present in detail the reconstruction process through the example trees of Figure 7. Figure 7a presents the initial tree T that is an
HETree-C, with ℓ = 8 and d = 2. Figures 7b~7e
present several reconstructed trees T ′ . Blue dashed
lines are used to indicate the elements (i.e., nodes,
edges) of T ′ which do not exist in T . Regarding statistics, we assume that in each node we compute the
mean value. In each T ′ , we present only the mean values that are not known from T . Also, in mean values
computations, the values that are reused from T are
highlighted in yellow. All reconstruction details and
computational analysis for each case are included in
Appendix D.
3.3.1. The User Modifies the Tree Degree
Regarding the modification of the degree parameter,
we distinguish the following cases:
The user increases the tree degree. We have that
d′ > d; based on the d′ value we have the following
cases:
(1) d′ = dk , with k ∈ N+ and k > 1: Figure 7a
presents T with d = 2 and Figure 7d presents the reconstructed T ′ with d′ = 4 (i.e., k = 2). T ′ results by
simply removing the nodes with height 1 (i.e., d, e, f ,
g) and connecting the nodes with height 2 (i.e., b, c)
with the leaves.
In general, T ′ results from T by simply removing
tree levels from T . Additionally, there is no need for
computing any new statistics, since the statistics for all
nodes of T ′ remain the same as in T .
(2) d′ = k · d, with k ∈ N+ , k > 1 and k 6= dν
where ν ∈ N+ : An example with k = 3 is presented
in Figure 7c, where we have d′ = 6. In this case, the
15
leaves of T (Figure 7a) remain leaves in T ′ and all
internal nodes up to the reconstruction root of T are
constructed from scratch. As for the node statistics, we
can compute the mean values for T ′ nodes with height
1 (i.e., µb , µc ) by aggregating already computed mean
values (e.g., µd , µe , etc.) from T .
In general, except for the leaves, we construct all
internal nodes from scratch. For the internal nodes of
height 1, we compute their statistics by aggregating
the statistics of T leaves, whereas for internal nodes of
height greater than 1, we compute from scratch their
statistics.
(3) elsewhere: In any other case where the user increases the tree degree, all internal nodes in T ′ except
for the leaves are constructed from scratch. In contrast
with the previous case, the leaves’ statistics from T
can not be reused and, thus, for all internal nodes in T ′
the statistics are recomputed.
The user decreases the tree degree. Here we have
that d′ < d; based on the d′ value we have the following two cases:
√
(1) d′ = k d, with k ∈ N+ and k > 1: Assume that
now Figure 7d depicts T , with d = 4, while Figure 7a
presents T ′ with d′ = 2. We can observe that T ′ contains all nodes of T , as well as a set of extra internal
nodes (i.e., d, e, f , g). Hence, T ′ results from T by
constructing some new internal nodes.
(2) elsewhere: This case is the same as the previous
case (3) where the user increases the tree degree.
3.3.2. The User Modifies the Number of Leaves
Regarding the modification of the number of leaves
parameter, we distinguish the following cases:
The user increases the number of leaves. In this case
we have that ℓ′ > ℓ; hence, each leaf of T is split into
several leaves in T ′ and the data objects contained in
a T leaf must be reallocated to the new leaves in T ′ .
As a result, all nodes (both leaves and internal nodes)
in T ′ have different contents compared to nodes in T
and must be constructed from scratch along with their
statistics.
In this case, constructing T ′ requires
2 ′
·ℓ −d
O(|D| + d d−1
) (by avoiding the sorting phase).
The user decreases the number of leaves. In this case
we have that ℓ′ < ℓ; based on the ℓ′ value we have the
following three cases:
16
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
(a) (ℓ=8, d=2)
(c) (ℓ=8, d=6)
[20, 100]
a
[20, 44]
b
b
c
[20, 24]
[30, 44]
[46, 55]
e
d
= ( d*6 +
e*6 +
f*6)/18
i
g
f
k
m
l
n
h
p
o
[46, 51] [52, 55] [66, 69]
[30, 35] [36, 44]
= ( o*3 +
(e) (ℓ=3, d=2)
p*3)/6
[66, 100]
b
[20, 100]
c
a
[20, 52]
i
k
[20, 21] [23, 24]
h
c
a
[66, 100]
[20, 21]
[20, 21] [23, 24]
[20, 100]
[20, 55]
[46, 100]
m
l
[30, 35] [36, 44]
o
n
[46, 51] [52, 55]
[66, 69]
[90, 100]
20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100
[90, 100]
20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100
(b) (ℓ=4, d=2)
a
[20, 44]
[20, 24]
b
(d) (ℓ=8, d=4)
[20, 100]
[30, 44]
[46, 100]
[46, 55]
c
[20, 100]
a
[20, 44]
e
f
g
20, 20, 21
30, 32, 35
46, 50, 51
66, 68, 69
23, 24, 24
36, 40, 44
52, 53, 55
90, 94, 100
[20, 32]
h
i
k
l
m
n
o
[46, 51] [52, 55] [66, 69]
f
[53, 100]
35
53, 55
23, 24, 24
36, 40, 44
66, 68, 69
30, 32
46, 50, 51
90, 94, 100
52
c
[30, 35] [36, 44]
[35, 52]
20, 20, 21
d
[20, 21] [23, 24]
c
e
d
[46, 100]
b
[66, 100]
d
[53, 100]
b
p
p
[90, 100]
= ( h*3 + i*3 + 30 + 32)/8
e
= (35 + l*3 +
f
= (53 + 55 +
m*3 +
o*3 +
52)/8
p*3)/8
20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100
Fig. 7. Adaptive HETree construction examples
ℓ
(1) ℓ′ = k , with k ∈ N+ : Considering that Figure 7a
d
presents T with ℓ = 8 and d = 2. A reconstruction
example of this case with k = 1, is presented in Figure 7b, where we have T ′ with ℓ′ = 4. In Figure 7b,
we observe that the leaves in T ′ result from merging
dk leaves of T . For example, the leaf d of T ′ results
from merging the leaves h and i of T . Then, T ′ results
from T , by replacing the T nodes with height k (i.e.,
b, e, f , g), with the T ′ leaves. Finally, the nodes of T
with height less than k are not included in T ′ .
Therefore, in this case, T ′ is constructed by merging
the leaves of T ′ and removing the internal nodes of T ′
having height less or equal to k. Also, we do not recompute the statistics of the new leaves of T ′ as these
are derived from the statistics of the removed nodes
with height k.
ℓ
, with k ∈ N+ , k > 1 and k 6= dν , where
k
ν ∈ N+ : As in the previous case, the leaves in T ′ are
constructed by merging leaves from T and their statistics are computed based on the statistics of the merged
leaves. In this case, however, all internal nodes in T ′
have to be constructed from scratch.
ℓ
(3) ℓ′ = ℓ − k, with k ∈ N+ , k > 1 and ℓ′ 6= , where
ν
ν ∈ N+ : The two previous cases describe that each
leaf in T ′ f ully contains k leaves from T . In this case,
a leaf in T ′ may partially contains leaves from T . A
leaf in T ′ fully contains a leaf from T when the T ′
leaf contains all data objects belonging to the T leaf.
Otherwise, a leaf in T ′ partially contain a leaf from T
when the T ′ leaf contains a subset of the data objects
from the T leaf.
An example of this case is shown in Figure 7e that
depicts a reconstructed T ′ resulting from the T pre(2) ℓ′ =
sented in Figure 7a. The d leaf of T ′ fully contains
leaves h, i of T and partially leaf k for which value 35
belongs to a different leaf (i.e., e).
Due to this partial containment, we have to construct
all leaves and internal nodes from scratch and recalculate their statistics. Still, the statistics of the fully contained leaves of T can be reused, by aggregating them
with the individual values of the data objects included
in the leaves. For example, as we can see in Figure 7e,
the mean value µd of the leaf d is computed by aggregating the mean values µh and µi corresponding to
the fully contained leaves h and i, with the individual
values 30, 32 of the partially contained leaf k.
4. The SynopsViz Tool
Based on the proposed hierarchical model, we have
developed a web-based prototype called SynopsViz13 .
The key features of SynopsViz are summarized as follows: (1) It supports the aforementioned hierarchical
model for RDF data visualization, browsing and analysis. (2) It offers automatic on-the-fly hierarchy construction, as well as user-defined hierarchy construction based on users’ preferences. (3) Provides faceted
browsing and filtering over classes and properties. (4)
Integrates statistics with visualization; visualizations
have been enriched with useful statistics and data information. (5) Offers several visualization techniques
(e.g., timeline, chart, treemap). (6) Provides a large
number of dataset’s statistics regarding the: data-level
(e.g., number of sameAs triples), schema-level (e.g.,
most common classes/properties), and structure level
13 synopsviz.imis.athena-innovation.gr
Input Data
(RDF/S ‒ OWL)
Metadata
Extractor
Data
Visualization
Module
Statistics
Processor
Hierarchy
Specifier
Statistics
Generator
Hierarchy
Constructor
Facet
Generator
Hierarchical
Model Module
Data & Schema
Handler
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
17
Hierarchical Visual
Exploration
User Preferences
Faceted Search
Statistics
Chart Timeline
Treemap Metadata
Fig. 8. System architecture
(e.g., entities with the larger in-degree). (7) Provides
numerous metadata related to the dataset: licensing,
provenance, linking, availability, undesirability, etc.
The latter can be considered useful for assessing data
quality [115].
In the rest of this section, Section 4.1 describes the
system architecture, Section 4.2 demonstrates the basic functionality of the SynopsViz. Finally, Section 4.3
provides technical information about the implementation.
4.1. System Architecture
The architecture of SynopsViz is presented in Figure 8. Our scenario involves three main parts: the
Client UI, the SynopsViz, and the Input data. The
Client part, corresponds to the system’s front-end offering several functionalities to the end-users. For example, hierarchical visual exploration, facet search,
etc. (see Section 4.2 for more details). SynopsViz
consumes RDF data as Input data; optionally, OWLRDF/S vocabularies/ontologies describing the input
data can be loaded. Next, we describe the basic components of the SynopsViz.
In the preprocessing phase, the Data and Schema
Handler parses the input data and inferes schema information (e.g., properties domain(s)/range(s), class/
property hierarchy, type of instances, type of properties, etc.). Facet Generator generates class and property facets over input data. Statistics Generator computes several statistics regarding the schema, instances
and graph structure of the input dataset. Metadata Extractor collects dataset metadata. Note that the model
construction does not require any preprocessing, it is
performed online, according to user interaction.
During runtime the following components are involved. Hierarchy Specifier is responsible for managing the configuration parameters of our hierarchy
model, e.g., the number of hierarchy levels, the number of nodes per level, and providing this information to the Hierarchy Constructor. Hierarchy Construc-
tor implements our tree structure. Based on the selected facets, and the hierarchy configuration, it determines the hierarchy of groups and the contained
triples. Statistics Processor computes statistics about
the groups included in the hierarchy. Visualization
Module allows the interaction between the user and
the back-end, allowing several operations (e.g., navigation, filtering, hierarchy specification) over the visualized data. Finally, the Hierarchical Model Module
maintains the in-memory tree structure for our model
and communicates with the Hierarchy Constructor for
the model construction, the Hierarchy Specifier for the
model customization, the Statistics Processor for the
statistics computations, and the Visualization Module
for the visual representation of the model.
4.2. SynopsViz In-Use
In this section we outline the basic functionality of
SynopsViz prototype. Figure 9 presents the web user
interface of the main window. SynopsViz UI consists
of the following main panels: Facets panel: presents
and manages facets on classes and properties; Input
data control panel: enables the user to import and manage input datasets; Visualization panel: is the main area
where interactive charts and statistics are presented;
Configuration panel: handles visualization settings.
Initially, users are able to select a dataset from a
number of offered real-word LD datasets (e.g., DBpedia, Eurostat) or upload their own. Then, for the selected dataset, the users are able to examine several of
the dataset’s metadata, and explore several datasets’s
statistics.
Using the facets panel, users are able to navigate and
filter data based on classes, numeric and date properties. In addition, through facets panel several information about the classes and properties (e.g., number of
instances, domain(s), range(s), IRI, etc.) are provided
to the users through the UI.
Users are able to visually explore data by considering properties’ values. Particularly, area charts and
18
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Fig. 9. Web user interface
timeline-based area charts are used to visualize the
resources considering the user’s selected properties.
Classes’ facets can also be used to filter the visualized data. Initially, the top level of the hierarchy is presented providing an overview of the data, organized
into top-level groups; the user can interactively drilldown (i.e., zoom-in) and roll-up (i.e., zoom-out) over
the group of interest, up to the actual values of the input data (i.e., LD resources). At the same time, statistical information concerning the hierarchy groups
as well as their contents (e.g., mean value, variance,
sample data, range) is presented through the UI (Figure 10a). Regarding the most detailed level (i.e., LD
resources), several visualization types are offered; i.e.,
area, column, line, spline and areaspline (Figure 10b).
In addition, users are able to visually explore data,
through class hierarchy. Selecting one or more classes,
users can interactively navigate over the class hierarchy using treemaps (Figure 10c) or pie charts (Figure 10d). Properties’ facets can also be used to filter
the visualized data. In SynopsViz the treemap visualization has been enriched with schema and statistical information. For each class, schema metadata (e.g.,
number of instances, subclasses, datatype/object properties) and statistical information (e.g., the cardinality
of each property, min, max value for datatype properties) are provided.
Finally, users can interactively modify the hierarchy
specifications. Particularly, they are able to increase or
decrease the level of abstraction/detail presented, by
modifying both the number of hierarchy levels, and
number of nodes per level.
A video presenting the basic functionality of our
prototype is available at youtu.be/n2ctdH5PKA0.
Also, a demonstration of SynopsViz tool is presented
in [19].
4.3. Implementation
SynopsViz is implemented on top of several open
source tools and libraries. The back-end of our system is developed in Java, Jena framework is used for
RDF data handing and Jena TDB is used for diskbased RDF storing. The front-end prototype, is developed using HTML and Javascript. Regarding visualization libraries, we use Highcharts, for the area, column, line, spline, areaspline and timeline-based charts
and Google Charts for treemap and pie charts.
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
(a) Groups of numeric RDF data (Area chart)
(c) Class hierarchy (Treemap chart)
19
(b) Numeric RDF data (Column chart)
(d) Class hierarchy (Pie chart)
Fig. 10. Numeric data & class hierarchy visualization examples
5. Experimental Analysis
In this section we present the evaluation of our approach. In Section 5.1, we present the dataset and the
experimental setting. Then, in Section 5.2 we present
the performance results and in Section 5.3 the user
evaluation we performed.
5.1. Experimental Setting
In our evaluation, we use the well known DBpedia
2014 LD dataset. Particularly, we use the Mappingbased Properties (cleaned) dataset14 which contains
high-quality data, extracted from Wikipedia Infoboxes.
This dataset contains 33.1M triples and includes a
large number of numeric and temporal properties of
varying sizes. The largest numeric property in this
dataset has 534K triples, whereas the largest temporal
property has 762K.
Regarding the methods used in our evaluation, we
consider our HETree hierarchical approaches, as well
as a simple non-hierarchical visualization approach,
14 downloads.dbpedia.org/2014/en/mappingbased_properties_
cleaned_en.nt.bz2
referred as FLAT. FLAT is considered as a competitive method against our hierarchical approaches. It provides single-level visualizations, rendering only the actual data objects; i.e., it is the same as the visualization provided by SynopsViz at the most detailed level.
In more detail, the FLAT approach corresponds to a
column chart in which the resources are sorted in ascending order based on their object values, the horizontal axis contains the resources’ names (i.e., triples’
subjects), and the vertical axis corresponds to objects’
values. By hovering over a resource, a tooltip appears
including the resource’s name and object value.
Regarding the HETree approaches, the tree parameters (i.e., number of leaves, degree and height) are
automatically computed following the approach described in Section 2.5. In our experiments, the lower
and the upper bound for the objects rendered at the
most detailed level have been set to λmin = 10 and
λmax = 50, respectively. Considering the visualizations provided by the default Highcharts settings, these
numbers are reasonable for our screen size and resolution.
Finally, our backend system is hosted on a server
with a quad-core CPU at 2GHz and 8GB of RAM running Windows Server 2008. As client, we used a lap-
20
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
top with i5 CPU at 2.5GHz with 4G RAM, running
Windows 7, Firefox 38.0.1 and ADSL2+ internet connection. Additionally, in the user evaluation, the client
is employed with a 24" (1920×1200) screen.
5.2. Performance Evaluation
In this section, we study the performance of the proposed model, as well as the behaviour of our tool, in
terms of construction and response time, respectively.
Section 5.2.1 describes the setting of our performance
evaluation, and Section 5.2.2 presents the evaluation
results.
5.2.1. Setup
In order to study the performance, a number of
numeric and temporal properties from the employed
dataset are visualized using the two hierarchical approaches (i.e., HETree-C/R), as well as the FLAT approach. We select one set from each type of properties; each set contains 15 properties with varying sizes,
starting from small properties having 50-100 triples up
to the largest properties.
In our experiment, for each of the three approaches,
we measure the tool response time. Additionally, for
the two hierarchical approaches we also measure the
time required for the HETree construction.
Note that in hierarchical approaches through user interaction, the server sends to the browser only the data
required for rendering the current visualization level
(although the whole tree is constructed at the backend).
Hence, when a user requests to generate a visualization
we have the following workflow. Initially, our system
constructs the tree. Then, the data regarding the toplevel groups (i.e., root node children) are sent to the
browser which renders the result. Afterwards, based on
user interactions (i.e., drill-down, roll-up), the server
retrieves the required data from the tree and sends it
to the browser. Thus, the tree is constructed the first
time a visualization is requested for the given input
dataset; for any further user navigation over the hierarchy, the response time does not include the construction time. Therefore, in our experiments, in the hierarchical approaches, as response time we measure the
time required by our tool to provide the first response
(i.e., render the top-level groups), which corresponds
to the slower response in our visual exploration scenario. Thus, we consider the following measures in our
experiments:
Construction T ime: the time required to build the
HETree structure. This time includes (1) the time for
sorting the triples; (2) the time for building the tree;
and (3) the time for the statistics computations.
Response T ime: the time required to render the
charts, starting from the time the client sends the request. This time includes (1) the time required by the
server to compute and build the response. In the hierarchical approaches, this time corresponds to the Construction Time, plus the time required by the server to
build the JSON object sent to the client. In the FLAT
approach, it corresponds to the time spent in sorting the
triples plus the time for the JSON construction; (2) the
time spent in the client-sever communication; and (3)
the time required by the visualization library to render
the charts on the browser.
5.2.2. Results
Table 4 presents the evaluation results regarding
the numeric (upper half) and the temporal properties
(lower half). The properties are sorted in ascending order of the number of triples. For each property, the table contains the number of triples, the characteristics
of the constructed HETree structures (i.e., number of
leaves, degree, height, and number of nodes), as well
as the construction and the response time for each approach. The presented time measurements are the average values from 50 executions.
Regarding the comparison between the HETree and
FLAT, the FLAT approach can not provide results for
properties having more than 305K triples, indicated
in the last rows for both numeric and temporal properties with "–" in the FLAT response time. For the
rest properties, we can observe that the HETree approaches clearly outperform FLAT in all cases, even in
the smallest property (i.e., rankingWin, 50 triples). As
the size of properties increases, the difference between
the HETree approaches and FLAT increases, as well.
In more detail, for large properties having more than
53K triples (i.e., the numeric properties larger than the
populationDensity -12th row-, and the temporal properties larger than the added -11th row-), the HETree
approaches outperform the FLAT by one order of magnitude.
Regarding the time required for the construction of
the HETree structure, from Table 4 we can observe
the following: The performance of both HETtree structures is very close for most of the examined properties,
with the HETree-R performing slightly better than the
HETree-C (especially in the relatively small numeric
properties). Furthermore, we can observe that the response time follows a similar trend as the construction time. This is expected since the communication
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
21
Table 4. Performance Results for Numeric & Temporal Properties
Tree Characteristics
Property (#Triples)
Numeric Properties
rankingWins (50)
distanceToBelfast (104)
waistSize (241)
fileSize (492)
hsvCoordinateValue (995)
lineLength (1, 923)
powerOutput (5, 453)
width (11, 049)
numberOfPages (21, 743)
inseeCode (36, 780)
areaWater (40, 564)
populationDensity (52, 572)
areaTotal (140, 408)
populationTotal (304, 522)
lat (533, 900)
#Leaves Degree Height #Nodes
HETree-C
HETree-R
FLAT
Construction Response Construction Response
Response
Time (msec) Time (msec) Time (msec) Time (msec) Time (msec)
9
9
16
27
81
81
243
729
729
2,187
2,187
2,187
6,561
19,683
19,683
3
3
4
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
3
4
4
5
6
6
7
7
7
8
9
9
13
13
21
40
121
121
364
1,093
1,093
3,280
3,280
3,280
9,841
29,524
29,524
5
7
10
18
74
77
234
506
2,888
4,632
4,945
6,803
16,158
31,141
73,528
324
337
346
347
403
409
560
830
3,219
4,962
5,134
7,127
16,482
31,473
73,862
1
4
9
16
50
55
217
467
2,403
4,105
5,274
6,080
13,298
25,866
71,784
323
329
336
345
383
391
540
799
2,722
4,436
5,457
6,404
13,627
26,196
72,106
415
419
440
575
980
1,463
2,583
6,135
12,669
19,119
29,538
44,262
219,018
1,523,675
—
Temporal Properties
retired (155)
9
endDate (341)
27
lastAirDate (704)
64
buildingStartDate (1, 415)
81
latestReleaseDate (2, 925)
243
orderDate (3, 788)
243
decommissioningDate (7, 082) 243
shipLaunch (15, 938)
729
completionDate (17, 017)
729
foundingDate (19, 694)
729
added (44, 227)
2,187
activeYearsStartDate (98, 160) 6,561
releaseDate (169, 156)
6,561
deathDate (321, 883)
19,683
birthDate (761, 830)
59,049
3
3
4
3
3
3
3
3
3
3
3
3
3
3
3
2
3
3
4
5
5
5
6
6
6
7
8
8
9
10
13
40
85
121
364
364
364
1,093
1,093
1,093
3,280
9,841
9,841
29,524
88,573
8
17
34
73
162
210
405
1,772
1,987
2,745
5,912
10,368
19,122
32,990
85,797
330
339
359
406
496
542
735
2,094
2,311
3,069
5,943
10,702
19,451
33,313
86,120
4
16
30
53
146
195
383
1,595
1,793
2,583
6,244
8,952
16,526
27,936
83,982
327
339
359
384
480
523
717
1,919
2,121
2,905
6,265
9,282
16,856
28,271
84,314
425
468
853
1,103
1,804
2,011
3,423
6,935
7,814
8,699
33,846
107,587
950,545
—
—
cost, as well as the times required for constructing and
rendering the JSON object are almost the same for all
cases.
Regarding the comparison between the construction
and the response time in the HETree approaches, from
Table 4 we can observe the following. For properties having up to 5.5K triples (i.e., the numeric properties smaller than the width -8th row-, and the temporal properties smaller than the decommissioningDate -7th row-), the response time is dominated by
the communication cost, and the time required for
the JSON construction and rendering. For properties
with only a small number of triples (i.e., waistSize,
241 triples), only 1.5% of the response time is spent
on constructing the HETree. Moreover, for a property
with a larger number of triples (i.e., buildingStartData,
1.415 triples), 18% of the time is spent on constructing
the HETree. Finally, for the largest property for which
the time spent in communication cost, JSON construc-
tion and rendering is larger than the construction time
(i.e., powerOutput, 5.453 triples), 42% of the time is
spent on constructing the HETree.
Figure 11 summarizes the results from Table 4, presenting the response time for all approaches w.r.t. the
number of triples. Particularly, Figure 11a includes all
properties sizes (i.e., 50 to 762K). Further, in order to
have a precise observation over small property sizes
(Small properties) in which the difference between the
FLAT and the HETree approaches is smaller, we report properties with less than 20K triples separately
in Figure 11b. Once again, we observe that HETree-R
performs slightly better than the HETree-C. Additionally, from Figure 11b we can indicate that for up to
10K triples the performance of the HETree approaches
is almost the same. We can also observe the significant difference between the FLAT and the HETree approaches.
22
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
FLAT
HETree-C
HETree-R
FLAT
HETree-C
HETree-R
3000
60000
Time (msec)
Time (msec)
80000
40000
2000
1000
20000
0
0
50
200K
400K
Number of triples
600K
762K
(a) All Properties (50 to 762K triples)
50
5K
10K
Number of triples
15K
20K
(b) Small Properties (50 to 20K triples)
Fig. 11. Response Time w.r.t. the number of triples
However our method clearly outperforms the nonhierarchical method, as we can observe from the above
results, the construction of the whole hierarchy can
not provide an efficient solution for datasets containing
more than 10K objects. As discussed in Section 3.2,
for efficient exploration over large datasets an incremental hierarchy construction is required. In the incremental exploration scenario, the number of hierarchy
nodes that have to be processed and constructed is significantly fewer compared to the non-incremental.
For example, adopting an non-incremental construction in populationTotal (305K triples), 29.6K nodes
are to be initially constructed (along with their statistics). On the other hand, with the incremental approach
(as analysed in Section 3.2) at the beginning of each
exploration scenario, only the initial nodes are constructed. Initial nodes are the nodes initially presented,
as well as the nodes potentially reached by the user’s
first operation.
In the RES scenario, the initial nodes are the leaf of
interest (1 node) and its sibling leaves (at most d − 1
nodes). In the RAN, the initial nodes are the nodes of
interest (at most d nodes), their children (at most d2
nodes), and their parent node along with its siblings (at
most d nodes). Finally, in the BSC scenario the initial
nodes are the root node (1 node) and its children (at
most d nodes). Overall, at most d, 2d + d2 , and d + 1
nodes are initially constructed in the RES, RAN, and
BSC scenarios respectively. Therefore, in populationTotal case where d = 3 at most 3, 15 and 4 nodes are
initially constructed in the RES, RAN, and BSC scenarios respectively.
5.3. User Study
In this section we present the user evaluation of
our tool, where we have employed three approaches:
the two hierarchical and the FLAT. Section 5.3.1 describes the user tasks, Section 5.3.2 outlines the evaluation procedure and setup, Section 5.3.3 summarizes
the evaluation results, and Section 5.3.4 discusses issues related to the evaluation process.
5.3.1. Tasks
In this section we describe the different types of
tasks that are used in the user evaluation process.
Type 1 [Find resources with specific value]: This
type of tasks requests the resources having value v (as
object). For this task type, we define task T1 by selecting a value v that corresponds to 5 resources. Given
this task, the participants are asked to provide the number of resources that pertain to this value. In order to
solve this task, the participants first have to find a resource with value v and then check which of the nearby
resources also have the same value.
Type 2 [Find resources in a range of values]: This
type of tasks requests the resources having value
greater than vmin and less than vmax . We define two
tasks of this type, by selecting different combinations
of vmin and vmax values, such that tasks which consider different numbers of resources are defined. We
define two tasks, the first task considers a relative small
number of resources while the second a larger. In our
experiments we select 10 as a small number of resources, while as a large number we select 50. Particularly, in the first task, named T2.1, we specify the
values vmin and vmax such that a relatively small set
of (approximately 10) resources are included, whereas
the second task, T2.2, considers a relatively larger set
of (approximately 50) resources. Given these tasks,
the participants are asked to provide the number of
resources included in the given range. This task can
be solved by first finding a resource with a value in-
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
cluded in the given range, and then explore the nearby
resources in order to identify the resources in the given
range.
Type 3 [Compare distributions]: This type of tasks
requests from the participant to identify whether more
resources appear above or below a given value v. For
this type, we define task T3, by selecting the value v
near to the median. Given this task, the participants
are asked to provide the number of resources appearing either above or below the value v. The answer for
this tasks requires from the participants to indicate the
value v and determine the number or resources appearing either before or after this value.
5.3.2. Setup
In order to study the effect of the property size in the
selected tasks, we have selected two properties of different sizes from the employed dataset (Section 5.1).
The hsvCoordinateHue numeric property containing
970 triples, is referred to as Small, and the maximumElevation numeric property, containing 37.936
triples, is referred to as Large. The first one corresponds to a hierarchy of height 4 and degree 3, and the
latter corresponds to a hierarchy of height 7 and degree 3. We should note here that through the user evaluation, the hierarchy parameters were fixed for all the
tasks, and the participants were not allowed to modify them, such that the setting has been the same for
everyone.
In our evaluation, 10 participants took part. The participants were computer science graduate students and
researchers. At the beginning of the evaluation, each
participant has introduced to the system by an instructor who provided a brief tutorial over the features required for the tasks. After the instructions, the participants familiarized themselves with the system. Note
that we have integrated in the SynopsViz the FLAT approach along with the HETree approaches.
During the evaluation, each participant performed
the previously described four tasks, using all approaches (i.e., HETree-C/R and FLAT), over both the
small and large properties. In order to reduce the learning effects and fatigue we defined three groups. In the
first group, the participants start their tasks with the
HETree-C approach, in the second with HETree-R,
and in the third with FLAT. Finally, the property (i.e.,
small, large) first used in each task was counterbalanced among the participants and the tasks. The entire
evaluation did not exceed 75 minutes.
Furthermore, for each task (e.g., T2.1, T.3), three
task instances were specified by slightly modifying the
23
task parameters. As a result, given a task, a participant
has to solve a different instance of this task, in each
approach.
For example, in task T2.1, for the HETree-R, the selected v corresponds to a solution of 11 resources, in
HETree-C, to 9 resources, whereas for FLAT v corresponded to a solution of 8 resources. The task instance
assigned to each approach varied among the participants.
During the evaluation the instructor measured the
time required for each participant to complete a task,
as well as the number of incorrect answers. Table 5
presents the average time required for the participants
to complete each task. The table contains the measurements for all approaches, and for both properties. Although we acknowledge that the number of participants in our evaluation is small, we have computed the
statistical significance of the results. Essentially, for
each property, the p-value of each task is presented in
the last column. The p-value is computed using oneway repeated measures ANOVA.
In addition, the results regarding the number of tasks
that were not correctly answered are presented in Table 6. Particularly, the table presents the percentage of
incorrect answers for each task and property, referred
to as error rate. Additionally, for each task and property, the table includes the p-value. Here, the p-value
has been computed using Fisher’s exact test.
5.3.3. Results
Task T1. Regarding the first task, as we can observe from Table 5, the HETree approaches outperform
FLAT, in both property sizes. Note that the time results
on T1 are statistically significant (p < 0.01).
As expected, all approaches require more time for
the Large property compared to the Small one. This
overhead in FLAT is caused by the larger number of
resources that the participants have to scroll over and
examine, until they indicate the requested resource’s
value. On the other hand, in HETree, the overhead is
caused by the larger number of levels that the Large
property hierarchy has. Hence, the participants have to
perform more drill-down operations and examine more
groups of objects, until they reach the LD resources.
We can also observe that in this task, the HETree-R
performs slightly better than the HETree-C in both
property sizes. This is due to the fact that, in HETree-R
structure, resources having the same value are always
contained in the same leaf. As a result, the participants
had to inspect only one leaf. On the other hand, in
24
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Table 5. Average Task Completion Time (sec)
Small Property
FLAT HETree-C HETree-R
T1
54
T2.1 63
T2.2 120
T3
262
⋆⋆
29
57
69
41
(p < 0.01)
28
64
74
40
⋆
(p < 0.05)
Table 6. Error Rate (%)
Small Property
Large Property
p
⋆⋆
◆
⋆⋆
⋆⋆
◆
FLAT HETree-C HETree-R
85
74
128
—
52
60
72
64
47
69
77
62
FLAT HETree-C HETree-R
p
⋆⋆
⋆
⋆⋆
—
(p > 0.05)
HETree-C this does not always hold, hence the participants could have explored more than one leaf.
Finally, as we can observe from Table 6, in all
cases only correct answers have been provided. However, none of those results are statistically significant
(p > 0.05).
Task T2.1. In the next task, where the participants had
to indicate a small set of resources in a range of values,
the FLAT performance is very close to the HETree,
especially in the Small property (Table 5). In addition, we can observe that the HETree-C approach performs slightly better than the HETree-R. Finally, regarding the statistical significance of the results, in
Small property we have p > 0.05, while in Large we
have p < 0.005.
The poor performance of the HETree approaches in
this task can be explained by the small set of resources
requested and the HETree parameters adopted in the
user evaluation. In this setting, the resources contained
in the task solution are distributed over more than one
leaves. Hence, the participants had to perform several
roll-up and drill-down operations in order to find all the
resources. On the other hand, in FLAT, once the participants had indicated one of the requested resources,
it was very easy for them to find out the rest of the
solution’s resources. To sum up, in FLAT, most of the
time is spent on identifying the first of the resources,
while in HETree the first resource is identified very
quickly. Regarding the difference in performance between the HETree approaches we have the following.
In HETree-C due to the fixed number of objects in each
leaf, the participants had to visit at most one or two
leaves in order to solve this task. On the other hand, in
HETree-R, the number of objects in each leaf is varied, so most times the participants had to inspect more
than two leaves in order to solve the task. Finally, also
in this case only correct answers were given (Table 6).
Task T2.2. In this task the participants had to indicate a
larger set (compared to the previous task) of resources
given a range of values. HETree approaches noticeably
T1
T2.1
T2.2
T3
⋆⋆
0
0
20
70
0
0
0
0
(p < 0.01)
0
0
0
0
⋆
(p < 0.05)
Large Property
p
◆
◆
◆
⋆⋆
◆
FLAT HETree-C HETree-R p
0
0
20
—
0
0
0
0
0
0
10
0
◆
◆
◆
—
(p > 0.05)
outperform the FLAT approach with statistical significance (p < 0.01), while similar results are observed in
both properties.
In the FLAT approach a considerable time was spent
to identify and navigate over a large number of resources. On the other hand, due to the large number of resources involved in the task’s solution, there
are groups in the hierarchy that explicitly contain resources of solutions (i.e., they do not contain resources
not included in the solution). As a result, the participants in HETree could easily indicate and compute the
whole solution by combining the information related
to the groups (i.e., number of enclosed resources) and
individual resources. Due to the same reasons stated
in the previous task (i.e., T2.1), similarly in T2.2 the
HETree-C performs slightly better than the HETree-R.
Finally, we can observe from Table 6 (but without statistical significance), that it was more difficult for participants to solve correctly this task with FLAT than
with HETree.
Task T3. In the last task, participants were requested
to find which of the two ranges contained more resources. As expected, Table 5 shows that the HETree
approaches clearly outperform the FLAT approach
with statistical significance in the Small property. This
is due to the fact that the participants in FLAT had to
overview and navigate over almost half of the dataset.
As a result, apart from the long time required for this
process, it was also very difficult to find the correct solution. This is also verified by Table 6 on a statistically
significant level. On the other hand, in the HETree approaches, the participants could easily find out the answer by considering the resources enclosed by several
groups.
Regarding the Large property, as it is expected, it
was impossible for participants to solve this task with
FLAT, since this required to parse over and count about
19K resources. As a result, none of the participants
completed this task using FLAT (indicated with "–" in
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Table 5), considering the 5 minute time limit used in
this task.
5.3.4. Discussion
The user evaluation showed that the hierarchical approaches can be efficient (i.e., require short time in
solving tasks) and effective (i.e., have lower error rate)
in several cases. In more detail, the HETree approaches
performed very well on indicating specific values over
a dataset, and given the appropriate parameter setting
are marginally affected by the dataset size. Also note
that due to the "vertical-based" exploration, the position (e.g., towards the end) of the requested value in
the dataset does not affect the efficiency of the approach. Furthermore, it is shown that the hierarchical
approaches can efficiently and effectively handle visual exploration tasks that involve large numbers of
objects.
At the end of the evaluation, the participants gave
us valuable feedbacks on possible improvements of
our tool. Most of the participants criticized several aspects in the interface, since our tool is an early prototype. Also, several participants mentioned difficulties in obtaining their "position" (e.g., which is the currently visualized range of values, or the previously visualized range of values) during the exploration. Finally, some participants mentioned that some hierarchies contained more levels than needed. As previously mentioned, the adopted parameters are not well
suited for the evaluation, since hierarchies with a degree larger than 3 (and as result less levels) are required.
Finally, additional tasks for demonstrating the capabilities of our model can be considered. However, most
of these tasks were not selected in this evaluation, because it was not possible for the participants to perform
them with the FLAT approach. An indicative set includes: (1) Find the number of resources (and/or statistics) in the 1st and 3rd quartile; (2) Find statistics (e.g.,
mean value, variance) for the top-10 or 50 resources;
(3) Find the decade (i.e., temporal data) in which most
events take place.
6. Related Work
This section reviews works related to our approach
on visualization and exploration in the Web of Data
(WoD). Section 6.1 presents systems and techniques
for WoD visualization and exploration, Section 6.2 discusses techniques on WoD statistical analysis, Sec-
25
tion 6.3 present hierarchical data visualization techniques, and finally, Section 6.4 discusses works on data
structures & processing related to our HETree data
structure.
In Table 7 we provide an overview and compare several visualization systems that offer similar features to
our SynopsViz. The WoD column indicates systems
that target the Semantic Web and Linked Data area
(i.e., RDF, RDF/S, OWL). The Hierarchical column
indicates systems that provide hierarchical visualization of non-hierarchical data. The Statistics column
captures the provision of statistics about the visualized
data. The Recomm. column indicates systems, which
offer recommendation mechanisms for visualization
settings (e.g., appropriate visualization type, visualization parameters, etc.). The Incr. column indicate systems that provide incremental visualizations. Finally,
the Preferences column captures the ability of the users
to apply data (e.g., aggregate) or visual (e.g., increase
abstraction) operations.
6.1. Exploration & Visualization in the Web of Data
A large number of works studying issues related to
WoD visual exploration and analysis have been proposed in the literature [30,18,80,3]. In what follows,
we classify these works into the following categories:
(1) Generic visualization systems, (2) Domain, vocabulary & device-specific visualization systems, and (3)
Graph-based visualization systems.
6.1.1. Generic Visualization Systems
In the context of WoD visual exploration, there is a
large number of generic visualization frameworks, that
offer a wide range of visualization types and operations. Next, we outline the best known systems in this
category.
Rhizomer [21] provides WoD exploration based on
a overview, zoom and filter workflow. Rhizomer offers various types of visualizations such as maps, timelines, treemaps and charts. VizBoard [109,110] is an information visualization workbench for WoD build on
top of a mashup platform. VizBoard presents datasets
in a dashboard-like, composite, and interactive visualization. Additionally, the system provides visualization recommendations. Payola [68] is a generic framework for WoD visualization and analysis. The framework offers a variety of domain-specific (e.g., public
procurement) analysis plugins (i.e., analyzers), as well
as several visualization techniques (e.g., graphs, tables,
etc.). In addition, Payola offers collaborative features
26
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Table 7. Visualization Systems Overview
System
WoD Hierarchical
Rhizomer [21]
Payola [68]
LDVM [20]
Vis Wizard [105]
LDVizWiz [6]
LinkDaViz [103]
VizBoard [109]
SemLens [51]
LODeX [15]
LODWheel [99]
RelFinder [50]
Fenfire [48]
Lodlive [22]
IsaViz [87]
graphVizdb [17,16]
ViCoMap [89]
EDT [79]
Polaris [98]
XmdvTool [112]
GrouseFlocks [5]
GMine [57]
Gephi [11]
CGV [104]
SynopsViz
⋆
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Data Types⋆
Vis. Types⋆⋆
Statistics Recomm.
N, T, S, H, G
C, M, T, TL
N, T, S, H, G C, CI, G, M, T, TL, TR
S, H, G
B, M, T, TR
N, T, S
B, C, M, P, PC, SG
S, H, G
M, P, TR
N, T, S
B, C, S, M, P
N, H
C,S, T
N
S
G
G, M, P
N, S, G
C, G, M, P
G
G
G
G
G
G
G
G
G
G
N, T, S
M
N, T, H
C, CM, T, SP
N, T, S, H
C, M, S
N
DS, PC, S, ST
G
G
G
G
G
G
G
G
N, T, H
C, P, T, TL $
Incr.
Preferences Domain App. Type
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
generic
OLAP
OLAP
generic
generic
generic
generic
generic
generic
Web
Web
Web
Web
Web
Web
Web
Web
Web
Web
Web
Desktop
Web
Desktop
Web
Web
Desktop
Desktop
Desktop
Desktop
Desktop
Desktop
Desktop
Web
N: Numeric, T: Temporal, S: Spatial, H: Hierarchical (tree), G: Graph (network)
B: bubble chart, C: chart, CI: circles, CM: colormap, DS: dimensional stacking, G: graph, M: map, P: pie, PC: parallel coordinates,
S: scatter, SG: streamgraph, SP: solarplot, ST: star glyphs, T: treemap, TL: timeline, TR: tree
⋆⋆
$
The HETree model is not restricted to these visualization types.
for users to create and share analyzers. In Payola the
visualizations can be customized according to ontologies used in the resulting data.
The Linked Data Visualization Model (LDVM) [20]
provides an abstract visualization process for WoD
datasets. LDVM enables the connection of different
datasets with various kinds of visualizations in a dynamic way. The visualization process follows a four
stage workflow: Source data, Analytical abstraction,
Visualization abstraction, and View. A prototype based
on LDVM considers several visualization techniques,
e.g., circle, sunburst, treemap, etc. Finally, the LDVM
has been adopted in several use cases [69]. Vis Wizard
[105] is a Web-based visualization system, which exploits data semantics to simplify the process of setting
up visualizations. Vis Wizard is able to analyse multiple datasets using brushing and linking methods. Similarly, Linked Data Visualization Wizard (LDVizWiz)
[6] provides a semi-automatic way for the production
of possible visualization for WoD datasets. In a same
context, LinkDaViz [103] finds the suitable visualizations for a give part of a dataset. The framework uses
heuristic data analysis and a visualization model in or-
der to facilitate automatic binding between data and
visualization options.
Balloon Synopsis [91] provides a WoD visualizer
based on HTML and JavaScript. It adopts a nodecentric visualization approach in a tile design. Additionally, it supports automatic information enhancement of the local RDF data by accessing either remote
SPARQL endpoints or performing federated queries
over endpoints using the Balloon Fusion service. Balloon Synopsis offers customizable filters, namely ontology templates, for the users to handle and transform (e.g., filter, merge) input data. SemLens [51] is
a visual system that combines scatter plots and semantic lenses, offering visual discovery of correlations
and patterns in data. Objects are arranged in a scatter plot and are analysed using user-defined semantic lenses. LODeX [15] is a system that generates a
representative summary of a WoD source. The system
takes as input a SPARQL endpoint and generates a visual (graph-based) summary of the WoD source, accompanied by statistical and structural information of
the source. LODWheel [99] is a Web-based visualizing system which combines JavaScript libraries (e.g.,
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
MooWheel, JQPlot) in order to visualize RDF data in
charts and graphs. Hide the stack [31] proposes an approach for visualizing WoD for mainstream end-users.
Underlying Semantic Web technologies (e.g., RDF,
SPARQL) are utilized, but are "hidden" from the endusers. Particularly, a template-based visualization approach is adopted, where the information for each resource is presented based on its rdf:type.
6.1.2. Domain, Vocabulary & Device-specific
Visualization Systems
In this section, we present systems that target visualization needs for specific types of data and domains,
RDF vocabularies or devices.
Several systems focus on visualizing and exploring geo-spatial data. Map4rdf [74] is a faceted browsing tool that enables RDF datasets to be visualized
on an OSM or Google Map. Facete [97] is an exploration and visualization system for SPARQL accessible data, offering faceted filtering functionalities. SexTant [82] and Spacetime [107] focus on visualizing and
exploring time-evolving geo-spatial data. The LinkedGeoData Browser [96] is a faceted browser and editor which is developed in the context of LinkedGeoData project. Finally, in the same context DBpedia Atlas [106] offers exploration over the DBpedia dataset
by exploiting the dataset’s spatial data. Furthermore,
in the context of linked university data, VISUalization
Playground (VISU) [4] is an interactive tool for specifying and creating visualizations using the contents of
linked university data cloud. Particularly, VISU offers
a novel SPARQL interface for creating data visualizations. Query results from selected SPARQL endpoints
are visualized with Google Charts.
A variety of systems target multidimensional WoD
modelled with the Data Cube vocabulary. CubeViz
[37,90] is a faceted browser for exploring statistical data. The system provides data visualizations using different types of charts (i.e., line, bar, column,
area and pie). The Payola Data Cube Vocabulary [52]
adopts the LDVM stages [20] in order to visualize
RDF data described by the Data Cube vocabulary. The
same types of charts as in CubeViz are provided in this
system. The OpenCube Toolkit [60] offers several systems related to statistical WoD. For example, OpenCube Browser explores RDF data cubes by presenting
a two-dimensional table. Additionally, the OpenCube
Map View offers interactive map-based visualizations
of RDF data cubes based on their geo-spatial dimension. The Linked Data Cubes Explorer (LDCE) [64]
allows users to explore and analyse statistical datasets.
27
Finally, [85] offers several map and chart visualizations of demographic, social and statistical linked cube
data.15
Regarding device-specific systems, DBpedia Mobile
[14] is a location-aware mobile application for exploring and visualizing DBpedia resources. Who’s Who
[23] is an application for exploring and visualizing information focusing on several issues that appear in the
mobile environment. For example, the application considers the usability and data processing challenges related to the small display size and limited resources of
the mobile devices.
6.1.3. Graph-based Visualization Systems
A large number of systems visualize WoD datasets
adopting a graph-based (a.k.a., node-link) approach.
RelFinder [50] is a Web-based tool that offers interactive discovery and visualization of relationships
(i.e., connections) between selected WoD resources.
Fenfire [48] and Lodlive [22] are exploratory systems that allow users to browse WoD using interactive graphs. Starting from a given URI, the user
can explore WoD by following the links. IsaViz [87]
allows users to zoom and navigate over the RDF
graph, and also it offers several "edit" operations (e.g.,
delete/add/rename nodes and edges). In the same context, graphVizdb [17,16] is built on top of spatial and
database techniques offering interactive visualization
over very large (RDF) graphs. A different approach
has been adopted in [100], where sampling techniques
have been exploited. Finally, ZoomRDF [116] employs
a space-optimized visualization algorithm in order to
increase the number of resources which are displayed.
6.1.4. Discussion
In contrast to the aforementioned approaches, our
work does not focus solely on proposing techniques
for WoD visualization. Instead, we introduce a generic
model for organizing, exploring and analysing numeric
and temporal data in a multilevel fashion. The underlying model is not bound to any specific type of visualization (e.g., chart); rather it can be adopted by
several "flat" techniques and offer multilevel visualizations over non-hierarchical data. Also, we present a
prototype system that employs the introduced hierarchical model and offers efficient multilevel visual exploration over WoD datasets, using charts and timelines.
15 www.linked-statistics.gr
28
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
6.2. Statistical Analysis in the Web of Data
A second area related to the analysis features of the
proposed model deals with WoD statistical analysis.
RDFStats [72] calculates statistical information about
RDF datasets. LODstats [9] is an extensible framework, offering scalable statistical analysis of WoD
datasets. RapidMiner LOD Extension [88,84] is an
extension of the data mining platform RapidMiner16 ,
offering sophisticated data analysis operations over
WoD. SparqlR17 is a package of the R18 statistical
analysis platform. SparqlR executes SPARQL queries
over SPARQL endpoints and provides statistical analysis and visualization over SPARQL results. Finally,
ViCoMap [89] combines WoD statistical analysis and
visualization, in a Web-based tool, which offers correlation analysis and data visualization on maps.
6.2.1. Discussion
In comparison with these systems, our work does
not focus on new techniques for WoD statistics computation and analysis. We are primarily interested on
enhancing the visualization and user exploration functionality by providing statistical properties of the visualized datasets and objects, making use of existing
computation techniques. Also, we demonstrate how
in the proposed structure, computations can be efficiently performed on-the-fly and enrich our hierarchical model. The presence of statistics provides quantifiable overviews of the underlying WoD resources at
each exploration step. This is particularly important in
several tasks when you have to explore a large number
of either numeric or temporal data objects. Users can
examine next levels’ characteristics at a glance, this
way are not enforced to drill down in lower hierarchy
levels. Finally, the statistics over the different hierarchy levels enables analysis over different granularity
levels.
6.3. Hierarchical Visual Exploration
The wider area of data and information visualization
has provided a variety of approaches for hierarchical
analysis and presentation.
Treemaps [93] visualize tree structures using a
space-filling layout algorithm based on recursive subdivision of space. Rectangles are used to represent
16 rapidminer.com
17 cran.r-project.org/web/packages/SPARQL/index.html
18 www.r-project.org
tree nodes, the size of each node is proportional to
the cumulative size of its descendant nodes. Finally,
a large number of treemaps variations have been proposed (e.g., Cushion Treemaps, Squarified Treemaps,
Ordered Treemaps, etc.).
Moreover, hierarchical visualization techniques have
been extensively employed to visualize very large
graphs using the node-link paradigm. In these techniques the graph is recursively decomposed into smaller
sub-graphs that form a hierarchy of abstraction layers. In most cases, the hierarchy is constructed by exploiting clustering and partitioning methods [1,7,11,
57,104,75]. In other works, the hierarchy is defined
with hub-based [76] and density-based [117] techniques. GrouseFlocks [5] supports ad-hoc hierarchies
which are manually defined by the users. Finally, there
also some edge bundling techniques which join graph
edges to bundles. The edges are often aggregated based
on clustering techniques [41,38,86], a mesh [71,28] or
explicitly by a hierarchy [53].
In the context of data warehousing and online analytical processing (OLAP), several approaches provide hierarchical visual exploration, by exploiting the
predefined hierarchies in the dimension space. [79]
proposes a class of OLAP-aware hierarchical visual
layouts; similarly, [102] uses OLAP-based hierarchical stacked bars. Polaris [98] offers visual exploratory
analysis of data warehouses with rich hierarchical
structure.
Several hierarchical techniques have been proposed
in the context of ontology visualization and exploration [40,34,46,73]. CropCircles [111] adopts a hierarchical geometric containment approach, representing the class hierarchy as a set of concentric circles.
Knoocks [70] combines containment-based and nodelink approaches. In this work, ontologies are visualized
as nested blocks where each block is depicted as a rectangle containing a sub-branch shown as tree map. A
different approach is followed by OntoTrix [10] which
combine graphs with adjacency matrices.
Finally, in the context of hierarchical navigation,
[65] organizes query results using the MeSH concept
hierarchy. In [24] a hierarchical structure is dynamically constructed to categorize numeric and categorical
query results. Similarly, [26] constructs personalized
hierarchies by considering diverse users preferences.
6.3.1. Discussion
In contrast to above approaches that target graphbased or hierarchically-organized data, our work focuses on handling arbitrary numeric and temporal data,
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
with out requiring it to be described by an hierarchical schema. As an example of hierarchically-organized
data, consider class hierarchies or multidimensional
data organized in multilevel hierarchical dimensions
(e.g., in OLAP context, temporal data is hierarchically organized based on years, months, etc.). In contrast to aforementioned approaches, our work dynamically constructs the hierarchies from raw numeric and
temporal data. Thus the proposed model can be combined with "flat" visualization techniques (e.g., chart,
timeline), in order to provide multilevel visualizations
over non-hierarchical data. In that sense, our approach
can be considered more flexible compared to the techniques that rely on predefined hierarchies, as it can
enable exploratory functionality on dynamically retrieved datasets, by (incrementally) constructing hierarchies on-the-fly, and allowing users to modify these
hierarchies.
6.4. Data Structures & Data Processing
In this section we present the data structures and the
data (pre-)processing techniques which are the most
relevant to our approach.
R-Tree [45] is disk-based multi-dimensional indexing structure, which has been widely used in order to
efficiently handle spatial queries. R-Tree adopts the
notion of minimum bounding rectangles (MBRs) in order to hierarchical organize multi-dimensional objects.
Data discretization [42,33] is a process where continuous attributes are transformed into discrete. A large
number of methods (e.g., supervised, unsupervised,
univariate, multivariate) for data discretization have
been proposed. Binning is a simple unsupervised discretization method in which a predefined number of
bins is created. Widely known binning methods are the
equal-width and equal-frequency. In equal-width approach, the range of an attribute is divided into intervals that have equal width and each interval represents
a bin. In equal-frequency approach, an equal number
of values are placed in each bin.
By recursively applying discretization techniques, a
hierarchical discretization of attribute’s values can be
produced (a.k.a. concept/generalization hierarchies).
In [92] a dynamic programming algorithm for generating numeric concept hierarchies is proposed. The algorithm attempts to maximize both the similarity between the objects stored in the same hierarchy’s node,
as well as the dissimilarity between the objects stored
in different nodes. The generated hierarchy is a balanced tree where different nodes may have different
29
number of children. Similarly, [47] constructs hierarchies based on data distribution. Essentially, both the
leaf and the interval nodes are created in such a way
that an even distribution is achieved. The hierarchy
construction considers also a threshold specifying the
maximum number of distinct values enclosed by nodes
in each hierarchy level. Finally, binary concept hierarchies (with degree equal to two) are generated in [27].
Starting from the whole dataset, it performs a recursive
binary partitioning over the dataset’s values; the recursion is terminated when the number of distinct values
in the resultant partitions is less than a pre-specified
threshold.
Using the data objects from our running example
(Figure 1), Figure 12 shows the hierarchies generated
from the aforementioned approaches. Figure 12(a)
presents the hierarchy resulting from [27] and Figure 12(b) depicts the result using the method from [47].
The parameters in each method are set, so that the resulting hierarchies are as much as possible similar to
our hierarchies (Figures 2 & 3) . Hence, the threshold
in (a) is set to 3, and in (b) is set to 2.
6.4.1. Discussion
The basic concepts of HETree structure can be considered similar to a simplified version of a static 1D
R-Tree. However, in order to provide efficient query
processing in disk-based environment, R-Tree considers a large number of I/O-related issues (e.g., space
coverage, nodes overlaps, fill guarantees, etc.). On the
other hand, we introduce a lightweight, main memory
structure that efficiently constructed on-the-fly. Also,
the proposed structure aims at organizing the data in
a practical manner for a (visual) exploration scenario,
rather than for disk-based indexing and querying efficiency.
Compared to discretization techniques, our tree
model exhibits several similarities, namely, the HETree-C
version can be considered as a hierarchical version of
the equal-frequency binning, and the HETree-R of the
equal-width binning. However, the goal of data organization in HETree is to enable visualization and hierarchical exploration capabilities over dynamically retrieved non-hierarchical data. Hence, compared to the
binning methods we can consider the following basic differences. First, in contrast with binning methods that require from the user to specify some parameters (e.g., the number/size of the bins, the number
of distinct values in each bin, etc); our approach is
able to automatically estimate the hierarchy parameters and adjust the visualization results by consider-
30
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
[20, 100]
(a)
a
[20, 35] b
c [35, 100]
[35, 50)
d
e
[20, 30)
[30, 35]
[50, 100]
g
f
[50, 100)
p8 age 20 p4 age 30
p0 age 35
h
i
[35, 37)
[37, 50)
p5 age 35
p3 age 37
p6 age 45
(b)
k
l
[100, 100]
m
n
p1 age 100
[50, 55)
[55, 100)
p9 age 50
p2 age 55
[20, 100]
a
[20, 44)
[44, 100]
b
d
[20, 28)
e
[28, 36)
p8 age 20 p4 age 30
p0 age 35
p5 age 35
c
f
[36, 44)
p3 age 37
g
[44, 52)
h
[52, 60)
p6 age 45 p2 age 55
p9 age 50
i
[76, 84)
k
[92, 100]
p7 age 80 p1 age 100
Fig. 12. Hierarchies generated from different approaches. a) based
on [27] b) based on [47]
ing the visualization environment characteristics. Second, in hierarchical approaches the user is not always
allowed to specify the hierarchy characteristics (e.g.,
degree). For example, the hierarchies in [27] have always degree equal to two (Figure 12(a)), while in [47]
the nodes have varying degrees (Figure 12(b)). On
the other hand, in our approach the hierarchy characteristics can be specified precisely. In addition, when
not specific hierarchy characteristics are requested, our
approach generates perfect trees (Section 2.5), offering a "uniform" hierarchy structure. Third, the computational complexity in some of the hierarchical approaches (e.g., [92]) is prohibitive (i.e., at least cubic) for using them in practise; especially in settings
where the hierarchies have to constructed on-the-fly.
Fourth, the proposed tree structure is exploited in order to allow efficient statistics computations over different groups of data; then, the statistics are used in
order to enhance the overall exploration functionality. Finally, the construction of the model is tailored
to the user interaction and preferences; our model offers incremental construction considering the user interaction, as well as efficiently adaptation to the users
preferences.
7. Conclusions
In this paper we have presented HETree, a generic
model that combines personalized multilevel exploration with online analysis of numeric and temporal data. Our model is built on top of a lightweight
tree-based structure, which can be efficiently constructed on-the-fly for a given set of data. We have
presented two variations for constructing our model:
the HETree-C structure organizes input data into fixedsize groups, whereas the HETree-R structure organizes
input data into fixed-range groups. In that way, the
users can customize the exploration experience, allowing them to organize data into different ways, by parameterizing the number of groups, the range and cardinality of their contents, the number of hierarchy levels, and so on. We have also provided a way for efficiently computing statistics over the tree, as well as
a method for automatically deriving from the input
dataset the best-fit parameters for the construction of
the model. Regarding the performance of multilevel
exploration over large datasets, our model offers incremental HETree construction and prefetching, as well
as efficient HETree adaptation based on user preferences. Based on the introduced model, a Web-based
prototype system, called SynopsViz, has been developed. Finally, the efficiency and the effectiveness of
the presented approach are demonstrated via a thorough performance evaluation and an empirical user
study.
Some insights for future work include the support
of sophisticated methods for data organization in order to effectively handle skewed data distributions and
outliers. Particularly, we are currently working on hybrid HETree versions, that integrate concepts from
both HETree-C and HETree-R version. For example, a
hybrid HETree-C considers a threshold regarding the
maximum range of a group; similarly, a threshold regarding the maximum number of objects in a group is
considered in hybrid HETree-R version. Regarding the
SynopsViz tool, we are planning to redesign and extend the graphical user interface, so our tool to be able
to use data resulting from SPARQL endpoints, as well
as to offer more sophisticated filtering techniques (e.g.,
SPARQL-enabled browsing over the data). Finally, we
are interested in including more visual techniques and
libraries.
Acknowledgements. We would like to thank the editors and the three reviewers for their hard work in reviewing our article, their comments helped us to significant improve our work. Further, we thank Giorgos Giannopoulos and Marios Meimaris for many
helpful comments on earlier versions of this article.
This work was partially supported by the EU/Greece
funded KRIPIS: MEDA Project and the EU project
"SlideWiki" (688095).
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
References
[1] J. Abello, F. van Ham, and N. Krishnan. ASK-GraphView:
A Large Scale Graph Visualization System. IEEE Trans. Vis.
Comput. Graph., 12(5), 2006.
[2] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden,
and I. Stoica. BlinkDB: queries with bounded errors and
bounded response times on very large data. In EuroSys, 2013.
[3] F. Alahmari, J. A. Thom, L. Magee, and W. Wong. Evaluating Semantic Browsers for Consuming Linked Data. In Australasian Database Conference (ADC), 2012.
[4] M. Alonen, T. Kauppinen, O. Suominen, and E. Hyvönen. Exploring the Linked University Data with Visualization Tools.
In Extended Semantic Web Conference (ESWC), 2013.
[5] D. Archambault, T. Munzner, and D. Auber. GrouseFlocks:
Steerable Exploration of Graph Hierarchy Space. IEEE Trans.
Vis. Comput. Graph., 14(4), 2008.
[6] G. A. Atemezing and R. Troncy. Towards a linked-data based
visualization wizard. In Workshop on Consuming Linked
Data, 2014.
[7] D. Auber. Tulip - A Huge Graph Visualization Framework.
In Graph Drawing Software. 2004.
[8] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak,
and Z. G. Ives. DBpedia: A Nucleus for a Web of Open Data.
In International Semantic Web Conference (ISWC), 2007.
[9] S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats An Extensible Framework for High-Performance Dataset Analytics. In Knowledge Engineering and Knowledge Management, 2012.
[10] B. Bach, E. Pietriga, and I. Liccardi. Visualizing Populated
Ontologies with OntoTrix. Int. J. Semantic Web Inf. Syst.,
9(4), 2013.
[11] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An Open
Source Software for Exploring and Manipulating Networks.
In Conference on Weblogs and Social Media (ICWSM), 2009.
[12] L. Battle, R. Chang, and M. Stonebraker. Dynamic Prefetching of Data Tiles for Interactive Visualization. Technical Report, 2015.
[13] L. Battle, M. Stonebraker, and R. Chang. Dynamic reduction of query result sets for interactive visualizaton. In IEEE
Conference on Big Data, 2013.
[14] C. Becker and C. Bizer. Exploring the Geospatial Semantic
Web with DBpedia Mobile. J. Web Sem., 7(4), 2009.
[15] F. Benedetti, L. Po, and S. Bergamaschi. A Visual Summary
for Linked Open Data sources. In International Semantic Web
Conference (ISWC), 2014.
[16] N. Bikakis, J. Liagouris, M. Krommyda, G. Papastefanatos,
and T. Sellis. Towards Scalable Visual Exploration of Very
Large RDF Graphs. In Extended Semantic Web Conference
(ESWC), 2015.
[17] N. Bikakis, J. Liagouris, M. Krommyda, G. Papastefanatos,
and T. Sellis. graphVizdb: A Scalable Platform for Interactive Large Graph Visualization. In IEEE Intl. Conf. on Data
Engineering (ICDE), 2016.
[18] N. Bikakis and T. Sellis. Exploration and Visualization in the
Web of Big Linked Data: A Survey of the State of the Art.
In International Workshop on Linked Web Data Management
(LWDM), 2016.
[19] N. Bikakis, M. Skourla, and G. Papastefanatos.
rdf:SynopsViz - A Framework for Hierarchical Linked Data
Visual Exploration and Analysis. In Extended Semantic Web
31
Conference (ESWC), 2014.
[20] J. M. Brunetti, S. Auer, R. García, J. Klímek, and M. Necaský.
Formal Linked Data Visualization Model. In iiWAS, 2013.
[21] J. M. Brunetti, R. Gil, and R. García. Facets and Pivoting for
Flexible and Usable Linked Data Exploration. In Interacting
with Linked Data Workshop, 2012.
[22] D. V. Camarda, S. Mazzini, and A. Antonuccio. LodLive,
exploring the web of data. In Conference on Semantic Systems
(I-SEMANTICS), 2012.
[23] A. E. Cano, A. Dadzie, and M. Hartmann. Who’s Who - A
Linked Data Visualisation Tool for Mobile Environments. In
Extended Semantic Web Conference (ESWC), 2011.
[24] K. Chakrabarti, S. Chaudhuri, and S. Hwang. Automatic Categorization of Query Results. In ACM Conference on Management of Data (SIGMOD), 2004.
[25] S. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while exploring massive time series. In IEEE Symposium on Visual Analytics Science and Technology (VAST),
2008.
[26] Z. Chen and T. Li. Addressing diverse user preferences in
SQL-query-result navigation. In ACM Conference on Management of Data (SIGMOD), 2007.
[27] W. W. Chu and K. Chiang. Abstraction of High Level Concepts from Numerical Values in Databases. In AAAI Workshop on Knowledge Discovery in Databases, 1994.
[28] W. Cui, H. Zhou, H. Qu, P. C. Wong, and X. Li. GeometryBased Edge Clustering for Graph Visualization. IEEE Trans.
Vis. Comput. Graph., 14(6), 2008.
[29] A. Dadzie, V. Lanfranchi, and D. Petrelli. Seeing is believing: Linking data with knowledge. Information Visualization,
8(3), 2009.
[30] A. Dadzie and M. Rowe. Approaches to visualising Linked
Data: A survey. Semantic Web, 2(2), 2011.
[31] A. Dadzie, M. Rowe, and D. Petrelli. Hide the Stack: Toward
Usable Linked Data. In Extended Semantic Web Conference
(ESWC), 2011.
[32] P. R. Doshi, E. A. Rundensteiner, and M. O. Ward. Prefetching for Visual Data Exploration. In Conference on Database
Systems for Advanced Applications (DASFAA), 2003.
[33] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised Discretization of Continuous Features. In International Conference on Machine Learning, 1995.
[34] M. Dudás, O. Zamazal, and V. Svátek. Roadmapping and
Navigating in the Ontology Visualization Landscape. In Conference on Knowledge Engineering and Knowledge Management (EKAW), 2014.
[35] A. Eldawy, M. Mokbel, and C. Jonathan. HadoopViz: A
MapReduce Framework for Extensible Visualization of Big
Spatial Data. In IEEE Intl. Conf. on Data Engineering
(ICDE), 2016.
[36] N. Elmqvist and J. Fekete. Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design
Guidelines. IEEE Trans. Vis. Comput. Graph., 16(3), 2010.
[37] I. Ermilov, M. Martin, J. Lehmann, and S. Auer. Linked Open
Data Statistics: Collection and Exploitation. In Knowledge
Engineering and the Semantic Web, 2013.
[38] O. Ersoy, C. Hurter, F. V. Paulovich, G. Cantareiro, and
A. Telea. Skeleton-Based Edge Bundling for Graph Visualization. IEEE Trans. Vis. Comput. Graph., 17(12), 2011.
[39] D. Fisher, I. O. Popov, S. M. Drucker, and m. c. schraefel.
Trust me, i’m partially right: incremental visualization lets
32
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
analysts explore large datasets faster. In Conference on Human Factors in Computing Systems CHI, 2012.
B. Fu, N. F. Noy, and M.-A. Storey. Eye Tracking the User
Experience - An Evaluation of Ontology Visualization Techniques. Semantic Web Journal (to appear), 2015.
E. R. Gansner, Y. Hu, S. C. North, and C. E. Scheidegger.
Multilevel agglomerative edge bundling for visualizing large
graphs. In IEEE Pacific Visualization Symposium (PacificVis),
2011.
S. García, J. Luengo, J. A. Sáez, V. López, and F. Herrera. A
Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Trans. Knowl.
Data Eng., 25(4), 2013.
P. Godfrey, J. Gryz, and P. Lasek. Interactive Visualization of
Large Data Sets, 2015. Technical Report, York University.
P. Godfrey, J. Gryz, P. Lasek, and N. Razavi. Visualization
through inductive aggregation. In Conference on Extending
Database Technology (EDBT), 2016.
A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In ACM Conference on Management of Data
(SIGMOD), 1984.
F. Haag, S. Lohmann, S. Negru, and T. Ertl. OntoViBe: An
Ontology Visualization Benchmark. In Workshop on Visualizations and User Interfaces for Knowledge Engineering and
Linked Data Analytics, 2014.
J. Han and Y. Fu. Dynamic Generation and Refinement of
Concept Hierarchies for Knowledge Discovery in Databases.
In AAAI Workshop on Knowledge Discovery in Databases,
1994.
T. Hastrup, R. Cyganiak, and U. Bojars. Browsing Linked
Data with Fenfire. In World Wide Web Conference (WWW),
2008.
J. Heer and S. Kandel. Interactive Analysis of Big Data. ACM
Crossroads, 19(1), 2012.
P. Heim, S. Lohmann, and T. Stegemann. Interactive Relationship Discovery via the Semantic Web. In Extended Semantic Web Conference (ESWC), 2010.
P. Heim, S. Lohmann, D. Tsendragchaa, and T. Ertl. SemLens: visual analysis of semantic data with scatter plots and
semantic lenses. In Conference on Semantic Systems (ISEMANTICS), 2011.
J. Helmich, J. Klímek, and M. Necaský. Visualizing RDF
Data Cubes Using the Linked Data Visualization Model. In
Extended Semantic Web Conference (ESWC), 2014.
D. Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Trans. Vis. Comput. Graph., 12(5), 2006.
S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of
Data Exploration Techniques. In ACM Conference on Management of Data (SIGMOD), 2015.
J. Im, F. G. Villegas, and M. J. McGuffin. VisReduce: Fast
and responsive incremental information visualization of large
datasets. In IEEE Conference on Big Data, 2013.
P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. Combining User Interaction, Speculative Query Execution and Sampling in the DICE System. Proc. of the VLDB Endowment
(PVLDB), 7(13), 2014.
J. F. R. Jr., H. Tong, J. Pan, A. J. M. Traina, C. T. Jr., and
C. Faloutsos. Large Graph Analysis in the GMine System.
IEEE Trans. Knowl. Data Eng., 25(1), 2013.
U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. Faster
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
Visual Analytics through Pixel-Perfect Aggregation. Proc. of
the VLDB Endowment (PVLDB), 7(13), 2014.
U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. VDDA:
automatic visualization-driven data aggregation in relational
databases. Journal on Very Large Data Bases (VLDBJ), 2015.
E. Kalampokis, A. Nikolov, P. Haase, R. Cyganiak,
A. Stasiewicz, A. Karamanou, M. Zotou, D. Zeginis, E. Tambouris, and K. A. Tarabanis. Exploiting Linked Data Cubes
with OpenCube Toolkit. In International Semantic Web Conference (ISWC), 2014.
A. Kalinin, U. Çetintemel, and S. B. Zdonik. Interactive Data
Exploration Using Semantic Windows. In ACM Conference
on Management of Data (SIGMOD), 2014.
A. Kalinin, U. Çetintemel, and S. B. Zdonik. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. of the VLDB Endowment (PVLDB),
8(10), 2015.
N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In IEEE Conference
on Data Engineering (ICDE), 2014.
B. Kämpgen and A. Harth. OLAP4LD - A Framework for
Building Analysis Applications Over Governmental Statistics. In Extended Semantic Web Conference (ESWC), 2014.
A. Kashyap, V. Hristidis, M. Petropoulos, and S. Tavoulari.
Effective Navigation of Query Results Based on Concept Hierarchies. IEEE Trans. Knowl. Data Eng., 23(4), 2011.
H. A. Khan, M. A. Sharaf, and A. Albarrak. DivIDE: efficient
diversification for interactive data exploration. In Conference
on Scientific and Statistical Database Management (SSDBM),
2014.
A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden,
and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. Proc. of the VLDB Endowment (PVLDB),
8(5), 2015.
J. Klímek, J. Helmich, and M. Necaský. Payola: Collaborative Linked Data Analysis and Visualization Framework. In
Extended Semantic Web Conference (ESWC), 2013.
J. Klímek, J. Helmich, and M. Necaský. Use Cases for Linked
Data Visualization Model. In Workshop on Linked Data on
the Web, (LDOW), 2015.
S. Kriglstein and R. Motschnig-Pitrik. Knoocks: New Visualization Approach for Ontologies. In Conference on Information Visualisation, 2008.
A. Lambert, R. Bourqui, and D. Auber. Winding Roads: Routing edges into bundles. Comput. Graph. Forum, 29(3), 2010.
A. Langegger and W. Wöß. RDFStats - An Extensible RDF
Statistics Generator and Library. In Database and Expert Systems Applications, 2009.
M. Lanzenberger, J. Sampson, and M. Rester. Visualization
in Ontology Tools. In International Conference on Complex,
Intelligent and Software Intensive Systems (CISIS), 2009.
A. d. Leon, F. Wisniewki, B. Villazón-Terrazas, and O. Corcho. Map4rdf- Faceted Browser for Geospatial Datasets. In
Using Open Data: policy modeling, citizen empowerment,
data journalism, 2012.
C. Li, G. Baciu, and Y. Wang. ModulGraph: Modularitybased Visualization of Massive Graphs. In Visualization in
High Performance Computing, 2015.
Z. Lin, N. Cao, H. Tong, F. Wang, U. Kang, and D. H. P. Chau.
Demonstrating Interactive Multi-resolution Large Graph Exploration. In IEEE Conference on Data Mining Workshops,
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
2013.
[77] L. D. Lins, J. T. Klosowski, and C. E. Scheidegger.
Nanocubes for Real-Time Exploration of Spatiotemporal
Datasets. IEEE Trans. Vis. Comput. Graph., 19(12), 2013.
[78] Z. Liu, B. Jiang, and J. Heer. imMens: Real-time Visual
Querying of Big Data. Comput. Graph. Forum, 32(3):421–
430, 2013.
[79] S. Mansmann and M. H. Scholl. Exploring OLAP aggregates
with hierarchical visualization techniques. In ACM Symposium on Applied Computing (SAC), 2007.
[80] N. Marie and F. L. Gandon. Survey of Linked Data Based
Exploration Systems. In Workshop on Intelligent Exploration
of Semantic Data (IESD), 2014.
[81] K. Morton, M. Balazinska, D. Grossman, and J. D. Mackinlay. Support the Data Enthusiast: Challenges for NextGeneration Data-Analysis Systems. Proc. of the VLDB Endowment (PVLDB), 7(6), 2014.
[82] C. Nikolaou, K. Dogani, K. Bereta, G. Garbis, M. Karpathiotakis, K. Kyzirakos, and M. Koubarakis. SexTant: Visualizing Time-Evolving Linked Geospatial Data. In International
Semantic Web Conference (ISWC), 2013.
[83] Y. Park, M. J. Cafarella, and B. Mozafari. VisualizationAware Sampling for Very Large Databases. In IEEE Intl.
Conf. on Data Engineering (ICDE), 2016.
[84] H. Paulheim. Generating Possible Interpretations for Statistics from Linked Open Data. In Extended Semantic Web Conference (ESWC), 2012.
[85] I. Petrou, M. Meimaris, and G. Papastefanatos. Towards a
methodology for publishing Linked Open Statistical Data.
eJournal of eDemocracy & Open Government, 6(1), 2014.
[86] D. Phan, L. Xiao, R. B. Yeh, P. Hanrahan, and T. Winograd.
Flow Map Layout. In IEEE Symposium on Information Visualization (InfoVis), 2005.
[87] E. Pietriga. IsaViz: a Visual Environment for Browsing and
Authoring RDF Models. In World Wide Web Conference
(WWW), 2002.
[88] P. Ristoski, C. Bizer, and H. Paulheim. Mining the Web of
Linked Data with RapidMiner. In International Semantic Web
Conference (ISWC), 2014.
[89] P. Ristoski and H. Paulheim. Visual Analysis of Statistical
Data on Maps using Linked Open Data. In Extended Semantic
Web Conference (ESWC), 2015.
[90] P. E. R. Salas, F. M. D. Mota, K. K. Breitman, M. A.
Casanova, M. Martin, and S. Auer. Publishing Statistical Data
on the Web. Int. J. Semantic Computing, 6(4), 2012.
[91] K. Schlegel, T. Weißgerber, F. Stegmaier, C. Seifert, M. Granitzer, and H. Kosch. Balloon Synopsis: A Modern NodeCentric RDF Viewer and Browser for the Web. In Extended
Semantic Web Conference (ESWC), 2014.
[92] C. Shen and Y. Chen. A dynamic-programming algorithm for
hierarchical discretization of continuous attributes. European
Journal of Operational Research, 184(2), 2008.
[93] B. Shneiderman. Tree Visualization with Tree-Maps: 2-d
Space-Filling Approach. ACM Trans. Graph., 11(1), 1992.
[94] B. Shneiderman. The Eyes Have It: A Task by Data Type
Taxonomy for Information Visualizations. In IEEE Symposium on Visual Languages, 1996.
[95] B. Shneiderman. Extreme visualization: squeezing a billion
records into a million pixels. In ACM Conference on Management of Data (SIGMOD), 2008.
[96] C. Stadler, J. Lehmann, K. Höffner, and S. Auer. LinkedGeo-
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
33
Data: A core for a web of spatial open data. Semantic Web,
3(4), 2012.
C. Stadler, M. Martin, and S. Auer. Exploring the web of spatial data with facete. In World Wide Web Conference (WWW),
2014.
C. Stolte, D. Tang, and P. Hanrahan. Query, analysis, and visualization of hierarchically structured data using Polaris. In
ACM Conference on Knowledge Discovery and Data Mining
(SIGKDD), 2002.
M. Stuhr, D. Roman, and D. Norheim. LODWheel JavaScript-based Visualization of RDF Data. In Workshop on
Consuming Linked Data, 2011.
S. Sundara, M. Atre, V. Kolovski, S. Das, Z. Wu, E. I. Chong,
and J. Srinivasan. Visualizing large-scale RDF data using
Subsets, Summaries, and Sampling in Oracle. In ICDE, 2010.
F. Tauheed, T. Heinis, F. Schürmann, H. Markram, and A. Ailamaki. SCOUT: Prefetching for Latent Feature Following
Queries. Proc. of the VLDB Endowment (PVLDB), 5(11),
2012.
K. Techapichetvanich and A. Datta. Interactive Visualization
for OLAP. In Computational Science and Its Applications
(ICCSA), 2005.
K. Thellmann, M. Galkin, F. Orlandi, and S. Auer. LinkDaViz
- Automatic Binding of Linked Data to Visualizations. In
International Semantic Web Conference (ISWC), 2015.
C. Tominski, J. Abello, and H. Schumann. CGV - An interactive graph visualization system. Computers & Graphics,
33(6), 2009.
G. Tschinkel, E. E. Veas, B. Mutlu, and V. Sabol. Using Semantics for Interactive Visual Analysis of Linked Open Data.
In International Semantic Web Conference (ISWC), 2014.
F. Valsecchi, M. Abrate, C. Bacciu, M. Tesconi, and
A. Marchetti. DBpedia Atlas: Mapping the Uncharted Lands
of Linked Data. In Workshop on Linked Data on the Web,
LDOW, 2015.
F. Valsecchi and M. Ronchetti. Spacetime: a Two Dimensions Search and Visualisation Engine Based on Linked
Data. In Conference on Advances in Semantic Processing
(SEMAPRO), 2014.
M. Vartak, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: Automatically Generating Query Visualizations.
Proc. of the VLDB Endowment (PVLDB), 7(13), 2014.
M. Voigt, S. Pietschmann, L. Grammel, and K. Meißner.
Context-aware Recommendation of Visualization Components. In Conference on Information, Process, and Knowledge Management (eKNOW), 2012.
M. Voigt, S. Pietschmann, and K. Meißner. A SemanticsBased, End-User-Centered Information Visualization Process
for Semantic Web Data. In Semantic Models for Adaptive
Interactive Systems. 2013.
T. D. Wang and B. Parsia. CropCircles: Topology Sensitive
Visualization of OWL Class Hierarchies. In International Semantic Web Conference (ISWC), 2006.
M. O. Ward. XmdvTool: Integrating Multiple Methods for
Visualizing Multivariate Data. In IEEE Visualization, 1994.
H. Wickham. Bin-summarise-smooth: a framework for visualising large data. Technical report, 2013.
E. Wu, L. Battle, and S. R. Madden. The Case for Data Visualization Management Systems. Proc. of the VLDB Endowment (PVLDB), 7(10), 2014.
A. Zaveri, A. M. Anisa Rula, R. Pietrobon, J. Lehmann, and
34
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
S. Auer. Quality assessment methodologies for linked open
data. Semantic Web Journal (to appear), 2015.
[116] K. Zhang, H. Wang, D. T. Tran, and Y. Yu. ZoomRDF: semantic fisheye zooming on RDF data. In World Wide Web
Conference (WWW), 2010.
[117] M. Zinsmaier, U. Brandes, O. Deussen, and H. Strobelt. Interactive Level-of-Detail Rendering of Large Graphs. IEEE
Trans. Vis. Comput. Graph., 18(12), 2012.
[118] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for
interactive exploration of big data series. In ACM Conference
on Management of Data (SIGMOD), 2014.
Appendix
A. Incremental HETree Construction
Remark 1. Each time ICO constructs a node (either
as part of initial nodes or due to a construction rule),
it also constructs all of its sibling nodes.
Proof of Proposition 1. Considering the different cases
of currently presented HETree elements and the available exploration operations, we have the following.
(1) A set of (internal or leaf) sibling nodes S are
presented and the user performs a roll-up action. Here,
the roll-up action will render the parent node of S
along with parent’s sibling nodes. In the case that S
are the nodes of interest (RAN scenario), the rendered
nodes have been constructed in the beginning of the
exploration (as part of RAN initial nodes). Otherwise,
the presented nodes have been previously constructed
due to construction Rule 1(i) (see Section 3.2).
(2) A set of internal sibling nodes C are presented
and the user performs a drill-down action over a node
c ∈ C. In this case, the drill-down will render c child
nodes. If C are the nodes of interest (RAN scenario),
then the child nodes of c have been constructed at the
beginning of the exploration (as part of RAN initial
nodes). Else, if C is the root node (BSC scenario), then
again the child nodes of c have been constructed at
the beginning of the exploration (as part of BSC initial nodes). Otherwise, the children of c, have been
constructed before due to construction Rule 1(ii) (see
Section 3.2).
(3) A set of leaf sibling nodes L are presented
and the user performs a drill-down action over a leaf
l ∈ L. In this case the drill-down action will render
data objects contained in l. Since a leaf is constructed
together with its data objects, all data objects here have
been previously constructed along with l.
(4) A set of data objects O are presented and the
user performs a roll-up action. Here, the roll-up action will render the leaf that contains O along with the
leaf’s siblings. In RAN and BSC exploration scenarios, data objects are reachable only via a drill-down
action over the leaf over the leaf that are contained,
whereas in the RES scenario, the data objects, contained in the leaf of interest, are the first elements that
are presented to the user.
In the general case, since O are reached only via
a drill-down, their parent leaf has already been constructed. Based on Remark 1, all sibling nodes of this
leaf have also been constructed. In the case of the RES
scenario, where O includes the resource of interest,
the leaf that contains O along with leaf’s siblings have
been constructed at the beginning of the exploration.
Thus, it is shown that, in all cases, the HETree elements that a user can reach by performing one operation, have been previously constructed by ICO. This
concludes the proof of Proposition 1.
Proof of Theorem 1. We will show that, during an exploration scenario, in any exploration step, ICO constructs only the required HETree elements. Considering an exploration scenario, ICO constructs nodes only
either as initial nodes, or via construction rules. The
initial nodes are constructed once, at the beginning of
the exploration process; based on the definition of the
initial nodes, these nodes are the required HETree elements for the first user operation.
During the exploration process, ICO constructs
nodes only via the construction rules. Essentially, from
construction rules, only the Rule 1 construct new
nodes. Considering the part of the tree rendered when
Rule 1 (Section 3.2) is applied, it is apparent that the
nodes constructed by Rule 1 are only the required
HETree elements.
Therefore, it is apparent that in any exploration step,
ICO constructs only the required HETree elements. By
considering all the steps comprising a user exploration
scenario, the overall number of elements constructed is
the minimum. This concludes the proof of Theorem 1.
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Procedure 4: constrRollUp-R(D, d, cur, H)
Procedure 5: constrDrillDown-R(D, d, cur, H)
Input: D: set of objects; d: tree degree; cur: currently
presented elements; H: currently created HETree-R
Output: H: updated HETree-R
Input: D: set of objects; d: tree degree; cur: currently
presented elements; H: currently created HETree-R
Output: H: updated HETree-R
//Computed in ICO-R: len: the length of the leaf’s interval
//Computed in ICO-R: len: the length of the leaf’s interval
1
2
3
4
5
6
7
8
9
create an empty node par
par.h ← cur[1].h + 1
par.I − ← cur[1].I −
par.I + ← cur[|cur|].I +
for i ← 1 to |cur| do
par.c[i] ← cur[i]
cur[i].p ← par
//cur parent node
35
1
2
3
lc = len · dcur[1].h−1
//length of the children’s intervals
for i ← 1 to |cur| do
if cur[i].c[0] = null then continue
//nodes previously constructed
//create parent-child relations
4
Ich ← computeSiblingInterv-R(cur[i].I − , cur[i].I + , lc , d)
5
S ← constrSiblingNodes-R(Ich , cur[i], cur[i].data, cur[1].h − 1)
6
for k ← 1 to |S| do
cur[i].c[k] ← S[k]
//compute intervals for cur[i] children
insert par into H
lp ← par.I + − par.I −
//par interval length
j
par.I − −D.minv
d·lp
10
−
Ippar
← D.minv + d · lp ·
11
+
−
Ippar
← min(D.maxv, Ippar
+ d · lp )
k
7
8
9
insert S into H
return H
//compute interval for par parent, Ippar
12
13
lsp ← (len · dcur[1].h )
//interval length for a par sibling node
−
+
Ispar ← computeSiblingInterv-R(Ippar
, Ippar
, lsp , d)
Procedure 6: computeSiblingInterv-R(low, up, len, n)
//compute intervals for all par sibling nodes
14
remove par.I from Ispar
15
S ← constrSiblingNodes-R(Ispar , null, D, cur[1].h + 1)
insert S into H
return H
Input: low: intervals’ lower bound; up: intervals’ upper
bound; len: intervals’ length; n: number of siblings
Output: I: an ordered set with at most n equal length intervals
//remove par interval, par already constructed
16
17
6
It− , It+ ← low
for i ← 1 to n do
It− ← It+
It+ ← min(up , len + It− )
append It to I
if It+ = up then break
7
return I
1
2
3
B. ICO Algorithm
The constrRollUp-R (Procedure 4) initially constructs the cur parent node par (lines 1-7). Next,
it computes the interval Ippar corresponding to par
parent node interval (lines 10-11). Using Ippar , it
computes the intervals for each of par sibling nodes
(line 13). Finally, the computed sibling nodes’ intervals Ispar are used for the nodes construction (line 15).
In the constrDrillDown-R (Procedure 5), for each
node in cur, its children are constructed as follows
(line 2). First, the procedure computes the intervals Ich
of each child and then it constructs all children (line 5).
Finally, the child relations for the parent node cur[i]
are constructed (line 6-7).
C. Incremental HETree Construction Analysis
In this section, we analyse in details the worst case
of ICO algorithm, i.e., when the construction cost is
maximized.
C.1. The HETree-R Version
The worst case in HETree-R occurs when the whole
dataset D is contained in a set of sibling leaf nodes L,
where |L| ≤ d.
4
5
Considering the above setting, in the RES scenario,
the cost is maximized when ICO-R constructs L (as
initial nodes). In this case, the cost is
O(|D| + |D|log|D|) = O(|D|log|D|).
In a RAN scenario, the cost is maximized when the
parent node p of L along with p’s sibling nodes are
considered as nodes of interest. First, let’s note that
in this case p has no sibling nodes, since all the sibling nodes are empty (i.e., they do not enclose data).
Hence, the p has to be constructed in ICO-R as initial nodes, as well as the L in constrDrillDown-R,
and the parent of p in constrRollUp-R. The p construction in ICO-R requires O(|D|). Also, the L construction in constrDrillDown-R requires O(d + |D| +
|D|log|D| + d). Finally, the construction of the parent of p in constrRollUp-R requires O(1). Therefore, in RAN the overall cost in the worst case is
O(|D| + d + |D| + |D|log|D| + d) = O(|D|log|D|).
Finally, in BSC scenario, the cost is maximized
when the L have to be constructed by
constrDrillDown-R,
which
requires
O(d + |D| + |D|log|D| + d) = O(|D|log|D|).
36
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Procedure 7: constrSiblingNodes-R(I, p, A, h)
Input: I: an ordered set with equal length intervals p: nodes’
parent node; A: available data objects; h: nodes’ height
Output: S: a set of HETree-R sibling nodes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
l = I[1]+ − I[1]−
T[ ] ← ∅
foreach tr ∈ A do
j
k
tr.o−I[1]−
j←
+1
l
//intervals’ length
//indicate enclosed data for each node
if j ≥ 0 and j ≤ |I| then
insert object tr into T [j]
remove object tr from A
for i ← 1 to |I| do
if T [i] = ∅ then continue
create a new node n
n.I − ← I[i]−
n.I + ← I[i]+
n.p ← p
n.c ← null
n.data ← T [i]
n.h ← h
if h = 0 then
sort n.data based on objects values
//construct nodes
//node is a leaf
append n to S
return S
C.2. The HETree-C Version
First let’s note that in HETree-C version, the dataset
is sorted at the beginning of the exploration and the
leaves contain equal number of data objects. As a result, during a node construction, the data objects, enclosed by it, can be directly identified by computing
its position over the dataset and without the need of
scanning the dataset or the enclosed data values. However, in ICO we assume that the node’s statistics are
computed each time the node is constructed. Hence, in
each node construction, we scan the data objects that
are enclosed by this node. In RES scenario, the worst
case occurs, when the user rolls up for the first time
to the nodes at level 2 (i.e., two levels below the root).
In this case, ICO has to construct the d nodes at level
1, as well the children for the d − 1 nodes in level 2.
Note that the construction of the parent of the nodes in
level 2 does not require to process any data objects or
construct children, since these nodes are already constructed. Now regarding the construction of the rest
d−1 nodes at level 1, ICO will process at most the d−1
d
of all data objects.19 Thus, the cost for constrRollUp-C
is O(d + d−1
d |D| + d − 1). Finally, for constructing
19 This number can be easily computed by considering the number
of leafs enclosed by these nodes.
the child nodes for the d − 1 nodes in level 2, we are
required to process at most the d−1
d2 of all data objects. Hence, the cost for constrDrillDown-C is O(d2 +
d−1
2
d2 |D| + d ). Therefore, in RES the cost in worst
d−1
2
2
case is O(d + d−1
d |D| + d − 1 + d + d2 |D| + d )
= O(d2 + d−1
d |D|).
In RAN scenario, the worst case occurs, when the
user starts from any set of sibling nodes at level 2.
Hence, the cost is maximized at the beginning of the
exploration. In this case, ICO has to construct the d
initial nodes at level 2, the d nodes at level 1, and
the children for all the d nodes in level 2. First the
d initial nodes at level 2 are constructed by ICO-R,
which can be done in O(d + |D| + d). Then, the d
nodes at level 1 are constructed by constrRollUp-C.
Similarly as in RES scenario, this can be done in
O(d + d−1
d |D| + d − 1). Finally, the construction
of the child nodes for all the d nodes in level 2 requires to process |D|
d data objects. Hence, the cost for
2
constrDrillDown-C is O(d2 + |D|
d + d ). Therefore, in
RAN the cost, in the worst case, is O(|D| + d + d +
|D|
d−1
d−1
2
2
2
d |D| + d − 1 + d + d + d ) = O(d + d |D|).
Finally, in BSC scenario, the worst case occurs,
when the user visits for the first time any of the node at
level 1. In this case, ICO has to construct the children
for the d nodes in level 1. Hence, constrDrillDown-C
has to process |D| data objects in order to construct
the d2 child nodes. Therefore, in BSC the cost in worst
case is O(d2 + |D| + d2 ) = O(d2 + |D|).
D. Adaptive HETree Construction
D.1. Preliminaries
In order to perform traversal over the levels of the
HETree (i.e., level-order traversal), we use an array H
of pointers to the ordered set of nodes at each level,
with H[0] referring to the set of leaf nodes and H[k]
referring to the set of nodes at height k. Moreover, we
consider the following simple procedures that are used
for the ADA implementation:
– mergeLeaves(L, m), where L is an ordered set of
leaf nodes and m ∈ N+ ,with
m > 1. This procedure
L
new leaf nodes, i.e., each
returns an ordered set of m
new leaf merges m leaf nodes from L. The procedure
traverses L, constructs a new leaf for every m nodes in
L and appends the data items from the m nodes to the
new leaf. This procedure requires O(|L|).
– replaceNode(n1 , n2 ), replaces the node n1 with
the node n2 ; it removes n1 , and updates the parent
37
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
of n1 to refer to n2 . This procedure requires constant
time, hence O(1).
– createEdges(P, C, d), where P , C are ordered sets
of nodes and d is the tree degree. It creates the edges
(i.e., parent-child relations) from the parent nodes P
to the child nodes C, with degree d. The procedure
traverses over P and connects each node P [i] with the
nodes from C[(i − 1)d + 1] to C[(i − 1)d + d]. This
procedure requires O(|C|).
D.2. The User Modifies the Tree Degree
D.2.1. The user increases the tree degree
′
k
+
(1) d = d , with k ∈ N and k > 1
T ree Construction. For the T ′ construction, we perform a reverse level-order traversal over T , using the
H vector. Starting from the leaves (H[0]), we skip (i.e.,
remove) k − 1 levels of nodes. Then, for the nodes of
the above level (H[k]), we create child relations with
the (non-skipped) nodes in the level below. The above
process continues until we reach the root node of T .
Hence, in this case all nodes in T ′ are obtained
"directly" from T . Particularly, T ′ is constructed using the root node of T , as well the T nodes from
H[j · k], j ∈ N0 .
The T ′ construction requires the execution of
createEdges procedure, j times. For computing j, we
have that j · k ≤ |H| ⇔ j · k ≤ logd ℓ. Considering that
d′ = dk , we have that k = logd d′ . Hence, j · logd d′ ≤
logd ℓ ⇔ j ≤ logd (ℓ − d′ ). So, considering that worst
case complexity for createEdges is O(ℓ), we have that
the overall complexity is O(ℓ · logd (ℓ − d′ )). Since we
have that ℓ ≤ |D|, then in worst case the T ′ can be constructed in O(|D|logd (|D|)) = O(|D|log √
k ′ (|D|)).
d
Statistics Computations. In this case there is no
need for computing any new statistics.
for each internal node of height 1, require O(k) instead
′
of O(d′ ), where k = dd . Hence, considering that there
ℓ
are d′ internal nodes of height 1 in T ′ , the cost for
their statistics is O(k · dℓ′ ) = O( k·ℓ
d′ + k).
Regarding the cost of recomputing them from
20
scratch, consider that there are dℓ−1
′ −1 internal nodes
with heights greater than 1; the statistics compu′
′
tations for these nodes require O( dd·ℓ−d
′ −1 ). Therefore, the overall cost for statistics computations is
d′ ·ℓ−d′
O( k·ℓ
d′ + k + d′ −1 ).
(3) elsewhere
T ree Construction. Similar to the previous case, the
′2
′
T ′ construction requires O( d d′·ℓ−d
−1 ).
Statistics Computations. In this case, the statistics
should be computed from scratch for all internal nodes
′2
′
in T ′ . Therefore, the complexity is O( d d′·ℓ−d
−1 ).
D.2.2. The user decreases the tree degree
(1) d′ =
T ree Construction. As in T ′ the leaves remain the
same as in T , we only use the constrInterNodes (Procedure 2) to build the rest of the tree. Therefore, in the
worst case, the complexity for constructing the T ′ is
′2
′
O( d d′·ℓ−d
−1 ).
Statistics Computations. The statistics for T ′ nodes
of height 1 can be computed by aggregating statistics
from T . Particularity, in T ′ the statistics computations
d, with k ∈ N+ and k > 1
T ree Construction. For the T ′ construction we perform a reverse level-order traversal over T using the
H vector and starting from the nodes having height
of 1. In each level, for each node n we call the
constrInterNodes (Procedure 2) using as input the d
child nodes of n and the new degree d′ . Note that, in
this reconstruction case, the constrInterNodes does require to construct the root node; the root node here
is always corresponding to the node n. Hence, the
complexity of constrInterNodes for one call is O(d).
Considering that, we perform a procedure call for all
the internal nodes, as well as that the maximum number of internal nodes is d·ℓ−1
d−1 , we have that, in the
2
·ℓ−d
worst case the T ′ can be constructed in O( d d−1
)=
′2k
(2) d′ = k · d, with k ∈ N+ , k > 1 and k 6= dν where
ν ∈ N+
√
k
′k
).
O( d d′k·ℓ−d
−1
Regarding the number of internal nodes that we
have to construct from scratch. Since T ′ has all the
nodes of T , for T ′ we have to construct from scratch
d·ℓ−1
d′ ·ℓ−1
d′ −1 − d−1 new internal nodes, where the first part
corresponds to the number of internal nodes of T ′ , and
the second part corresponds to T . Considering that,
√
′
d′k ·ℓ−1
d′ = k d, we have to build dd·ℓ−1
′ −1 − d′k −1 internal
nodes.
20 Take into account that the maximum number of internal nodes
⌈logd ℓ⌉
(considering all levels) is d d−1 −1 .
38
N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis
Statistics Computations. Statistics should be computed only for the new internal nodes of T ′ . Hence, the
′
d′k ·ℓ−1
cost here is O(d′ · ( dd·ℓ−1
′ −1 − d′k −1 ))
D.3. The User Modifies the Number of Leaves
D.3.1. The user decreases the number of leaves
ℓ
, with k ∈ N+
dk
T ree Construction. In this case, each leaf in T ′ results by merging dk leaves from T . Hence, T ′ leaves
are constructed by calling mergeLeaves(ℓ, dk ). So,
considering the mergeLeaves complexity, in worst case
the new leaves construction requires O(|D|). Then,
each leaf of T ′ replace an internal nodes of T having
height
ℓ of k. Therefore, in worst case (k = 1), we call
the replaceNode procedure, which requires
d times
O( dℓ ). Therefore, the overall cost for constructing
T ′ in worst case is O(|D| + dℓ ) = O(|D|). Note that
in this, as well as in the following case, we assume that
T is a perfect tree. In case where T is not perfect, we
can use as T the perfect tree that initially proposed by
our system.
(1) ℓ′ =
(2) ℓ′ =
ν ∈ N+
ℓ
, with k ∈ N+ , k > 1 and k 6= dν , where
k
T ree Construction. In this case, each leaf in T ′ results by merging k leaves from T . Hence, the T ′ leaves
are constructed by calling the mergeLeaves(ℓ, k),
which in worst case requires O(|D|). Then, the rest
of the tree is constructed from scratch using the
constrInterNodes. Therefore, the overall cost for T ′
2 ′
·ℓ −d
).
construction is O(|D| + d d−1
Statistics Computations. The statistics for all internal nodes have to be computed from scratch. Regarding the leaves, the statistics for each leaf in T ′ are computed by aggregating the statistics of the T leaves it includes. Essentially, for computing the statistics in each
leaf in an HETree-C, we have to process k values instead of |D|
ℓ′ . However, in the worst case (i.e., ℓ = |D|),
we have that k = |D|
ℓ′ . Therefore, in the worst case (for
both HETree versions) the complexity is the same as
computing statistics from scratch.
(3) ℓ′ = ℓ − k, with k ∈ N+ , k > 1 and ℓ′ 6=
ν ∈ N+
ℓ
, where
ν
T ree Construction. In order to construct T ′ we have
to construct all nodes from scratch, which in the worst
2 ′
·ℓ −d
).
case requires O(|D|log|D| + d d−1
Statistics Computations. The statistics of the T
leaves that are fully contained in T ′ can be used
for calculating the statistics of the new leaves. The
worst case is when the number of leaves that are fully
contained in T ′ is minimized. For HETree-C (resp.
HETree-R), this occurs when the size of leaves in T ′
is λ′ = λ + 1 (resp. of length ρ′ = ρ + ρℓ ). In this
case, for every λ (resp. ρ) leaves from T that are used
to construct the T ′ leaves, at least 1 leaf is fully contained. Hence, when we process all ℓ leaves, at least
ℓ2
′
|D| leaves are fully contained in T .
Hence in HETree-C, in statistics computations over
the leaves, instead of processing |D| values, we proℓ2
· λ = |D| − ℓ. The same also
cess at most |D| − |D|
holds in HETree-R, if we assume normal distribution
over the values in D. Therefore, the cost for computing
the leaves statistics is O(|D| − ℓ) = O(|D| − ℓ′ − k).