Academia.eduAcademia.edu

A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis

Data exploration and visualization systems are of great importance in the Big Data era, in which the volume and heterogeneity of available information make it difficult for humans to manually explore and analyse data. Most traditional systems operate in an offline way, limited to accessing preprocessed (static) sets of data. They also restrict themselves to dealing with small dataset sizes, which can be easily handled with conventional techniques. However, the Big Data era has realized the availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer API or query endpoints for online access, or the data is received in a stream fashion. Therefore, modern systems must address the challenge of on-the-fly scalable visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for information abstraction and summarization. Further, they must take into account different user-defined exploration scenarios and user preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale model. We define two versions of this structure, that adopt different data organization approaches, well-suited to exploration and analysis context. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction and prefetching via user interaction, and dynamic adaptation of the hierarchies based on user preferences. A thorough theoretical analysis is presented, illustrating the efficiency of the proposed model. The proposed model is realized in a web-based prototype tool, called SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets. Finally, we provide a performance evaluation and a empirical user study employing real datasets.

1 Semantic Web 0 (2015) 1–0 A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 1 Nikos Bikakis a,b,∗ , George Papastefanatos b , Melina Skourla a and Timos Sellis c a National Technical University of Athens, Greece ATHENA Research Center, Greece c Swinburne University of Technology, Australia b Abstract. Data exploration and visualization systems are of great importance in the Big Data era, in which the volume and heterogeneity of available information make it difficult for humans to manually explore and analyse data. Most traditional systems operate in an offline way, limited to accessing preprocessed (static) sets of data. They also restrict themselves to dealing with small dataset sizes, which can be easily handled with conventional techniques. However, the Big Data era has realized the availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer API or query endpoints for online access, or the data is received in a stream fashion. Therefore, modern systems must address the challenge of on-the-fly scalable visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for information abstraction and summarization. Further, they must take into account different user-defined exploration scenarios and user preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale model. We define two versions of this structure, that adopt different data organization approaches, well-suited to exploration and analysis context. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction and prefetching via user interaction, and dynamic adaptation of the hierarchies based on user preferences. A thorough theoretical analysis is presented, illustrating the efficiency of the proposed methods. The presented model is realized in a web-based prototype tool, called SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets. Finally, we provide a performance evaluation and a empirical user study employing real datasets. Keywords: visual analytics, big data, multiscale, progressive, incremental indexing, linked data, multiresolution, visual aggregation, binning, adaptive, hierarchical navigation, personalized exploration, data reduction, summarization, SynopsViz 1. Introduction Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data exploration and visualization enable users to identify interesting patterns, infer correlations 1 To appear in Semantic Web Journal (SWJ), 2016 author. e-mail: bikakis@dblab.ntua.gr * Corresponding and causalities, and support sense-making activities over data that are not always possible with traditional data mining techniques [54,29]. This is of great importance in the Big Data era, where the volume and heterogeneity of available information make it difficult for humans to manually explore and analyse large datasets. One of the major challenges in visual exploration is related to the large size that characterizes many 2 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis datasets nowadays. Considering the visual information seeking mantra: “overview first, zoom and filter, then details on demand” [94], gaining overview is a crucial task in the visual exploration scenario. However, offering an overview of a large dataset is an extremely challenging task. Information overloading is a common issue in large dataset visualization; a basic requirement for the proposed approaches is to offer mechanisms for information abstraction and summarization. The above challenges can be overcome by adopting hierarchical aggregation approaches (for simplicity we also refer to them as hierarchical) [36]. Hierarchical approaches allow the visual exploration of very large datasets in a multilevel fashion, offering overview of a dataset, as well as an intuitive and usable way for finding specific parts within a dataset. Particularly, in hierarchical approaches, the user first obtains an overview of the dataset (both structure and a summary of its content) before proceeding to data exploration operations, such as roll-up and drill-down, filtering out a specific part of it and finally retrieving details about the data. Therefore, hierarchical approaches directly support the visual information seeking mantra. Also, hierarchical approaches can effectively address the problem of information overloading as it provides information abstraction and summarization. A second challenge is related to the availability of API and query endpoints (e.g., SPARQL) for online data access, as well as the cases where that data is received in a stream fashion. The latter pose the challenge of handling large sets of data in a dynamic setting, and as a result, a preprocessing phase, such as traditional indexing, is prevented. In this respect, modern techniques must offer scalability and efficient processing for on-the-fly analysis and visualization of dynamic datasets. Finally, the requirement for on-the-fly visualization must be coupled with the diversity of preferences and requirements posed by different users and tasks. Therefore, the proposed approaches should provide the user with the ability to customize the exploration experience, allowing users to organize data into different ways according to the type of information or the level of details she wishes to explore. Considering the general problem of exploring big data [95,81,18,54,49,43], most approaches aim at providing appropriate summaries and abstractions over the enormous number of available data objects. In this respect, a large number of systems adopt approximation techniques (a.k.a. data reduction techniques) in which partial results are computed. Existing ap- proaches are mostly based on: (1) sampling and filtering [39,83,2,67,55,13] and/or (2) aggregation (e.g., binning, clustering) [36,59,44,58,78,113,12,77,1,57]. Similarly, some modern database-oriented systems adopt approximation techniques using query-based approaches (e.g., query translation, query rewriting) [13,59,58,108,114]. Recently, incremental approximation techniques are adopted; in these approaches approximate answers are computed over progressively larger samples of the data [39,2,55]. In a different context, an adaptive indexing approach is used in [118], where the indexes are created incrementally and adaptively throughout exploration. Further, in order to improve performance many systems exploit caching and prefetching techniques [101,61,56,12,25,66,32]. Finally, in other approaches, parallel architectures are adopted [35,63,62,55]. Addressing the aforementioned challenges, in this work, we introduce a generic model that combines personalized multilevel exploration with online analysis of numeric and temporal data. At the core lies a lightweight hierarchical aggregation model, constructed on-the-fly for a given set of data. The proposed model is a tree-based structure that aggregates data objects into multiple levels of hierarchically related groups based on numerical or temporal values of the objects. Our model also enriches groups (i.e., aggregations/summaries) with statistical information regarding their content, offering richer overviews and insights into the detailed data. An additional feature is that it allows users to organize data exploration in different ways, by parameterizing the number of groups, the range and cardinality of their contents, the number of hierarchy levels, and so on. On top of this model, we propose three user exploration scenarios and present two methods for efficient exploration over large datasets: the first one achieves the incremental construction of the model based on user interaction, whereas the second one enables dynamic and efficient adaptation of the model to the user’s preferences. The efficiency of the proposed model is illustrated through a thorough theoretical analysis, as well as an experimental evaluation. Finally, the proposed model is realized in a web-based tool, called SynopsViz that offers a variety of visualization techniques (e.g., charts, timelines) for multilevel visual exploration and analysis over Linked Data (LD) datasets. Contributions. The main contributions of this work are summarized as follows. N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis − We introduce a generic model for organizing, exploring, and analysing numeric and temporal data in a multilevel fashion. − We implement our model as a lightweight, main memory tree-based structure, which can be efficiently constructed on-the-fly. − We propose two tree structure versions, which adopt different approaches for the data organization. − We describe a simple method to estimate the tree construction parameters, when no user preferences are available. − We define different exploration scenarios assuming various user exploration preferences. − We introduce a method that incrementally constructs and prefetches the hierarchy tree via user interaction. − We propose an efficient method that dynamically adapts an existing hierarchy to a new, considering user’s preferences. − We present a thorough theoretical analysis, illustrating the efficiency of the proposed model. − We develop a prototype system which implements the presented model, offering multilevel visual exploration and analysis over LD. − We conduct a thorough performance evaluation and an empirical user study, using the DBpedia 2014 dataset. Outline. The remainder of this paper is organized as follows. Section 2 presents the proposed hierarchical model, and Section 3 provides the exploration scenarios and methods for efficient hierarchical exploration. Then, Section 4 presents the SynopsViz tool and demonstrate the basic functionality. The evaluation of our system is presented in Section 5. Section 6 reviews related work, while Section 7 concludes this paper. 2. The HETree Model In this section we present HETree (Hierarchical Exploration Tree), a generic model for organizing, exploring, and analysing numeric and temporal data in a multilevel fashion. Particularly, HETree is defined in the context of multilevel (visual) exploration and analysis. The proposed model hierarchically organize arbitrary numeric and temporal data, without requiring it to be described by an hierarchical scheme. We should note that, our model is not bound to any specific type 3 of visualization; rather it can be adopted by several "flat" visualization techniques (e.g., charts, timeline), offering scalable and multilevel exploration over nonhierarchical data. In what follows, we present some basic aspects of our working scenario (i.e., visual exploration and analysis scenario) and highlight the main assumptions and requirements employed in the construction of our model. First, the input data in our scenario can be retrieved directly from a database, but also produced dynamically; e.g., from a query or from data filtering (e.g., faceted browsing). Thus, we consider that data visualization is performed online; i.e., we do not assume an offline preprocessing phase in the construction of the visualization model. Second, users can specify different requirements or preferences with respect to the data organization. For example, a user prefers to organize the data as a deep hierarchy for a specific task, while for another task a flat hierarchical organization is more appropriate. Therefore, even if the data is not dynamically produced, the data organization is dynamically adapted to the user preferences. The same also holds for any additional information (e.g., statistical information) that is computed for each group of objects. This information must be recomputed when the groups of objects (i.e., data organization) are modified. From the above, a basic requirement is that the model must be constructed on-the-fly for any given data and users preferences. Therefore, we implement our model as a lightweight, main memory tree structure, which can be efficiently constructed on-the-fly. We define two versions of this tree structure, following data organization approaches well-suited to visual exploration and analysis context: the first version considers fixed-range groups of data objects, whereas the second considers fixed-size groups. Finally, our structure allows efficient on-the-fly statistical computations, which are extremely valuable for the hierarchical exploration and analysis scenario. The basic idea of our model is to hierarchically group data objects based on values of one of their properties. Input data objects are stored at the leaves, while internal nodes aggregate their child nodes. The root of the tree represents (i.e., aggregates) the whole dataset. The basic concepts of our model can be considered similar to a simplified version of a static 1D R-Tree [45]. Regarding the visual representation of the model and data exploration, we consider that both data objects sets (leaf nodes contents) and entities represent- 4 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis ing groups of objects (leaf or internal nodes) are visually represented enabling the user to explore the data in a hierarchical manner. Note that our tree structure organizes data in a hierarchical model, without setting any constraints on the way the user interacts with these hierarchies. As such, it is possible that different strategies can be adopted, regarding the traversal policy, as well as the nodes of the tree that are rendered in each visualization stage. In the rest of this section, preliminaries are presented in Section 2.1. In Section 2.2, we introduce the proposed tree structure. Sections 2.3 and 2.4 present the two versions of the structure. Finally, Section 2.5 discusses the specification of the parameters required for the tree construction, and Section 2.6 presents how statistics computations can be performed over the tree. 2.1. Preliminaries In this work we formalize data objects as RDF triples. However, the presented methods are generic and can be applied to any data objects with numeric or temporal attributes. Hence, in the following, the terms triple and (data) object will be used interchangeably. We consider an RDF dataset R consisting of a set of RDF triples. As input data, we assume a set of RDF triples D, where D ⊆ R and triples in D have as objects either numeric (e.g., integer, decimal) or temporal values (e.g., date, time). Let tr be an RDF triple, tr.s, tr.p and tr.o represent, respectively, the subject, predicate and object of the RDF triple tr. Given input data D, S is an ordered set of RDF triples, produced from D, where triples are sorted based on objects’ values, in ascending order. Assume that S[i] denotes the i-th triple, with S[1] the first triple. Then, for each i < j, we have that S[i].o ≤ S[j].o. Also, D = S, i.e., for each tr, tr ∈ D iff tr ∈ S. Figure 1 presents a set of 10 RDF triples, representing persons and their ages. In Figure 1, we assume that the subjects p0-p9 are instances of a class Person and the predicate age is a datatype property with integer range. p0 age 35 p1 age 100 p2 age 55 p3 age 37 p4 age 30 p5 age 35 p6 age 45 p7 age 80 p8 age 20 p9 age 50 Fig. 1. Running example input data (data objects) Example 1. In Figure 1, given the RDF triple tr = p0 age 35, we have that tr.s = p0, tr.p = age and tr.o = 35. Also, given that all triples comprise the input data D and S is the ordered set of D based on the object values, in ascending order; we have that S[1] = p8 age 20 and S[10] = p1 age 100. Assume an interval I = [a, b], where a, b ∈ R; then, I = {k ∈ R | a ≤ k ≤ b}. Similarly, for I = [a, b), we have that I = {k ∈ R | a ≤ k < b}. Let I − and I + denote the lower and upper bound of the interval I, respectively. That is, given I = [a, b], then I − = a and I + = b. The length of an interval I is defined as |I + − I − |. In this work we assume rooted trees. The number of the children of a node is its degree. Nodes with degree 0 are called leaf nodes. Moreover, any non-leaf node is called internal node. Sibling nodes are the nodes that have the same parent. The level of a node is defined by letting the root node be at level zero. Additionally, the height of a node is the length of the longest path from the node to a leaf. A leaf node has a height of 0. The height of a tree is the maximum level of any node in the tree. The degree of a tree is the maximum degree of a node in the tree. An ordered tree is a tree where the children of each node are ordered. A tree is called an m-ary tree if every internal node has no more than m children. A full m-ary tree is a tree where every internal node has exactly m children. A perfect m-ary tree is a full m-ary tree in which all leaves are at the same level. 2.2. The HETree Structure In this section, we present in more detail the HETree structure. HETree hierarchically organizes numeric and temporal data into groups; intervals are used to represents these groups.1 HETree is defined by the tree degree and the number of leaf nodes.2 Essentially, the number of leaf nodes corresponds to the number of groups where input data objects are organized. The tree degree corresponds to the (maximum) number of groups where a group is split in the lower level. 1 Note that our structure handles numeric and temporal data in a similar manner. Also, other types of one-dimensional data may be supported, with the requirement that a total order can be defined over the data. 2 Note that following a similar approach, the HETree can also be defined by specifying the tree height instead of degree or number of leaves. 5 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Given a set of data objects (RDF triples) D, a positive integer ℓ denoting the number of leaf nodes; and a positive integer d denoting the tree degree; an HETree (D, ℓ, d) is an ordered d-ary tree, with the following basic properties. − The tree has exactly ℓ number of leaf nodes. − All leaf nodes appear in the same level. − Each leaf node contains a set of data objects, sorted in ascending order based on their values. Given a leaf node n, n.data denote the data objects contained in n. − Each internal node has at most d children nodes. Let n be an internal node, n.ci denotes the i-th child for the node n, with n.c1 be the leftmost child. − Each node corresponds to an interval. Given a node n, n.I denotes the interval for the node n. − At each level, all nodes are sorted based on the lower bounds of their intervals. That is, let n be an internal node, for any i < j, we have that n.ci .I − ≤ n.cj .I − . − For a leaf node, its interval is bounded by the values of the objects included in this leaf node. Let n be the leftmost leaf node; assume that n contains x objects from D. Then, we have that n.I − = S[1].o and n.I + = S[x].o, where S is the ordered object set resulting from D. − For an internal node, its interval is bounded by the union of the intervals of its children. That is, let n be an internal node, having k child nodes; then, we have n.I − = n.c1 .I − and n.I + = n.ck .I + . Furthermore, we present two different approaches for organizing the data in the HETree. Assume the scenario in which a user wishes to (visually) explore and analyse the historic events from DBpedia [8], per decade. In this case, user orders historic events by their date and organizes them into groups of equal ranges (i.e., decade). In a second scenario, assume that a user wishes to analyse in the Eurostat dataset the gross domestic product (GDP) organized into fixed groups of countries. In this case, the user is interested in finding information like: the range and the variance of the GDP values over the top-10 countries with the highest GDP factor. In this scenario, the user orders countries by their GDP and organizes them into groups of equal sizes (i.e., 10 countries per group). In the first approach, we organize data objects into groups, where the object values of each group covers equal range of values. In the second approach, we or- [20, 100] a [20, 45] b c [50, 100] e d [20, 30] [35, 35] f [37, 45] g [50, 55] h [80, 100] p8 age 20 p0 age 35 p4 age 30 p5 age 35 p3 age 37 p6 age 45 p9 age 50 p2 age 55 p7 age 80 p1 age 100 Fig. 2. A Content-based HETree (HETree-C) ganize objects into groups, where each group contains the same number of objects. In the following sections, we present in detail the two approaches for organizing the data in the HETree. 2.3. A Content-based HETree (HETree-C) In this section we introduce a version of the HETree, named HETree-C (Content-based HETree). This HETree version organizes data into equally sized groups. The basic property of the HETree-C is that each leaf node contains approximately the same number of objects and the content (i.e., objects) of a leaf node specifies its interval. For the tree construction, the objects are first assigned to the leaves and then the intervals are defined. An HETree-C (D, ℓ, d) is an HETree, with the following extra property. Each lleaf mnode contains λ or .3 Particularly, the λ − 1 objects, whereλ = |D| ℓ ℓ−(λ·ℓ−|D|) leftmost leaves contain λ objects, while the rest leaves contain λ − 1.4 We can equivalently define the HETree-C by providing the number of objects per leaf λ, instead of the number of leaves ℓ. Example 2. Figure 2 presents an HETree-C constructed by considering the set of objects D from Figure 1, ℓ = 5 and d = 3. As we can observe, all the leaf nodes contain equal number of objects.  Particularly, we have that λ = 10 = 2. Regard5 − ing the leftmost interval, we have d.I = 20 and d.I + = 30. 3 We assume that, the number of objects is at least as the number of leaves; i.e., |D| ≥ ℓ. 4 As an alternative we can construct the HETree-C, so each leaf contains λ objects, except the rightmost leaf which will contain between 1 and λ objects. 6 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Algorithm 1. createHETree-C/R (D, ℓ, d) Input: D: set of objects; ℓ: number of leaf nodes; d: tree degree Output: r: root node of the HETree tree 1 2 3 4 S ← sort D based on objects values L ← constrLeaves-C/R(S, ℓ) r ← constrtInterlNodes(L, d) return r Procedure 1: constrLeaves-C(S, ℓ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Input: S: ordered set of objects; ℓ: number of leaf nodes Output: L: ordered set of leaf nodes l m |S| λ← ℓ k ← ℓ − (λ · ℓ − |S|) beg ← 1 for i ← 1 to ℓ do create an empty leaf node n if i ≤ k then num ← λ else num ← λ − 1 end ← beg + num for t ← beg to end do n.data ← S[t] n.I − ← S[beg].o n.I + ← S[end].o L[i] ← n beg ← end + 1 return L 2.3.1. The HETree-C Construction We construct the HETree-C in a bottom-up way. Algorithm 1 describes the HETree-C construction. Initially, the algorithm sort the object set D in ascending order, based on objects values (line 1). Then, the algorithm uses two procedures to construct the tree nodes. Finally, the root node of the constructed tree is returned (line 4). The constrLeaves-C procedure (Procedure 1) construct ℓ leaf nodes (lines 4–16). For the first k leaves, λ objects are inserted, while for the rest leaves, λ − 1 objects are inserted (lines 6–9). Finally, the set of created leaf nodes is returned (line 17). The constrtInterNodes procedure (Procedure 2) builds the internal nodes in a recursive manner. For the nodes H, their parents nodes P are created (lines 4-16); then, the procedure calls itself using as input the parent nodes P (line 21). The recursion terminates when the number of created parent nodes is equal to one (line 17); i.e., the root of the tree is created. Computational Analysis. The computational cost for the HETree-C construction (Algorithm 1) is the sum of three parts. The first is sorting the input data, which can be done in the worst case in O(|D|log|D|), employing a linearithmic sorting algorithm (e.g., mergesort). The second part is the constrLeaves-C procedure, which requires O(|D|) for scanning all data objects. The third part is the proce ℓ constrtInterNodes ℓ ℓ dure, which requires d · ( d + d2 + d3 + . . . + 1), with the sum being the number of internal nodes in the tree. Note that the maximum number of internal nodes in a d-ary tree corresponds to the number of internal nodes in a perfect d-ary tree of the same height. Also, note the number of internal nodes of a perfect d-ary h −1 tree of height h is dd−1 . In our case, the height of our tree is h = ⌈logd ℓ⌉. Hence, the maximum number of ⌈logd ℓ⌉ internal nodes is d d−1 −1 ≤ d·ℓ−1 d−1 . Therefore, the constrtInterNodes procedure, in worst case requires 2 ·ℓ−d ). Therefore, the overall computational cost O( d d−1 for the HETree-C construction in the worst case is 2 ·ℓ−d O(|D|log|D| + |D| + d d−1 ) = O(|D|log|D| + d2 ·ℓ−d 5 d−1 ). 2.4. A Range-based HETree (HETree-R) The second version of the HETree is called HETree-R (Range-based HETree). HETree-R organizes data into equally ranged groups. The basic property of the HETree-R is that each leaf node covers an equal range of values. Therefore, in HETree-R, the data space defined by the objects values is equally divided over the leaves. As opposed to HETree-C, in HETree-R the interval of a leaf specifies its content. Therefore, for the HETree-R construction, the intervals of all leaves are first defined and then objects are inserted. An HETree-R (D, ℓ, d) is an HETree, with the following extra property. The interval of each leaf node has the same length; i.e., covers equal range of values. Formally, let S be the sorted RDF set resulting from D, for each leaf node its interval has length ρ, where ρ = |S[1].o−S[|S|].o| .6 Therefore, for a leaf node n, we ℓ − have that |n.I − n.I + | = ρ. For example, for the leftmost leaf, its interval is [S[1].o, S[1].o + ρ). The 5 In the complexity computations presented through the paper, terms that are dominated by others (i.e., having lower growth rate) are omitted. 6 We assume here that, there is at least one object in D with different value than the rest objects. N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Procedure 2: constrtInterNodes(H, d) 1 2 3 4 5 6 7 8 9 t ← d − (pn · d − |H|) //last parent’s number of children cbeg ← 1 //first child node for p ← 1 to pnum do create an empty internal node n if p = pnum then cnum ← t //number of children else cnum ← d 12 13 n.I − 11 14 15 16 17 18 19 20 21 Procedure 3: constrLeaves-R(S, ℓ) Input: H: ordered set of nodes; d: tree degree Output: r: root node for H Variables: P : ordered set of H’s parent nodes m l |H| pnum ← d //number of parents nodes cend ← cbeg + cnum for j ← cbeg to cend do n.c[j] ← H[j] 10 //last child node HETree-R is equivalently defined by providing the interval length ρ, instead of the number of leaves ℓ. Example 3. Figure 3 presents an HETree-R tree constructed by considering the set of objects D (Figrue 1), ℓ = 5 and d = 3. As we can observe from Figure 3, each leaf node covers equal range of values. Particularly, we have that the interval of each leaf must have length ρ = |20−100| = 16. 5 [20, 100] a b c g e d f [20, 36) [36, 52) [52, 68) [68, 84) p2 age 55 1 2 3 4 5 6 7 8 9 10 11 13 if pnum = 1 then r←P return r else return constrtInterlNodes(P, d) p8 age 20 p3 age 37 p4 age 30 p6 age 45 p0 age 35 p9 age 50 p5 age 35 Input: S: ordered set of objects; ℓ: number of leaf nodes Output: L: ordered set of leaf nodes 12 ].I − ← H[cbeg n.I + ← H[cend ].I + P [p] ← n cbeg ← cend + 1 [20, 68) 7 p7 age 80 [68, 100] h [84, 100] p1 age 100 Fig. 3. A Range-based HETree (HETree-R) 2.4.1. The HETree-R Construction This section studies the construction of the HETree-R structure. The HETree-R is also constructed in a bottom-up fashion. |S[1].o−S[|S|].o| ρ← ℓ for i ← 1 to ℓ do create an empty leaf node n if i = 1 then n.I − ← S[1].o else n.I − ← L[i − 1].I + n.I + ← n.I − + ρ L[i] ← n for t ← 1jto |S| do j← S[t].o−S[1].o ρ k +1 L[j].data ← S[t] return L Similarly with the HETree-C version, Algorithm 1 is used for the HETree-R construction. The only difference is the constrLeaves-R procedure (line 2), which creates the leaf nodes of the HETree-R and is presented in Procedure 3. The procedure The procedure constructs ℓ leaf nodes (lines 2–9) and assigns same intervals to all of them (lines 4–8), it traverses all objects in S (lines 10–12) and places them to the appropriate leaf node (line 12). Finally, returns the set of created leaves (line 13). Computational Analysis. The computational cost for the HETree-R construction (Algorithm 1) for sorting the input data (line 1) and creating the internal nodes (line 3) is the same as in the HETree-C case. The constrLeaves-R procedure (line 2) requires O(ℓ + |D|) = O(|D|) (since |D| ≥ ℓ). Using the computational costs for the first and the third part from Section 2.3.1, we have that in worst case, the overall computational cost for the HETree-R construction 2 ·ℓ−d ) = O(|D|log|D| + is O(|D|log|D| + |D| + d d−1 d2 ·ℓ−d d−1 ). 2.5. Estimating the HETree Parameters In our working scenario, the user specifies the parameters required for the HETree construction (e.g., number of leaves ℓ). In this section, we describe our approach for automatically calculating the HETree parameters based on the input data, when no user preferences are provided. Our goal is to derive the parameters by the input data, such that the resulting HETree can address some basic guidelines set by the visualiza- 8 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis tion environment. In what follows, we discuss in detail the proposed approach. An important parameter in hierarchical visualizations is the minimum and maximum number of objects that can be effectively rendered in the most detailed level.7 In our case, the above numbers correspond to the number of objects contained in the leaf nodes. The proper calculation of these numbers is crucial such that the resulting tree avoids overloaded visualizations. Therefore, in HETree construction, our approach considers the minimum and the maximum number of objects per leaf node, denoted as λmin and λmax , respectively. Besides the number of objects rendered in the lowest level, our approach considers perfect m-ary trees, such that a more "uniform" structure (i.e., all the internal nodes have exactly m child nodes) results. The following example illustrates our approach of calculating the HETree parameters. Example 4. Assume that based on an adopted visualization technique, the ideal number of data objects to be rendered on a specific screen is between 25 and 50. Hence, we have that λmin = 25 and λmax = 50. Now, let’s assume that we want to visualize the object set D1 , using an HETree-C, where |D1 | = 500. Based on the number of objects and the λ bounds, we can estimate the bounds for the number of leaves. Let ℓmin and ℓmax denote the lower and the upper bound  we have  number of  leaves.  Therefore,   for the |D1 | 500 |D1 | ≤ ℓ ≤ ≤ ℓ ≤ ⇔ that λmin 50  λmax 500 ⇔ 10 ≤ ℓ ≤ 20. 25 Hence, our HETree-C should have between ℓmin = 10 and ℓmax = 20 leaf nodes. Since, we consider perfect m-ary trees, from Table 1 we can identify the tree characteristics that conform to the number of leaves guideline. The candidate setting (i.e., leaf number and degree) is indicated in Table 1, using dark-grey colour. Note that, the settings with d = 2 are not examined since visualizing two groups of objects in each level is considered a small number under most visualization settings. Hence, in any case we only assume settings with d ≥ 3 and height ≥ 2. Therefore, an HETree-C with ℓ = 16 and d = 4 is a suitable structure for our case. Now, let’s assume that we want to visualize the object set D2 , where |D2 | = 1000. Following a similar approach, we have that 20 ≤ ℓ ≤ 40. The candi7 Similar bounds can also be defined for other tree levels. Table 1. Number of leaf nodes for perfect m-ary trees Degree Height 3 4 5 6 1 2 3 4 5 6 3 9 27 81 243 729 4 16 64 256 1024 4048 5 25 625 3125 15625 78125 6 36 216 1296 7776 46656 date settings are indicated in Table 1 using light-grey colour. Hence, we have the following settings that satisfy the considered guideline: S1: ℓ = 27, d = 3; S2: ℓ = 25, d = 5; and S3: ℓ = 36, d = 6. In the case where more than one setting satisfies the considered guideline, we select the preferable one according to following set of rules. From the candidate settings, we prefer the setting which results in the highest tree (1st Criterion).8 In case that the highest tree is constructed by more than one settings, we consider the distance c, between ℓ and the centre of ℓmin and ℓmax (2nd Criterion); i.e., max c = |ℓ − ℓmin +ℓ |. The setting with the lowest c 2 value is selected. Note that, based on the visualization context, different criteria and preferences may be followed. In our example, from the candidate settings, setting S1 is selected, since it will construct the highest tree (i.e., height = 3). On the other hand, settings S2 and S3 will construct trees with lower heights (i.e., height = 2). Now, assume a scenario where only S2 and S3 are candidates. In this case, since both settings result to trees with equal heights, the 2nd Criterion is considered. Hence, for the S2 we have c2 = |25− 20+40 2 |= 5. Similarly, for the S3 c3 = |36 − 20+40 | = 6. 2 Therefore, between the S2 and S3, the setting S2 is preferable, since c2 < c3 . In case of HETree-R, a similar approach is followed, assuming normal distribution over the values of the objects. 8 Depending on user preferences and the examined task, the shortest tree may be preferable. For example, starting from the root, the user wishes to access the data objects (i.e., lowest level) by performing the smallest amount of drill-down operations possible. 9 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis [20, 100] 2.6. Statistics Computations over HETree Data statistics is a crucial aspect in the context of hierarchical visual exploration and analysis. Statistical informations over groups of objects (i.e., aggregations) offer rich insights into the underlying (i.e., aggregated) data. In this way, useful information regarding different set of objects with common characteristics is provided. Additionally, this information may also guide the users through their navigation over the hierarchy. In this section, we present how statistics computation is performed over the nodes of the HETree. Statistics computations exploit two main aspects of the HETree structure: (1) the internal nodes aggregate their child nodes; and (2) the tree is constructed in bottom-up fashion. Statistics computation is performed during the tree construction; for the leaf nodes, we gather statistics from the objects they contain, whereas for the internal nodes we aggregate the statistics of their children. For simplicity, here, we assume that each node contains the following extra fields, used for simple statistics computations, although more complex or RDF-related (e.g., most common subject, subject with the minimum value, etc.) statistics can be computed. Assume a node n, as n.N we denote the number of objects covered by n; as n.µ and n.σ 2 we denote the mean and the variance of the objects’ values covered by n, respectively. Additionally, we assume the minimum and the maximum values, denoted as n.min and n.max, respectively. Statistics computations can be easily performed in the construction algorithms (Algorithm 1) without any modifications. The follow example illustrates these computations. Example 5. In this example we assume the HETree-C presented in Figure 2. Figure 4 shows the HETree-C with the computed statistics in each node. When all the leaf nodes have been constructed, the statistics for each leaf is computed. For instance, we can see from Figure 4, that for the rightmost leaf h we have: h.N = 2, h.µ = 80+100 = 90 and 2 h.σ 2 = 12 · ((80 − 90)2 + (100 − 90)2 ) = 100. Also, we have h.min = 80 and h.max = 100. Following the above process, we compute the statistics for all leaf nodes. Then, for each parent node we construct, we compute its statistics using the computed statistics of its child nodes. Considering the c internal node, with the child nodes g and h, we have that a N=6 = 33.7 2 = 57.2 [20, 45] d [20, 30] N=4 = 71.3 404.7 2= b e [35, 35] p8 age 20 p0 age 35 p4 age 30 p5 age 35 N=2 = 25 2 = 25 N = 10 = 48.7 535.2 2= N=2 = 35 2= 0 c [50, 100] f [37, 45] g [50, 55] h [80, 100] p3 age 37 p6 age 45 p9 age 50 p2 age 55 p7 age 80 p1 age 100 N=2 = 41 2 = 16 N=2 = 52.5 6.25 2= N=2 = 90 100 2= Fig. 4. Statistics computation over HETree c.min = 50 and c.max = 100. Also, we have that c.N = g.N + h.N = 2 + 2 = 4. Now we will compute the mean value by combining the ·h.µ = children mean values: c.µ = g.N ·g.µ+h.N g.N +h.N 2·52.5+2·90 = 71.3. Similarly, for variance we have 2+2 c.σ 2 = g.N ·g.σ 2 +h.N ·h.σ 2 +g.N ·(g.µ−c.µ)2 +h.N ·(h.µ−c.µ)2 g.N +h.N = 2·6.25+2·100+2·(52.5−71.3)2 +2·(90−71.3)2 2+2 = 404.7. The similar approach is also followed for the case of HETree-R. Computational Analysis. Most of the well known statistics (e.g., mean, variance, skewness, etc.) can be computed linearly w.r.t. the number of elements. Therefore, the computation cost over a set of numeric values S is considered as O(|S|). Assume a leaf node n containing k objects, then the cost for statistics computations for n is O(k). Also, the cost for all leaf nodes is O(|D|). Let an internal node n, then the cost for n is O(d); since the statistics in n are computed by aggregating the statistics of the d child nodes. Considering that d·ℓ−1 d−1 is the maximum number of internal nodes (Section 2.3.1), we have that in the worst case 2 ·ℓ−d the cost for the internal nodes is O( d d−1 ). Therefore, the overall cost for statistics computations over 2 ·ℓ−d ). an HETree is O(|D| + d d−1 3. Efficient Multilevel Exploration In this section, we exploit the HETree structure in order to efficiently handle different multilevel exploration scenarios. Essentially, we propose two methods for efficient hierarchical exploration over large datasets. The first method incrementally constructs the hierarchy via user interaction; the second one achieves dynamic adaptation of the data organization based on user’s preferences. 10 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 3.1. Exploration Scenarios In a typical multilevel exploration scenario, referred here as Basic exploration scenario (BSC), the user explores a dataset in a top-down fashion. The user first obtains an overview of the data through the root level, and then drills down to more fine-grained contents for accessing the actual data objects at the leaves. In BSC, the root of the hierarchy is the starting point of the exploration and, thus, the first element to be presented (i.e., rendered). The described scenario offers basic exploration capabilities; however it does not assume use cases with user-specified starting points, other than the root, such as starting the exploration from a specific resource, or from a specific range of values. Consider the following example, in which the user wishes to explore the DBpedia infoboxes dataset to find places with very large population. Initially, she selects the populationTotal property and starts her exploration from the root node, moves down the right part of the tree and ends up at the rightmost leaf that contains the highly populated places. Then, she is interested in viewing the area size (i.e., areaTotal property) for one of the highly populated places and, also, in exploring places with similar area size. Finally, she decides to explore places based on the water area size (i.e., areaWater) they contain. In this case, she prefers to start her exploration by considering places that their water area size is within a given range of values. In this example, besides BSC one we consider two additional exploration scenarios. In the Resourcebased exploration scenario (RES), the user specifies a resource of interest (e.g., an IRI) and a specific property; the exploration starts from the leaf containing the specific resource and proceeds in a bottom-up fashion. Thus, in RES the data objects contained in the same leaf with the resource of interest are presented first. We refer to that leaf as leaf of interest. The third scenario, named Range-based exploration scenario (RAN) enables the user to start her exploration from an arbitrary point in the hierarchy providing a range of values; the user starts from a set of internal nodes and she can then move up or down the hierarchy. The RAN scenario begins by rendering all sibling nodes that are children of the node covering the specified range of interest; we refer to these nodes as nodes of interest. Note that, regarding the adopted rendering policy for all scenarios, we only consider nodes belonging to the same level. That is, sibling nodes or data objects contained in the same leaf, are rendered. Regarding the "navigation-related" operations, the user can move down or up the hierarchy by performing a drill-down or a roll-up operation, respectively. A drill-down operation over a node n enables the user to focus on n and render its child nodes. If n is a leaf node, the set of data objects contained in n are rendered. On the other hand, the user can perform a rollup operation on a set of sibling nodes S. The parent node of S along with the parent’s sibling nodes are rendered. Finally, the roll-up operation when applied to a set of data objects O will render the leaf node that contains O along its sibling leaves, whereas a drill-down operation is not applied to a data object. 3.2. Incremental HETree Construction In the Web of Data, the dataset might be dynamically retrieved by a remote site (e.g., via a SPARQL endpoint), as a result, in all exploration scenarios, we have assumed that the HETree is constructed on-the-fly at the time the user starts her exploration. In the previous DBpedia example, the user explores three different properties; although only a small part of their hierarchy is accessed, the whole hierarchies are constructed and the statistics of all nodes are computed. Considering the recommended HETree parameters for the employed properties, this scenario requires that 29.5K nodes will be constructed for populationTotal property, 9.8K nodes for the areaTotal and 3.3K nodes for the areaWater, amounting to a total number of 42.6K nodes. However, the construction of the hierarchies for large datasets poses a time overhead (as shown in the experimental section) and, consequently, increased response time in user exploration. In this section, we introduce ICO (Incremental HETree Construction) method, which incrementally constructs the HETree, based on user interaction. The proposed method goes beyond the incremental tree construction, aiming at further reducing the response time during the exploration process by "preconstructing" (i.e., prefetching) the parts of the tree that will be visited by the user in her next roll-up or drill-down operation. Hence, a node n is not constructed when the user visits it for the first time; instead, it has been constructed in a previous exploration step, where the user was on a node in which n can be reached by a roll-up or a drill-down operation. This way, our method offers incremental construction of the tree, tailored to each user’s exploration. Finally, we 11 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis ⤴ roll-up explore 1 create: d, e, f Resource of interest create: b, c d [20, 36) e [36, 52) f [52, 68) p8 age 20 p4 age 30 p0 age 35 p5 age 35 p3 age 37 p6 age 45 p9 age 50 p2 age 55 Range of interest Age [20, 100] a create: a, g, h [20, 68) http://persons.com/p6 2 ⤴ roll-up c b [68, 100] [20, 68) e d f [20, 36) [36, 52) [52, 68) p8 age 20 p3 age 37 p4 age 30 p6 age 45 p0 age 35 p9 age 50 p5 age 35 explore 30-50 create: d, e, f, b, c b c [68, 100] g e d f [20, 36) [36, 52) [52, 68) [68, 84) p2 age 55 ⤴ roll-up p8 age 20 p3 age 37 p4 age 30 p6 age 45 p0 age 35 p9 age 50 p5 age 35 create: a, g, h p2 age 55 p7 age 80 h [84, 100] p1 age 100 Current view Fig. 5. Incremental HETree construction example. ➊ Resource-based (RES) exploration scenario; ➋ Range-based (RAN) exploration scenario show that, during an exploration scenario, ICO constructs the minimum number of HETree elements. Employing ICO method in the DBpedia example, the populationTotal hierarchy will only construct 76 nodes (the root along its child nodes and 9 nodes in each of the lower tree levels) and the areaTotal will construct 3 nodes corresponding to the leaf node containing the requested resource and its siblings. Finally, the areaWater hierarchy initially will contain either 6 or 15 nodes, depending on whether the user’s input range corresponds to a set of sibling leaf nodes, or to a set of sibling internal nodes, respectively. Example 6. We demonstrate the functionality of ICO through the following example. Assume the dataset used in our running examples, describing persons and their ages. Figure 5 presents the incremental construction of the HETree presented in Figure 3 for the RES and RAN exploration scenarios. Blue color is used to indicate the HETree elements that are presented (rendered) to the user, in each exploration stage. In the RES scenario (upper flow in Figure 5), the user specifies "http://persons.com/p6" as her resource of interest; all data objects contained in the same leaf (i.e., e) with the resource of interest are initially presented to the user. The ICO initially constructs the leaf e, along with its siblings, i.e., leaves d and f . These leaves correspond to the nodes that the user can reach in a next (roll-up) step. Next, the user rolls up and the leaves d, e and f are presented to her. At the same time, parent node b and its sibling c are constructed. Note that all elements which are accessible to the user by moving either down (i.e., d, e, f data objects), or up (i.e., b, c nodes) are already constructed. Finally, when the user rolls up b and c nodes are rendered and parent node a, along with the children of c, i.e., g and h, are constructed. In the RAN scenario (lower flow in Figure 5), the user specifies [30, 50] as her range of interest. The nodes covering this range (i.e., d, e) are initially presented along with their sibling f . Also, ICO constructs the parent node b and its sibling c because they are accessible by one exploration step. Then, the user performs a roll-up and ICO constructs the a, g, h nodes (as described in the RES scenario above). In the beginning of each exploration scenario, ICO constructs a set of initial nodes, which are the nodes initially presented, as well as the nodes potentially reached by the user’s first operation (i.e., required HETree elements). The required HETree elements of an exploration step are nodes that can be reached by the user by performing one exploration operation. Hence, in the RES scenario, the initial nodes are the leaf of interest and its sibling leaves. In the RAN, the initial nodes are the nodes of interest, their children, and their parent node along with its siblings. Finally, in the BSC scenario the initial nodes are the root node and its children. In what follows we describe the construction rules adopted by ICO through the user exploration process. These rules provide the correspondences between the types of elements presented in each exploration step and the elements that ICO constructs. Note that these rules are applied after the construction of the initial nodes, in all three exploration scenarios. The correctness of these rules is verified later in Proposition 1. Rule 1: If a set of internal sibling nodes C is presented, ICO constructs: (i) the parent node of C along with the parent’s siblings, and (ii) the children of each node in C. Rule 2: If a set of leaf sibling nodes L is presented, ICO does not construct anything (the required nodes have been previously constructed). 12 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Rule 3: If a set of data objects O is presented, ICO does not construct anything (the required nodes have been previously constructed). The following proposition shows that, in all case, the required HETree elements have been constructed earlier by ICO.9 Proposition 1. In any exploration scenario, the HETree elements a user can reach by performing one operation (i.e., required elements), have been previously constructed by ICO. Also, the following theorem shows that over any exploration scenario ICO constructs only the required HETree elements. Theorem 1. ICO constructs the minimum number of HETree elements in any exploration scenario. 3.2.1. ICO Algorithm In this section, we present the incremental HETree construction algorithm. Note that, here we include the pseudocode only for the HETree-R version, since the only difference with the HETree-C version is in the way that the nodes’ intervals are computed and that the dataset is initially sorted. In the analysis of the algorithms, both versions are studied. Here, we assume that each node n contains the following extra fields. Let a node n, n.p denotes the parent node of n, and n.h denotes the height of n in the hierarchy. Additionally, given a dataset D, D.minv and D.maxv denote the minimum and the maximum value for all objects in D, respectively. The user preferences regarding the exploration’s starting point are represented as an interval U . In the RES scenario, given that the value of the explored property for the resource of interest is o, we have U − = U + = o. In the RAN scenario, given that the range of interest is R, we have that U − = max (D.minv, R− ) and U + = min (D.maxv, R+ ). In the BSC scenario, the user does not provide any preferences regarding the starting point, so we have U − = D.minv and U + = D.maxv. Finally, according to the definition of HETree, a node n encloses a data object (i.e., triple) tr if n.I − ≥ tr.o and n.I + ≤ tr.o. The algorithm ICO-R (Algorithm 2) implements the incremental method for HETree-R. The algorithm uses two procedures to construct all required nodes (available in Appendix B). The first procedure constrRollUp-R (Procedure 4) constructs the 9 Proofs are included in Appendix A. Algorithm 2. ICO-R(D, ℓ, d, U , cur, H) Input: D: set of objects; ℓ: number of leaf nodes; d: tree degree; U : interval representing user’s starting point; cur: currently presented elements; H: currently created HETree-R Output: H: updated HETree-R Variables: len: the length of the leaf’s interval 1 2 3 4 5 6 7 8 9 10 if cur = null then //first ICO call len ← D.maxv−D.minv ℓ from U compute I0 , h0 //used for constructing initial nodes cur , H ← constrSiblingNodes-R(I0 , null, D, h0 ) if RES then return H if cur[1].p = null and D 6= ∅ then H ← constrRollUp-R(D, d, cur, H) //cur are not leaves if cur[1].h > 0 then H ← constrDrillDown-R(D, d, cur, H) return H nodes which can be reached by a roll-up operation, whereas constrDrillDown-R (Procedure 5) constructs the nodes which can be reached by a drilldown operation. Additionally, the aforementioned procedures exploit two secondary procedures (Appendix B): computeSiblingInterv-R (Procedure 6) and constrSiblingNodes-R (Procedure 7), which are used for nodes’ intervals computations and nodes construction. The ICO-R algorithm is invoked at the beginning of the exploration scenario, in order to construct the initial nodes, as well as every time the user performs an operation. The algorithm takes as input the dataset D, the tree parameters d and ℓ, the starting point U , the currently presented (i.e., rendered) elements cur, and the constructed HETree H. ICO-R begins with the currently presented elements cur equal to null (lines 1-5). Based on the starting point U , the algorithm computes the interval I0 corresponding to the sibling nodes that are first presented to the user, as well as its hierarchy height h0 (line 3). For sake of simplicity, the details for computing I0 and h0 are omitted. For example, the interval I for the leaf that contains the resource of interest with object valueo, is  computed as I − = D.minv + len · o−D.minv and len I + = min(D.maxv, I − + len). Following a similar approach, we can easily compute I0 and h0 . Based on I0 , the algorithm constructs the sibling nodes that are first presented to the user (line 4). Then, the algorithm constructs the rest initial nodes (lines 69). In the RES case, as I0 we consider the interval that includes the leaf that contains the resource of interest along with its sibling leaves. Hence, all the initial 13 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis nodes are constructed in line 4 and the algorithm terminates (line 5) until the next user’s operation. After the first call, in each ICO execution, the algorithm initially checks if the parent node of the currently presented elements is already constructed, or if all the nodes that enclose data objects10 have been constructed (line 6). Then, procedure constrRollUp-R (line 7) is used to construct the cur parent node, as well as the parent’s siblings. In the case that cur are not leaf nodes or data objects (line 8), procedure constrDrillDown-R (line 9) is used to construct all cur children. Finally, the algorithm returns the updated HETree (line 10). Discussion. The worst case for the computational cost is higher in HETree-R than in HETree-C, for all exploration scenarios. Particularly, in HETree-R worst case, ICO must build leaves that contain the whole dataset and the computational cost is O(|D|log|D|) for all scenarios. In HETree-C, for the RES and RAN scenarios, the cost is O(d2 + d−1 d |D|), and for the BSC scenario the cost is O(d2 +|D|). A detailed computational analysis for both HETree-R and HETree-C is included in Appendix C. 3.3. Adaptive HETree Construction In a (visual) exploration scenario, users wish to modify the organization of the data by providing user10 Note that in the HETree-R version, we may have nodes that do not enclose any data objects. ℓ=4 a [20, 44] d'=4 [46, 100] b [30, 44] [46, 55] e d h i ℓ'=2 c [20, 24] [66, 100] g f k m l n p o [90, 100] [20, 21] [23, 24] [30, 35] [36, 44] [46, 51] [52, 55] [66, 69] 20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100 [20, 100] a [20, 44] [46, 100] ℓ'=2 b d'=4 h [20, 21] i [23, 24] [46, 55] k l [30, 35] [36, 44] 20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 3.2.2. Computational Analysis Here we analyse the incremental construction for both HETree versions. Number of Constructed Nodes. Regarding the number of initial nodes constructed in each scenario: in RES scenario, at most d leaf nodes are constructed; in RAN scenario, at most 2d + d2 nodes are constructed; finally in BSC scenario, d + 1 are constructed. Regarding the maximum number of nodes constructed in each operation in RES and RAN scenarios: (1) A roll-up operation constructs at most d + d · (d − 1) = d2 nodes. The d nodes are constructed in constrRollUp, whereas the d · (d − 1) in constrDrillDown. (2) A drill-down operation constructs at most d2 nodes in constrDrillDown. As for the BSC scenario: (1) A roll-up operation does not construct any nodes. (2) A drill-down operation constructs at most d2 nodes in constrDrillDown. [20, 100] d=2 c [66, 100] f g 46, 50, 51 66, 68, 69 52, 53, 55 90, 94, 100 Fig. 6. Adaptive HETree example specific preferences for the whole hierarchy or part of it. The user can select a specific subtree and alter the number of groups presented in each level (i.e., the tree degree) or the size of the groups (i.e., number of leaves). In this case, a new tree (or a part of it) pertaining to the new parameters provided by the user should be constructed on-the-fly. For example, consider the HETree-C of Figure 6 representing ages of persons.11 A user may navigate to node b, where she prefers to increase the number of groups presented in each level. Thus, she modifies the degree of b from 2 to 4 and the subtree is adapted to the new parameter as depicted on the bottom tree of Figure 6. On the other hand, the user prefers exploring the right subtree (starting from node c) with less details. She chooses to increase the size of the groups by reducing (from 4 to 2) the number of leaves for the subtree of c. In both cases, constructing the subtree from scratch based on the user-provided parameters and recomputing statistics entails a significant time overhead, especially, when user preferences are applied to a large part of or the whole hierarchy. In this section, we introduce ADA (Adaptive HETree Construction) method, which dynamically adapts an existing HETree to a new, considering a set of userdefined parameters. Instead of both constructing the tree and computing the nodes’ statistics from scratch, our method reconstructs the new part(s) of the hierarchy by exploiting the existing elements (i.e., nodes, statistics) of the tree. In this way, ADA achieves to reduce the overall construction cost and enables the onthe-fly reorganization of the visualized data. In the ex11 For simplicity, Figure 6 presents only the values of the objects. 14 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Table 2. Summary of Adaptive HETree Construction⋆ Modify Degree Full Construction Tree Construction Complexity O(mlogm+d′ e) ′ d =d O(mlog √ k ℓ′ 0 e 0 0 0 0 0 O(m+d′ e) O(1) #leaves0 ℓ ′ 0 #leaves+ 0 0 #internals0 e 0 #internals+ 0 0 #leaves0 #leaves+ #internals0 #internals+ k d′ elsewhere ℓ >ℓ O(d′ e) O(d′k r) O(d′ e) O(m+d′ e) O(m) O(m+d′ e) O(mlogm+d′ e) 0 0 e 0 0 0 e−r 0 0 0 e 0 ℓ′ 0 e 0 0 ℓ′ 0 0 0 ℓ′ e 0 ℓ′ 0 e 0 O( kℓ′ +d′ e) O(d′ (e−r)) O(d′ e) O(m+d′ e) O(1) O(m+d′ e) 0 0 ℓ′ O(m+d′ e−ℓ′ −k) 0 0 0 0 0 0 0 0 ℓ′ e−r e e 0 e 0 0 0 0 d =k·d m) Modify Num. of Leaves √ k d = d ′ ′ ′ ′ ℓ = ℓ dk ℓ′ = ℓ k ℓ′ = ℓ − k Statistics Computations Complexity ⋆ m = |D|, e = d′ ℓ′ −1 d′ −1 d ′ e− l ′m ℓ l ′ md′ ℓ 0 d′ (maximum number of internal nodes), and r = ℓ′ − ℓ′2 d′ ℓ′2 d′ e 0 d′k ℓ′ −1 d′k −1 ample of Figure 6, the new subtree of b can be derived from the old one, just by removing the internal nodes d and e, while the new subtree of c results from merging leaves together and aggregating their statistics. Let T (D, ℓ, d) denote the existing HETree and T ′ (D, ℓ′ , d′ ) is the new HETree corresponding to the new user preferences for the tree degree d′ and the number of leaves ℓ′ . Note that T could also denote a subtree of an existing HETree (in the scenario where the user modifies only a part of it). In this case, the user indicates the reconstruction root of T . Then, ADA identifies the following elements of T : (1) The elements of T that also exist in T ′ . For example, consider the following two cases: the leaf nodes of T ′ are internal nodes of T in level x; the statistics of T ′ nodes in level x are equal to the statistics of T nodes in level y. (2) The elements of T that can be reused (as "building blocks") for constructing elements in T ′ . For example, consider the following two cases: each leaf node of T ′ is constructed by merging x leaf nodes of T ; the statistics for the node n of T ′ can be computed by aggregating the statistics from the nodes q and w of T . Consequently, we consider that an element (i.e., node or node’s statistics) in T ′ can be: (1) constructed/computed from scratch12 , (2) reused as is from T or (3) derived by aggregating elements from T. Table 2 summarizes the ADA reconstruction process. Particularly, the table includes: (1) the computational complexity for constructing T ′ , denoted as 12 Note that it is possible for a from scratch constructed node in T ′ to aggregate statistics from nodes in T . Complexity; (2) the number of leaves and internal nodes of T ′ constructed from scratch, denoted as #leaves0 and #internals0 , respectively; and (3) the number of leaves and internal nodes of T ′ derived from nodes of T , denoted as #leaves+ and #internals+ , respectively. The lower part of the table presents the results for the computation of node statistics in T ′ . Finally, the second table column, denoted as F ull Construction, presents the results of constructing T ′ from scratch. The following example demonstrates the ADA results, considering a DBpedia exploration scenario. Example 7. The user explores the populationTotal property of the DBpedia dataset. The default system organization for this property is a hierarchy with degree 3. The user modifies the tree parameters in order to fit better visualization results as following. First, she decides to render more groups in each hierarchy level and increases the degree from 3 to 9 (1st Modification). Then, she observes that the results overflow the visualization area and that a smaller degree fits better; thus she re-adjusts the tree degree to a value of 6 (2nd Modification). Finally, she navigates through the data values and decides to increase the groups’ size by a factor of three (i.e., dividing by three the number of leaves) (3rd Modification). Again, she corrects her decision and readjusts the final group size to twice the default size (4th Modification). Table 3 summarizes the number of nodes, constructed by a Full Construction and ADA in each modification, along with the required statistics computations. Considering the whole set of modifications, ADA constructs only the 22% (15.4K vs. N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 70.2K) of the nodes that are created in the case of the full construction. Also, ADA computes the statistics for only 8% (5.6K vs. 70.2K) of the nodes. Table 3. Full Construction vs. ADA over DBpedia Exploration Scenario (cells values: Full / ADA) Modify Degree 1st Modification 2nd Modification Modify Num. of Leaves 3rd Modification 4th Modification Tree Construction #nodes 22.1K / 0 23.6K / 3.9K 9.8K / 6.6K 14.7K / 4.9K 23.6K / 659 9.8K / 0 14.7K / 4.9K Statistics Computations #nodes 22.1K / 0 In the next sections, we present in detail the reconstruction process through the example trees of Figure 7. Figure 7a presents the initial tree T that is an HETree-C, with ℓ = 8 and d = 2. Figures 7b~7e present several reconstructed trees T ′ . Blue dashed lines are used to indicate the elements (i.e., nodes, edges) of T ′ which do not exist in T . Regarding statistics, we assume that in each node we compute the mean value. In each T ′ , we present only the mean values that are not known from T . Also, in mean values computations, the values that are reused from T are highlighted in yellow. All reconstruction details and computational analysis for each case are included in Appendix D. 3.3.1. The User Modifies the Tree Degree Regarding the modification of the degree parameter, we distinguish the following cases: The user increases the tree degree. We have that d′ > d; based on the d′ value we have the following cases: (1) d′ = dk , with k ∈ N+ and k > 1: Figure 7a presents T with d = 2 and Figure 7d presents the reconstructed T ′ with d′ = 4 (i.e., k = 2). T ′ results by simply removing the nodes with height 1 (i.e., d, e, f , g) and connecting the nodes with height 2 (i.e., b, c) with the leaves. In general, T ′ results from T by simply removing tree levels from T . Additionally, there is no need for computing any new statistics, since the statistics for all nodes of T ′ remain the same as in T . (2) d′ = k · d, with k ∈ N+ , k > 1 and k 6= dν where ν ∈ N+ : An example with k = 3 is presented in Figure 7c, where we have d′ = 6. In this case, the 15 leaves of T (Figure 7a) remain leaves in T ′ and all internal nodes up to the reconstruction root of T are constructed from scratch. As for the node statistics, we can compute the mean values for T ′ nodes with height 1 (i.e., µb , µc ) by aggregating already computed mean values (e.g., µd , µe , etc.) from T . In general, except for the leaves, we construct all internal nodes from scratch. For the internal nodes of height 1, we compute their statistics by aggregating the statistics of T leaves, whereas for internal nodes of height greater than 1, we compute from scratch their statistics. (3) elsewhere: In any other case where the user increases the tree degree, all internal nodes in T ′ except for the leaves are constructed from scratch. In contrast with the previous case, the leaves’ statistics from T can not be reused and, thus, for all internal nodes in T ′ the statistics are recomputed. The user decreases the tree degree. Here we have that d′ < d; based on the d′ value we have the following two cases: √ (1) d′ = k d, with k ∈ N+ and k > 1: Assume that now Figure 7d depicts T , with d = 4, while Figure 7a presents T ′ with d′ = 2. We can observe that T ′ contains all nodes of T , as well as a set of extra internal nodes (i.e., d, e, f , g). Hence, T ′ results from T by constructing some new internal nodes. (2) elsewhere: This case is the same as the previous case (3) where the user increases the tree degree. 3.3.2. The User Modifies the Number of Leaves Regarding the modification of the number of leaves parameter, we distinguish the following cases: The user increases the number of leaves. In this case we have that ℓ′ > ℓ; hence, each leaf of T is split into several leaves in T ′ and the data objects contained in a T leaf must be reallocated to the new leaves in T ′ . As a result, all nodes (both leaves and internal nodes) in T ′ have different contents compared to nodes in T and must be constructed from scratch along with their statistics. In this case, constructing T ′ requires 2 ′ ·ℓ −d O(|D| + d d−1 ) (by avoiding the sorting phase). The user decreases the number of leaves. In this case we have that ℓ′ < ℓ; based on the ℓ′ value we have the following three cases: 16 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis (a) (ℓ=8, d=2) (c) (ℓ=8, d=6) [20, 100] a [20, 44] b b c [20, 24] [30, 44] [46, 55] e d = ( d*6 + e*6 + f*6)/18 i g f k m l n h p o [46, 51] [52, 55] [66, 69] [30, 35] [36, 44] = ( o*3 + (e) (ℓ=3, d=2) p*3)/6 [66, 100] b [20, 100] c a [20, 52] i k [20, 21] [23, 24] h c a [66, 100] [20, 21] [20, 21] [23, 24] [20, 100] [20, 55] [46, 100] m l [30, 35] [36, 44] o n [46, 51] [52, 55] [66, 69] [90, 100] 20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100 [90, 100] 20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100 (b) (ℓ=4, d=2) a [20, 44] [20, 24] b (d) (ℓ=8, d=4) [20, 100] [30, 44] [46, 100] [46, 55] c [20, 100] a [20, 44] e f g 20, 20, 21 30, 32, 35 46, 50, 51 66, 68, 69 23, 24, 24 36, 40, 44 52, 53, 55 90, 94, 100 [20, 32] h i k l m n o [46, 51] [52, 55] [66, 69] f [53, 100] 35 53, 55 23, 24, 24 36, 40, 44 66, 68, 69 30, 32 46, 50, 51 90, 94, 100 52 c [30, 35] [36, 44] [35, 52] 20, 20, 21 d [20, 21] [23, 24] c e d [46, 100] b [66, 100] d [53, 100] b p p [90, 100] = ( h*3 + i*3 + 30 + 32)/8 e = (35 + l*3 + f = (53 + 55 + m*3 + o*3 + 52)/8 p*3)/8 20, 20, 21 23, 24, 24 30, 32, 35 36, 40, 44 46, 50, 51 52, 53, 55 66, 68, 69 90, 94, 100 Fig. 7. Adaptive HETree construction examples ℓ (1) ℓ′ = k , with k ∈ N+ : Considering that Figure 7a d presents T with ℓ = 8 and d = 2. A reconstruction example of this case with k = 1, is presented in Figure 7b, where we have T ′ with ℓ′ = 4. In Figure 7b, we observe that the leaves in T ′ result from merging dk leaves of T . For example, the leaf d of T ′ results from merging the leaves h and i of T . Then, T ′ results from T , by replacing the T nodes with height k (i.e., b, e, f , g), with the T ′ leaves. Finally, the nodes of T with height less than k are not included in T ′ . Therefore, in this case, T ′ is constructed by merging the leaves of T ′ and removing the internal nodes of T ′ having height less or equal to k. Also, we do not recompute the statistics of the new leaves of T ′ as these are derived from the statistics of the removed nodes with height k. ℓ , with k ∈ N+ , k > 1 and k 6= dν , where k ν ∈ N+ : As in the previous case, the leaves in T ′ are constructed by merging leaves from T and their statistics are computed based on the statistics of the merged leaves. In this case, however, all internal nodes in T ′ have to be constructed from scratch. ℓ (3) ℓ′ = ℓ − k, with k ∈ N+ , k > 1 and ℓ′ 6= , where ν ν ∈ N+ : The two previous cases describe that each leaf in T ′ f ully contains k leaves from T . In this case, a leaf in T ′ may partially contains leaves from T . A leaf in T ′ fully contains a leaf from T when the T ′ leaf contains all data objects belonging to the T leaf. Otherwise, a leaf in T ′ partially contain a leaf from T when the T ′ leaf contains a subset of the data objects from the T leaf. An example of this case is shown in Figure 7e that depicts a reconstructed T ′ resulting from the T pre(2) ℓ′ = sented in Figure 7a. The d leaf of T ′ fully contains leaves h, i of T and partially leaf k for which value 35 belongs to a different leaf (i.e., e). Due to this partial containment, we have to construct all leaves and internal nodes from scratch and recalculate their statistics. Still, the statistics of the fully contained leaves of T can be reused, by aggregating them with the individual values of the data objects included in the leaves. For example, as we can see in Figure 7e, the mean value µd of the leaf d is computed by aggregating the mean values µh and µi corresponding to the fully contained leaves h and i, with the individual values 30, 32 of the partially contained leaf k. 4. The SynopsViz Tool Based on the proposed hierarchical model, we have developed a web-based prototype called SynopsViz13 . The key features of SynopsViz are summarized as follows: (1) It supports the aforementioned hierarchical model for RDF data visualization, browsing and analysis. (2) It offers automatic on-the-fly hierarchy construction, as well as user-defined hierarchy construction based on users’ preferences. (3) Provides faceted browsing and filtering over classes and properties. (4) Integrates statistics with visualization; visualizations have been enriched with useful statistics and data information. (5) Offers several visualization techniques (e.g., timeline, chart, treemap). (6) Provides a large number of dataset’s statistics regarding the: data-level (e.g., number of sameAs triples), schema-level (e.g., most common classes/properties), and structure level 13 synopsviz.imis.athena-innovation.gr Input Data (RDF/S ‒ OWL) Metadata Extractor Data Visualization Module Statistics Processor Hierarchy Specifier Statistics Generator Hierarchy Constructor Facet Generator Hierarchical Model Module Data & Schema Handler N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 17 Hierarchical Visual Exploration User Preferences Faceted Search Statistics Chart Timeline Treemap Metadata Fig. 8. System architecture (e.g., entities with the larger in-degree). (7) Provides numerous metadata related to the dataset: licensing, provenance, linking, availability, undesirability, etc. The latter can be considered useful for assessing data quality [115]. In the rest of this section, Section 4.1 describes the system architecture, Section 4.2 demonstrates the basic functionality of the SynopsViz. Finally, Section 4.3 provides technical information about the implementation. 4.1. System Architecture The architecture of SynopsViz is presented in Figure 8. Our scenario involves three main parts: the Client UI, the SynopsViz, and the Input data. The Client part, corresponds to the system’s front-end offering several functionalities to the end-users. For example, hierarchical visual exploration, facet search, etc. (see Section 4.2 for more details). SynopsViz consumes RDF data as Input data; optionally, OWLRDF/S vocabularies/ontologies describing the input data can be loaded. Next, we describe the basic components of the SynopsViz. In the preprocessing phase, the Data and Schema Handler parses the input data and inferes schema information (e.g., properties domain(s)/range(s), class/ property hierarchy, type of instances, type of properties, etc.). Facet Generator generates class and property facets over input data. Statistics Generator computes several statistics regarding the schema, instances and graph structure of the input dataset. Metadata Extractor collects dataset metadata. Note that the model construction does not require any preprocessing, it is performed online, according to user interaction. During runtime the following components are involved. Hierarchy Specifier is responsible for managing the configuration parameters of our hierarchy model, e.g., the number of hierarchy levels, the number of nodes per level, and providing this information to the Hierarchy Constructor. Hierarchy Construc- tor implements our tree structure. Based on the selected facets, and the hierarchy configuration, it determines the hierarchy of groups and the contained triples. Statistics Processor computes statistics about the groups included in the hierarchy. Visualization Module allows the interaction between the user and the back-end, allowing several operations (e.g., navigation, filtering, hierarchy specification) over the visualized data. Finally, the Hierarchical Model Module maintains the in-memory tree structure for our model and communicates with the Hierarchy Constructor for the model construction, the Hierarchy Specifier for the model customization, the Statistics Processor for the statistics computations, and the Visualization Module for the visual representation of the model. 4.2. SynopsViz In-Use In this section we outline the basic functionality of SynopsViz prototype. Figure 9 presents the web user interface of the main window. SynopsViz UI consists of the following main panels: Facets panel: presents and manages facets on classes and properties; Input data control panel: enables the user to import and manage input datasets; Visualization panel: is the main area where interactive charts and statistics are presented; Configuration panel: handles visualization settings. Initially, users are able to select a dataset from a number of offered real-word LD datasets (e.g., DBpedia, Eurostat) or upload their own. Then, for the selected dataset, the users are able to examine several of the dataset’s metadata, and explore several datasets’s statistics. Using the facets panel, users are able to navigate and filter data based on classes, numeric and date properties. In addition, through facets panel several information about the classes and properties (e.g., number of instances, domain(s), range(s), IRI, etc.) are provided to the users through the UI. Users are able to visually explore data by considering properties’ values. Particularly, area charts and 18 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Fig. 9. Web user interface timeline-based area charts are used to visualize the resources considering the user’s selected properties. Classes’ facets can also be used to filter the visualized data. Initially, the top level of the hierarchy is presented providing an overview of the data, organized into top-level groups; the user can interactively drilldown (i.e., zoom-in) and roll-up (i.e., zoom-out) over the group of interest, up to the actual values of the input data (i.e., LD resources). At the same time, statistical information concerning the hierarchy groups as well as their contents (e.g., mean value, variance, sample data, range) is presented through the UI (Figure 10a). Regarding the most detailed level (i.e., LD resources), several visualization types are offered; i.e., area, column, line, spline and areaspline (Figure 10b). In addition, users are able to visually explore data, through class hierarchy. Selecting one or more classes, users can interactively navigate over the class hierarchy using treemaps (Figure 10c) or pie charts (Figure 10d). Properties’ facets can also be used to filter the visualized data. In SynopsViz the treemap visualization has been enriched with schema and statistical information. For each class, schema metadata (e.g., number of instances, subclasses, datatype/object properties) and statistical information (e.g., the cardinality of each property, min, max value for datatype properties) are provided. Finally, users can interactively modify the hierarchy specifications. Particularly, they are able to increase or decrease the level of abstraction/detail presented, by modifying both the number of hierarchy levels, and number of nodes per level. A video presenting the basic functionality of our prototype is available at youtu.be/n2ctdH5PKA0. Also, a demonstration of SynopsViz tool is presented in [19]. 4.3. Implementation SynopsViz is implemented on top of several open source tools and libraries. The back-end of our system is developed in Java, Jena framework is used for RDF data handing and Jena TDB is used for diskbased RDF storing. The front-end prototype, is developed using HTML and Javascript. Regarding visualization libraries, we use Highcharts, for the area, column, line, spline, areaspline and timeline-based charts and Google Charts for treemap and pie charts. N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis (a) Groups of numeric RDF data (Area chart) (c) Class hierarchy (Treemap chart) 19 (b) Numeric RDF data (Column chart) (d) Class hierarchy (Pie chart) Fig. 10. Numeric data & class hierarchy visualization examples 5. Experimental Analysis In this section we present the evaluation of our approach. In Section 5.1, we present the dataset and the experimental setting. Then, in Section 5.2 we present the performance results and in Section 5.3 the user evaluation we performed. 5.1. Experimental Setting In our evaluation, we use the well known DBpedia 2014 LD dataset. Particularly, we use the Mappingbased Properties (cleaned) dataset14 which contains high-quality data, extracted from Wikipedia Infoboxes. This dataset contains 33.1M triples and includes a large number of numeric and temporal properties of varying sizes. The largest numeric property in this dataset has 534K triples, whereas the largest temporal property has 762K. Regarding the methods used in our evaluation, we consider our HETree hierarchical approaches, as well as a simple non-hierarchical visualization approach, 14 downloads.dbpedia.org/2014/en/mappingbased_properties_ cleaned_en.nt.bz2 referred as FLAT. FLAT is considered as a competitive method against our hierarchical approaches. It provides single-level visualizations, rendering only the actual data objects; i.e., it is the same as the visualization provided by SynopsViz at the most detailed level. In more detail, the FLAT approach corresponds to a column chart in which the resources are sorted in ascending order based on their object values, the horizontal axis contains the resources’ names (i.e., triples’ subjects), and the vertical axis corresponds to objects’ values. By hovering over a resource, a tooltip appears including the resource’s name and object value. Regarding the HETree approaches, the tree parameters (i.e., number of leaves, degree and height) are automatically computed following the approach described in Section 2.5. In our experiments, the lower and the upper bound for the objects rendered at the most detailed level have been set to λmin = 10 and λmax = 50, respectively. Considering the visualizations provided by the default Highcharts settings, these numbers are reasonable for our screen size and resolution. Finally, our backend system is hosted on a server with a quad-core CPU at 2GHz and 8GB of RAM running Windows Server 2008. As client, we used a lap- 20 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis top with i5 CPU at 2.5GHz with 4G RAM, running Windows 7, Firefox 38.0.1 and ADSL2+ internet connection. Additionally, in the user evaluation, the client is employed with a 24" (1920×1200) screen. 5.2. Performance Evaluation In this section, we study the performance of the proposed model, as well as the behaviour of our tool, in terms of construction and response time, respectively. Section 5.2.1 describes the setting of our performance evaluation, and Section 5.2.2 presents the evaluation results. 5.2.1. Setup In order to study the performance, a number of numeric and temporal properties from the employed dataset are visualized using the two hierarchical approaches (i.e., HETree-C/R), as well as the FLAT approach. We select one set from each type of properties; each set contains 15 properties with varying sizes, starting from small properties having 50-100 triples up to the largest properties. In our experiment, for each of the three approaches, we measure the tool response time. Additionally, for the two hierarchical approaches we also measure the time required for the HETree construction. Note that in hierarchical approaches through user interaction, the server sends to the browser only the data required for rendering the current visualization level (although the whole tree is constructed at the backend). Hence, when a user requests to generate a visualization we have the following workflow. Initially, our system constructs the tree. Then, the data regarding the toplevel groups (i.e., root node children) are sent to the browser which renders the result. Afterwards, based on user interactions (i.e., drill-down, roll-up), the server retrieves the required data from the tree and sends it to the browser. Thus, the tree is constructed the first time a visualization is requested for the given input dataset; for any further user navigation over the hierarchy, the response time does not include the construction time. Therefore, in our experiments, in the hierarchical approaches, as response time we measure the time required by our tool to provide the first response (i.e., render the top-level groups), which corresponds to the slower response in our visual exploration scenario. Thus, we consider the following measures in our experiments: Construction T ime: the time required to build the HETree structure. This time includes (1) the time for sorting the triples; (2) the time for building the tree; and (3) the time for the statistics computations. Response T ime: the time required to render the charts, starting from the time the client sends the request. This time includes (1) the time required by the server to compute and build the response. In the hierarchical approaches, this time corresponds to the Construction Time, plus the time required by the server to build the JSON object sent to the client. In the FLAT approach, it corresponds to the time spent in sorting the triples plus the time for the JSON construction; (2) the time spent in the client-sever communication; and (3) the time required by the visualization library to render the charts on the browser. 5.2.2. Results Table 4 presents the evaluation results regarding the numeric (upper half) and the temporal properties (lower half). The properties are sorted in ascending order of the number of triples. For each property, the table contains the number of triples, the characteristics of the constructed HETree structures (i.e., number of leaves, degree, height, and number of nodes), as well as the construction and the response time for each approach. The presented time measurements are the average values from 50 executions. Regarding the comparison between the HETree and FLAT, the FLAT approach can not provide results for properties having more than 305K triples, indicated in the last rows for both numeric and temporal properties with "–" in the FLAT response time. For the rest properties, we can observe that the HETree approaches clearly outperform FLAT in all cases, even in the smallest property (i.e., rankingWin, 50 triples). As the size of properties increases, the difference between the HETree approaches and FLAT increases, as well. In more detail, for large properties having more than 53K triples (i.e., the numeric properties larger than the populationDensity -12th row-, and the temporal properties larger than the added -11th row-), the HETree approaches outperform the FLAT by one order of magnitude. Regarding the time required for the construction of the HETree structure, from Table 4 we can observe the following: The performance of both HETtree structures is very close for most of the examined properties, with the HETree-R performing slightly better than the HETree-C (especially in the relatively small numeric properties). Furthermore, we can observe that the response time follows a similar trend as the construction time. This is expected since the communication N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 21 Table 4. Performance Results for Numeric & Temporal Properties Tree Characteristics Property (#Triples) Numeric Properties rankingWins (50) distanceToBelfast (104) waistSize (241) fileSize (492) hsvCoordinateValue (995) lineLength (1, 923) powerOutput (5, 453) width (11, 049) numberOfPages (21, 743) inseeCode (36, 780) areaWater (40, 564) populationDensity (52, 572) areaTotal (140, 408) populationTotal (304, 522) lat (533, 900) #Leaves Degree Height #Nodes HETree-C HETree-R FLAT Construction Response Construction Response Response Time (msec) Time (msec) Time (msec) Time (msec) Time (msec) 9 9 16 27 81 81 243 729 729 2,187 2,187 2,187 6,561 19,683 19,683 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 4 4 5 6 6 7 7 7 8 9 9 13 13 21 40 121 121 364 1,093 1,093 3,280 3,280 3,280 9,841 29,524 29,524 5 7 10 18 74 77 234 506 2,888 4,632 4,945 6,803 16,158 31,141 73,528 324 337 346 347 403 409 560 830 3,219 4,962 5,134 7,127 16,482 31,473 73,862 1 4 9 16 50 55 217 467 2,403 4,105 5,274 6,080 13,298 25,866 71,784 323 329 336 345 383 391 540 799 2,722 4,436 5,457 6,404 13,627 26,196 72,106 415 419 440 575 980 1,463 2,583 6,135 12,669 19,119 29,538 44,262 219,018 1,523,675 — Temporal Properties retired (155) 9 endDate (341) 27 lastAirDate (704) 64 buildingStartDate (1, 415) 81 latestReleaseDate (2, 925) 243 orderDate (3, 788) 243 decommissioningDate (7, 082) 243 shipLaunch (15, 938) 729 completionDate (17, 017) 729 foundingDate (19, 694) 729 added (44, 227) 2,187 activeYearsStartDate (98, 160) 6,561 releaseDate (169, 156) 6,561 deathDate (321, 883) 19,683 birthDate (761, 830) 59,049 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 4 5 5 5 6 6 6 7 8 8 9 10 13 40 85 121 364 364 364 1,093 1,093 1,093 3,280 9,841 9,841 29,524 88,573 8 17 34 73 162 210 405 1,772 1,987 2,745 5,912 10,368 19,122 32,990 85,797 330 339 359 406 496 542 735 2,094 2,311 3,069 5,943 10,702 19,451 33,313 86,120 4 16 30 53 146 195 383 1,595 1,793 2,583 6,244 8,952 16,526 27,936 83,982 327 339 359 384 480 523 717 1,919 2,121 2,905 6,265 9,282 16,856 28,271 84,314 425 468 853 1,103 1,804 2,011 3,423 6,935 7,814 8,699 33,846 107,587 950,545 — — cost, as well as the times required for constructing and rendering the JSON object are almost the same for all cases. Regarding the comparison between the construction and the response time in the HETree approaches, from Table 4 we can observe the following. For properties having up to 5.5K triples (i.e., the numeric properties smaller than the width -8th row-, and the temporal properties smaller than the decommissioningDate -7th row-), the response time is dominated by the communication cost, and the time required for the JSON construction and rendering. For properties with only a small number of triples (i.e., waistSize, 241 triples), only 1.5% of the response time is spent on constructing the HETree. Moreover, for a property with a larger number of triples (i.e., buildingStartData, 1.415 triples), 18% of the time is spent on constructing the HETree. Finally, for the largest property for which the time spent in communication cost, JSON construc- tion and rendering is larger than the construction time (i.e., powerOutput, 5.453 triples), 42% of the time is spent on constructing the HETree. Figure 11 summarizes the results from Table 4, presenting the response time for all approaches w.r.t. the number of triples. Particularly, Figure 11a includes all properties sizes (i.e., 50 to 762K). Further, in order to have a precise observation over small property sizes (Small properties) in which the difference between the FLAT and the HETree approaches is smaller, we report properties with less than 20K triples separately in Figure 11b. Once again, we observe that HETree-R performs slightly better than the HETree-C. Additionally, from Figure 11b we can indicate that for up to 10K triples the performance of the HETree approaches is almost the same. We can also observe the significant difference between the FLAT and the HETree approaches. 22 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis FLAT HETree-C HETree-R FLAT HETree-C HETree-R 3000 60000 Time (msec) Time (msec) 80000 40000 2000 1000 20000 0 0 50 200K 400K Number of triples 600K 762K (a) All Properties (50 to 762K triples) 50 5K 10K Number of triples 15K 20K (b) Small Properties (50 to 20K triples) Fig. 11. Response Time w.r.t. the number of triples However our method clearly outperforms the nonhierarchical method, as we can observe from the above results, the construction of the whole hierarchy can not provide an efficient solution for datasets containing more than 10K objects. As discussed in Section 3.2, for efficient exploration over large datasets an incremental hierarchy construction is required. In the incremental exploration scenario, the number of hierarchy nodes that have to be processed and constructed is significantly fewer compared to the non-incremental. For example, adopting an non-incremental construction in populationTotal (305K triples), 29.6K nodes are to be initially constructed (along with their statistics). On the other hand, with the incremental approach (as analysed in Section 3.2) at the beginning of each exploration scenario, only the initial nodes are constructed. Initial nodes are the nodes initially presented, as well as the nodes potentially reached by the user’s first operation. In the RES scenario, the initial nodes are the leaf of interest (1 node) and its sibling leaves (at most d − 1 nodes). In the RAN, the initial nodes are the nodes of interest (at most d nodes), their children (at most d2 nodes), and their parent node along with its siblings (at most d nodes). Finally, in the BSC scenario the initial nodes are the root node (1 node) and its children (at most d nodes). Overall, at most d, 2d + d2 , and d + 1 nodes are initially constructed in the RES, RAN, and BSC scenarios respectively. Therefore, in populationTotal case where d = 3 at most 3, 15 and 4 nodes are initially constructed in the RES, RAN, and BSC scenarios respectively. 5.3. User Study In this section we present the user evaluation of our tool, where we have employed three approaches: the two hierarchical and the FLAT. Section 5.3.1 describes the user tasks, Section 5.3.2 outlines the evaluation procedure and setup, Section 5.3.3 summarizes the evaluation results, and Section 5.3.4 discusses issues related to the evaluation process. 5.3.1. Tasks In this section we describe the different types of tasks that are used in the user evaluation process. Type 1 [Find resources with specific value]: This type of tasks requests the resources having value v (as object). For this task type, we define task T1 by selecting a value v that corresponds to 5 resources. Given this task, the participants are asked to provide the number of resources that pertain to this value. In order to solve this task, the participants first have to find a resource with value v and then check which of the nearby resources also have the same value. Type 2 [Find resources in a range of values]: This type of tasks requests the resources having value greater than vmin and less than vmax . We define two tasks of this type, by selecting different combinations of vmin and vmax values, such that tasks which consider different numbers of resources are defined. We define two tasks, the first task considers a relative small number of resources while the second a larger. In our experiments we select 10 as a small number of resources, while as a large number we select 50. Particularly, in the first task, named T2.1, we specify the values vmin and vmax such that a relatively small set of (approximately 10) resources are included, whereas the second task, T2.2, considers a relatively larger set of (approximately 50) resources. Given these tasks, the participants are asked to provide the number of resources included in the given range. This task can be solved by first finding a resource with a value in- N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis cluded in the given range, and then explore the nearby resources in order to identify the resources in the given range. Type 3 [Compare distributions]: This type of tasks requests from the participant to identify whether more resources appear above or below a given value v. For this type, we define task T3, by selecting the value v near to the median. Given this task, the participants are asked to provide the number of resources appearing either above or below the value v. The answer for this tasks requires from the participants to indicate the value v and determine the number or resources appearing either before or after this value. 5.3.2. Setup In order to study the effect of the property size in the selected tasks, we have selected two properties of different sizes from the employed dataset (Section 5.1). The hsvCoordinateHue numeric property containing 970 triples, is referred to as Small, and the maximumElevation numeric property, containing 37.936 triples, is referred to as Large. The first one corresponds to a hierarchy of height 4 and degree 3, and the latter corresponds to a hierarchy of height 7 and degree 3. We should note here that through the user evaluation, the hierarchy parameters were fixed for all the tasks, and the participants were not allowed to modify them, such that the setting has been the same for everyone. In our evaluation, 10 participants took part. The participants were computer science graduate students and researchers. At the beginning of the evaluation, each participant has introduced to the system by an instructor who provided a brief tutorial over the features required for the tasks. After the instructions, the participants familiarized themselves with the system. Note that we have integrated in the SynopsViz the FLAT approach along with the HETree approaches. During the evaluation, each participant performed the previously described four tasks, using all approaches (i.e., HETree-C/R and FLAT), over both the small and large properties. In order to reduce the learning effects and fatigue we defined three groups. In the first group, the participants start their tasks with the HETree-C approach, in the second with HETree-R, and in the third with FLAT. Finally, the property (i.e., small, large) first used in each task was counterbalanced among the participants and the tasks. The entire evaluation did not exceed 75 minutes. Furthermore, for each task (e.g., T2.1, T.3), three task instances were specified by slightly modifying the 23 task parameters. As a result, given a task, a participant has to solve a different instance of this task, in each approach. For example, in task T2.1, for the HETree-R, the selected v corresponds to a solution of 11 resources, in HETree-C, to 9 resources, whereas for FLAT v corresponded to a solution of 8 resources. The task instance assigned to each approach varied among the participants. During the evaluation the instructor measured the time required for each participant to complete a task, as well as the number of incorrect answers. Table 5 presents the average time required for the participants to complete each task. The table contains the measurements for all approaches, and for both properties. Although we acknowledge that the number of participants in our evaluation is small, we have computed the statistical significance of the results. Essentially, for each property, the p-value of each task is presented in the last column. The p-value is computed using oneway repeated measures ANOVA. In addition, the results regarding the number of tasks that were not correctly answered are presented in Table 6. Particularly, the table presents the percentage of incorrect answers for each task and property, referred to as error rate. Additionally, for each task and property, the table includes the p-value. Here, the p-value has been computed using Fisher’s exact test. 5.3.3. Results Task T1. Regarding the first task, as we can observe from Table 5, the HETree approaches outperform FLAT, in both property sizes. Note that the time results on T1 are statistically significant (p < 0.01). As expected, all approaches require more time for the Large property compared to the Small one. This overhead in FLAT is caused by the larger number of resources that the participants have to scroll over and examine, until they indicate the requested resource’s value. On the other hand, in HETree, the overhead is caused by the larger number of levels that the Large property hierarchy has. Hence, the participants have to perform more drill-down operations and examine more groups of objects, until they reach the LD resources. We can also observe that in this task, the HETree-R performs slightly better than the HETree-C in both property sizes. This is due to the fact that, in HETree-R structure, resources having the same value are always contained in the same leaf. As a result, the participants had to inspect only one leaf. On the other hand, in 24 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Table 5. Average Task Completion Time (sec) Small Property FLAT HETree-C HETree-R T1 54 T2.1 63 T2.2 120 T3 262 ⋆⋆ 29 57 69 41 (p < 0.01) 28 64 74 40 ⋆ (p < 0.05) Table 6. Error Rate (%) Small Property Large Property p ⋆⋆ ◆ ⋆⋆ ⋆⋆ ◆ FLAT HETree-C HETree-R 85 74 128 — 52 60 72 64 47 69 77 62 FLAT HETree-C HETree-R p ⋆⋆ ⋆ ⋆⋆ — (p > 0.05) HETree-C this does not always hold, hence the participants could have explored more than one leaf. Finally, as we can observe from Table 6, in all cases only correct answers have been provided. However, none of those results are statistically significant (p > 0.05). Task T2.1. In the next task, where the participants had to indicate a small set of resources in a range of values, the FLAT performance is very close to the HETree, especially in the Small property (Table 5). In addition, we can observe that the HETree-C approach performs slightly better than the HETree-R. Finally, regarding the statistical significance of the results, in Small property we have p > 0.05, while in Large we have p < 0.005. The poor performance of the HETree approaches in this task can be explained by the small set of resources requested and the HETree parameters adopted in the user evaluation. In this setting, the resources contained in the task solution are distributed over more than one leaves. Hence, the participants had to perform several roll-up and drill-down operations in order to find all the resources. On the other hand, in FLAT, once the participants had indicated one of the requested resources, it was very easy for them to find out the rest of the solution’s resources. To sum up, in FLAT, most of the time is spent on identifying the first of the resources, while in HETree the first resource is identified very quickly. Regarding the difference in performance between the HETree approaches we have the following. In HETree-C due to the fixed number of objects in each leaf, the participants had to visit at most one or two leaves in order to solve this task. On the other hand, in HETree-R, the number of objects in each leaf is varied, so most times the participants had to inspect more than two leaves in order to solve the task. Finally, also in this case only correct answers were given (Table 6). Task T2.2. In this task the participants had to indicate a larger set (compared to the previous task) of resources given a range of values. HETree approaches noticeably T1 T2.1 T2.2 T3 ⋆⋆ 0 0 20 70 0 0 0 0 (p < 0.01) 0 0 0 0 ⋆ (p < 0.05) Large Property p ◆ ◆ ◆ ⋆⋆ ◆ FLAT HETree-C HETree-R p 0 0 20 — 0 0 0 0 0 0 10 0 ◆ ◆ ◆ — (p > 0.05) outperform the FLAT approach with statistical significance (p < 0.01), while similar results are observed in both properties. In the FLAT approach a considerable time was spent to identify and navigate over a large number of resources. On the other hand, due to the large number of resources involved in the task’s solution, there are groups in the hierarchy that explicitly contain resources of solutions (i.e., they do not contain resources not included in the solution). As a result, the participants in HETree could easily indicate and compute the whole solution by combining the information related to the groups (i.e., number of enclosed resources) and individual resources. Due to the same reasons stated in the previous task (i.e., T2.1), similarly in T2.2 the HETree-C performs slightly better than the HETree-R. Finally, we can observe from Table 6 (but without statistical significance), that it was more difficult for participants to solve correctly this task with FLAT than with HETree. Task T3. In the last task, participants were requested to find which of the two ranges contained more resources. As expected, Table 5 shows that the HETree approaches clearly outperform the FLAT approach with statistical significance in the Small property. This is due to the fact that the participants in FLAT had to overview and navigate over almost half of the dataset. As a result, apart from the long time required for this process, it was also very difficult to find the correct solution. This is also verified by Table 6 on a statistically significant level. On the other hand, in the HETree approaches, the participants could easily find out the answer by considering the resources enclosed by several groups. Regarding the Large property, as it is expected, it was impossible for participants to solve this task with FLAT, since this required to parse over and count about 19K resources. As a result, none of the participants completed this task using FLAT (indicated with "–" in N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Table 5), considering the 5 minute time limit used in this task. 5.3.4. Discussion The user evaluation showed that the hierarchical approaches can be efficient (i.e., require short time in solving tasks) and effective (i.e., have lower error rate) in several cases. In more detail, the HETree approaches performed very well on indicating specific values over a dataset, and given the appropriate parameter setting are marginally affected by the dataset size. Also note that due to the "vertical-based" exploration, the position (e.g., towards the end) of the requested value in the dataset does not affect the efficiency of the approach. Furthermore, it is shown that the hierarchical approaches can efficiently and effectively handle visual exploration tasks that involve large numbers of objects. At the end of the evaluation, the participants gave us valuable feedbacks on possible improvements of our tool. Most of the participants criticized several aspects in the interface, since our tool is an early prototype. Also, several participants mentioned difficulties in obtaining their "position" (e.g., which is the currently visualized range of values, or the previously visualized range of values) during the exploration. Finally, some participants mentioned that some hierarchies contained more levels than needed. As previously mentioned, the adopted parameters are not well suited for the evaluation, since hierarchies with a degree larger than 3 (and as result less levels) are required. Finally, additional tasks for demonstrating the capabilities of our model can be considered. However, most of these tasks were not selected in this evaluation, because it was not possible for the participants to perform them with the FLAT approach. An indicative set includes: (1) Find the number of resources (and/or statistics) in the 1st and 3rd quartile; (2) Find statistics (e.g., mean value, variance) for the top-10 or 50 resources; (3) Find the decade (i.e., temporal data) in which most events take place. 6. Related Work This section reviews works related to our approach on visualization and exploration in the Web of Data (WoD). Section 6.1 presents systems and techniques for WoD visualization and exploration, Section 6.2 discusses techniques on WoD statistical analysis, Sec- 25 tion 6.3 present hierarchical data visualization techniques, and finally, Section 6.4 discusses works on data structures & processing related to our HETree data structure. In Table 7 we provide an overview and compare several visualization systems that offer similar features to our SynopsViz. The WoD column indicates systems that target the Semantic Web and Linked Data area (i.e., RDF, RDF/S, OWL). The Hierarchical column indicates systems that provide hierarchical visualization of non-hierarchical data. The Statistics column captures the provision of statistics about the visualized data. The Recomm. column indicates systems, which offer recommendation mechanisms for visualization settings (e.g., appropriate visualization type, visualization parameters, etc.). The Incr. column indicate systems that provide incremental visualizations. Finally, the Preferences column captures the ability of the users to apply data (e.g., aggregate) or visual (e.g., increase abstraction) operations. 6.1. Exploration & Visualization in the Web of Data A large number of works studying issues related to WoD visual exploration and analysis have been proposed in the literature [30,18,80,3]. In what follows, we classify these works into the following categories: (1) Generic visualization systems, (2) Domain, vocabulary & device-specific visualization systems, and (3) Graph-based visualization systems. 6.1.1. Generic Visualization Systems In the context of WoD visual exploration, there is a large number of generic visualization frameworks, that offer a wide range of visualization types and operations. Next, we outline the best known systems in this category. Rhizomer [21] provides WoD exploration based on a overview, zoom and filter workflow. Rhizomer offers various types of visualizations such as maps, timelines, treemaps and charts. VizBoard [109,110] is an information visualization workbench for WoD build on top of a mashup platform. VizBoard presents datasets in a dashboard-like, composite, and interactive visualization. Additionally, the system provides visualization recommendations. Payola [68] is a generic framework for WoD visualization and analysis. The framework offers a variety of domain-specific (e.g., public procurement) analysis plugins (i.e., analyzers), as well as several visualization techniques (e.g., graphs, tables, etc.). In addition, Payola offers collaborative features 26 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Table 7. Visualization Systems Overview System WoD Hierarchical Rhizomer [21] Payola [68] LDVM [20] Vis Wizard [105] LDVizWiz [6] LinkDaViz [103] VizBoard [109] SemLens [51] LODeX [15] LODWheel [99] RelFinder [50] Fenfire [48] Lodlive [22] IsaViz [87] graphVizdb [17,16] ViCoMap [89] EDT [79] Polaris [98] XmdvTool [112] GrouseFlocks [5] GMine [57] Gephi [11] CGV [104] SynopsViz ⋆ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Data Types⋆ Vis. Types⋆⋆ Statistics Recomm. N, T, S, H, G C, M, T, TL N, T, S, H, G C, CI, G, M, T, TL, TR S, H, G B, M, T, TR N, T, S B, C, M, P, PC, SG S, H, G M, P, TR N, T, S B, C, S, M, P N, H C,S, T N S G G, M, P N, S, G C, G, M, P G G G G G G G G G G N, T, S M N, T, H C, CM, T, SP N, T, S, H C, M, S N DS, PC, S, ST G G G G G G G G N, T, H C, P, T, TL $ Incr. Preferences Domain App. Type ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ generic generic generic generic generic generic generic generic generic generic generic generic generic generic generic generic OLAP OLAP generic generic generic generic generic generic Web Web Web Web Web Web Web Web Web Web Web Desktop Web Desktop Web Web Desktop Desktop Desktop Desktop Desktop Desktop Desktop Web N: Numeric, T: Temporal, S: Spatial, H: Hierarchical (tree), G: Graph (network) B: bubble chart, C: chart, CI: circles, CM: colormap, DS: dimensional stacking, G: graph, M: map, P: pie, PC: parallel coordinates, S: scatter, SG: streamgraph, SP: solarplot, ST: star glyphs, T: treemap, TL: timeline, TR: tree ⋆⋆ $ The HETree model is not restricted to these visualization types. for users to create and share analyzers. In Payola the visualizations can be customized according to ontologies used in the resulting data. The Linked Data Visualization Model (LDVM) [20] provides an abstract visualization process for WoD datasets. LDVM enables the connection of different datasets with various kinds of visualizations in a dynamic way. The visualization process follows a four stage workflow: Source data, Analytical abstraction, Visualization abstraction, and View. A prototype based on LDVM considers several visualization techniques, e.g., circle, sunburst, treemap, etc. Finally, the LDVM has been adopted in several use cases [69]. Vis Wizard [105] is a Web-based visualization system, which exploits data semantics to simplify the process of setting up visualizations. Vis Wizard is able to analyse multiple datasets using brushing and linking methods. Similarly, Linked Data Visualization Wizard (LDVizWiz) [6] provides a semi-automatic way for the production of possible visualization for WoD datasets. In a same context, LinkDaViz [103] finds the suitable visualizations for a give part of a dataset. The framework uses heuristic data analysis and a visualization model in or- der to facilitate automatic binding between data and visualization options. Balloon Synopsis [91] provides a WoD visualizer based on HTML and JavaScript. It adopts a nodecentric visualization approach in a tile design. Additionally, it supports automatic information enhancement of the local RDF data by accessing either remote SPARQL endpoints or performing federated queries over endpoints using the Balloon Fusion service. Balloon Synopsis offers customizable filters, namely ontology templates, for the users to handle and transform (e.g., filter, merge) input data. SemLens [51] is a visual system that combines scatter plots and semantic lenses, offering visual discovery of correlations and patterns in data. Objects are arranged in a scatter plot and are analysed using user-defined semantic lenses. LODeX [15] is a system that generates a representative summary of a WoD source. The system takes as input a SPARQL endpoint and generates a visual (graph-based) summary of the WoD source, accompanied by statistical and structural information of the source. LODWheel [99] is a Web-based visualizing system which combines JavaScript libraries (e.g., N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis MooWheel, JQPlot) in order to visualize RDF data in charts and graphs. Hide the stack [31] proposes an approach for visualizing WoD for mainstream end-users. Underlying Semantic Web technologies (e.g., RDF, SPARQL) are utilized, but are "hidden" from the endusers. Particularly, a template-based visualization approach is adopted, where the information for each resource is presented based on its rdf:type. 6.1.2. Domain, Vocabulary & Device-specific Visualization Systems In this section, we present systems that target visualization needs for specific types of data and domains, RDF vocabularies or devices. Several systems focus on visualizing and exploring geo-spatial data. Map4rdf [74] is a faceted browsing tool that enables RDF datasets to be visualized on an OSM or Google Map. Facete [97] is an exploration and visualization system for SPARQL accessible data, offering faceted filtering functionalities. SexTant [82] and Spacetime [107] focus on visualizing and exploring time-evolving geo-spatial data. The LinkedGeoData Browser [96] is a faceted browser and editor which is developed in the context of LinkedGeoData project. Finally, in the same context DBpedia Atlas [106] offers exploration over the DBpedia dataset by exploiting the dataset’s spatial data. Furthermore, in the context of linked university data, VISUalization Playground (VISU) [4] is an interactive tool for specifying and creating visualizations using the contents of linked university data cloud. Particularly, VISU offers a novel SPARQL interface for creating data visualizations. Query results from selected SPARQL endpoints are visualized with Google Charts. A variety of systems target multidimensional WoD modelled with the Data Cube vocabulary. CubeViz [37,90] is a faceted browser for exploring statistical data. The system provides data visualizations using different types of charts (i.e., line, bar, column, area and pie). The Payola Data Cube Vocabulary [52] adopts the LDVM stages [20] in order to visualize RDF data described by the Data Cube vocabulary. The same types of charts as in CubeViz are provided in this system. The OpenCube Toolkit [60] offers several systems related to statistical WoD. For example, OpenCube Browser explores RDF data cubes by presenting a two-dimensional table. Additionally, the OpenCube Map View offers interactive map-based visualizations of RDF data cubes based on their geo-spatial dimension. The Linked Data Cubes Explorer (LDCE) [64] allows users to explore and analyse statistical datasets. 27 Finally, [85] offers several map and chart visualizations of demographic, social and statistical linked cube data.15 Regarding device-specific systems, DBpedia Mobile [14] is a location-aware mobile application for exploring and visualizing DBpedia resources. Who’s Who [23] is an application for exploring and visualizing information focusing on several issues that appear in the mobile environment. For example, the application considers the usability and data processing challenges related to the small display size and limited resources of the mobile devices. 6.1.3. Graph-based Visualization Systems A large number of systems visualize WoD datasets adopting a graph-based (a.k.a., node-link) approach. RelFinder [50] is a Web-based tool that offers interactive discovery and visualization of relationships (i.e., connections) between selected WoD resources. Fenfire [48] and Lodlive [22] are exploratory systems that allow users to browse WoD using interactive graphs. Starting from a given URI, the user can explore WoD by following the links. IsaViz [87] allows users to zoom and navigate over the RDF graph, and also it offers several "edit" operations (e.g., delete/add/rename nodes and edges). In the same context, graphVizdb [17,16] is built on top of spatial and database techniques offering interactive visualization over very large (RDF) graphs. A different approach has been adopted in [100], where sampling techniques have been exploited. Finally, ZoomRDF [116] employs a space-optimized visualization algorithm in order to increase the number of resources which are displayed. 6.1.4. Discussion In contrast to the aforementioned approaches, our work does not focus solely on proposing techniques for WoD visualization. Instead, we introduce a generic model for organizing, exploring and analysing numeric and temporal data in a multilevel fashion. The underlying model is not bound to any specific type of visualization (e.g., chart); rather it can be adopted by several "flat" techniques and offer multilevel visualizations over non-hierarchical data. Also, we present a prototype system that employs the introduced hierarchical model and offers efficient multilevel visual exploration over WoD datasets, using charts and timelines. 15 www.linked-statistics.gr 28 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 6.2. Statistical Analysis in the Web of Data A second area related to the analysis features of the proposed model deals with WoD statistical analysis. RDFStats [72] calculates statistical information about RDF datasets. LODstats [9] is an extensible framework, offering scalable statistical analysis of WoD datasets. RapidMiner LOD Extension [88,84] is an extension of the data mining platform RapidMiner16 , offering sophisticated data analysis operations over WoD. SparqlR17 is a package of the R18 statistical analysis platform. SparqlR executes SPARQL queries over SPARQL endpoints and provides statistical analysis and visualization over SPARQL results. Finally, ViCoMap [89] combines WoD statistical analysis and visualization, in a Web-based tool, which offers correlation analysis and data visualization on maps. 6.2.1. Discussion In comparison with these systems, our work does not focus on new techniques for WoD statistics computation and analysis. We are primarily interested on enhancing the visualization and user exploration functionality by providing statistical properties of the visualized datasets and objects, making use of existing computation techniques. Also, we demonstrate how in the proposed structure, computations can be efficiently performed on-the-fly and enrich our hierarchical model. The presence of statistics provides quantifiable overviews of the underlying WoD resources at each exploration step. This is particularly important in several tasks when you have to explore a large number of either numeric or temporal data objects. Users can examine next levels’ characteristics at a glance, this way are not enforced to drill down in lower hierarchy levels. Finally, the statistics over the different hierarchy levels enables analysis over different granularity levels. 6.3. Hierarchical Visual Exploration The wider area of data and information visualization has provided a variety of approaches for hierarchical analysis and presentation. Treemaps [93] visualize tree structures using a space-filling layout algorithm based on recursive subdivision of space. Rectangles are used to represent 16 rapidminer.com 17 cran.r-project.org/web/packages/SPARQL/index.html 18 www.r-project.org tree nodes, the size of each node is proportional to the cumulative size of its descendant nodes. Finally, a large number of treemaps variations have been proposed (e.g., Cushion Treemaps, Squarified Treemaps, Ordered Treemaps, etc.). Moreover, hierarchical visualization techniques have been extensively employed to visualize very large graphs using the node-link paradigm. In these techniques the graph is recursively decomposed into smaller sub-graphs that form a hierarchy of abstraction layers. In most cases, the hierarchy is constructed by exploiting clustering and partitioning methods [1,7,11, 57,104,75]. In other works, the hierarchy is defined with hub-based [76] and density-based [117] techniques. GrouseFlocks [5] supports ad-hoc hierarchies which are manually defined by the users. Finally, there also some edge bundling techniques which join graph edges to bundles. The edges are often aggregated based on clustering techniques [41,38,86], a mesh [71,28] or explicitly by a hierarchy [53]. In the context of data warehousing and online analytical processing (OLAP), several approaches provide hierarchical visual exploration, by exploiting the predefined hierarchies in the dimension space. [79] proposes a class of OLAP-aware hierarchical visual layouts; similarly, [102] uses OLAP-based hierarchical stacked bars. Polaris [98] offers visual exploratory analysis of data warehouses with rich hierarchical structure. Several hierarchical techniques have been proposed in the context of ontology visualization and exploration [40,34,46,73]. CropCircles [111] adopts a hierarchical geometric containment approach, representing the class hierarchy as a set of concentric circles. Knoocks [70] combines containment-based and nodelink approaches. In this work, ontologies are visualized as nested blocks where each block is depicted as a rectangle containing a sub-branch shown as tree map. A different approach is followed by OntoTrix [10] which combine graphs with adjacency matrices. Finally, in the context of hierarchical navigation, [65] organizes query results using the MeSH concept hierarchy. In [24] a hierarchical structure is dynamically constructed to categorize numeric and categorical query results. Similarly, [26] constructs personalized hierarchies by considering diverse users preferences. 6.3.1. Discussion In contrast to above approaches that target graphbased or hierarchically-organized data, our work focuses on handling arbitrary numeric and temporal data, N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis with out requiring it to be described by an hierarchical schema. As an example of hierarchically-organized data, consider class hierarchies or multidimensional data organized in multilevel hierarchical dimensions (e.g., in OLAP context, temporal data is hierarchically organized based on years, months, etc.). In contrast to aforementioned approaches, our work dynamically constructs the hierarchies from raw numeric and temporal data. Thus the proposed model can be combined with "flat" visualization techniques (e.g., chart, timeline), in order to provide multilevel visualizations over non-hierarchical data. In that sense, our approach can be considered more flexible compared to the techniques that rely on predefined hierarchies, as it can enable exploratory functionality on dynamically retrieved datasets, by (incrementally) constructing hierarchies on-the-fly, and allowing users to modify these hierarchies. 6.4. Data Structures & Data Processing In this section we present the data structures and the data (pre-)processing techniques which are the most relevant to our approach. R-Tree [45] is disk-based multi-dimensional indexing structure, which has been widely used in order to efficiently handle spatial queries. R-Tree adopts the notion of minimum bounding rectangles (MBRs) in order to hierarchical organize multi-dimensional objects. Data discretization [42,33] is a process where continuous attributes are transformed into discrete. A large number of methods (e.g., supervised, unsupervised, univariate, multivariate) for data discretization have been proposed. Binning is a simple unsupervised discretization method in which a predefined number of bins is created. Widely known binning methods are the equal-width and equal-frequency. In equal-width approach, the range of an attribute is divided into intervals that have equal width and each interval represents a bin. In equal-frequency approach, an equal number of values are placed in each bin. By recursively applying discretization techniques, a hierarchical discretization of attribute’s values can be produced (a.k.a. concept/generalization hierarchies). In [92] a dynamic programming algorithm for generating numeric concept hierarchies is proposed. The algorithm attempts to maximize both the similarity between the objects stored in the same hierarchy’s node, as well as the dissimilarity between the objects stored in different nodes. The generated hierarchy is a balanced tree where different nodes may have different 29 number of children. Similarly, [47] constructs hierarchies based on data distribution. Essentially, both the leaf and the interval nodes are created in such a way that an even distribution is achieved. The hierarchy construction considers also a threshold specifying the maximum number of distinct values enclosed by nodes in each hierarchy level. Finally, binary concept hierarchies (with degree equal to two) are generated in [27]. Starting from the whole dataset, it performs a recursive binary partitioning over the dataset’s values; the recursion is terminated when the number of distinct values in the resultant partitions is less than a pre-specified threshold. Using the data objects from our running example (Figure 1), Figure 12 shows the hierarchies generated from the aforementioned approaches. Figure 12(a) presents the hierarchy resulting from [27] and Figure 12(b) depicts the result using the method from [47]. The parameters in each method are set, so that the resulting hierarchies are as much as possible similar to our hierarchies (Figures 2 & 3) . Hence, the threshold in (a) is set to 3, and in (b) is set to 2. 6.4.1. Discussion The basic concepts of HETree structure can be considered similar to a simplified version of a static 1D R-Tree. However, in order to provide efficient query processing in disk-based environment, R-Tree considers a large number of I/O-related issues (e.g., space coverage, nodes overlaps, fill guarantees, etc.). On the other hand, we introduce a lightweight, main memory structure that efficiently constructed on-the-fly. Also, the proposed structure aims at organizing the data in a practical manner for a (visual) exploration scenario, rather than for disk-based indexing and querying efficiency. Compared to discretization techniques, our tree model exhibits several similarities, namely, the HETree-C version can be considered as a hierarchical version of the equal-frequency binning, and the HETree-R of the equal-width binning. However, the goal of data organization in HETree is to enable visualization and hierarchical exploration capabilities over dynamically retrieved non-hierarchical data. Hence, compared to the binning methods we can consider the following basic differences. First, in contrast with binning methods that require from the user to specify some parameters (e.g., the number/size of the bins, the number of distinct values in each bin, etc); our approach is able to automatically estimate the hierarchy parameters and adjust the visualization results by consider- 30 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis [20, 100] (a) a [20, 35] b c [35, 100] [35, 50) d e [20, 30) [30, 35] [50, 100] g f [50, 100) p8 age 20 p4 age 30 p0 age 35 h i [35, 37) [37, 50) p5 age 35 p3 age 37 p6 age 45 (b) k l [100, 100] m n p1 age 100 [50, 55) [55, 100) p9 age 50 p2 age 55 [20, 100] a [20, 44) [44, 100] b d [20, 28) e [28, 36) p8 age 20 p4 age 30 p0 age 35 p5 age 35 c f [36, 44) p3 age 37 g [44, 52) h [52, 60) p6 age 45 p2 age 55 p9 age 50 i [76, 84) k [92, 100] p7 age 80 p1 age 100 Fig. 12. Hierarchies generated from different approaches. a) based on [27] b) based on [47] ing the visualization environment characteristics. Second, in hierarchical approaches the user is not always allowed to specify the hierarchy characteristics (e.g., degree). For example, the hierarchies in [27] have always degree equal to two (Figure 12(a)), while in [47] the nodes have varying degrees (Figure 12(b)). On the other hand, in our approach the hierarchy characteristics can be specified precisely. In addition, when not specific hierarchy characteristics are requested, our approach generates perfect trees (Section 2.5), offering a "uniform" hierarchy structure. Third, the computational complexity in some of the hierarchical approaches (e.g., [92]) is prohibitive (i.e., at least cubic) for using them in practise; especially in settings where the hierarchies have to constructed on-the-fly. Fourth, the proposed tree structure is exploited in order to allow efficient statistics computations over different groups of data; then, the statistics are used in order to enhance the overall exploration functionality. Finally, the construction of the model is tailored to the user interaction and preferences; our model offers incremental construction considering the user interaction, as well as efficiently adaptation to the users preferences. 7. Conclusions In this paper we have presented HETree, a generic model that combines personalized multilevel exploration with online analysis of numeric and temporal data. Our model is built on top of a lightweight tree-based structure, which can be efficiently constructed on-the-fly for a given set of data. We have presented two variations for constructing our model: the HETree-C structure organizes input data into fixedsize groups, whereas the HETree-R structure organizes input data into fixed-range groups. In that way, the users can customize the exploration experience, allowing them to organize data into different ways, by parameterizing the number of groups, the range and cardinality of their contents, the number of hierarchy levels, and so on. We have also provided a way for efficiently computing statistics over the tree, as well as a method for automatically deriving from the input dataset the best-fit parameters for the construction of the model. Regarding the performance of multilevel exploration over large datasets, our model offers incremental HETree construction and prefetching, as well as efficient HETree adaptation based on user preferences. Based on the introduced model, a Web-based prototype system, called SynopsViz, has been developed. Finally, the efficiency and the effectiveness of the presented approach are demonstrated via a thorough performance evaluation and an empirical user study. Some insights for future work include the support of sophisticated methods for data organization in order to effectively handle skewed data distributions and outliers. Particularly, we are currently working on hybrid HETree versions, that integrate concepts from both HETree-C and HETree-R version. For example, a hybrid HETree-C considers a threshold regarding the maximum range of a group; similarly, a threshold regarding the maximum number of objects in a group is considered in hybrid HETree-R version. Regarding the SynopsViz tool, we are planning to redesign and extend the graphical user interface, so our tool to be able to use data resulting from SPARQL endpoints, as well as to offer more sophisticated filtering techniques (e.g., SPARQL-enabled browsing over the data). Finally, we are interested in including more visual techniques and libraries. Acknowledgements. We would like to thank the editors and the three reviewers for their hard work in reviewing our article, their comments helped us to significant improve our work. Further, we thank Giorgos Giannopoulos and Marios Meimaris for many helpful comments on earlier versions of this article. This work was partially supported by the EU/Greece funded KRIPIS: MEDA Project and the EU project "SlideWiki" (688095). N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis References [1] J. Abello, F. van Ham, and N. Krishnan. ASK-GraphView: A Large Scale Graph Visualization System. IEEE Trans. Vis. Comput. Graph., 12(5), 2006. [2] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013. [3] F. Alahmari, J. A. Thom, L. Magee, and W. Wong. Evaluating Semantic Browsers for Consuming Linked Data. In Australasian Database Conference (ADC), 2012. [4] M. Alonen, T. Kauppinen, O. Suominen, and E. Hyvönen. Exploring the Linked University Data with Visualization Tools. In Extended Semantic Web Conference (ESWC), 2013. [5] D. Archambault, T. Munzner, and D. Auber. GrouseFlocks: Steerable Exploration of Graph Hierarchy Space. IEEE Trans. Vis. Comput. Graph., 14(4), 2008. [6] G. A. Atemezing and R. Troncy. Towards a linked-data based visualization wizard. In Workshop on Consuming Linked Data, 2014. [7] D. Auber. Tulip - A Huge Graph Visualization Framework. In Graph Drawing Software. 2004. [8] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives. DBpedia: A Nucleus for a Web of Open Data. In International Semantic Web Conference (ISWC), 2007. [9] S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats An Extensible Framework for High-Performance Dataset Analytics. In Knowledge Engineering and Knowledge Management, 2012. [10] B. Bach, E. Pietriga, and I. Liccardi. Visualizing Populated Ontologies with OntoTrix. Int. J. Semantic Web Inf. Syst., 9(4), 2013. [11] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An Open Source Software for Exploring and Manipulating Networks. In Conference on Weblogs and Social Media (ICWSM), 2009. [12] L. Battle, R. Chang, and M. Stonebraker. Dynamic Prefetching of Data Tiles for Interactive Visualization. Technical Report, 2015. [13] L. Battle, M. Stonebraker, and R. Chang. Dynamic reduction of query result sets for interactive visualizaton. In IEEE Conference on Big Data, 2013. [14] C. Becker and C. Bizer. Exploring the Geospatial Semantic Web with DBpedia Mobile. J. Web Sem., 7(4), 2009. [15] F. Benedetti, L. Po, and S. Bergamaschi. A Visual Summary for Linked Open Data sources. In International Semantic Web Conference (ISWC), 2014. [16] N. Bikakis, J. Liagouris, M. Krommyda, G. Papastefanatos, and T. Sellis. Towards Scalable Visual Exploration of Very Large RDF Graphs. In Extended Semantic Web Conference (ESWC), 2015. [17] N. Bikakis, J. Liagouris, M. Krommyda, G. Papastefanatos, and T. Sellis. graphVizdb: A Scalable Platform for Interactive Large Graph Visualization. In IEEE Intl. Conf. on Data Engineering (ICDE), 2016. [18] N. Bikakis and T. Sellis. Exploration and Visualization in the Web of Big Linked Data: A Survey of the State of the Art. In International Workshop on Linked Web Data Management (LWDM), 2016. [19] N. Bikakis, M. Skourla, and G. Papastefanatos. rdf:SynopsViz - A Framework for Hierarchical Linked Data Visual Exploration and Analysis. In Extended Semantic Web 31 Conference (ESWC), 2014. [20] J. M. Brunetti, S. Auer, R. García, J. Klímek, and M. Necaský. Formal Linked Data Visualization Model. In iiWAS, 2013. [21] J. M. Brunetti, R. Gil, and R. García. Facets and Pivoting for Flexible and Usable Linked Data Exploration. In Interacting with Linked Data Workshop, 2012. [22] D. V. Camarda, S. Mazzini, and A. Antonuccio. LodLive, exploring the web of data. In Conference on Semantic Systems (I-SEMANTICS), 2012. [23] A. E. Cano, A. Dadzie, and M. Hartmann. Who’s Who - A Linked Data Visualisation Tool for Mobile Environments. In Extended Semantic Web Conference (ESWC), 2011. [24] K. Chakrabarti, S. Chaudhuri, and S. Hwang. Automatic Categorization of Query Results. In ACM Conference on Management of Data (SIGMOD), 2004. [25] S. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while exploring massive time series. In IEEE Symposium on Visual Analytics Science and Technology (VAST), 2008. [26] Z. Chen and T. Li. Addressing diverse user preferences in SQL-query-result navigation. In ACM Conference on Management of Data (SIGMOD), 2007. [27] W. W. Chu and K. Chiang. Abstraction of High Level Concepts from Numerical Values in Databases. In AAAI Workshop on Knowledge Discovery in Databases, 1994. [28] W. Cui, H. Zhou, H. Qu, P. C. Wong, and X. Li. GeometryBased Edge Clustering for Graph Visualization. IEEE Trans. Vis. Comput. Graph., 14(6), 2008. [29] A. Dadzie, V. Lanfranchi, and D. Petrelli. Seeing is believing: Linking data with knowledge. Information Visualization, 8(3), 2009. [30] A. Dadzie and M. Rowe. Approaches to visualising Linked Data: A survey. Semantic Web, 2(2), 2011. [31] A. Dadzie, M. Rowe, and D. Petrelli. Hide the Stack: Toward Usable Linked Data. In Extended Semantic Web Conference (ESWC), 2011. [32] P. R. Doshi, E. A. Rundensteiner, and M. O. Ward. Prefetching for Visual Data Exploration. In Conference on Database Systems for Advanced Applications (DASFAA), 2003. [33] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised Discretization of Continuous Features. In International Conference on Machine Learning, 1995. [34] M. Dudás, O. Zamazal, and V. Svátek. Roadmapping and Navigating in the Ontology Visualization Landscape. In Conference on Knowledge Engineering and Knowledge Management (EKAW), 2014. [35] A. Eldawy, M. Mokbel, and C. Jonathan. HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data. In IEEE Intl. Conf. on Data Engineering (ICDE), 2016. [36] N. Elmqvist and J. Fekete. Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines. IEEE Trans. Vis. Comput. Graph., 16(3), 2010. [37] I. Ermilov, M. Martin, J. Lehmann, and S. Auer. Linked Open Data Statistics: Collection and Exploitation. In Knowledge Engineering and the Semantic Web, 2013. [38] O. Ersoy, C. Hurter, F. V. Paulovich, G. Cantareiro, and A. Telea. Skeleton-Based Edge Bundling for Graph Visualization. IEEE Trans. Vis. Comput. Graph., 17(12), 2011. [39] D. Fisher, I. O. Popov, S. M. Drucker, and m. c. schraefel. Trust me, i’m partially right: incremental visualization lets 32 [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis analysts explore large datasets faster. In Conference on Human Factors in Computing Systems CHI, 2012. B. Fu, N. F. Noy, and M.-A. Storey. Eye Tracking the User Experience - An Evaluation of Ontology Visualization Techniques. Semantic Web Journal (to appear), 2015. E. R. Gansner, Y. Hu, S. C. North, and C. E. Scheidegger. Multilevel agglomerative edge bundling for visualizing large graphs. In IEEE Pacific Visualization Symposium (PacificVis), 2011. S. García, J. Luengo, J. A. Sáez, V. López, and F. Herrera. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Trans. Knowl. Data Eng., 25(4), 2013. P. Godfrey, J. Gryz, and P. Lasek. Interactive Visualization of Large Data Sets, 2015. Technical Report, York University. P. Godfrey, J. Gryz, P. Lasek, and N. Razavi. Visualization through inductive aggregation. In Conference on Extending Database Technology (EDBT), 2016. A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In ACM Conference on Management of Data (SIGMOD), 1984. F. Haag, S. Lohmann, S. Negru, and T. Ertl. OntoViBe: An Ontology Visualization Benchmark. In Workshop on Visualizations and User Interfaces for Knowledge Engineering and Linked Data Analytics, 2014. J. Han and Y. Fu. Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases. In AAAI Workshop on Knowledge Discovery in Databases, 1994. T. Hastrup, R. Cyganiak, and U. Bojars. Browsing Linked Data with Fenfire. In World Wide Web Conference (WWW), 2008. J. Heer and S. Kandel. Interactive Analysis of Big Data. ACM Crossroads, 19(1), 2012. P. Heim, S. Lohmann, and T. Stegemann. Interactive Relationship Discovery via the Semantic Web. In Extended Semantic Web Conference (ESWC), 2010. P. Heim, S. Lohmann, D. Tsendragchaa, and T. Ertl. SemLens: visual analysis of semantic data with scatter plots and semantic lenses. In Conference on Semantic Systems (ISEMANTICS), 2011. J. Helmich, J. Klímek, and M. Necaský. Visualizing RDF Data Cubes Using the Linked Data Visualization Model. In Extended Semantic Web Conference (ESWC), 2014. D. Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Trans. Vis. Comput. Graph., 12(5), 2006. S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of Data Exploration Techniques. In ACM Conference on Management of Data (SIGMOD), 2015. J. Im, F. G. Villegas, and M. J. McGuffin. VisReduce: Fast and responsive incremental information visualization of large datasets. In IEEE Conference on Big Data, 2013. P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. Combining User Interaction, Speculative Query Execution and Sampling in the DICE System. Proc. of the VLDB Endowment (PVLDB), 7(13), 2014. J. F. R. Jr., H. Tong, J. Pan, A. J. M. Traina, C. T. Jr., and C. Faloutsos. Large Graph Analysis in the GMine System. IEEE Trans. Knowl. Data Eng., 25(1), 2013. U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. Faster [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] Visual Analytics through Pixel-Perfect Aggregation. Proc. of the VLDB Endowment (PVLDB), 7(13), 2014. U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. VDDA: automatic visualization-driven data aggregation in relational databases. Journal on Very Large Data Bases (VLDBJ), 2015. E. Kalampokis, A. Nikolov, P. Haase, R. Cyganiak, A. Stasiewicz, A. Karamanou, M. Zotou, D. Zeginis, E. Tambouris, and K. A. Tarabanis. Exploiting Linked Data Cubes with OpenCube Toolkit. In International Semantic Web Conference (ISWC), 2014. A. Kalinin, U. Çetintemel, and S. B. Zdonik. Interactive Data Exploration Using Semantic Windows. In ACM Conference on Management of Data (SIGMOD), 2014. A. Kalinin, U. Çetintemel, and S. B. Zdonik. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. of the VLDB Endowment (PVLDB), 8(10), 2015. N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In IEEE Conference on Data Engineering (ICDE), 2014. B. Kämpgen and A. Harth. OLAP4LD - A Framework for Building Analysis Applications Over Governmental Statistics. In Extended Semantic Web Conference (ESWC), 2014. A. Kashyap, V. Hristidis, M. Petropoulos, and S. Tavoulari. Effective Navigation of Query Results Based on Concept Hierarchies. IEEE Trans. Knowl. Data Eng., 23(4), 2011. H. A. Khan, M. A. Sharaf, and A. Albarrak. DivIDE: efficient diversification for interactive data exploration. In Conference on Scientific and Statistical Database Management (SSDBM), 2014. A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. Proc. of the VLDB Endowment (PVLDB), 8(5), 2015. J. Klímek, J. Helmich, and M. Necaský. Payola: Collaborative Linked Data Analysis and Visualization Framework. In Extended Semantic Web Conference (ESWC), 2013. J. Klímek, J. Helmich, and M. Necaský. Use Cases for Linked Data Visualization Model. In Workshop on Linked Data on the Web, (LDOW), 2015. S. Kriglstein and R. Motschnig-Pitrik. Knoocks: New Visualization Approach for Ontologies. In Conference on Information Visualisation, 2008. A. Lambert, R. Bourqui, and D. Auber. Winding Roads: Routing edges into bundles. Comput. Graph. Forum, 29(3), 2010. A. Langegger and W. Wöß. RDFStats - An Extensible RDF Statistics Generator and Library. In Database and Expert Systems Applications, 2009. M. Lanzenberger, J. Sampson, and M. Rester. Visualization in Ontology Tools. In International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), 2009. A. d. Leon, F. Wisniewki, B. Villazón-Terrazas, and O. Corcho. Map4rdf- Faceted Browser for Geospatial Datasets. In Using Open Data: policy modeling, citizen empowerment, data journalism, 2012. C. Li, G. Baciu, and Y. Wang. ModulGraph: Modularitybased Visualization of Massive Graphs. In Visualization in High Performance Computing, 2015. Z. Lin, N. Cao, H. Tong, F. Wang, U. Kang, and D. H. P. Chau. Demonstrating Interactive Multi-resolution Large Graph Exploration. In IEEE Conference on Data Mining Workshops, N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis 2013. [77] L. D. Lins, J. T. Klosowski, and C. E. Scheidegger. Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput. Graph., 19(12), 2013. [78] Z. Liu, B. Jiang, and J. Heer. imMens: Real-time Visual Querying of Big Data. Comput. Graph. Forum, 32(3):421– 430, 2013. [79] S. Mansmann and M. H. Scholl. Exploring OLAP aggregates with hierarchical visualization techniques. In ACM Symposium on Applied Computing (SAC), 2007. [80] N. Marie and F. L. Gandon. Survey of Linked Data Based Exploration Systems. In Workshop on Intelligent Exploration of Semantic Data (IESD), 2014. [81] K. Morton, M. Balazinska, D. Grossman, and J. D. Mackinlay. Support the Data Enthusiast: Challenges for NextGeneration Data-Analysis Systems. Proc. of the VLDB Endowment (PVLDB), 7(6), 2014. [82] C. Nikolaou, K. Dogani, K. Bereta, G. Garbis, M. Karpathiotakis, K. Kyzirakos, and M. Koubarakis. SexTant: Visualizing Time-Evolving Linked Geospatial Data. In International Semantic Web Conference (ISWC), 2013. [83] Y. Park, M. J. Cafarella, and B. Mozafari. VisualizationAware Sampling for Very Large Databases. In IEEE Intl. Conf. on Data Engineering (ICDE), 2016. [84] H. Paulheim. Generating Possible Interpretations for Statistics from Linked Open Data. In Extended Semantic Web Conference (ESWC), 2012. [85] I. Petrou, M. Meimaris, and G. Papastefanatos. Towards a methodology for publishing Linked Open Statistical Data. eJournal of eDemocracy & Open Government, 6(1), 2014. [86] D. Phan, L. Xiao, R. B. Yeh, P. Hanrahan, and T. Winograd. Flow Map Layout. In IEEE Symposium on Information Visualization (InfoVis), 2005. [87] E. Pietriga. IsaViz: a Visual Environment for Browsing and Authoring RDF Models. In World Wide Web Conference (WWW), 2002. [88] P. Ristoski, C. Bizer, and H. Paulheim. Mining the Web of Linked Data with RapidMiner. In International Semantic Web Conference (ISWC), 2014. [89] P. Ristoski and H. Paulheim. Visual Analysis of Statistical Data on Maps using Linked Open Data. In Extended Semantic Web Conference (ESWC), 2015. [90] P. E. R. Salas, F. M. D. Mota, K. K. Breitman, M. A. Casanova, M. Martin, and S. Auer. Publishing Statistical Data on the Web. Int. J. Semantic Computing, 6(4), 2012. [91] K. Schlegel, T. Weißgerber, F. Stegmaier, C. Seifert, M. Granitzer, and H. Kosch. Balloon Synopsis: A Modern NodeCentric RDF Viewer and Browser for the Web. In Extended Semantic Web Conference (ESWC), 2014. [92] C. Shen and Y. Chen. A dynamic-programming algorithm for hierarchical discretization of continuous attributes. European Journal of Operational Research, 184(2), 2008. [93] B. Shneiderman. Tree Visualization with Tree-Maps: 2-d Space-Filling Approach. ACM Trans. Graph., 11(1), 1992. [94] B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In IEEE Symposium on Visual Languages, 1996. [95] B. Shneiderman. Extreme visualization: squeezing a billion records into a million pixels. In ACM Conference on Management of Data (SIGMOD), 2008. [96] C. Stadler, J. Lehmann, K. Höffner, and S. Auer. LinkedGeo- [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] 33 Data: A core for a web of spatial open data. Semantic Web, 3(4), 2012. C. Stadler, M. Martin, and S. Auer. Exploring the web of spatial data with facete. In World Wide Web Conference (WWW), 2014. C. Stolte, D. Tang, and P. Hanrahan. Query, analysis, and visualization of hierarchically structured data using Polaris. In ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), 2002. M. Stuhr, D. Roman, and D. Norheim. LODWheel JavaScript-based Visualization of RDF Data. In Workshop on Consuming Linked Data, 2011. S. Sundara, M. Atre, V. Kolovski, S. Das, Z. Wu, E. I. Chong, and J. Srinivasan. Visualizing large-scale RDF data using Subsets, Summaries, and Sampling in Oracle. In ICDE, 2010. F. Tauheed, T. Heinis, F. Schürmann, H. Markram, and A. Ailamaki. SCOUT: Prefetching for Latent Feature Following Queries. Proc. of the VLDB Endowment (PVLDB), 5(11), 2012. K. Techapichetvanich and A. Datta. Interactive Visualization for OLAP. In Computational Science and Its Applications (ICCSA), 2005. K. Thellmann, M. Galkin, F. Orlandi, and S. Auer. LinkDaViz - Automatic Binding of Linked Data to Visualizations. In International Semantic Web Conference (ISWC), 2015. C. Tominski, J. Abello, and H. Schumann. CGV - An interactive graph visualization system. Computers & Graphics, 33(6), 2009. G. Tschinkel, E. E. Veas, B. Mutlu, and V. Sabol. Using Semantics for Interactive Visual Analysis of Linked Open Data. In International Semantic Web Conference (ISWC), 2014. F. Valsecchi, M. Abrate, C. Bacciu, M. Tesconi, and A. Marchetti. DBpedia Atlas: Mapping the Uncharted Lands of Linked Data. In Workshop on Linked Data on the Web, LDOW, 2015. F. Valsecchi and M. Ronchetti. Spacetime: a Two Dimensions Search and Visualisation Engine Based on Linked Data. In Conference on Advances in Semantic Processing (SEMAPRO), 2014. M. Vartak, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: Automatically Generating Query Visualizations. Proc. of the VLDB Endowment (PVLDB), 7(13), 2014. M. Voigt, S. Pietschmann, L. Grammel, and K. Meißner. Context-aware Recommendation of Visualization Components. In Conference on Information, Process, and Knowledge Management (eKNOW), 2012. M. Voigt, S. Pietschmann, and K. Meißner. A SemanticsBased, End-User-Centered Information Visualization Process for Semantic Web Data. In Semantic Models for Adaptive Interactive Systems. 2013. T. D. Wang and B. Parsia. CropCircles: Topology Sensitive Visualization of OWL Class Hierarchies. In International Semantic Web Conference (ISWC), 2006. M. O. Ward. XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data. In IEEE Visualization, 1994. H. Wickham. Bin-summarise-smooth: a framework for visualising large data. Technical report, 2013. E. Wu, L. Battle, and S. R. Madden. The Case for Data Visualization Management Systems. Proc. of the VLDB Endowment (PVLDB), 7(10), 2014. A. Zaveri, A. M. Anisa Rula, R. Pietrobon, J. Lehmann, and 34 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis S. Auer. Quality assessment methodologies for linked open data. Semantic Web Journal (to appear), 2015. [116] K. Zhang, H. Wang, D. T. Tran, and Y. Yu. ZoomRDF: semantic fisheye zooming on RDF data. In World Wide Web Conference (WWW), 2010. [117] M. Zinsmaier, U. Brandes, O. Deussen, and H. Strobelt. Interactive Level-of-Detail Rendering of Large Graphs. IEEE Trans. Vis. Comput. Graph., 18(12), 2012. [118] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In ACM Conference on Management of Data (SIGMOD), 2014. Appendix A. Incremental HETree Construction Remark 1. Each time ICO constructs a node (either as part of initial nodes or due to a construction rule), it also constructs all of its sibling nodes. Proof of Proposition 1. Considering the different cases of currently presented HETree elements and the available exploration operations, we have the following. (1) A set of (internal or leaf) sibling nodes S are presented and the user performs a roll-up action. Here, the roll-up action will render the parent node of S along with parent’s sibling nodes. In the case that S are the nodes of interest (RAN scenario), the rendered nodes have been constructed in the beginning of the exploration (as part of RAN initial nodes). Otherwise, the presented nodes have been previously constructed due to construction Rule 1(i) (see Section 3.2). (2) A set of internal sibling nodes C are presented and the user performs a drill-down action over a node c ∈ C. In this case, the drill-down will render c child nodes. If C are the nodes of interest (RAN scenario), then the child nodes of c have been constructed at the beginning of the exploration (as part of RAN initial nodes). Else, if C is the root node (BSC scenario), then again the child nodes of c have been constructed at the beginning of the exploration (as part of BSC initial nodes). Otherwise, the children of c, have been constructed before due to construction Rule 1(ii) (see Section 3.2). (3) A set of leaf sibling nodes L are presented and the user performs a drill-down action over a leaf l ∈ L. In this case the drill-down action will render data objects contained in l. Since a leaf is constructed together with its data objects, all data objects here have been previously constructed along with l. (4) A set of data objects O are presented and the user performs a roll-up action. Here, the roll-up action will render the leaf that contains O along with the leaf’s siblings. In RAN and BSC exploration scenarios, data objects are reachable only via a drill-down action over the leaf over the leaf that are contained, whereas in the RES scenario, the data objects, contained in the leaf of interest, are the first elements that are presented to the user. In the general case, since O are reached only via a drill-down, their parent leaf has already been constructed. Based on Remark 1, all sibling nodes of this leaf have also been constructed. In the case of the RES scenario, where O includes the resource of interest, the leaf that contains O along with leaf’s siblings have been constructed at the beginning of the exploration. Thus, it is shown that, in all cases, the HETree elements that a user can reach by performing one operation, have been previously constructed by ICO. This concludes the proof of Proposition 1.  Proof of Theorem 1. We will show that, during an exploration scenario, in any exploration step, ICO constructs only the required HETree elements. Considering an exploration scenario, ICO constructs nodes only either as initial nodes, or via construction rules. The initial nodes are constructed once, at the beginning of the exploration process; based on the definition of the initial nodes, these nodes are the required HETree elements for the first user operation. During the exploration process, ICO constructs nodes only via the construction rules. Essentially, from construction rules, only the Rule 1 construct new nodes. Considering the part of the tree rendered when Rule 1 (Section 3.2) is applied, it is apparent that the nodes constructed by Rule 1 are only the required HETree elements. Therefore, it is apparent that in any exploration step, ICO constructs only the required HETree elements. By considering all the steps comprising a user exploration scenario, the overall number of elements constructed is the minimum. This concludes the proof of Theorem 1.  N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Procedure 4: constrRollUp-R(D, d, cur, H) Procedure 5: constrDrillDown-R(D, d, cur, H) Input: D: set of objects; d: tree degree; cur: currently presented elements; H: currently created HETree-R Output: H: updated HETree-R Input: D: set of objects; d: tree degree; cur: currently presented elements; H: currently created HETree-R Output: H: updated HETree-R //Computed in ICO-R: len: the length of the leaf’s interval //Computed in ICO-R: len: the length of the leaf’s interval 1 2 3 4 5 6 7 8 9 create an empty node par par.h ← cur[1].h + 1 par.I − ← cur[1].I − par.I + ← cur[|cur|].I + for i ← 1 to |cur| do par.c[i] ← cur[i] cur[i].p ← par //cur parent node 35 1 2 3 lc = len · dcur[1].h−1 //length of the children’s intervals for i ← 1 to |cur| do if cur[i].c[0] = null then continue //nodes previously constructed //create parent-child relations 4 Ich ← computeSiblingInterv-R(cur[i].I − , cur[i].I + , lc , d) 5 S ← constrSiblingNodes-R(Ich , cur[i], cur[i].data, cur[1].h − 1) 6 for k ← 1 to |S| do cur[i].c[k] ← S[k] //compute intervals for cur[i] children insert par into H lp ← par.I + − par.I − //par interval length j par.I − −D.minv d·lp 10 − Ippar ← D.minv + d · lp · 11 + − Ippar ← min(D.maxv, Ippar + d · lp ) k 7 8 9 insert S into H return H //compute interval for par parent, Ippar 12 13 lsp ← (len · dcur[1].h ) //interval length for a par sibling node − + Ispar ← computeSiblingInterv-R(Ippar , Ippar , lsp , d) Procedure 6: computeSiblingInterv-R(low, up, len, n) //compute intervals for all par sibling nodes 14 remove par.I from Ispar 15 S ← constrSiblingNodes-R(Ispar , null, D, cur[1].h + 1) insert S into H return H Input: low: intervals’ lower bound; up: intervals’ upper bound; len: intervals’ length; n: number of siblings Output: I: an ordered set with at most n equal length intervals //remove par interval, par already constructed 16 17 6 It− , It+ ← low for i ← 1 to n do It− ← It+ It+ ← min(up , len + It− ) append It to I if It+ = up then break 7 return I 1 2 3 B. ICO Algorithm The constrRollUp-R (Procedure 4) initially constructs the cur parent node par (lines 1-7). Next, it computes the interval Ippar corresponding to par parent node interval (lines 10-11). Using Ippar , it computes the intervals for each of par sibling nodes (line 13). Finally, the computed sibling nodes’ intervals Ispar are used for the nodes construction (line 15). In the constrDrillDown-R (Procedure 5), for each node in cur, its children are constructed as follows (line 2). First, the procedure computes the intervals Ich of each child and then it constructs all children (line 5). Finally, the child relations for the parent node cur[i] are constructed (line 6-7). C. Incremental HETree Construction Analysis In this section, we analyse in details the worst case of ICO algorithm, i.e., when the construction cost is maximized. C.1. The HETree-R Version The worst case in HETree-R occurs when the whole dataset D is contained in a set of sibling leaf nodes L, where |L| ≤ d. 4 5 Considering the above setting, in the RES scenario, the cost is maximized when ICO-R constructs L (as initial nodes). In this case, the cost is O(|D| + |D|log|D|) = O(|D|log|D|). In a RAN scenario, the cost is maximized when the parent node p of L along with p’s sibling nodes are considered as nodes of interest. First, let’s note that in this case p has no sibling nodes, since all the sibling nodes are empty (i.e., they do not enclose data). Hence, the p has to be constructed in ICO-R as initial nodes, as well as the L in constrDrillDown-R, and the parent of p in constrRollUp-R. The p construction in ICO-R requires O(|D|). Also, the L construction in constrDrillDown-R requires O(d + |D| + |D|log|D| + d). Finally, the construction of the parent of p in constrRollUp-R requires O(1). Therefore, in RAN the overall cost in the worst case is O(|D| + d + |D| + |D|log|D| + d) = O(|D|log|D|). Finally, in BSC scenario, the cost is maximized when the L have to be constructed by constrDrillDown-R, which requires O(d + |D| + |D|log|D| + d) = O(|D|log|D|). 36 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Procedure 7: constrSiblingNodes-R(I, p, A, h) Input: I: an ordered set with equal length intervals p: nodes’ parent node; A: available data objects; h: nodes’ height Output: S: a set of HETree-R sibling nodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 l = I[1]+ − I[1]− T[ ] ← ∅ foreach tr ∈ A do j k tr.o−I[1]− j← +1 l //intervals’ length //indicate enclosed data for each node if j ≥ 0 and j ≤ |I| then insert object tr into T [j] remove object tr from A for i ← 1 to |I| do if T [i] = ∅ then continue create a new node n n.I − ← I[i]− n.I + ← I[i]+ n.p ← p n.c ← null n.data ← T [i] n.h ← h if h = 0 then sort n.data based on objects values //construct nodes //node is a leaf append n to S return S C.2. The HETree-C Version First let’s note that in HETree-C version, the dataset is sorted at the beginning of the exploration and the leaves contain equal number of data objects. As a result, during a node construction, the data objects, enclosed by it, can be directly identified by computing its position over the dataset and without the need of scanning the dataset or the enclosed data values. However, in ICO we assume that the node’s statistics are computed each time the node is constructed. Hence, in each node construction, we scan the data objects that are enclosed by this node. In RES scenario, the worst case occurs, when the user rolls up for the first time to the nodes at level 2 (i.e., two levels below the root). In this case, ICO has to construct the d nodes at level 1, as well the children for the d − 1 nodes in level 2. Note that the construction of the parent of the nodes in level 2 does not require to process any data objects or construct children, since these nodes are already constructed. Now regarding the construction of the rest d−1 nodes at level 1, ICO will process at most the d−1 d of all data objects.19 Thus, the cost for constrRollUp-C is O(d + d−1 d |D| + d − 1). Finally, for constructing 19 This number can be easily computed by considering the number of leafs enclosed by these nodes. the child nodes for the d − 1 nodes in level 2, we are required to process at most the d−1 d2 of all data objects. Hence, the cost for constrDrillDown-C is O(d2 + d−1 2 d2 |D| + d ). Therefore, in RES the cost in worst d−1 2 2 case is O(d + d−1 d |D| + d − 1 + d + d2 |D| + d ) = O(d2 + d−1 d |D|). In RAN scenario, the worst case occurs, when the user starts from any set of sibling nodes at level 2. Hence, the cost is maximized at the beginning of the exploration. In this case, ICO has to construct the d initial nodes at level 2, the d nodes at level 1, and the children for all the d nodes in level 2. First the d initial nodes at level 2 are constructed by ICO-R, which can be done in O(d + |D| + d). Then, the d nodes at level 1 are constructed by constrRollUp-C. Similarly as in RES scenario, this can be done in O(d + d−1 d |D| + d − 1). Finally, the construction of the child nodes for all the d nodes in level 2 requires to process |D| d data objects. Hence, the cost for 2 constrDrillDown-C is O(d2 + |D| d + d ). Therefore, in RAN the cost, in the worst case, is O(|D| + d + d + |D| d−1 d−1 2 2 2 d |D| + d − 1 + d + d + d ) = O(d + d |D|). Finally, in BSC scenario, the worst case occurs, when the user visits for the first time any of the node at level 1. In this case, ICO has to construct the children for the d nodes in level 1. Hence, constrDrillDown-C has to process |D| data objects in order to construct the d2 child nodes. Therefore, in BSC the cost in worst case is O(d2 + |D| + d2 ) = O(d2 + |D|). D. Adaptive HETree Construction D.1. Preliminaries In order to perform traversal over the levels of the HETree (i.e., level-order traversal), we use an array H of pointers to the ordered set of nodes at each level, with H[0] referring to the set of leaf nodes and H[k] referring to the set of nodes at height k. Moreover, we consider the following simple procedures that are used for the ADA implementation: – mergeLeaves(L, m), where L is an ordered set of leaf nodes and m ∈ N+ ,with  m > 1. This procedure L new leaf nodes, i.e., each returns an ordered set of m new leaf merges m leaf nodes from L. The procedure traverses L, constructs a new leaf for every m nodes in L and appends the data items from the m nodes to the new leaf. This procedure requires O(|L|). – replaceNode(n1 , n2 ), replaces the node n1 with the node n2 ; it removes n1 , and updates the parent 37 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis of n1 to refer to n2 . This procedure requires constant time, hence O(1). – createEdges(P, C, d), where P , C are ordered sets of nodes and d is the tree degree. It creates the edges (i.e., parent-child relations) from the parent nodes P to the child nodes C, with degree d. The procedure traverses over P and connects each node P [i] with the nodes from C[(i − 1)d + 1] to C[(i − 1)d + d]. This procedure requires O(|C|). D.2. The User Modifies the Tree Degree D.2.1. The user increases the tree degree ′ k + (1) d = d , with k ∈ N and k > 1 T ree Construction. For the T ′ construction, we perform a reverse level-order traversal over T , using the H vector. Starting from the leaves (H[0]), we skip (i.e., remove) k − 1 levels of nodes. Then, for the nodes of the above level (H[k]), we create child relations with the (non-skipped) nodes in the level below. The above process continues until we reach the root node of T . Hence, in this case all nodes in T ′ are obtained "directly" from T . Particularly, T ′ is constructed using the root node of T , as well the T nodes from H[j · k], j ∈ N0 . The T ′ construction requires the execution of createEdges procedure, j times. For computing j, we have that j · k ≤ |H| ⇔ j · k ≤ logd ℓ. Considering that d′ = dk , we have that k = logd d′ . Hence, j · logd d′ ≤ logd ℓ ⇔ j ≤ logd (ℓ − d′ ). So, considering that worst case complexity for createEdges is O(ℓ), we have that the overall complexity is O(ℓ · logd (ℓ − d′ )). Since we have that ℓ ≤ |D|, then in worst case the T ′ can be constructed in O(|D|logd (|D|)) = O(|D|log √ k ′ (|D|)). d Statistics Computations. In this case there is no need for computing any new statistics. for each internal node of height 1, require O(k) instead ′ of O(d′ ), where k = dd . Hence, considering that there ℓ are d′ internal nodes of height 1 in T ′ , the cost for   their statistics is O(k · dℓ′ ) = O( k·ℓ d′ + k). Regarding the cost of recomputing them from 20 scratch, consider that there are dℓ−1 ′ −1 internal nodes with heights greater than 1; the statistics compu′ ′ tations for these nodes require O( dd·ℓ−d ′ −1 ). Therefore, the overall cost for statistics computations is d′ ·ℓ−d′ O( k·ℓ d′ + k + d′ −1 ). (3) elsewhere T ree Construction. Similar to the previous case, the ′2 ′ T ′ construction requires O( d d′·ℓ−d −1 ). Statistics Computations. In this case, the statistics should be computed from scratch for all internal nodes ′2 ′ in T ′ . Therefore, the complexity is O( d d′·ℓ−d −1 ). D.2.2. The user decreases the tree degree (1) d′ = T ree Construction. As in T ′ the leaves remain the same as in T , we only use the constrInterNodes (Procedure 2) to build the rest of the tree. Therefore, in the worst case, the complexity for constructing the T ′ is ′2 ′ O( d d′·ℓ−d −1 ). Statistics Computations. The statistics for T ′ nodes of height 1 can be computed by aggregating statistics from T . Particularity, in T ′ the statistics computations d, with k ∈ N+ and k > 1 T ree Construction. For the T ′ construction we perform a reverse level-order traversal over T using the H vector and starting from the nodes having height of 1. In each level, for each node n we call the constrInterNodes (Procedure 2) using as input the d child nodes of n and the new degree d′ . Note that, in this reconstruction case, the constrInterNodes does require to construct the root node; the root node here is always corresponding to the node n. Hence, the complexity of constrInterNodes for one call is O(d). Considering that, we perform a procedure call for all the internal nodes, as well as that the maximum number of internal nodes is d·ℓ−1 d−1 , we have that, in the 2 ·ℓ−d worst case the T ′ can be constructed in O( d d−1 )= ′2k (2) d′ = k · d, with k ∈ N+ , k > 1 and k 6= dν where ν ∈ N+ √ k ′k ). O( d d′k·ℓ−d −1 Regarding the number of internal nodes that we have to construct from scratch. Since T ′ has all the nodes of T , for T ′ we have to construct from scratch d·ℓ−1 d′ ·ℓ−1 d′ −1 − d−1 new internal nodes, where the first part corresponds to the number of internal nodes of T ′ , and the second part corresponds to T . Considering that, √ ′ d′k ·ℓ−1 d′ = k d, we have to build dd·ℓ−1 ′ −1 − d′k −1 internal nodes. 20 Take into account that the maximum number of internal nodes ⌈logd ℓ⌉ (considering all levels) is d d−1 −1 . 38 N. Bikakis et al. / A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis Statistics Computations. Statistics should be computed only for the new internal nodes of T ′ . Hence, the ′ d′k ·ℓ−1 cost here is O(d′ · ( dd·ℓ−1 ′ −1 − d′k −1 )) D.3. The User Modifies the Number of Leaves D.3.1. The user decreases the number of leaves ℓ , with k ∈ N+ dk T ree Construction. In this case, each leaf in T ′ results by merging dk leaves from T . Hence, T ′ leaves are constructed by calling mergeLeaves(ℓ, dk ). So, considering the mergeLeaves complexity, in worst case the new leaves construction requires O(|D|). Then, each leaf of T ′ replace an internal nodes of T having height  ℓ  of k. Therefore, in worst case (k = 1), we call the replaceNode procedure, which requires d  times  O( dℓ ). Therefore, the overall cost for constructing   T ′ in worst case is O(|D| + dℓ ) = O(|D|). Note that in this, as well as in the following case, we assume that T is a perfect tree. In case where T is not perfect, we can use as T the perfect tree that initially proposed by our system. (1) ℓ′ = (2) ℓ′ = ν ∈ N+ ℓ , with k ∈ N+ , k > 1 and k 6= dν , where k T ree Construction. In this case, each leaf in T ′ results by merging k leaves from T . Hence, the T ′ leaves are constructed by calling the mergeLeaves(ℓ, k), which in worst case requires O(|D|). Then, the rest of the tree is constructed from scratch using the constrInterNodes. Therefore, the overall cost for T ′ 2 ′ ·ℓ −d ). construction is O(|D| + d d−1 Statistics Computations. The statistics for all internal nodes have to be computed from scratch. Regarding the leaves, the statistics for each leaf in T ′ are computed by aggregating the statistics of the T leaves it includes. Essentially, for computing the statistics in each leaf in an HETree-C, we have to process k values instead of |D| ℓ′ . However, in the worst case (i.e., ℓ = |D|), we have that k = |D| ℓ′ . Therefore, in the worst case (for both HETree versions) the complexity is the same as computing statistics from scratch. (3) ℓ′ = ℓ − k, with k ∈ N+ , k > 1 and ℓ′ 6= ν ∈ N+ ℓ , where ν T ree Construction. In order to construct T ′ we have to construct all nodes from scratch, which in the worst 2 ′ ·ℓ −d ). case requires O(|D|log|D| + d d−1 Statistics Computations. The statistics of the T leaves that are fully contained in T ′ can be used for calculating the statistics of the new leaves. The worst case is when the number of leaves that are fully contained in T ′ is minimized. For HETree-C (resp. HETree-R), this occurs when the size of leaves in T ′ is λ′ = λ + 1 (resp. of length ρ′ = ρ + ρℓ ). In this case, for every λ (resp. ρ) leaves from T that are used to construct the T ′ leaves, at least 1 leaf is fully contained. Hence, when we process all ℓ leaves, at least ℓ2 ′ |D| leaves are fully contained in T . Hence in HETree-C, in statistics computations over the leaves, instead of processing |D| values, we proℓ2 · λ = |D| − ℓ. The same also cess at most |D| − |D| holds in HETree-R, if we assume normal distribution over the values in D. Therefore, the cost for computing the leaves statistics is O(|D| − ℓ) = O(|D| − ℓ′ − k).