Universal Data Model As A Way To Build Multi-Paradigm Data Lakes
Universal Data Model As A Way To Build Multi-Paradigm Data Lakes
Universal Data Model As A Way To Build Multi-Paradigm Data Lakes
to raise the issue of producing a single data structure uniting dif- nested data. These capabilities will allow you to work with data
ferent data models and to conduct the experiment on large data at the top generalized levels and only access the details when it
volumes, including with loading of unstructured data. These is- is needed. Metagraphs are the only type of data model that has
sues will be the focus of the next experiment. The conducted this capability. In [17] it was proposed to use metagraphs to de-
experiment has laid down prerequisites that allow us to propose scribe the structure of a data lake.
a single universal data model for building a data lake, which will
provide processing of all enterprise data both in OLAP and The first time the term ”metagraph” was mentioned in the
OLTP mode. monograph by Basu A., Blanning R. [18]. Their definition of a
metagraph included:
TABLE I. AVERAGE QUERY EXECUTION TIME TO GRAPH DATABASES • Ability to combine vertices into arbitrary groups and in-
Database Simple, sec. Medium, sec. Difficult, sec. side these groups to have nested groups of vertices.
Apache Spark 22.45458 37.75869 36.19566 • Ability to connect by edges both individual vertices and
Neo4j 1.203 31.1444 82.5117 groups of vertices, including any nesting penetrating
group boundaries.
TABLE II. AVERAGE QUERY EXECUTION TIME TO MULTIDIMENSIONAL • The presence of variables on the edges to which values
DATABASES can be assigned.
Database Simple, sec. Medium, sec. Difficult, sec. Thereafter, various modifications of the metagraph model
Apache Spark 47,50 26,07 41,79 appeared: a model with meta-vertices [19], a hierarchical model
Pentaho BI 44,8 23,4 35,6 with metavertices and metaedges [20], and an annotated model
[21], [22]. The extension of the Bazu-Blanning model proposed
in the annotated metagraphs is the most universal model pre-
TABLE III. AVERAGE QUERY EXECUTION TIME TO RELATIONAL sented and is of interest for its use within a uni- versal data
DATABASES
model. This model was named annotated because metavertices
Database Simple, sec. Medium, sec. Difficult, sec. or metaedges containing the same internal objects as other
Apache Spark 3.04766 16.01681 17.23437 metavertices or metaedges annotate these objects, allowing
some additional attributes to be added to them in the new repre-
PostgreSQL 0.49267 13.64606 42.41046
sentation.
III. THE METAGRAPH MODEL Let us describe the main elements of the annotated meta-
graph. The metagraph itself is defined as: MG = ⟨V, MV, E,
Since a data lake can contain a very large amount of data, the ME ⟩, where MG – metagraph, V – set of metagraph vertices,
universal data model used to structure it should provide an un-
MV – set of metagraph metavertices, E – set of metagraph
limited number of data nesting levels and encapsulation of
edges, ME – set of metagraph metaedges.
A vertex is defined as: v = ⟨{atr1, …, atrk}⟩, v ∈ V , where IV. PROTOGRAPH AND ARCHIGRAPH
v – metagraph vertex, atr1,…, atrk – vertex attributes. In [23], [24] the concept was considered that allows de-
An edge of a metagraph is described as: e = ⟨vbegin, vend, scribing various generalizations of graphs (metagraphs, hyper-
eo, {atr1, …, atrk}⟩, e ∈ E ∧ eo ∈ {true, false}, where e – meta- graphs, multigraphs and others) through an archigraph and a
graph edge, vbegin – start vertex (metavertex) of the edge, vend protograph.
– end vertex (metavertex) of the edge, atr1,…, atrk – edge attrib- An archigraph is called a collection of sets between whose
utes, eo – edge directional sign (eo = true – directed edge; eo = elements there exists an incidence relation. Formally, an archi-
false – undirected edge). graph is defined as: Gn = ⟨V1, V2, …, Vn⟩, where Gn – archigraph,
Fragment of metagraph in a general form is defined as: EV Vi – set of elements, n – number of sets. Thus, it can be said that
= ⟨{ev | ev ∈ (V ∪ E ∪ MV ∪ ME)}⟩, where EV – fragment of an archigraph consists of some number of classes, where Vi con-
metagraph, ev – element that is either an edge, a meta edge, a tains the set of elements of the i-th class. An example of an archi-
vertex, or a meta vertex. graph of degree 2 G2 = ⟨V1, V2⟩ is a regular graph given as: G =
⟨E, V ⟩, where E – set of edges, V – of vertices.
A metavertex is defined as: mv = ⟨EV, {atr1, …, atrk}⟩, mv
∈ MV, where mv – metagraph metavertex, atr1, … , atrk – A protograph is called a set of elements P = {p1, p2, …
metavertex attributes, EV – fragment of metagraph. , pn} and their neighborhood matrix M = ∥mi,j∥n×n, mi,j ∈
A metaedge is defined as: me = ⟨vbegin, vend, eo, {atr1, …, {0, 1}, where 1 means the presence of the neighborhood of
atrk}, EV ⟩, me ∈ ME ∧ eo ∈ {true, false}, where me – metagraph element pi with element pj, and 0 means its absence. A pro-
metaedge , vbegin – start vertex (metavertex) of the edge, vend tograph can be considered as a graph with no edges; the role
– end vertex (metavertex) of the edge, atr1,…, atrk – edge attrib- of edges is played by the adjacency of vertices to each other.
utes, eo – edge directional sign (eo = true – directed edge; eo = Examples of protographs are: stack, queue, map. Examples
false – undirected edge), EV – fragment of metagraph. of infinite protographs are a Turing machine tape and a par-
Thus, the metagraph of the annotated model includes edges, quet. A protograph can be either undirected or directed. An
vertices, metaedges and metavertices. Each element has its own example of each protograph is shown in Fig. 3a and Fig. 3b,
set of attributes, where each attribute has a name and a value. respectively.
Edges and metaedges of such a metagraph can penetrate through
A protograph is a
the boundaries of metavertices and metaedges to any nesting
minimal model and by
depth.
selecting subsets it is
An example of an annotated metagraph is shown in Fig. 2. possible to form a
(a) undirected protograph
The metagraph contains three metavertices: mv1, mv2, mv3. The graph, a metagraph, an
metavertex mv1 contains archigraph. An archi-
vertices v1, v2, v3 and graph Gn can be defined
edges e1, e2, e3 connecting as a protograph P,
them. The metavertex whose elements are par-
mv2 contains vertices v4, titioned into n classes.
Also, [23] describes in (b) directed protograph
v5 and edges e6 connect-
ing them. The edges e4, e5 detail the representation
Fig. 3. Example of a protograph
are examples of edges of various generaliza-
connecting vertices v2-v4 tions of the graph as a protograph, including the metagraph.
Fig. 2. Example of an annotated and v3-v5, which are con- In this way, the previously described metagraph models can
metagraph tained in different be systematized through the concept of archigraph. Thus, the
metavertices mv1 and first model proposed by A. Bazu and R. Blanning [18], is an
mv2. The edge e7 is an example of an edge connecting the archigraph of degree 4 and can be represented as a protograph
metavertices mv1 and mv2. The edge e8 is an example of an edge of 4 classes: vertices, vertex groups, edges and variables. And
connecting vertex v2 and metavertices mv2. The metavertex mv3 the annotated model is an archigraph with degree 5 and can be
contains metavertex mv2, vertices v2, v3 and edge e2 from represented as a protograph of 5 classes: vertices, metavertices,
metavertex mv1, as well as edges e4, e5, e8, which suggests a ho- edges, meta-ribs and attributes.
lonic aspect of the metagraph structure.
Therefore, it is possible to extend the archigraph represen-
The metagraph model has a wide range of applications, but tation of the annotated metagraph model to an archigraph of
in the context of lake data, such a model is not sufficient to solve higher degree.
all problems. Lakes can con- tain relational data, NoSQL data-
base data, multidimensional cubes, or/and text documents for V. ARCHIGRAPH OF THE UNIVERSAL DATA MODEL
search indexes, so their description with the metagraph model is
a quite complex challenge. Consequently, it is necessary to ex- As previously mentioned, a data lake can store relational
tend this model. For this purpose, let address the notions of pro- data, data from NoSQL databases, multidimensional cubes, text
tograph and archigraph proposed in [23], [24]. documents, search indexes, graphs, and more. To describe a uni-
versal data model of such a lake, an archigraph can be used.
Such an archigraph would be based on the archigraph of the metaedges, tables, multidimensional cubes, indexes,
annotated metagraph model with the addition of new classes to documents.
describe formats unnatural to the metagraph.
Additionally, in order to increase the capabilities and usabil-
Next, we will consider a universal data model that supports ity of the described universal model, it is possible in the future
graph, tabular, and multidimensional data representations, as to expand the archigraph to support the following features:
well as a search index.
1) Virtual tables and virtual multidimensional cubes as ob-
Let us start with the tabular representation. In [25] it was jects in addition to tables and multidimensional cubes.
proposed to consider both the table and all its elements in the
2) Remotely located vertices, metavertices, edges,
archigraph. This approach complicates the process of reading
metaedges, tables and multidimensional cubes. These
the table in comparison with its representation as a set of con-
are objects that are located outside a particular data lake
secutive bytes because the table elements allocated as separate
(possibly in another data lake) but are visible, their data
elements of the archigraph will require additional resources for
is readable.
their search and reading. Therefore, it will be enough to add one
class of ”tables” to the archigraph. 3) Mechanisms to support temporality in the form of addi-
tional timestamps and states similar to the ones pre-
For the multidimensional representation, we will also distin-
sented in [26]. While for tables and multidimensional
guish one class ”multidimensional cube”.
cubes the addition of additional labels and states can be
For a search index, it is enough to allocate an ”index” class relatively easily realized by adding additional columns
to describe the index itself and a ”document” class to describe or axes of multidimensional cubes, metagraph struc-
the documents associated with the index. tures require additional system attributes for all ele-
ments, the presence and change of which will allow
Thus, to describe a universal data lake data model with sup- tracking the appearance, change and deletion of all
port for graph, tabular, multidimensional data and search in- model elements in time.
dexes would require an archigraph with 9 classes:
The proposed universal data model based on the archigraph
• vertices; will allow use of complex metagraph structures and linking of
• edges; tables, multidimensional cubes and search indexes to them. This
will enable all major types of applications to work on a single
• metavertices; data structure: transactional systems that currently use rela-
tional, graph or NoSQL databases, analytical systems that use
• metaedges;
multidimensional data structures, text search tools, master data
• multidimensional cubes; management systems and Internet of Things applications. To
implement the proposed model, technology platforms sup- port-
• tables; ing it should be developed that allow organizing a unified enter-
• indexes; prise-wide data storage environment that enables it to operate
with data stored not only on data center clusters, but also on
• documents; Edge computers directly involved in technological operations
and in the operation of production equipment.
• attributes.
It is also necessary to specify a formalized system of rules of VI. DATA LAKE BASED ON THE UNIVERSAL DATA MODEL
adjacency of elements of the specified classes in the protograph It is proposed to implement the system for creating and
corresponding to the archigraph: maintaining data lakes on the basis of a universal data model
according to the architecture presented in Fig. 4.
• Each edge can be adjacent to one of the elements of the
following classes: vertices, metavertices, tables, multidi- To store the archigraph, a special metagraph DBMS is used,
mensional cubes, indexes, documents. that is implemented as part of a separate subproject. Examples
of such DBMS were proposed in [26]–[28]. Also, a metagraph
• Each metaedge can be adjacent to one of the elements of DBMS can be built on the basis of a columnar DBMS using the
the following classes: vertices, metavertices, tables, mul- concept described in [29]. But the choice of a concrete realiza-
tidimensional cubes, indexes, documents. tion of a metagraph DBMS is an additional task to be solved and
• Metavertices can contain within them: vertices, metaver- is beyond the scope of this paper.
tices, edges, metaedges, tables, multidimensional cubes, The core architecture of the system for creating and main-
documents, indexes. taining data lakes will contain 3 main levels: the data storage
• Metaedges can contain within them: vertices, metaverti- level, the level of data representation universal model, and the
ces, edges, metaedges, tables, multidimensional cubes, level of analytical query processing.
documents, indexes. • Data storage level is responsible for storing all lake data
• Attributes can be adjacent to an element of one of the in the metagraph DBMS, deployed on the basis of HDFS
following classes: vertices, metavertices, edges, file storage. The level is closed, i.e. access to it is possible
only through the next level of the system.
Fig. 4. Schema of the system for creating and maintaining data lakes on the basis of universal data model