Universal Data Model As A Way To Build Multi-Paradigm Data Lakes

Universal Data Model as a Way to Build
Multi-paradigm Data Lakes

Artem A. Sukhobokov Yury E. Gapanyuk Artem A. Vetoshkin*
SAP America, Inc. Dep. Information processing and manage- Dep. Information processing and manage-
Newtown Square, USA ment systems ment systems
ORCID: 0000-0002-1370-6905 Bauman Moscow State Technical University Bauman Moscow State Technical University
artem.sukhobokov@gmail.com Moscow, Russia Moscow, Russia
ORCID: 0000-0001-9005-8174 ORCID: 0009-0005-3510-6942
gapyu@bmstu.ru vetart1941@gmail.com
*Corresponding author
Alexandra R. Mironova Daniil R. Nikolskiy Maria A. Morozevich

Dep. Information processing and manage- Dep. of Automated and Computer Systems Dep. Information processing and manage-
ment systems Voronezh State Technical University ment systems
Bauman Moscow State Technical University Voronej, Russia Bauman Moscow State Technical University
Moscow, Russia ORCID: 0009-0002-4940-9838 Moscow, Russia
ORCID: 0009-0000-0286-1160 nikolsky.dan@gmail.com ORCID: 0009-0009-7494-6273
alexandra.miralex@gmail.com mrsmoroz@mail.ru
Nikita A. Klyukin Rodion A. Afanasev Dmitriy S. Lakhvich

Dep. Information processing and manage- OZON Marketplace Kazakhstan LLP Uzum Market
ment systems Almaty, Kazakhstan Toshkent, Uzbekistan
Bauman Moscow State Technical University ORCID: 0009-0007-4591-6281 ORCID: 0000-0002-2267-4204
Moscow, Russia roafanasev@gmail.com frostball@gmail.com
ORCID: 0009-0001-9261-4805
nikitaklyukin365@gmail.com
Abstract—The paper focuses on data lakes building that com- I. INTRODUCTION

bine all data from different models stored and processed across
the enterprise in both OLAP and OLTP modes. As a test of this A data lake (Database) represents a way of organizing large
idea, an experiment was conducted to evaluate the performance of amounts of data coming from various sources. Data lakes first
a multi-paradigm data lake built on a single SparkSQL-HDFS appeared and began to be used as an economical alternative to
platform against 4 specialized DBMSs that contained the same data warehouses [1], [2]. This was conditioned by the lack of a
data. The experiment showed that such a solution is possible, but rigid data schema, faster adoption, higher adaptability to various
further experimentation on a larger amount of data and using un- demands and accessibility to a wider range of users [1], [2]. The
structured data is needed to confirm. As a further development of concept began to develop rapidly. Hundreds of scientific publi-
the idea of creating multi-paradigm data lakes on a single techno- cations and dozens of books are currently devoted to the archi-
logical platform, the article proposes the concept of a universal tecture of data lakes. As a further development of the data lake
data model. It is based on the archigraph structure supporting concept, such forms and varieties of data organization emerged
graph, tabular and multidimensional data representation, text as:
documents, and Search Engine search index. Unlike the first
multi-paradigm data lake experiment, the second one develops a • The Data Lakehouse – combining a data lake and a data
metagraph DBMS as a unified technology platform. The architec- warehouse [3].
ture of the data lake management system built on the universal
model basis is developed with its use, and the variant of represen- • The Intelligent Data Lake contains structural and seman-
tation of the archigraph data describing the lake structure in the tic metadata for data retrieval, provides means for enrich-
metagraph DBMS is given. The article is furthered by, the paper ing metadata via methods of schema matching and merg-
briefly describes an ongoing project to build a data lake manage- ing, and offers a unified interface for query processing
ment system using a universal data model, and a planned experi- [4].
ment to evaluate the performance and scalability of the imple-
mented data lake construction method. • The Knowledge Lake represents a contextualized data
lake and a set of algorithms for converting raw data
Keywords—data lake, universal data model, multidimensional (stored in a data lake) into contextualized data and
cube, table, archigraph, search index, metagraph knowledge using extraction, enrichment, annotation,
linking and generalization methods [5].
• The enterprise big data lake [6]. • Pentaho BI – for processing multidimensional data.
• Various mixed forms, for example, an enterprise-scale To model a large data lake, it was decided to use the Mi-
intelligent data lake [7]. crosoft Academic Knowledge Graph [15], [16] as a data source,
containing details about publications, references contained in
Due to the increasing volume and variety of data stored, them, the authors of these publications and their affiliation, as
multi-paradigm data lakes are becoming increasingly common. well as about scientific journals, conferences, books, patents and
These are lakes containing data represented in different data fields of science to which the publications relate.
models, e.g., relational, multidimensional, graph [9]. The analy-
sis of the reasons causing the creation of such data lakes and the To assess the speed, it was proposed to use pairwise equiva-
history of their emergence are discussed in the review [10]. lent queries of three levels of complexity and with their help
compare by models the data processing speed on an integrated
Modern data warehousing systems are characterized by ex- platform and on a specialized DBMS. To carry out the first ex-
tensive use of Internet of Things, fog computing, mist compu- periment, a test stand was created.
ting and blockchain technologies. Connecting a large number of
Edge computers and mobile devices increases the amount of All servers of the stand are virtual cloud servers VK Cloud
data to exabytes [11]. Such volumes naturally preclude the pos- having the following characteristics:
sibility of duplicating all data for use in analytics.
• 2xvCPU Intel Xeon Gold 6238R CPU @ 2.20GHz.
In order to eliminate data fragmentation and excessive du-
plication, enterprises need a layer where all data is managed • 8 GB RAM DDR4, Synchronous, 2400 MHz.
wherever it resides. The senior-level structure in the data man- • 30 GB ceph-ssd with triple replication to different stor-
agement organization is now the data lake, so it is natural to as- age servers.
sign these functions to it, expanding its role.
• OS Ubuntu 22.4.
Combining all enterprise data at the lake level will positively
affect the solution of several problems in enterprise data organ- At first, 600 thousand publication records and all related rec-
ization. First, it is multi-cloud data processing [12]. In the case ords of other types were transferred to the cloud server with
of merging all enterprise data into a single lake, a clear boundary PostgreSQL. 600,000 publication records included 100,000 rec-
will be drawn to show that the data belonging to the enterprise ords of each of the existing types (journal articles, conference
on different clouds are part of the lake, and other data used in articles, books, book chapters, patents and publications without
the course of work on these clouds are external data. Secondly, specifying the type). In total, more than 3 million records were
all enterprise data located on different geographically distributed migrated. During the transfer, tables that could not be linked to
data centers will be merged into a single lake [13]. This will al- the main database and columns with service information were
low, if necessary, data exchange between them by means of the deleted. In the final database 12 tables were left, which are
data lake management system. Thirdly, data access and backup shown in Fig. 1.
management will be improved and possibly existing data redun- Based on the relational database structure presented in Fig.
dancy will be eliminated. All of the above shows significant ad- 1, the structure of the graph database and the structure of the
vantages in the event the lake becomes a unified structure and a multidimensional database were developed. Next, the creation
way to organize the storage of all enterprise data. and populating of databases were conducted, which had signifi-
II. THE FIRST EXPERIMENT cant specifics for each DBMS.
For testing the advisability of using a single lake structure to The results of testing on graph, multidimensional and rela-
organize the storage of all enterprise data, an initial experiment tional databases are shown in Tables I, II, III, respectively.
was conducted to evaluate ways to construct multi-paradigm Conclusions on the results of testing of relational databases
data lakes [14]. and graph databases boil down to the fact that by Apache Spark
During the experiment, two variants of the implementation means it is feasible to implement relational and graph structure
of a multi-paradigm data lake were modeled: in data lakes, it will outperform PostgreSQL and Neo4j in case
of complex data queries.
• using several specialized DBMSs, each of which sup-
ported its own data model (graph, multidimensional and For multidimensional data, the test result obtained ambigu-
relational); ous. We assume that the reason for ambiguity is a suboptimal
model of multidimensional data architecture with hierarchies for
• using a single integrated big data processing platform, Apache Spark. When executing a query with aggregation of cu-
where all data from different models (also graph, multi- bes on the axes that had hierarchies, a large brute force of data
dimensional and relational) are collected together. was performed. However, the maximum variance in query exe-
Apache Spark was used as a single integrated data pro- cution time on a single platform and a specialized DBMS is 17%
cessing platform, while as specialized DBMS were used: when executing a difficult query, which in many cases will be
acceptable when using a universal platform for the lake.
• PostgreSQL – for processing relational data.
These results are a confirmation that for storing all enterprise
• Neo4j – for processing graph data. data in different models, a single data organization in the form
of a data lake can be used. The conducted experiment allows us
Fig. 1. Data schema after transformations
to raise the issue of producing a single data structure uniting dif- nested data. These capabilities will allow you to work with data
ferent data models and to conduct the experiment on large data at the top generalized levels and only access the details when it
volumes, including with loading of unstructured data. These is- is needed. Metagraphs are the only type of data model that has
sues will be the focus of the next experiment. The conducted this capability. In [17] it was proposed to use metagraphs to de-
experiment has laid down prerequisites that allow us to propose scribe the structure of a data lake.
a single universal data model for building a data lake, which will
provide processing of all enterprise data both in OLAP and The first time the term ”metagraph” was mentioned in the
OLTP mode. monograph by Basu A., Blanning R. [18]. Their definition of a
metagraph included:
TABLE I. AVERAGE QUERY EXECUTION TIME TO GRAPH DATABASES • Ability to combine vertices into arbitrary groups and in-
Database Simple, sec. Medium, sec. Difficult, sec. side these groups to have nested groups of vertices.
Apache Spark 22.45458 37.75869 36.19566 • Ability to connect by edges both individual vertices and
Neo4j 1.203 31.1444 82.5117 groups of vertices, including any nesting penetrating
group boundaries.
TABLE II. AVERAGE QUERY EXECUTION TIME TO MULTIDIMENSIONAL • The presence of variables on the edges to which values
DATABASES can be assigned.
Database Simple, sec. Medium, sec. Difficult, sec. Thereafter, various modifications of the metagraph model
Apache Spark 47,50 26,07 41,79 appeared: a model with meta-vertices [19], a hierarchical model
Pentaho BI 44,8 23,4 35,6 with metavertices and metaedges [20], and an annotated model
[21], [22]. The extension of the Bazu-Blanning model proposed
in the annotated metagraphs is the most universal model pre-
TABLE III. AVERAGE QUERY EXECUTION TIME TO RELATIONAL sented and is of interest for its use within a universal data
DATABASES
model. This model was named annotated because metavertices
Database Simple, sec. Medium, sec. Difficult, sec. or metaedges containing the same internal objects as other
Apache Spark 3.04766 16.01681 17.23437 metavertices or metaedges annotate these objects, allowing
some additional attributes to be added to them in the new repre-
PostgreSQL 0.49267 13.64606 42.41046
sentation.
III. THE METAGRAPH MODEL Let us describe the main elements of the annotated meta-
graph. The metagraph itself is defined as: MG = ⟨V, MV, E,
Since a data lake can contain a very large amount of data, the ME ⟩, where MG – metagraph, V – set of metagraph vertices,
universal data model used to structure it should provide an un-
MV – set of metagraph metavertices, E – set of metagraph
limited number of data nesting levels and encapsulation of
edges, ME – set of metagraph metaedges.
A vertex is defined as: v = ⟨{atr1, …, atrk}⟩, v ∈ V , where IV. PROTOGRAPH AND ARCHIGRAPH
v – metagraph vertex, atr1,…, atrk – vertex attributes. In [23], [24] the concept was considered that allows de-
An edge of a metagraph is described as: e = ⟨vbegin, vend, scribing various generalizations of graphs (metagraphs, hyper-
eo, {atr1, …, atrk}⟩, e ∈ E ∧ eo ∈ {true, false}, where e – metagraphs, multigraphs and others) through an archigraph and a
graph edge, vbegin – start vertex (metavertex) of the edge, vend protograph.
– end vertex (metavertex) of the edge, atr1,…, atrk – edge attrib- An archigraph is called a collection of sets between whose
utes, eo – edge directional sign (eo = true – directed edge; eo = elements there exists an incidence relation. Formally, an archi-
false – undirected edge). graph is defined as: Gn = ⟨V1, V2, …, Vn⟩, where Gn – archigraph,
Fragment of metagraph in a general form is defined as: EV Vi – set of elements, n – number of sets. Thus, it can be said that
= ⟨{ev | ev ∈ (V ∪ E ∪ MV ∪ ME)}⟩, where EV – fragment of an archigraph consists of some number of classes, where Vi con-
metagraph, ev – element that is either an edge, a meta edge, a tains the set of elements of the i-th class. An example of an archi-
vertex, or a meta vertex. graph of degree 2 G2 = ⟨V1, V2⟩ is a regular graph given as: G =
⟨E, V ⟩, where E – set of edges, V – of vertices.
A metavertex is defined as: mv = ⟨EV, {atr1, …, atrk}⟩, mv
∈ MV, where mv – metagraph metavertex, atr1, … , atrk – A protograph is called a set of elements P = {p1, p2, …
metavertex attributes, EV – fragment of metagraph. , pn} and their neighborhood matrix M = ∥mi,j∥n×n, mi,j ∈
A metaedge is defined as: me = ⟨vbegin, vend, eo, {atr1, …, {0, 1}, where 1 means the presence of the neighborhood of
atrk}, EV ⟩, me ∈ ME ∧ eo ∈ {true, false}, where me – metagraph element pi with element pj, and 0 means its absence. A pro-
metaedge , vbegin – start vertex (metavertex) of the edge, vend tograph can be considered as a graph with no edges; the role
– end vertex (metavertex) of the edge, atr1,…, atrk – edge attrib- of edges is played by the adjacency of vertices to each other.
utes, eo – edge directional sign (eo = true – directed edge; eo = Examples of protographs are: stack, queue, map. Examples
false – undirected edge), EV – fragment of metagraph. of infinite protographs are a Turing machine tape and a par-
Thus, the metagraph of the annotated model includes edges, quet. A protograph can be either undirected or directed. An
vertices, metaedges and metavertices. Each element has its own example of each protograph is shown in Fig. 3a and Fig. 3b,
set of attributes, where each attribute has a name and a value. respectively.
Edges and metaedges of such a metagraph can penetrate through
A protograph is a
the boundaries of metavertices and metaedges to any nesting
minimal model and by
depth.
selecting subsets it is
An example of an annotated metagraph is shown in Fig. 2. possible to form a
(a) undirected protograph
The metagraph contains three metavertices: mv1, mv2, mv3. The graph, a metagraph, an
metavertex mv1 contains archigraph. An archi-
vertices v1, v2, v3 and graph Gn can be defined
edges e1, e2, e3 connecting as a protograph P,
them. The metavertex whose elements are par-
mv2 contains vertices v4, titioned into n classes.
Also, [23] describes in (b) directed protograph
v5 and edges e6 connect-
ing them. The edges e4, e5 detail the representation
Fig. 3. Example of a protograph
are examples of edges of various generaliza-
connecting vertices v2-v4 tions of the graph as a protograph, including the metagraph.
Fig. 2. Example of an annotated and v3-v5, which are con- In this way, the previously described metagraph models can
metagraph tained in different be systematized through the concept of archigraph. Thus, the
metavertices mv1 and first model proposed by A. Bazu and R. Blanning [18], is an
mv2. The edge e7 is an example of an edge connecting the archigraph of degree 4 and can be represented as a protograph
metavertices mv1 and mv2. The edge e8 is an example of an edge of 4 classes: vertices, vertex groups, edges and variables. And
connecting vertex v2 and metavertices mv2. The metavertex mv3 the annotated model is an archigraph with degree 5 and can be
contains metavertex mv2, vertices v2, v3 and edge e2 from represented as a protograph of 5 classes: vertices, metavertices,
metavertex mv1, as well as edges e4, e5, e8, which suggests a ho- edges, meta-ribs and attributes.
lonic aspect of the metagraph structure.
Therefore, it is possible to extend the archigraph represen-
The metagraph model has a wide range of applications, but tation of the annotated metagraph model to an archigraph of
in the context of lake data, such a model is not sufficient to solve higher degree.
all problems. Lakes can contain relational data, NoSQL data-
base data, multidimensional cubes, or/and text documents for V. ARCHIGRAPH OF THE UNIVERSAL DATA MODEL
search indexes, so their description with the metagraph model is
a quite complex challenge. Consequently, it is necessary to ex- As previously mentioned, a data lake can store relational
tend this model. For this purpose, let address the notions of pro- data, data from NoSQL databases, multidimensional cubes, text
tograph and archigraph proposed in [23], [24]. documents, search indexes, graphs, and more. To describe a uni-
versal data model of such a lake, an archigraph can be used.
Such an archigraph would be based on the archigraph of the metaedges, tables, multidimensional cubes, indexes,
annotated metagraph model with the addition of new classes to documents.
describe formats unnatural to the metagraph.
Additionally, in order to increase the capabilities and usabil-
Next, we will consider a universal data model that supports ity of the described universal model, it is possible in the future
graph, tabular, and multidimensional data representations, as to expand the archigraph to support the following features:
well as a search index.
1) Virtual tables and virtual multidimensional cubes as ob-
Let us start with the tabular representation. In [25] it was jects in addition to tables and multidimensional cubes.
proposed to consider both the table and all its elements in the
2) Remotely located vertices, metavertices, edges,
archigraph. This approach complicates the process of reading
metaedges, tables and multidimensional cubes. These
the table in comparison with its representation as a set of con-
are objects that are located outside a particular data lake
secutive bytes because the table elements allocated as separate
(possibly in another data lake) but are visible, their data
elements of the archigraph will require additional resources for
is readable.
their search and reading. Therefore, it will be enough to add one
class of ”tables” to the archigraph. 3) Mechanisms to support temporality in the form of addi-
tional timestamps and states similar to the ones pre-
For the multidimensional representation, we will also distin-
sented in [26]. While for tables and multidimensional
guish one class ”multidimensional cube”.
cubes the addition of additional labels and states can be
For a search index, it is enough to allocate an ”index” class relatively easily realized by adding additional columns
to describe the index itself and a ”document” class to describe or axes of multidimensional cubes, metagraph struc-
the documents associated with the index. tures require additional system attributes for all ele-
ments, the presence and change of which will allow
Thus, to describe a universal data lake data model with sup- tracking the appearance, change and deletion of all
port for graph, tabular, multidimensional data and search in- model elements in time.
dexes would require an archigraph with 9 classes:
The proposed universal data model based on the archigraph
• vertices; will allow use of complex metagraph structures and linking of
• edges; tables, multidimensional cubes and search indexes to them. This
will enable all major types of applications to work on a single
• metavertices; data structure: transactional systems that currently use rela-
tional, graph or NoSQL databases, analytical systems that use
• metaedges;
multidimensional data structures, text search tools, master data
• multidimensional cubes; management systems and Internet of Things applications. To
implement the proposed model, technology platforms support-
• tables; ing it should be developed that allow organizing a unified enter-
• indexes; prise-wide data storage environment that enables it to operate
with data stored not only on data center clusters, but also on
• documents; Edge computers directly involved in technological operations
and in the operation of production equipment.
• attributes.
It is also necessary to specify a formalized system of rules of VI. DATA LAKE BASED ON THE UNIVERSAL DATA MODEL
adjacency of elements of the specified classes in the protograph It is proposed to implement the system for creating and
corresponding to the archigraph: maintaining data lakes on the basis of a universal data model
according to the architecture presented in Fig. 4.
• Each edge can be adjacent to one of the elements of the
following classes: vertices, metavertices, tables, multidi- To store the archigraph, a special metagraph DBMS is used,
mensional cubes, indexes, documents. that is implemented as part of a separate subproject. Examples
of such DBMS were proposed in [26]–[28]. Also, a metagraph
• Each metaedge can be adjacent to one of the elements of DBMS can be built on the basis of a columnar DBMS using the
the following classes: vertices, metavertices, tables, mul- concept described in [29]. But the choice of a concrete realiza-
tidimensional cubes, indexes, documents. tion of a metagraph DBMS is an additional task to be solved and
• Metavertices can contain within them: vertices, metaver- is beyond the scope of this paper.
tices, edges, metaedges, tables, multidimensional cubes, The core architecture of the system for creating and main-
documents, indexes. taining data lakes will contain 3 main levels: the data storage
• Metaedges can contain within them: vertices, metaverti- level, the level of data representation universal model, and the
ces, edges, metaedges, tables, multidimensional cubes, level of analytical query processing.
documents, indexes. • Data storage level is responsible for storing all lake data
• Attributes can be adjacent to an element of one of the in the metagraph DBMS, deployed on the basis of HDFS
following classes: vertices, metavertices, edges, file storage. The level is closed, i.e. access to it is possible
only through the next level of the system.
Fig. 4. Schema of the system for creating and maintaining data lakes on the basis of universal data model
This approach establishes a single source of truth and avoids

• Level of data representation is the core of the whole data duplication, which is an important cost-saving feature when
system. It is assigned such tasks as: organizing the data storing data for a huge enterprise.
storage structure in a metagraph DBMS, building an
archigraph based on the stored data (i.e., a universal VII. STORING A UNIVERSAL DATA MODEL IN A METAGRAPH
model) and providing interfaces for accessing it. This DBMS
level is open, i.e. it can be accessed from any place in the
architecture. In accordance with the previously described architecture, it
is proposed to store data for an archigraph in a metagraph. Let
• Level of analytical query processing allows to perform us describe a variant of storing the previously described archi-
various analytical queries to the data lake data, provides graph classes in the metagraph, except for the classes that are
interpretation of popular options for accessing data rep- already elements of the metagraph: ”metavertices”, ”vertices”,
resentations: SQL query, MDX query, query for search- ”edges”, ”metaedges”, ”attributes”.
ing by text sample and Cypher query. This layer is the
Elements of the class ”tables” can be represented by ordi-
entrance to the data lake system, otherwise known as the
nary metagraph vertices, with information about table name,
interface for working with the lake data. It is also closed.
field name and type, keys, table data specified in the attributes.
The core of the proposed system is an archigraph DBMS. In Links between tables can be implemented through relational re-
addition to the three levels describing the core of the system, it lations with the help of primary and secondary keys or through
makes sense to implement user level responsible for sending a metagraph edge linking tables to each other. An example of
queries with data, either through a universal query language to table representation is shown in Fig. 5a.
the archigraph DBMS, or through a user interface that allows
Elements of the class ”documents” can be represented by
visualization the query results, and data loading system respon-
ordinary metagraph vertices, with information about the docu-
sible for extracting and loading the necessary data into the sys-
ment name, and document data specified in the attributes. An
tem.
example of document representation is shown in Fig. 5b.
When implementing the analytical query processing layer, it
Elements of the class ”indexes” can also be represented as
may be necessary to allocate additional adapter services to coor-
vertices of a metagraph with information about the index name,
dinate the external interface of the data representation layer and
and index data specified in the attributes. An example of index
data request formats from query handlers.
representation is also shown in Fig. 5c.
In order to work with the entire set of query handlers, it is
Elements of the class ”multidimensional cubes” can be rep-
planned to develop a single language for addressing the archi-
resented as a metavertex containing a set of technical vertices
graph DBMS with internal sections for a specific data handler.
describing the cube axes. Then technical nodes of cube axes will
This architecture assumes that data is stored exclusively in contain in attributes the axis name, data with axis values and data
the data storage layer, all other layers should not contain any with hierarchy description, and cube metavertex will contain in
storage with data, except for caching to speed up operations. attributes the cube name, data with measures and description of
fields in measures. An example representation of a cube with 3
axes is shown in Fig. 5d.
An alternative variant of representing a multidimensional
cube in a metagraph was considered in [30], [31]. In these papers
(a) tables (b) documents (c) indexes
proposed to describe all elements of a multidimensional cube
through elements of a metagraph. This approach allows to sup-
port complexly structured data as cube measures, as well as to
describe hierarchies of cube axes with the help of a model aimed
at hierarchical data.
But the description of all cube elements through metagraph
elements will lead to the separation of measures and values of
axes into remote from each other blocks of physical memory.
This feature complicates the integration of the system with ex-
isting analytical systems aimed at working with ROLAP,
HOLAP, MOLAP technologies [32], which are designed for
storing cube data in single blocks in physical memory. Also be-
cause of this feature, searching and reading the necessary data (d) multidimensional cubes
from the metagraph will require more time and resources than
searching and reading the data stored as a single block in physi- Fig. 5. Variant of representation of archigraph classes in metagraph
cal memory.
corresponding independent DBMS. A variable number of meas-
In the result, it was decided not to follow the alternative op- urements will be taken for each degree of complexity. Since we
tion because of the performance degradation compared to the are working with DBMSs and not real-time systems, the values
option that was described, as well as the need to implement ad- obtained may vary significantly. To deal with such data, we will
ditional functionality to integrate with existing analytical sys- consider the resulting value as a random variable.
tems.
Then, to determine a representative number of measure-
In order to distinguish elements of archigraph classes stored ments, we will iteratively perform measurements and use Stu-
in the similar metagraph elements, the attribute ”type” is intro- dent's t-criterion (1) as a criterion for stopping them.
duced for each element, indicating the type of the corresponding
element. In Fig. 5a, Fig. 5b, Fig. 5c and Fig. 5d, all elements
have this attribute specified.
Therefore, a metagraph DBMS can be used to store universal
data model data. This way also allows within a metagraph to link
different views with each other or with any metadata about them.
VIII. THE SECOND EXPERIMENT
Earlier was described a variant of building a data lake man-
agement system based on a universal data model. Now it is nec-
essary to implement the proposed architecture and conduct an
experiment evaluating the performance and scalability of the
system for creating and maintaining data lakes compared to a
multi-paradigm data lake based on 4 independent DBMS: graph,
relational, multidimensional and search. It is envisioned to use
as such systems: Neo4j [33] for graph representation, Pentaho Describe this process in more detail. Let ni be the number of
ecosystem [34] for multidimensional, PostgreSQL [35] for rela- elements at the i-th iteration. Then at each iteration step we will
tional and ElasticSearch [36] for search queries. test the hypothesis of equality of the sample mean obtained at
the i-th iteration and (i – 1)-th iteration. We will set the signifi-
Microsoft Academic Knowledge Graph (MAKG) [15], [16] cance level equal to 0.05. And the number of degrees of freedom
will be used as the data for the experiment. This data will be will be, respectively, ni + ni – 1 – 2. Then we will compare the
divided into three equal parts for representation in graph, rela- calculated value of the t-criterion according to formula (1) with
tional and multidimensional formats. Publication texts from the tabular value obtained on the basis of the number of freedom
MAKG will be loaded and indexed into the data lake. In parallel, degrees and significance level. If the obtained value is less than
each part of the data will be loaded into the corresponding inde- the tabulated value of the criterion, it means that there are strong
pendent DBMS and into the corresponding section of the lake fluctuations in the data and additional measurement is necessary
based on the universal data representation. (next iteration), otherwise we finish the measurements.
To evaluate performance, pairwise testing of processing Also, each measurement will be performed on a "warmed
speed will be performed on queries of 5 degrees of complexity up" system, meaning that first the request will be run two or
in each private data model in the archigraph data lake and in the three times without fixing the execution time, so that the system
has time to grab all the necessary resources. After that, the query IX. CONCLUSION
processing time is measured. At the moment, the proposed architecture of the data lake
Using this approach, performance evaluations of private data maintaining system and the universal data model have not been
models in the archigraph data lake and the corresponding inde- implemented. It is planned to implement them and carry out an
pendent DBMSs will be obtained. experiment proving their viability. Positive results will show
that instead of using several different DBMSs at one enterprise,
Further, the obtained average performance values of the two each of which uses its own data model, it is possible to combine
mentioned data management systems will be compared. The sta- all the necessary models into one and organize the data lake on
tistical significance of the results will also be evaluated by test- the basis of the unified model.
ing the statistical hypothesis of inequality of the obtained aver-
ages based on the same Student's t-criterion (1). If acceptable performance, scalability and availability met-
rics are obtained, it will be possible to:
Scalability will be evaluated on five different data volumes.
During testing, new data will be added to the system and the • initiate pilot projects to utilize the universal data model
query execution time will be measured for the new volume. The in efficiently used data lakes;
process of measuring the query execution time will be done sim-
ilarly to the measurements performed in the performance evalu- • further develop the universal data model by varying or
ation. The nature of the resulting graphs will allow extrapolating adapting the data types included in it.
performance indicators for each of the particular models with a The transition to the universal data model will entail sim-
further increase in the volume of such data in the lake. plifying the creation and maintenance of multi-paradigm data
The work in progress has 10 lines of work directions for de- lakes and will cause the merging of separate independent lakes
veloping the data lake support system and the proposed experi- and databases into large unified lakes, the typology and evolu-
ment: tion of which are discussed in [7]. This, in turn, will increase the
need and create prerequisites for increasing the level of data
• Realization of metagraph DBMS for data storage layer lakes intelligence: the emergence in them of mechanisms of
in a data lake using a universal data model. adaptive self-regulation in accordance with the established
KPIs, independent search and replenishment of data, the possi-
• Creation of a multi-paradigm data lake based on inde- bility of intelligent data retrieval by declarative requests of users
pendent data storage and management systems. [7].
• Design of API and navigation language of archigraph
REFERENCES
DBMS implementing the universal data model.
[1] P. Pasupuleti, and B. S. Purra, Data lake development with big data, UK,
• Integration of SQL, MDX and MetaCypher query han- Birmingham: Packt Publishing Ltd, 2015.
dlers into the archigraph DBMS. [2] N. Miloslavskaya, and A.Tolstoy, “Big data, fast data and data lake con-
cepts,” Procedia Computer Science, 2016, vol. 88, pp. 300-305, doi:
• Integrating the text-based search query handlers into the 10.1016/j.procs.2016.07.439.
archigraph DBMS. [3] M. Armbrust, A. Ghodsi, R. Xin, and M. Zaharia, “Lakehouse: a new
generation of open platforms that unify data warehousing and advanced
• Design of archigraph DBMS software components for analytics,” Proceedings of 11th Annual Conference on Innovative Data
connecting SQL, MDX, MetaCypher and search query Systems Research (CIDR ’21), 2021, [online] Available:
handlers. http://cidrdb.org/cidr2021/papers/cidr2021 paper17.pdf.
[4] R. Hai, S. Geisler, and C. Quix, “Constance: An intelligent data lake sys-
• Design of a user-level system for data lake support using tem,” SIGMOD ’16: Proceedings of the 2016 International Conference
a universal data model. on Management of Data, 2016, pp. 2097-2100, doi:
10.1145/2882903.2899389.
• Evaluating the scalability of the proposed variants of data [5] A. Beheshti, B. Benatallah, Q. Z. Sheng, and F. Schiliro, “Intelligent
lake construction. knowledge lakes: The age of artificial intelligence and big data,” Web In-
formation Systems Engineering, 2020, pp. 24-34, doi: 10.1007/978- 981-
• Performance testing of the proposed variants of data lake 15-3281-8 3.
construction. [6] A. Gorelik The Enterprise Big Data Lake, USA, CA, Sebastopol:
O’Reilly Media, Inc., 2019.
• Design and implementation of MetaCypher – a declara- [7] A. A. Sukhobokov., Y. E. Gapanyuk, A.S. Zenger, and A.K. Tsvetkova,
tive query language for archigraph DBMS. “The concept of an intelligent data lake management system: machine
consciousness and a universal data model,” Procedia Computer Science,
The overwhelming number of performers of the listed works 2022, vol. 213, pp. 407-414, doi: 10.1016/j.procs.2022.11.085.
are 1st year master’s students. Completion of the works on the [8] Data Lake Market, SNS Insider. Report Code: SNS/ICT/1541. June 2022.
stated directions is planned by the middle of 2025, when they 125 p, [online] Available: https://www.snsinsider.com/reports/data-lake-
will go to the defense of diploma projects. As a result, it is market-1541#.
planned to get a complete system for creating by this time and [9] P. N. Sawadogo, and J. Darmont, “On data lake architectures and
maintaining a data lake based on a universal data model, as well metadata management,” Journal of Intelligent Information Systems, 2021,
vol 56, no. 1, pp. 97-120, doi: 10.1007/s10844-020-00608-7.
as to obtain the results of the described experiment. It is expected
to take about 3000 man-days to complete all tasks.
[10] R. Hai, C. Koutras, C. Quix, and M. Jarke, “Data Lakes: A Survey of PROCEEDINGS, L. Kalinichenko, Y. Manolopoulos, S. Stupnikov, N.
Functions and Systems,” IEEE Transactions on Knowledge and Data En- Skvortsov, V. Sukhomlin, Eds., 2018, vol 2277, pp. 82-89, [online] Avail-
gineering (Early Access), 2023, pp. 1-20, doi: able: https://ceur-ws.org/Vol-2277/paper17.pdf.
10.1109/TKDE.2023.3270101. [28] A. A. Sukhobokov, V. A. Trufanov, Y. A. Stolyarov, M. R. Sadykov, and
[11] SAP SE, IOT100. Internet of Things Fundamentals, 2017, Course Ver- O. O. Elizarov, “Distributed meta graph DBMS based on Blockchain
sion: 10, Material Number: 50139413. technology,” Natural and technical sciences, 2021, no. 7, pp. 201-209,
[12] Y. Fu, X. Qiu, and J. Wang, “F2MC: Enhancing data storage services with doi: 10.25633/ETN.2021.07.15.
fog-toMultiCloud hybrid computing,” IEEE 38th International Perfor- [29] M. Massri, P. Raipin, P. Meye, “GDBAlive: A Temporal Graph Database
mance Computing and Communications Conference (IPCCC), 2019, pp. Built on Top of a Columnar Data Store,” Journal of Advances in Infor-
1-6, doi: 10.1109/IPCCC47392.2019.8958748. mation Technology, August 2021, vol. 12, no. 3, pp. 169-178, doi:
[13] M. Bergui, S. Najah, and N. S. Nikolov, “A survey on bandwidth-aware 10.12720/jait.12.3.169-178
geo-distributed frameworks for big-data analytics,” Journal of Big Data, [30] YY. E. Gapanyuk, “The main provisions of the multidimensional meta-
2021, vol 8, article no. 40, doi: 10.1186/s40537-021-00427-9. graph model of data and knowledge Integrated models and Soft compu-
[14] A.A. Sukhobokov, R.A. Afanasev, A.G. Balabas, A.A. Vetoshkin, A.S. ting,” Artificial Intelligence IMSC-2022: Proceedings of the XI Interna-
Zenger, S.A. Konovalikova, M.A. Kucherenko, A.P. Larionova, A.R. tional Scientific and Practical Conference. In 2 volumes, 2022, vol. 2, pp.
Mironova, S.V. Ocheretnaya, and A.D. Rybina, “The first stage of the 28-38.
experiment to evaluate the performance of multi-paradigm data lakes,” [31] V. M. Chernenkiy, Y. E.Gapanyuk, A. N. Nardid, A.V. Gushcha and Y.
Journal of Natural and Technical sciences, 2023, no. 7 (182), pp.124-133, S. Fedorenko, “The Hybrid Multidimensional-Ontological Data Model
doi: 10.25633/ETN.2023.07.08. Based on Metagraph Approach,” Perspectives of System Informatics. PSI
[15] M. Fa¨rber, “The Microsoft Academic Knowledge Graph: A Linked Data 2017. Lecture Notes in Computer Science, A. Petrenko, and A. Voronkov,
Source with 8 Billion Triples of Scholarly Data,” Proceedings of the 18th Eds., 2018, vol. 10742, pp, 72-87, doi: 10.1007/978-3-319-74313-4 6.
International Semantic Web Conference (ISWC’19), 2019, pp. 113-129, [32] J. Han, M. Kamber, and J.Pei, Data Mining: Concepts and Techniques.
doi: 10.5281/zenodo.3936556. 3rd ed., Elsevier, Morgan Kaufmann, 2012.
[16] Microsoft Academic Knowledge Graph (MAKG), [online] Available: [33] Neo4j Graph Database & Analytics — Graph Database Management Sys-
https://makg.org. tem, [online] Available: https://neo4j.com.
[17] A. S. Zenger, A. K. Tsvetkova, Y. E. Gapanyuk, and A. A. Sukhobokov, [34] Pentaho, [online] Available: https://github.com/pentaho.
“Description of the data lake metagraph and development of an algorithm [35] PostgreSQL: The world’s most advanced open-source database, [online]
for searching the way to the top,” Artificial Intelligence in Management, Available: https://www.postgresql.org.
Control, and Data Processing Systems. Proceedings of the All-Russian
[36] Elasticsearch: The Official Distributed Search & Analytics Engine —
Scientific Conference IIASU’22, vol. 1, pp. 352-358. Moscow, BMSTU
Elastic, [online] Available: https://www.elastic.co/elasticsearch.
Publishing, 2022.
[18] A. Basu, and R. W. Blanning, Metagraphs and their applications, USA,
New York: Springer New York, 2007, doi: 10.1007/978-0-387-37234-1.
[19] L. S. Globa, M. Y. Ternovoy, and O. S. Shtogrina, “Metagraph based rep-
resentation and processing of the fuzzy knowledge bases,” Open Semantic
Technologies for Intelligent Systems, 2015, vol. 5, pp. 237- 240.
[20] S. V. Astanin, N. V. Dragnish, and N. K. Zhukovskaya, “Nested meta-
graphs as models of complex objects,” Engineering journal of Don, 2012,
no. 4-2, pp. 76-80.
[21] E. N. Samokhvalov, G. I. Revunkov, and Y. E. Gapanyuk, “Metagraphs
for Information Systems Semantics and Pragmatics Definition,” Herald
of the Bauman Moscow State Technical University. Series Instrument En-
gineering, 2015, no. 1 (100), pp. 83-99, doi: 10.18698/0236-3933-2015-
1-83-99.
[22] V. B. Tarassov, and Y. E. Gapanyuk, “Complex Graphs in the Modeling
of Multi-agent Systems: From Goal-Resource Networks to Fuzzy Meta-
graphs,” Russian Conference on Artificial Intelligence, S. O. Kuznetsov,
A. I. Panov, and K. S. Yakovlev, Eds., 2020, pp. 177-198, doi:
10.1007/978-3-030-59535-7 13.
[23] S. V. Kruchinin, “On some generalizations of graphs: multigraphs, hyper-
graphs, metagraphs, flow and port graphs, protographs, archigraphs,” Sci-
ence issues, 2017, no. 3, pp. 48-67.
[24] S. V. Kruchinin, “Protogaphs and Archigraphs as a Graphs Generaliza-
tion,” Journal of Scientific Research Publications, 2017, no. 3 (41), pp.
23-33.
[25] A. A. Sukhobokov, “Metagraph-tabular data model for asset management
systems,” Artificial Intelligence in Management, Control, and Data Pro-
cessing Systems. Proceedings of the All-Russian Scientific Conference
IIASU’22, vol. 1, pp. 93-99, Moscow, BMSTU Publishing, 2022.
[26] I.A. Erokhin, N.S. Grunin, A.V. Molchanov, E.A. Belousov, and Y.E.
Gapanyuk, “Method of storing metagraph data model in postgresql
DBMS,” Artificial Intelligence in Management, Control, and Data Pro-
cessing Systems. Proceedings of the All-Russian Scientific Conference
IIASU’22, vol. 2, pp. 348-351, Moscow, BMSTU Publishing, 2022.
[27] V. M. Chernenkiy, Y. E. Gapanyuk, Y. T. Kaganov, I. V. Dunin, M. A.
Lyaskovsky, and V. Larionov, “Storing Metagraph Model in Relational,
Document-Oriented, and Graph Databases,” Selected Papers of the XX
International Conference on Data Analytics and Management in Data In-
tensive Domains DAMDID/RCDL 2018, CEUR WORKSHOP

Universal Data Model As A Way To Build Multi-Paradigm Data Lakes

Uploaded by

Copyright:

Available Formats

Universal Data Model As A Way To Build Multi-Paradigm Data Lakes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Universal Data Model As A Way To Build Multi-Paradigm Data Lakes

Uploaded by

Copyright:

Available Formats

Universal Data Model as a Way to Build

Multi-paradigm Data Lakes

Alexandra R. Mironova Daniil R. Nikolskiy Maria A. Morozevich

Nikita A. Klyukin Rodion A. Afanasev Dmitriy S. Lakhvich

Abstract—The paper focuses on data lakes building that com- I. INTRODUCTION

This approach establishes a single source of truth and avoids

You might also like