Duda
Duda
SYSTEMS IN
MANAGEMENT Information Systems in Management (2012) Vol. 1 (1) 25 37
JERZY DUDA
Department of Applied Computer Science, Faculty of Management, AGH University
of Science and Technology (AGH)
NoSQL databases become more and more popular, not only in typical In-
ternet applications. They allow to store large volumes of data (so called big
data), while ensuring fast retrieving and fast appending. The main disad-
vantage of NoSQL databases is that they do not use relational model of data
and usually do not offer any declarative query language similar to SQL. This
raises the question how NoSQL databases can be used for OLAP processing
and other Business Intelligence tasks. In the paper the author presents the
most common types of NoSQL databases, describes MapReduce paradigm
and discusses models of OLAP processing for such databases. Finally some
preliminary results of aggregation performance in non-relational environment
are presented.
1. Introduction
26
concepts in this field. Finally, the author presents some results of experiments with
aggregation queries typical for OLAP and possible ways of their improvement.
Term “NoSQL” database was first used by C. Strozzi in 1998 for his database
system that did not provide SQL interface. According to his definition “NoSQL is a
fast, portable, relational database management system without arbitrary limits,
(other than memory and processor speed) that runs under, and interacts with, the
UNIX Operating System” [1]. Despite not utilizing SQL language this system was
in fact a relational database system, but data were stored in ASCII files and could
be manipulated by regular Unix tools instead of SQL. The “NoSQL” term has been
rediscovered eleven years later by, among the others, E. Evans [2] who used this
term in his blog while discussing open source databases of other types than rela-
tional. In this sense NoSQL databases can be seen as “non-relational, distributed,
open-source and horizontally scalable” [3]. Thus original sense of “NoSQL” term
meaning a relational database system without SQL language has change to a non-
relational database system that is not like a typical SQL database, so the term
should now be referred to as “Not only SQL”.
There is no single model of NoSQL database. In fact, any database system
that satisfies the conditions presented above (at least most of them including pri-
marily to be non-relational) can be classified to NoSQL family. Nosql-database.org
portal lists the following categories of NoSQL databases [3]:
column store,
document store,
key-value store,
graph databases,
object oriented databases,
XML databases,
others like mutlimodel, multidimensional, and mutlivalue databases.
However, many sources regard only the first four as true NoSQL databases, as
they are based on new concepts and are the most popular ones. In this paper the
author will focus mainly on column store and document store models, as the data-
bases utilizing these models are the most important from a business point of view
and are the most widely used in various applications. Finally, the MapReduce pro-
gramming model for effective processing of large databases will be discussed.
27
simply a column store) writes each table attribute in a separate column. For exam-
ple a simple table of customer tuples presented in Fig. 1 could be written in the
format presented in Fig. 2.
ID 1,2,3,4
Name A Glass Enterprise, A-1 Mobile, Abitz, Agnew Lumber Co.
Address 212 S Tower, 389 Raubuck Rd, W 181 Fredson Rd, PO Box 579
City Centralia, Winlock, Shelton, Centralia
Phone 1 360-736-1141, 1 360-262-0216, 360-426-1928, 360-736-8211
Figure 2. A simple column store for customers’ data
The idea of column-oriented data storage is not new and dates back to 1970s,
when the research on transposed files and vertical partitioning was first published
[4]. The first widely used system that utilized such a model was RAPID system
built for Statistics Canada in 1976. A decade later the advantage of the decomposed
storage model (DSM) over the row-wise storage (NSM) has been shown [5]. Nev-
ertheless for many years the only one column-oriented database system commer-
cially available was Sybase IQ. The situation has changed in a recent few years as
many open source as well as commercial products have been released. The most
popular open source projects that are able to store data in columns (many of them
are also capable to store data in rows) are:
Apache Cassandra – initially developed in Facebook,
C-Store – first comprehensive design description of column store by
researchers from few American universities, including MIT,
HBase – running on the top of Hadoop; currently used for Facebook
messaging system,
Google Bigtable – data are indexed by triples <row, column,
timestamp>, but tables are stored on the basis of column families.
28
The main advantages of column store systems over traditional RDBMS in-
clude first of all [6]:
improved bandwith utilization, as access is necessary only for tables
with those attributes that are required by a query,
improved data compression, as attribute values from the same domain
compress better than for mixed type values,
improved code pipelining (instructions per cycle), as data can be ac-
cessed directly in an iterative way, not through tuple-based interface,
improved cache storage, as caching only chosen attribute tables re-
quires less space than caching whole tuples.
The main disadvantage of column stores is the cost of tuple reconstruction. The
higher number of attributes is engaged in a query the slower is the answer. This,
however, can be improved by using in-memory cache or by utilizing SSD drives in
database servers. It is also worth to notice that OLAP processing and other Busi-
ness Intelligence methods usually require a limited number of attributes in a time.
{ "id": 1,
"name": "A Glass Enterprise",
"address": { "street": " 212 S Tower", "city": "Centralia"},
"phone": " 1 360-736-1141" }
Figure 3. JSON record containing customer data. Author’s own preparation
In this case the data is stored in tuples, like in the relational database, but in
less rigid format. As values are always stored together with their key, the structure
of rows can be changed in successive rows. A “schema-free” model allows to easi-
ly adjust a database to the changing information needs of business analytics.
29
The other advantages of document stores include:
rich data structures – document in a store can represent objects or ar-
rays, thus no ORM (object relational mapping) is required,
better performance – data are usually stored in a single table or they
are joined in memory, so traditional, time-consuming JOINS are un-
necessary and usually not supported,
horizontally scalable – document store works similar to distributed
hash tables, and it can be scaled easily for a large number of nodes.
The main disadvantage of document stores is that they usually do not provide
any declarative language for data manipulation (only OrientDB has SQL, but with-
out support for JOIN clauses). Data processing requires to use some procedural
language, so programming skills are needed in order to process data collected in
such databases. However, this lets to write more complex queries that operate on a
single row (like cursors in RDBMS). This ability can be applied to Business Intel-
ligence tools for developing new kind of analysis or to improve existing ones.
30
way. Business analysts can gain almost immediate access to tons of very accurate
data without the need of using traditional, time-consuming ETL processes. In this
sense such platforms (together with a database like HBase or any other NoSQL
database that support Hadoop) can be an alternative or at least an important addi-
tion to traditional data warehouses used in today’s Business Intelligence systems.
31
(usually RDBMS), and, if necessary, from any other source like spreadsheet files
or flat files. Data in the warehouse are organized in a special form that will be pre-
sented later in the chapter. Once the data warehouse is loaded and updated, busi-
ness analysts can perform some OLAP analysis, use data mining tools, or simply
build reports tailored to their needs. ELT tools, data warehouse, analytical and
reporting tools together constitute a BI platform (called often a BI suite).
32
plications than ROLAP [14], but it requires pre-processing (load from OLTP data-
base or data warehouse) that can last for a long time. Load times in ROLAP are
usually much shorter than in MOLAP and the server is more scalable, however,
ROLAP is evidently slower especially with respect to aggregated data. To over-
come this problem aggregations are stored in RDBMS as materialized views or as
dedicated tables. This in turn requires a careful design of ETL process, because
aggregate operations that were not previously cached can be a very time consum-
ing. The advantages of both ROLAP and MOLAP models have been combined in
HOLAP (Hybrid OLAP) model.
According to some business analysts current Business Intelligence platforms
and data warehouses do not cover all the data necessary for decision making in
today’s complex economic environment. Big volumes of data (so called “big data”)
cannot be processed by traditional warehouses and OLAP servers that base on
RDBMS solutions. Instead of them the solutions originated from the NoSQL
movement like MapReduce (Hadoop), Hive, Pig or jaql should be applied.
Mondrian
OLAP engine
SQL MDX JPivot
Warehouse
Cubes Web server
stored in
XML files
OLAP Server
Analytical tools
33
Mondrian OLAP server stores cube schemas in XML files that can be mod-
eled by client tools like Pentaho Schema Workbench and its engine is based on
ROLAP model. This engine can communicate only with SQL language, so the data
from NoSQL databases cannot be processed directly. First they need to be loaded
into the data warehouse that is based on RDBMS like in traditional BI platforms.
There are plans, expressed by some Mondrian community members, to give the
users a possibly to replace SQL database in Mondrian with HBase or MongoDB.
Basically the same architecture as in Mondrian is used in much more light-
weight solutions like e.g. an OLAP framework architecture written in Python (Cu-
bes) [14]. The architecture of such a lightweight server is shown in Fig. 7.
Application
Σ Aggregation
Slicer SQL
browser
Database
cell
Cubes
A client sends a request for some cell of the cube to the server (called in this
case a slicer), the server uses appropriate aggregation browser backend to compute
this request and a result in JSON format is sent back to the client. Although many
different aggregation browsers have been originally planned, including a Mon-
goDB browser, it has not been implemented yet. The only aggregation browser is
currently “SQL denormalized” that stores the data from converted star schema in
the form of a deformalized views and tables.
34
Thus, the idea is to use MapReduce paradigm not only on OLTP side, but also
for OLAP processing. Map and reduce functions can be programmed relatively
easy using HQL provided by Hive or Pig Latin language.
Also recently some new projects focusing strictly on OLAP processing like
e.g. olap4cloud [16] has emerged. Olap4cloud is based on HBase and Hadoop, but
besides MapReduce it uses data defragmentation, intensive indexing and preaggre-
gations. The developers of olap4cloud performed the test in which aggregation
query was executed with 300 million rows of randomly generated 3 dimensions
with values in [0..10], [0..100] and [0..10000] ranges and 3 measures with values in
[0..100], [0..1000] and [0..10000] [16]. The goal of the query was to sum few facts
across one dimension and filtering them on the basis of other two dimensions. The
query that was used in tests is shown in Fig. 8.
The test was run on few machines with average speed of processor 2 GHz and
average RAM size of 1.7 GB. Unfortunately the number of nodes has not been
specified by the authors. Nevertheless the obtained results are interesting. For Hive
platform query executed in 23 minutes, while for olap4cloud it took only 26 sec-
onds, when no pre-aggregations had been used, and impressive 5 seconds with pre-
aggregations [17].
To compare those results with a traditional RDBMS the author imported the
same 300 million rows into MS SQL 2008 Server that was run on a single machine
with 8GB RAM and 2.0GHz quad core processor. Although the import process
itself took for 30 minutes (the import was done by SQL Server Integration Services
– ETL tool for MS SQL Server) the query executed in 135 seconds for the first run
and only 12 seconds for the subsequent runs (SQL Server cached the data). This
can simulate a simple ROLAP.
Additionally the author has also tested a new aggregation mechanism, that
does not require the map and reduce functions written in an explicit way, and that
was introduced in MongoDB 2.2. Fig. 9 presents the code for MongoDB expres-
sion that is equivalent to the query presented in Fig. 8.
The query initially took 10 minutes to complete, but after indexes for “d1”
and “d2” were created the query time was reduced to only 23 seconds. In the latter
35
case it was only two time slower than for MS SQL server. It can be farther im-
proved by using MongoDB port for Hadoop to speed up map and reduce functions
that are then performed on many nodes.
4. Conclusions
REFERENCES
36
[4] Abadi D. J., Turner M. J., Hammond R., Cotton P. (1979) A DBMS for large statisti-
cal databases, VLDB '79 Proceedings of the 5-th international conference on Very
Large Data Bases.
[5] Abadi D.J., Boncz P.A., Harizopoulos S. (2009) Column oriented Database Systems,
PVLDB 2(2), 1664-1665.
[6] Bajaj P., Dhindsa S.K. (2012) A Comparative Study of Database Systems, Interna-
tional Journal of Engineering and Innovative Technology, Volume 1, Issue 6
[7] Dean J., Ghemawat S. (2008) MapReduce: Simplified Data Processing on Large
Clusters, Communications of the ACM, Volume 51 Issue 1, 107-113.
[8] Duarte de Souza R.G. (2010) MapReduce "Easy distributed computing",
http://girlincomputerscience.blogspot.com/2010/12/mapreduce.html, Retrieved 8 Sep-
tember 2012.
[9] Luhn H.P. (1958) A Business Intelligence System, IBM Journal 2 (4), 314-319.
[10] Chee T., Chan L., Chuah M., Tan Ch., Wong S., Yeoh W. (2009) Business Intelli-
gence Systems: State-Of-The-Art Review And Contemporary Applications, Symposi-
um on Progress in Information & Communication Technology 2009.
[11] Turban E., Sharda R., Delen D., King D. (2010) Business Intelligence, 2nd edition,
Prentice Hall.
[12] http://etl-tools.info/en/bi/datawarehouse_concepts.htm, Retrieved 10 September 2012.
[13] Chaudhuri S., Dayal U. (1997) An Overview of Data Warehousing and OLAP Tech-
nology, SIGMOD Record 26(1), 65-74.
[14] Bach Pedersen T., Jensen C. (2001) Multidimensional Database Technology, Distrib-
uted Systems Online (IEEE), 40-46.
[15] Cubes OLAP avec Mondrian, http://www.osbi.fr, Retrieved 11 September 2012.
[16] Cubes - OLAP Framework, http://packages.python.org/cubes, Retrieved 11-09-2012.
[17] olap4cloud. User Guide, http://code.google.com/p/olap4cloud/wiki/UserGuide, Re-
trieved 12 September 2012.
37