NoSQL Database Topics
This page should be general concepts not specifics of a single NoSQL technology.
Originally "No SQL" was used in the sense of "database technologies that are nonrelational, and/or do not offer an SQL compatible or SQL-like query language", that
were either Column oriented, Key-Value Store oriented, or Document oriented KeyValue Stores. With Couchbase 4.0 set to add an SQL-like language and query service
to its technology, while retaining its document oriented non-relational, and nonschema-enforced design, some are suggesting the term "noSQL" be understood as
an acronym "Not Only SQL".
An introductory whitepaper is attached to this page as a PDF: Why No SQL?
A seminal presentation on Polyglot Persistence, by Martin Fowler, considered a key
NoSQL luminary:
http://martinfowler.com/articles/nosql-intro-original.pdf
Some Ramsoft notes:
The fundamental approach in a Relational database is to give schema
enforcement job, and the determination of your query plans, as jobs
belonging to the database engine. A worklist feature is engineered via
relational models. While users might tune their SQL for performance after
looking at query plans, it's more common in the NoSQL world for developers
to engineer their systems via distributed techniques like Map and Reduce. It's
a fundamental mental shift.
In a NoSQL document database or key-value store, or big table database, the
system is schemaless. Or rather your application still needs a schema but
that schema exists only as a perceived schema, enforced only by whatever
code you build to enforce schema, within your web-services layers.
The system is designed from the ground up to be scalable, and to be
distributed.
Memory based caching is commonly used, making a NoSQL database very
fast for certain kinds of data fetches, even when a lot of rows are returned.
Clustering, sharding, replication, and other scalability and uptime enhancing
features are primary design drivers, instead of features you bolt on
afterwards.
Working with a cloud based architecture requires a smarter client. Clients for
NoSQL are often much smarter than conventional SQL clients.
Working with a NoSQL database.
o
In a key-value store system, there is a single key used to access any
object. The normal and lowest cost way to access any object is by
knowing its key, which is its object unique identifier.
In a document database, it is possible to very quickly find a particular
entity, even without any query engine, by walking a table, and because
this walk is all done without disk access, it can be much faster than in
a conventional SQL database. Sometimes, such as when you need full
text search on a document, or search on a wide set of fields, the end
results are dramatically faster than any SQL RDBMS.
Some NoSQL databases may have an expression language or even an
SQL-like mini language to make writing queries easier. It might be the
wrong approach, or it might be the right approach, to continue to be
"query-oriented" in our development. We need to establish rules for
when to use queries and when to use other techniques. Just because
queries were your only tool in the SQL world, don't let your thinking be
constrained. When all you have is a hammer, everything looks like a
nail.
Because of the schemaless nature of the system, everything in the
database is effectively optional, over time. This needs to be reflected
in our webservice design by having schema validation occur as part of
our webservice. For example, if a user provides a value equivalent to a
Foreign Key, the Foreign Key checking is done by the webservice,
which should raise an exception back to the client if an invalid foreign
key is provided. Strictly speaking the concept of a Foreign Key does not
exist in NoSQL, but it could exist as an element of our WebService
Application Service Design.
TODO: Schema design tips for Document databases.
Couchbase
This page is a bootstrap page to help developers understand Couchbase and start
using it. Couchbase is a document oriented NoSQL Database utilizing JSON as its
native document format.
Couchbase has slightly different capabilities and features than other products that
are also considered NoSQL, such as Hadoop or MongoDB. To understand the
differences, it helps to understand couchbase's architecture, and the problems it is
meant to solve, and how that differs from other tools like Hadoop or MongoDB.
See notes on Couchbase-to-RAMSOFT communications at the
Couchbase_Meeting_May27_2015 page.
First of all, do not confuse Apache CouchDB with Couchbase. They are
different systems, with some historical links, and some of the original
CouchDB developers are now working on Couchbase. Both use an underlying
memory cache server called memcached as one of their cluster node's
available services. See CouchBase vs CouchDB
Current Version of Couchbase as of May 2015, is version 3.0. Version 4.0 is
planned for release mid 2015. Version 1 of couchbase was a merged product
based on Membase and CouchOne Inc. JSON store was introduced in 2.0.
Features in 4.0, including N1QL, which is pronounced Nickel
o
N1QL Tutorial: http://query.pub.couchbase.com/tutorial/
N1QL Cheatsheet: http://docs.couchbase.com/files/Couchbase-N1QLCheatSheet.pdf
Developer Preview 4 of 4.0 is being used by RAMSOFT to evaluate Couchbase
technology. Currently the DP4 release is only available on Linux and MacOS.
No Windows Server exists yet.
Database Schema Design, Replication and Document Oriented Databases.
o
https://www.youtube.com/watch?v=Mj4PKKidREU : Couchbase 103
Data Modeling
TODO: COuchbase Server Setup
TODO: Couchbase Client SDK Setups
TODO: Webservice design with COuchbase; How to move the schema
enforcement into our webservice layers.
TODO: Explain VIEWS, INDEXING and COLLATION.
Using Couchbase Admin Console
Admin consoles are accessed via your web browser, for example:
http://couchbase1.ramsoft.biz:8091
Login is Administrator. Ask any developer for the current Couchbase cluster Admin
password.
Optimistic Locking with CAS
Optimistic locking via Compare and Swap is one technique for consistency, without
transactional locking. This is the shipping 3.0 docs on the subject:
http://docs.couchbase.com/developer/dev-guide-3.0/update-info.html
get, cas, set handles success
one possible failure case we could code up involves short random delay, retry
(get, set again).
The underlying CAS model is the same in 4.0 but the client has been upgraded to
involve less client side checking, that is to say, when doing a replace, I believe in
4.0 you simply perform your operations, and must handle Exceptions raised in the
client when inconsistency is detected.
Using the N1QL Client (cbq) to interactively query and create indexes.
You need to be logged into the couchbase server machine to use cbq, at least as of
DP4. That means you need an ssh client on your windows PC, you can get putty
from the network at \\mainstorage\shared\Downloads\putty
You can ssh into couchbase1.ramsoft.biz, use devuser account as logon, ask Jasmine
or Warren for current password.
The primary unit of document aggregation is the bucket. To query a bucket it must
have at least one index. After you have read the tutorials and documents above,
you will know how to create an index. That is a command executed from cbq.
Using Couchbase From Visual Studio in C#
Demos in svn: http://svn.ramsoft.biz/core/misc/demos/couchbase%20c
%23%20demos
C#/.Net Couchbase SDK Documentation
Hello World Example (Visual Studio 2013 or higher)
o
Create new C# console application
Use NuGet package manager console and type Install-Package
CouchbaseNetClient
Use this code in your Program.cs:
o
o
using System;
using Couchbase.Core;
using Couchbase.Configuration.Client;
o
o
namespace Couchbase.HelloCouchbase
class Program
o
o
static void Main(string[] args)
o
o
o
ClientConfiguration _config = new ClientConfiguration();
_config.Servers.Add(new
Uri("http://couchbase1.ramsoft.biz:8091/pools"));
//_config.BucketConfigs.Add(...);
var _cluster = new Cluster(_config);
o
o
using (var bucket = _cluster.OpenBucket() )
var document = new Document<dynamic>
Id = "Hello",
Content = new
{
Name = "Couchbase"
o
o
};
o
o
var upsert = bucket.Upsert(document);
if (upsert.Success)
var get = bucket.GetDocument<dynamic>(document.Id);
document = get.Document;
var msg = string.Format("{0} {1}!", document.Id,
document.Content.Name);
Console.WriteLine(msg);
o
o
o
Console.WriteLine("couchbase communications success.
press enter to continue.");
} else {
Console.WriteLine("upsert failed. press enter to
continue.");
}
o
o
Console.Read();
o
o
Schema Requirements and Performance Requirements for NoSQL
Definitions
RIS (Radiology Information System) = A software system enabling a team of
office workers to manage a Radiology office's daily operations, most
importantly the scheduling of appointments, and the tracking of all study
information, billing and insurance information, and any other information
related to the operation of a medical office.
PACS (Picture Archiving and Communication System) = A software system
enabling a physician (whether a Radiologist or a general practitioner) to view
a Diagnostic Image (such as an X-Ray), and to view related previous images
for the same patient ('relevant priors') and to view diagnostic reports, or to
dictate or transcribe a diagnostic report.
Worklist = The heart of a RIS and the heart of a PACS. A query consisting of at
least, Study date and time, Patient identification, Workflow Status of a study.
Some example worklists:
o
Dave has a scheduler worklist that lets him see referrals and put them
onto a calendar.
Steve has a worklist that lets him see studies that need updates to
their insurance information.
Janet has a worklist that shows which X-Ray studies need to be read by
her today.
Architecture Diagram and User Types
Performance and Hardware Requirements
We don't necessarily want to deploy a 3 node cluster on every site. How is
the performance of 1, 2 or 3 node cluster using any given NoSQL Solution
(say Couchbase) compare to the same hardware running MS SQL Server? (1
node should be okay, but you need to deploy backup.)
Do you need a 3 node cluster (say 3x some 2015 model of dell rackmount
server) to equal the single node performance of MS SQL Server or some other
conventional SQL solution that has features that Firebird lacks, like
clustering? (No but you lose durability, and you have backup/data loss
concerns now.)
Architecture and Features
Is N1QL useful for updating or just for querying? Answer: 4.0 release is readonly (SELECT). DML features (update/modify) post 4.0, planned.
Are there transactional update features in N1QL in 4.0, or planned? What are
lightweight transactions? Answer: Post 4.0.
How does the introduction of N1QL change how we would design our keys or
our document data? (Not sure yet.)
Is .NET platform client SDK a first class SDK? Why didn't the .net client
support the N1QL in DP4? (Will be updated soon.)
Is there any reason why we can't have hundreds of buckets? We read that's
not recommended, why? (Server Memory usage. Plan to use less than 10
buckets.)
Schema Requirements
audit log
A simple log which is appended to, and read, the table can grow very large
over time, no transactional or locking complications. Up to 40% of our largest
site's data, over time, might be composed of audit log entries.
Exam Image datasets
Exam image datasets consist of medical images that are composed of two
parts
o
Header that contains information about the patient i.e. name, date of
birth, exam description etc. The size of the header is typically < 10 KB
Pixel data that contains the actual image. Compressed Pixel data can
vary from 10 KB to > 4 GB, especially for multi-frame pixel data.
Couchbase as a document size limit of 20 MB and best-practices state to
keep it < 1 MB. This should be enough to store the header of an image minus
pixel data and any large tags that would not be useful for querying anyways.
Currently, we store selected header fields within our RDBMS database along
with a file name pointer to refer to the image on disk.
Image datasets are immutable. They are inserted only and never updated,
though they can be deleted.
scheduler
atomic reservation of time slots demands certain transactional or atomic or
ACID-like capabilities. User A and User B are going to click "Reserve 8 AM slot
on Machine 123" at the same time. One of them needs to win. One of them
needs to get an error message.
schedules are a kind of worklist.
schedules have a calendar.
schedules have a set of resources, that are grouped into rooms, which are
part of facilities. (a natural relational master-detail, three levels, in a
relational model).
resources have rules and other properties like
what times they are available
what times they are unavailable
blocked time can be specified
o
in single block time entries
as part of recurrences or series groups, which affect multiple resources,
or all resources on a particular site or location.
appointment objects are linked to studies, but are additional state for a study.
they connect a study to a machine, at a particular time, and are considered a
"contract" with a client.
eventual consistency issues would be a deal-breaker in the scheduler, you
can't have two sites scheduling the same resource, and not seeing each
other's data instantly (ideally) or at very least they can create provisional
bookings, and must then receive some confirmation or some error that their
booking succeeded at the root level where-ever the booking becomes final.
(round trip ACID, in a distributed environment).
worklist and studies
define worklist : a worklist is a view into the pending, in-progress, and
completed studies
a study is the single most important object in any PACS DB or RISPACS DB.
a worklist is a filtered view of the studies a particular user of our system
needs to see. Currently relies heavily on relational database capabilities to
create our worklist queries, and our worklist queries define the worklist
element of the system.
transforming our studies, and our worklist into a document architecture,
represents the largest sea-change in our design, and it is also the most
complex and multi-layered set of requirements of any area of our database.
the rest of RIS - patients, doctors, insurers, providers, referrers, etc
define our insurance and study and patient models.
define all referring and other healthcare provider models.
discuss an approach to the RIS side of the database that maintains all prior
states of every object, having auditing and undo, and the inability to destroy
or overwrite data, but rather to have it be superceded by newer copies of the
same data, as a design element.
design for audit, design for immutability; A patient doesn't so much have a
current set of insurance information, but rather has a latest set of information
(almost like the tip of a version control system) and all historical states for
that information, whether it is insurance, allergies, or whatever, is still
retrievable.