CassandraTraining v3.3.4

Developer track
By
Nirmallya Mukherjee
Index
Part
Part
Part
Part
Part
Part
Part
Part
Part
1
2
3
4
5
6
7
8
9
Warmup
Introducing C*
Setup & Installation
Prelude to modelling
Write and Read paths
Modelling
Applications with C*
Administration
Future releases
2
Part 1
Warm up!
What's the challange?

Databases are doing just fine!
Consumer oriented applications

Users in millions
Data produced in hundred/thousand of TB
Hardware failure happens no matter what
Is consumer growth predictable? Then how to do
capacity planning for infrastructure?
What can it do?

Using 330 Google Compute Engine virtual machines, 300
1TB Persistent Disk volumes, Debian Linux, and Datastax
Cassandra 2.2, we were able to construct a setup that can:
sustain one million writes per second to Cassandra with
a median latency of 10.3 ms and 95% completing under
23 ms
sustain a loss of of the instances and volumes and still
maintain the 1 million writes per second (though with
higher latency)
Read more here
http://googlecloudplatform.blogspot.in/2014/03/cassandrahits-one-million-writes-per-second-on-google-computeengine.html
Who uses it?
eBay
Netflix
Instagram
Comcast
Safeway
Sky
CERN
Travelocity
Spotify
Intuit
GE
Gartner magic quadrant
Types of NoSQL
Wide Row Store - Also known as wide-column stores, these databases

store data in rows and users are able to perform some query operations
via column-based access. A wide-row store offers very high performance
and a highly scalable architecture. Examples include: Cassandra, HBase,
and Google BigTable.
Key/Value - These NoSQL databases are some of the least complex as all
of the data within consists of an indexed key and a value. Examples
include Amazon DynamoDB, Riak, and Oracle NoSQL database
Document - Expands on the basic idea of key-value stores where

"documents" are more complex, in that they contain data and each
document is assigned a unique key, which is used to retrieve the
document. These are designed for storing, retrieving, and managing
document-oriented information, also known as semi-structured data.
Examples include MongoDB and CouchDB
Graph - Designed for data whose relationships are well represented as a

graph structure and has elements that are interconnected; with an
undetermined number of relationships between them. Examples include:
Neo4J and TitanDB
SQL and NoSQL

1. Database, Relational, strict
models
2. Data in rows, pre-defined
schema, sql supports join
3. Vertically scalable
4. Random access pattern
support
5. Good fit for online
transactional systems
6. Master slave model
7. Periodic data replication as
read only copies in slave
8. Availability model includes a
slight downtime in case of
outages
1. Datastore, distributed &

Non relational, flexible
schema
2. Data in key value pairs,
flexible schema, no joins
3. Horizontally scalable
4. Designed for access
patterns
5. Good for optimized read
based system or high
availability write systems
6. C* is materless model
7. Seamless data replication
8. Masterless allows for no
downtime
9
Applicability of SQL / NoSQL

No one single rule and very usecase
specific
High read (not write) oriented

High write (not read) oriented
Document based storage
KV based storage
Important to know the access patterns

upfront
Focus on modeling specific to the use case
Very difficult to fix an improper model

later on unlike a database
10
CAP Theorem
ACID
Atomicity - Atomicity requires that each transaction be "all or nothing"
Consistency - The consistency property ensures that any transaction
will bring the database from one valid state to another
Isolation - The isolation property ensures that the concurrent execution
of transactions result in a system state that would be obtained if
transactions were executed serially
Durability - Durability means that once a transaction has been
committed, it will remain so under all circumstances
What is it? Can I have all?

Consistency - all nodes have the same data at all times
Availability - Request must receive a response
Partition tolerance - Should run even if there is a part failure
CAP leads to BASE theory
Basically Available, Soft state, Eventual consistency
11
AP systems, what about C?

Consistency
HBase
MongoDB
RDBMS
Availability
Partition Tolerance
Cassandra
Google Big Table
How does this impact architecture?

1.Eventual consistency
2.Some application logic is needed
12
Good fit use cases

Store high volume stream data
Sensor data
Logs
Events
Online usage impressions
Scenario where the data can not be

recreated, it is lost if not saved
No need of rollups, aggregations
Another system needs to do this
13
Let's brain storm your usecase
14
Part 2
Introducing Cassandra
15
Database or Datastore
Where did the name "Cassandra" come from?
Symantics - no real definition, both are ok

I would like to call it "Datastore"
A bit of un-learning is required to get a hold of the "different" ways
Inspiration from (think of it as object persistence)
Google BigTable
Dynamo DB from Amazon
A few interesting observations about datastore
Some interesting reference from Greek mythology

What's the significance of the logo?
Put = insert/update (Upsert?)

Get = select
Delete = delete
Think HashMaps, Sorted Maps ...

Storage mechanism
Btree vs LSM Tree
Row oriented datastore
Biggest of all - No SPOF (Masterless)

16
Masterless architecture
Why large malls have more than one entrance?
R1
Client 1
R2
Client 2
Client 3
R3
17
Seed node(s)
It is like the recruitment department
Helps a new node come on board
More than one seed is a very good
practice
Per rack based could be good
Starts the gossip process in new
nodes
If you have more than one DC then
include seed nodes from each DC
No other purpose
18
Virtual node, Ring architecture

C* operates by dividing all data evenly around
a cluster of nodes, which can be visualized as a
ring
In the early days the nodes had a range of
tokens
This had to be done at the time of setting up C*
Node 1
Node 2
Node 3
Ring without
VNodes
D
C
E
E
Node 4
Node 5
D
Node 6
Few peers helping to bring a downed node up, Increased load

on the selected peers
What will happen if a node with elivated utilization goes down?
19
Virtual node, Ring architecture

In the later versions vNode made things easy - one node
with many ranges! (One primary range)
Recovery from failure got better - small contributions from
many nodes to help rebuilding
Recovery got faster and load on other nodes reduced
Node 1
Node 2
Node 3
Ring with
VNodes
Node 4
Node 5
Node 6
Many peers helping to bring a downed node up,

Marginal increase in load on the selected peers
20
Gossip
I think we all know what this means in English!

It is a communication mechanism among nodes
Runs every second
Passing state information
Available / down / bootstraping
Load it is under
Helps in detecting failed nodes
Upto 3 other nodes talk to one another

A way any node learns about the cluster and other nodes
Seed nodes play a critical role to start the gossip among
nodes
It is versioned, older states are erased
Once gossip has started on a node, the stats about other
nodes are stored locally so that a re-start can be fast
Local gossip info can be cleaned if needed
21
Partitioner
How do you all organize your cubicle/office space? Where will your stuff
be? Assume the floor you are on is about 25,000 sq ft.
A partitioner determines how to distribute the data across the nodes in the
cluster
Murmer3Partitioner - recommended for most purposes (default strategy as
well)
Once a partitioner is set for a cluster, cannot be changed without data
reload
The primary key is hashed using the murmer hash to determine which
node it needs to go
There is nothing called as the "Master replica", all replicas are identical
The token range it can produce is 263 to -263
Wikipedia details
MurmurHash is a non-cryptographic hash function suitable for general hash-based

lookup. It was created by Austin Appleby in 2008, and exists in a number of variants, all
of which have been released into the public domain. When compared to other popular
hash functions, MurmurHash performed well in a random distribution of regular keys.
22
Replication
How many of you have multiple copies of the most critical files pwd file, IT return confirmation etc on external drives?
Replication determines how many copies of the data will be
maintained in the cluster across nodes
There is no single magic formula to determine the correct number
The widely accepted number is 3 but your use case can be
different
This has an impact on the number of nodes in the cluster - cannot
have a high replication with less nodes
Replication factor <= number of nodes
In a DC/DR scenario using NetworkTopology strategy different

replication can be specified per keyspace per DR/DC
strategy_options:{data-center-name}={rep-factor-value}
Also has a performance impact during
"Insert/Update" using a particular consistency level

"Select" using a particular consistency level
23
Partitioner and Replication

Position of the first copy of the data is
determined by the partitioner and copies
are placed by walking the cluster in a
clockwise direction
For example, consider a simple 4 node cluster

where all of the partition keys managed by
the cluster were numbers in the PRIMARY
range of 0 to 100.
Each node is assigned a token that represents
a point in this range. In this simple example,
the token values are 0, 25, 50, and 75.
The first node, the one with token 0, is
75
responsible for the wrapping range (76-0).
The node with the lowest token also accepts
partition keys less than the lowest token and
more than the highest token.
Once the first node is determined the replicas
will be placed in the nodes in a clockwise
order
Data Range 4
(76+
wrapping
range)
Data Range 1
(1-25)
25
Data Range 3
(51-75)
Data Range 1
(26-50)
50
24
Snitch
What can you typically find at the entrance of a very large
mall? What's the need for "Can I help you?" desk?
Objective is to group machines into racks and data centers
to setup a DC/DR
Informs the partitioner about the rack and DC locations determines which nodes the replicas need to go
Identify which DC and rack a node belongs to
Routing requests efficiently (replication)
Tries not to have more than one replica in the same rack
(may not be a physical grouping)
Allows for a truly distributed fault tolerant cluster
Seamless synchronization of data across DC/DR
To be selected at the time of C* installation in C* yaml file
Changing a snitch is a long drawn up process especially if
data exists in the keyspace - you have to run a full repair in
your cluster
25
Snitch
In addition a "Google cloud snitch" is also available for GCE

26
Mixed cloud = cloud + local nodes
Snitch - property file

A bit more about property file snitch
All nodes in the system must have the
same cassandra-topology.properties file
It is very useful if you cannot ensure the IP
octates (this prevents the use of Rack
infering snitch where every octate means
something) 10.<octat>.<>.<>
Will give you full control of your cluster
and the flexibilty of assigning any IP to a
node (once assigned remains assigned)
It can be hard to maintain information
about a large cluster across multiple DC
but it is worth given the benifits
27
Snitch - property file example

# Data Center One
19.82.20.3 DC1:RAC1
19.83.123.233 DC1:RAC1
19.84.193.101 DC1:RAC1
19.85.13.6 DC1:RAC1
19.23.20.87 DC1:RAC2
19.15.16.200 DC1:RAC2
19.24.102.103 DC1:RAC2
# Data Center Two
29.51.8.2 DC2:RAC1
29.50.10.21 DC2:RAC1
29.50.29.14 DC2:RAC1
53.25.29.124 DC2:RAC2
53.34.20.223 DC2:RAC2
53.14.14.209 DC2:RAC2
The names like "DC1" and "DC2" are very important as we will see ...
Also, in production use GissipingPropertyFileSnitch (non cloud, for cloud use the appropriate snitch for that
cloud) where you keep the entries for the rack+DC (cassandra-rackdc.properties) and C* will gossip
this across the cluster automatically. C* topology properties is used as a fallback.
28
Commodity vs Specialized hardware
Vertical scalability
Horizontal scalability
This is what Google,
LinkedIn, Facebook do.
The norm is now being
adopted by large
corporations as well.
vs
1.
2.
3.
4.
5.
6.
Large CAPEX
Wasted/Idle resource
Failure takes out a large chunk
Expensive redundancy model
One shoe fitting all model
Too much co-existence
1.
2.
3.
4.
5.
6.
Low CAPEX (rent on IaaS)

Maximum resource utilization
Failure takes out a small chunk
Inexpensive redundancy
Specific h/w for specific tasks
Very less to no co-existence
Just in time expansion; stay in tune with the load. No need to build ahead in time in
anticipation
29
Elastic Linear Scalability

10 nodes
8 nodes
n1
n1
n8
n2
n7
n10
n2
n9
n3
n8
n4
n3
n6
n4
n5
n7
n5
n6
1.Need more capacity? Add servers

2.Auto Re-distribution of the cluster
3.Results in linear performance increase
30
Debate - Failover design strategy

Scenario
Maintain a small failover in the data
center, and then simply add multiple
nodes if a failure occurs
Is this "Just in time" strategy correct?
Analysis
The additional overhead of bootstrapping
the new nodes will actually reduce
capacity at a critical point when the
capacity is needed most perhaps creating
a domino effect leading to severe outage
31
Debate - what if all machines are not

similar?
n5
n6
n4
n1
n3
n2
State your observations based on typical machine characteristics

1. CPU, Memory
3. Storage space (consider externally mounted space as local)
3. Network connectivity
4. Query performance and "predictability" of queries
32
Deployment - 4 dimensions
1
Node - One C* instance
Rack - Logical set of nodes
DC-DR - Logical set of racks
Cluster - Nodes that map to a single token ring

(can be across DC), next slide ...
33
Distributed deployment
//These are the C* nodes that the DAO will look to connect to
public static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };
public enum clusterName { events, counter }
//You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)
cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();
session1 = cluster.connect(Constants.clusterName.events.toString());
RACK 1
RACK 2
Cluster: SKL-Platform
RACK A
RACK 3
RACK B
Region 1 - DC
Region 2 - DR
34
A bit more about multi DC setup

When using
NetworkToplogyStrategy, you set
the number of replicas per data
center
For example if you set the
replicas as 4 (2 in each DC) then
this is what you get ...
Data Center1
Data Center2
Rack1
Rack1
Node1
Node7
Node3
R1
Node9
Node5
Node11
Rack2
Rack2
Node2
Node8
Node4
Node6
R2
Node10
Node12
Data Center1
R1
R3
R4
R2
Data Center2
R3
ALTER KEYSPACE system_auth WITH REPLICATION =

{'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 2};
Then run nodetool repair on each affected node, wait for it
to come up and move on to the next node.
R4
35
Debate - replication setting

Let's assume we have the cluster as follows
DC1 having 5 nodes
DC2 having 5 nodes
What will happen if the following statements are

executed (assume no syntax errors)?
create keyspace if not exists test with replication = { 'class' :
'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 10 };
create table test ( col1 int, col2 varchar, primary key(col1));
insert into test (col1, col2) values (1, 'Sir Dennis Ritchie');
Will this succeed?
36
Regions and Zones

Many IaaS (Google / Amazon) providers have
multiple data centers around the world called
"Regions"
Each region is further divided into availability
"zones"
Depending on your IaaS or even your own DC,
you must plan the topology in such a way that
A client request should not travel over the network too
long to reach C*
An outage should not take out a significant chunk of
your cluster
A replication across DC should not add to the latency for
the client
37
Part 3
Setup & Installation
38
Let's get going! Installation

Acquiring and Installing C*
Understand the key components of Cassandra.yaml
Configuring and Installation structure
Data directory
Commit Log directory
Cache directory
System log configuration
Cassandra-env.ps1 (change the log file position)

Default is $logdir = "$env:CASSANDRA_HOME\logs"
In logback.xml change from .zip to .gz
<fileNamePattern>${cassandra.logdir}/system.log.%i.gz</fileNamePattern>
Creating a Cluster
Adding Nodes to a Cluster
Multiple Seed Nodes Ring management
Sizing is specific to the application - discussion
Clock Synchronization and its implications
Network Time Protocol (NTP)
39
Firewall and ports

Cassandra primarily operates using
three ports
7000 (or 7001 if SSL) - internode cluster
communication
7199 - JMX port
9160 / 9042 - Client connections
22 - for SSH
Firewall must have these ports open

for each node in the cluster/DC
40
Cassandra.yaml - an introduction
There are more configuration parameters, but we will cover those as we move along ...
41
Cassandra-env.sh
42
CQL shell - cqlsh
Command line tool

Specify the host and port (none means localhost)
Provides tab finish
Supports two different command sets
DDL
Shell commands (describe, source, tracing, copy etc)
Try some commands
show version
show host
describe keyspaces
describe keyspace <name>
describe tables
describe table <name>
CQL shell "source" command - loading a few million rows

"sstableloader" which is another tool - beyond a few million rows
43
Part 4
Prelude to Modeling
44
Keyspace aka Schema

It is like the database (MySQL) or schema/user (Oracle)
Cassandra
Database
create keyspace [IF NOT EXISTS]

meterdata with replication strategy
(optionally with DC/DR setup), durable
writes (True/False)
create database meterdata;
durable writes = false bypasses the

commit log. You can lose data.
45
Admin/System keyspaces
SELECT * FROM
system.schema_keyspaces
system.peers
system.schema_columnfamilies
system.schema_columns
system.local (info about itself)
system.hints
system.<many others>
Local contains information about

nodes (superset of gossip)
46
Column Family (CF) / Table

Cassandra
Database
create table meterdata.bill_data (...)

primary key (compound key)
with compaction, clustering order
create table
meterdata.bill_date (...)
pk, references, engine,
charset etc
PK is mandatory in C*
Insert into bill_data () values ();
Update bill_data set=.. where ..
Delete from bill_data where ..
Standard SQL DML

statements
What happens if we insert with a PK that already exists?

Hold on to your thoughts ...
47
PK aka Partition key aka Row key

Partition key determines which node the data will reside
Simple key
Composite key
It is the unit of replication in a cluster
All data for a given PK replicates around in the cluster

All data for a given PK is co-located on a node
The PK is hashed
Murmer3 hashing is the preferred

Random does based on MD5 (not considrerd secure)
Others are for legacy support only
A partition is fetched from the disk in one disk seek, every partition
requires a seek
More of PK during modeling ...
PK
Hash
meter_id_001
558a666df55
meter_id_002
88yt543edfhj
meter_id_003
aad543rgf54l
48
Visualizing the Primary Key

What are the components of the Primary key?
CREATE TABLE all_events (
event_type int,
date number,
//format as yyyymmdd number (more efficient)
created_hh int,
created_min int,
created_sec int,
created_nn int,
data text,
PRIMARY KEY((event_type, date), created_hh, created_min)
);
Partition key
Clustering
column(s)
Unique
Primary key = Partition key(s) + Clustering column(s)

Naturally by definition the primary key must be unique
49
Visualizing partition key based storage

Question - how does C* keep the data based on
the primary key definition of a given table?
Contiguous chunk of data on disk
Partition key
Meter_id_001
26-Jan-2014
Meter_id_001
27-Jan-2014
Meter_id_001
28-Jan-2014
...
...
Meter_id_001
15-Dec-2014
147
159
165
...
183
5698 5712 5745

12
58
86
91
...
302
Data growth per partition key

50
Failure Tolerance By Replication

Remember CAP theorem? C* gives availability
and partition tolerance
Replication Factor (RF) = 3
{C, B, A}
2
{B, A, H}1
{D, C, B}
Insert a record with Pk=B

B
F
G
private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min,
created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";
Session session = CassandraDAO.getEventSession();
51
BoundStatement boundStatement = getWritableStatement(session, allEventCql);
session.execute(boundStatement.bind(.., .., ..);
PK - data partitioning
Every node in the cluster is an owner of a range of tokens
that are calculated from the PK
After the client sends the write request the coordinators
partitioner uses the configured partitioner to determine the
token value
Then it looks for the node that has the token range (primary
range) and puts the first replica there
A node can have other ranges as well but has one primary
range
Subsequent replicas is with respect to the node of the first
copy/replica
Data with the same partition key will reside on the same
physical node
Clustering columns (in case of compound PK) does not
impact the choice of node
Default sort based on clustering columns is ASC
52
Coordinator Node
Incoming Requests (read/Wrire)
Coordinator handles the request
2
1
An interesting aspect to note is that the

data (each column) is timestamped using
the coordinator node's time
Every node can be coordinator ->Client
masterless
request
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
Hinted handoff a bit later ...

53
Consistency
Tunable at runtime
ONE (default)
QUORUM (strict majority w.r.t RF)
ALL
Apply both to read and write
protected static BoundStatement getWritableStatement(Session session, String cql,
boolean setAnyConsistencyLevel) {
PreparedStatement statement = session.prepare(cql);
if(setAnyConsistencyLevel) {
statement.setConsistencyLevel(ConsistencyLevel.ONE);
}
BoundStatement boundStatement = new BoundStatement(statement);
return boundStatement;
}
Other consistency levels are ANY, LOCAL_QUORUM, TWO, THREE,

LOCAL_ONE (see documentation on the web)
54
What is a Quorum?
quorum = RoundDown(sum_of_replication_factors / 2) + 1
sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .
+ datacentern_RF
Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1
replica down.
Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2
replicas down.
In a two data center cluster where each data center has a replication factor of
3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.
In a five data center cluster where two data centers have a replication factor
of 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.
55
Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
56
Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK
Wait for ONE ack before returning to
client
2
1
n3
n2
n4
5 s
n1
n5
coordinator
n8
n6
n7
57
Write Consistency
Write ONE
Send requests to all replicas in the cluster
applicable to the PK
Wait for ONE ack before returning to client
2
1
Other acks later , asynchronously
n2
n4
120 s
n1
n3
5 s
10 s
coordinator
n8
n5
n6
n7
58
Write Consistency
Write QUORUM
Send requests to all replicas
Wait for QUORUM ack before returning to client
2
1
n3
n2
n4
120 s
n1
5 s
10 s
coordinator
n8
n5
n6
n7
59
Read Consistency
Read ONE
Read from one node among all replicas
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
60
Read Consistency
Read ONE
Read from one node among all replicas
Contact the faster node (stats)
2
1
n2
n1
n3
n4
n5
coordinator
n8
n6
n7
61
Read Consistency
Read QUORUM
Read from one fastest node
2
1
n2
n1
n3
n4
n5
coordinator
n8
n6
n7
62
Read Consistency
Read QUORUM
AND request digest from other replicas to
reach QUORUM
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
63
Read Consistency
Read QUORUM
AND request digest from other replicas
to reach QUORUM
Return most up-to-date data to client
2
1
n3
n2
n4
Read repair a bit later ...

n1
n5
coordinator
n8
n6
n7
64
How to maintain consistency across

nodes?
C* allows for tunable consistency and
hence we have eventual consistency
For whatever reason there can be
inconsistency of data among nodes
Inconsistencies are fixed using
Read repair - update data partitions
during reads (aka Anti-entropy ops)
Nodetool - regular maintenance
Let's see consistency in action ...

65
Consistency in Action
RF = 3, Write ONE Read ONE
A write must be written to the commit log and

memtable of at least one replica node.
Write ONE: B
Data replication in progress..
Read ONE: A
Returns a response from the closest replica, as

determined by the snitch. By default, a read repair runs in
the background to make the other replicas consistent.
66
RF = 3, Write ONE Read QUORUM

Write ONE: B
Read QUORUM: A
Returns the record after a quorum of replicas has

responded from any data center.
67
RF = 3, Write ONE Read ALL

Write ONE: B
Read ALL: B
Returns the record after all replicas have

responded. The read operation will fail if a replica
does not respond.
68
RF = 3, Write QUORUM Read ONE

memtable on a quorum of replica nodes.
Write QUORUM: B
Read ONE: A
Returns a response from the closest replica, as

determined by the snitch. By default, a read repair runs in
the background to make the other replicas consistent.
69
RF = 3, Write QUORUM Read QUORUM
Write QUORUM: B
Read QUORUM: B
70
Consistency Level
ONE
Fast write, may not read latest
written value
QUORUM / LOCAL_QUORUM
Strict majority w.r.t. Replication Factor
Good balance
ALL
Not the best choice
Slow, no high availability
if(nodes_written + nodes_read) > replication factor

then you can get immediate consistency
71
Debate - Immediate consistency

The following consistency level gives
immediate consistency
Write CL = ALL, Read CL = ONE
Write CL = ONE, Read CL = ALL
Write CL = QUORUM, Read CL = QUORUM
Debate which CL combination you will

consider in what scenarios?
The combinations are
Few write but read heavy
Heavy write but few reads
Balanced
72
Hinted Handoff
A write request is received by the
coordinator
Coordinator finds a node is down/offline
It stores "Hints" on behalf of the offline node
if the coordinator knows in advance that the 1
node is down. Handoff is not taken if the
n2
node goes down at the time of write.
Read repair / nodetool repair will fix
Coordinator
inconsistencies.
2
3
n3
n4
n1
n5
Hints
n8
n6
n7
A hinted handoff is taken in system.hints table

{Location of the replica, version meta, actual data}
It is taken only when the write consistency level can be satisfied and a guarantee
that any read consistency level can read the record
73
Hinted Handoff
When the offline node comes up, the
coordinator forwards the stored hints
The node synchs up the state with the
hints
The coordinator cannot perpetually hold
the hints
2
1
Configure an appropriate duration for any

node to hold hints - default 3 hrs
Coordinator
n3
n2
n4
Hints
n1
n8
n5
n6
n7
74
Consistency level - ANY

This is a special consistency level
where even if any node is down the
write will succeed using a hinted
handoff
Use it very cautiously
75
What is a Anti-entropy op?
During reads the full data is requested

from ONE replica by the coordinator
Others send a digest which is a hash
of the data to the coordinator
Coordinator checks if the data and the
digests match
If they do not then the coordinator
takes the latest (n4) and merges into
the result
Then the stale nodes (n2, n3) are
updated automatically in the
background
Happens on a configurable % of reads
and not for all because there is an
overhead
read_repair_chance is the
configuration (defaults to 10% or 0.1)
= a 10% chance that a read repair will
happen when a read comes in
This is defined for a table or column
family
2
1
n3
n2
Anderson
ts = 268
Anderson
ts = 521
n4
Neo
ts = 851
n1
n5
coordinator
n8
n6
n7
Repair with nodetool, a bit later ...

76
Part 5
Write and Read path
77
Why C* writes fast

Try reading a book that is organized as
follows
How about a book that looks like

P1, P2, P3, P4, P5, P6, ...
P1, P23, P45, P2, P6, P99, P125, P8 ...
RDBMS
Cassandra
?
?
Seeks and writes values to

various pre-set locations
Continuously appends, merging and

contracting when appropriate
78
Storage
Writes are harder than reads to scale
Spinning disks aren't good with
random I/O
Goal: minimize random I/O
Log Structured Merged Tree (LSM)
Push changes from a memory-based
component to one or more disk
components
79
Some more on LSM

An LSM-tree is composed of two or
more tree-like component data
structures
Smaller component is C0 sits in
memory - aka Memtables
Larger component C1 sits in the disk
aka SSTables (Sorted String Table)
Immutable component
Writes are inserted in C0 and logs

In time flush C0 to C1
C1 is optimized for sequential access
80
Key components for Write

Each node implements the folloing key components (3 storage
mechanism and one process) to handle writes
1.Memtables - the in-memory tables corresponding to the CQL
tables along with the related indexes
2.CommitLog - an append-only log, replayed to restore node
Memtables
3.SSTables - Memtable snapshots periodically flushed (written
out) to free up heap
4.Compaction - periodic process to merge and streamline
multiple generations of SSTables
81
Cassandra Write Path

Client
Single node in the cluster

Appends (corresponds to CQL tables)
2
MemTable 1
MemTable 2
MemTable n
Memory
Mem to disk
durab
le
writes
= T RU
E
Coordinator
Appends
Disk
Table1
1
Table2
Commit log1
SSTable2
Commit log2
SSTable1
Commit logn
Related configurations are commitlog_sync (batch/periodic),

commitlog_sync_batch_window_in_ms, commitlog_synch_period_in_ms
82
Cassandra Write Path

Memtable exceeds a threshold, flush to SSTable, clean log,
clear heap corresponding to the memtables
MemTable 1
MemTable 2
Memory
MemTable n
New SSTable is created based on oldest commit log

A very fast sequential write operation
2
Commit log1
Table1
Table3
1
Commit log2
Table2
Disk
SSTable2
SSTable3
SSTable1
Commit logn
These are multiple generations of SSTables
which are compacted into one SSTable
select * from system.sstable_activity;
Writes are completely isolated from reads

83
No updated columns are visible until entire row is finished (technically, entire partition)
Column family data state
A Table data
=
It's Memtable
+
All of it's SSTables that have
been flushed
84
Memtables & SSTables

Memtables and SSTables are maintained per
table
SSTables are immutable, not written to again after the
memtable is flushed.
A partition is typically stored across multiple SSTable
files.
What is sorted? - Partitions are sorted and stored
For each SSTable, C* creates these structures

Partition index - A list of partition keys and the start position of
rows in the data file (on disk)
Partition index summary (in memory sample to speed up reads)
Bloom filter (optional and depends on % false positive setting)
85
Offheap memtables
As of C* 2.1 a new parameter has been introduced (memtable_allocation_type)

This capability is still under active development and performance over the
period of time is expected to get better
Heap buffers - the current default behavior
Off heap buffers - moves the cell/column name and value to DirectBuffer
objects. This has the lowest impact on reads the values are still live
Java buffers but only reduces heap significantly when you are storing
large strings or blobs
Off heap objects - moves the entire cell off heap, leaving only the
NativeCell reference containing a pointer to the native (off-heap) data. This
makes it effective for small values like ints or uuids as well, at the cost of
having to copy it back on-heap temporarily when reading from it (likely to
become default in C* 3.0)
Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra
doesnt need to flush as frequently. Bigger sstables means less compaction is needed.
Reads are more or less the same
For more information http://www.datastax.com/dev/blog/off-heap-memtables-inCassandra-2-1

86
Commit log
It is replayed in case a node went down and
wants to come back up
This replay creates the MemTables for that node
Commit log comprises of pieces (files) and it can
be controlled (commitlog_segment_size_in_mb)
Total commit log is a controlled parameter as well
(commitlog_total_space_in_mb)
Commit log itself is also acrued in memory. It is
then written to disk in two ways
Batch - if batch then all ack to requests wait until the
commit log is flushed to disk
Periodic - the request is immediately ack but after some
time the commit log is flushed. If a node were to go
down in this period then data can be lost if RF=1
Debate - "periodic" setting gives better performance and we should
87
use it. How to avoid the chances of data loss?
When does the flush trigger?

Memtable total space in MB reached
Default is 25% of JVM Heap
Can be more in case of off heap
OR Commit log total space in MB

reached
OR Nodetool flush (manual)
nodetool flush <keyspace> <table>
The default settings are usually good

and are there for a reason
88
Data file name structure

The data directories are created based on keyspaces and tables
.../data/keyspace/table
The actual DB file is named as

keyspace-table-format-generationNum-component
Component can be
CompressionInfo - compression info metadata
Data - PK, data size, column idx, row level tombstone info, column
count, column list in sorted order by name
Filter - Bloom filter
Index - index, also include Bloom filter info, tombstone
Statistics - hostograms for row size, gen numbers of files from where
this SST was compacted
Summary - index summary (that is loaded in mem for read
optimizations)
TOC - list of files
Digest - Text file with a digest
Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"
89
Cassandra Read Path

SELECTFROMWHERE #partition=.;
Row
Cache
SSTable1
SSTable2
SSTable3
Row cache will be checked if it is enabled
90
Cassandra Read Path

Row cache MISS
2 Bloom Filter
Row
Cache
?
SSTable1
SSTable2
SSTable3
One Bloom filter exists per SSTable and Memtable that the node is serving
Probablistic filter that tells "Maybe a partition key exists" in its
corresponding SSTable or a "Definite NO"
91
Cassandra Read Path

2 Bloom Filter
Row
Cache
F
SSTable1
Partition Key
Cache
SSTable2
SSTable3
If any of the Bloom filters return a possible "YES" then the partition key
may exist in one or more of the SSTables
Proceed to look at the partition key cache for the SSTables for which there
is a probability of finding the partition key, ignore the other SST key
caches
Physically a single key cache is maintained ...
next more on partition key cache
92
Partition Key Cache

SSTable1
Partition Key
Cache
# Partition
Offset
..
# Partition001
# Partition001
0x0
data
# Partition002
0x153
..
data
..
# Partition002
# Partition350
0x5464321
..
# Partition350
data
Data
.
Partition key cache stores the offset position of the partition keys
that have been read recently.
93
Cassandra Read Path

Row
Cache
2 Bloom Filter
Key Index
Sample
Partition Key
Cache
SSTable2
If partition key is not found in the key cache then the read proceeds to
check the "Key index sample" or "Partition summary" (in memory). It is a
subset of the "Partition Index" that is the full index of the SSTable
Next a bit more of key index sample ...
94
Key Index Sample

SSTable
Key Index
Sample
Sample
Offset
# Partition001
# Partition001
0x0
data
data
# Partition128
0x4500
# Partition256
0x851513
# Partition512
0x5464321
..
# Partition128
# Partition350
data
Data
.
Think of the offset as an absolute disk address that fseek in C can use
Ratio of the key index sample is 1 per 128 keys
95
Cassandra Read Path - complete flow

The coordinator will merge

based on TS as well based on
the consistency level
Row
Cache
Mem Table
Key Index
Sample
2 Bloom Filter
Partition Key
Cache
5
Partition Index
SSTable2
Merge based on the most recent timestamp of the columns

1. Memtable
2. One or more than one SSTable
more on the merge process a bit later in LWW ...
Also update the row cache (if enabled) and key cache
96
Bloom filters in action
Setting change takes effect when the SSTables are regenerated

In development env you can use nodetool/CCM scrub BUT it is time intensive
97
Row caching
Caches are also written on disk so that it comes alive after a restart
Global settings are also possible at the c* yaml file
98
Key caching
Stored per SSTable even if it is a common store for all SSTables

Stores only the key
Reduces the seek to just one read per replica (does not have to look into
the disk version of the index which is the full index)
It is the default = true
This also periodically get saved on disk as well
Changing index_interval increases/decreases the gap

Increasing the gap means more disk hits to get the partition key index
Reducing the gap means more memory but get to partition key without looking into
disk based partition index
99
Eager retry
If a node is slow in responding to a
request, the coordinator forwards
it to another holding a replica of
the requested partition
Node
1
Valid if RF>1
C* 2.0+ feature
91
Node
4
Read
<pk 91>
Client
91
Node
2
Node
3
Driver
ALTER TABLE users WITH speculative_retry = '10ms';

Or,
ALTER TABLE users WITH speculative_retry = '99percentile';
100
Last Write Win (LWW)

INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);
#Partition
Jdoe
Age
Name
33
John DOE
101

INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);
Auto generated Timestamp (micro seconds)
by coordinator. t1 is associated with the
columns
Jdoe
Age(t1)
Name(t1)
33
John DOE
102

UPDATE users SET age = 34 WHERE login = 'jdoe';
Assume a flush occurs
SSTable2
SSTable1
Jdoe
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
Remember that SSTables are immutable, once written it cannot be

updated.
Creates a new SSTable
103

DELETE age from users WHERE login = 'jdoe';
RIP
tombstone
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
104

SELECT age from users WHERE login = 'jdoe';
Where to read from? How to construct the response?

?
SSTable1
Jdoe
?
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
105

SELECT age from users WHERE login = 'jdoe';
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
106
Compaction
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
New SSTable
Jdoe
Name(t1)
John DOE
Related SSTables are merged

Most recent version of each column is compiled to one partition in one new SSTable
Columns marked for eviction/deletion are removed
Old generation SSTables are deleted
A new generation SSTable is created (hence the generation number keep increasing over time)
Disk space freed = sizeof(SST1 + SST2 + .. + SSTn) - sizeof(new compact SST)

Commit logs (containing segments) are versioned and reused as well
Newly freed space becomes available for reuse
107
Compaction
As per Datastax documentation
Three strategies
Sizetiered (default)
Leveled (needs about 50% more IO than size tiered but the number of SSTables
visited for data will be less)
DateTiered
Choice depends on the disk
Cassandra 2.1 improves read performance after compaction by performing an

incremental replacement of compacted SSTables.
Instead of waiting for the entire compaction to finish and then throwing away the old
SSTable (and cache), Cassandra can read data directly from the new SSTable even
before it finishes writing.
Mechanical disk = Size tiered

SSD = Leveled tiered
Choice depends on use case too
Size - It is best to use this strategy when you have insert-heavy, read-light
workloads
Level - It is best suited for ColumnFamilys with read-heavy workloads that have
frequent updates to existing rows
Date - for timeseries data along with TTL
108
SizeTiered compaction - a brief

With size-tiered compaction, similarly sized tables are
compacted into larger tables once a certain number are
accumulated (default 4)
A row can exist in multiple
SSTables, which can result
in a degradation of read
performance. This is
especially true if you
perform many updates or
deletes with high reads
It can be a good strategy
for write heavy scenarios
109
LeveledTiered compaction
Leveled compaction creates sstables of a
fixed, relatively small size (5MB by default
in Cassandras implementation), that are
grouped into levels.
Within each level, sstables are guaranteed
to be non-overlapping. Each level is ten
times as large as the previous
Leveled compaction guarantees that 90%
of all reads will be satisfied from a single
sstable (assuming nearly-uniform row
size). Worst case is bounded at the total
number of levels e.g 7 for 10TB of data
110
DateTiered compaction
This particularly applies to time series
data where the data lives for a specific
time
Use DateTiered compaction strategy
C* looks at the min and max Timestamp of
the SStable and finds out if anything is
really live
If not then that SSTable is just unlinked
and the compaction with another is fully
avoided
Set the TTL in the CF definition so that if a
bad code looks to insert without TTL it
does not cause unnecessary compaction
111
"Coming back to life"

Let's assume multiple replicas exist for a given column
One node was NOT able to recored a tombstone because it
was down
If this node remains down past the gc_grace_seconds
without a repair then it will contain the old record
Other nodes meanwhile have evicted the tombstone
column
When the downed node does come up it will bring the old
column back to life on the other nodes!
To ensure that deleted data never resurfaces, make sure to
run repair at least once every gc_grace_seconds, and never
let a node stay down longer than this time period
112
NTP is very important

A micro second time stamp is associated
with the columns
The timestamp used is of the machine in
which C* is running (coordinator)
If the timing of the machines in the cluster
is off then C* will be unable to determine
which record is latest!
This also means that the compaction may
not function
You can have a scenario where a deleted
record appears again or get unpredictable
number of records for your query
113
Lightweight transactions
Transactions are a bit different
"Compare and Set" model
Two requests can agree on a single value creation

One will suceed (uses Paxos algo, wikipedia has good
text on this subject)
Other will fail
INSERT INTO customer_account (customerID, customer_email)
VALUES ('LauraS', 'lauras@gmail.com')
IF NOT EXISTS;
UPDATE customer_account
SET customer_email='laurass@gmail.com'
IF
customerID='LauraS';
Cassandra 2.1.1 and later supports non-equal conditions for lightweight transactions. You can
use <, <=, >, >=, != and IN operators in WHERE clauses to query lightweight tables.
114
Triggers
The actual trigger logic resides
outside in a Java POJO
Feature may still be experimental
Has write performance impact
Better to have application logic do
any pre-processing
CREATE TRIGGER myTrigger
ON myTable
USING 'Java class name';
115
Debate - Locking in C*
These instances work exactly like instance 1. They will acquire
locks on other keys so that no two instances step on each other
Instance 1
Instance 2
Instance 3
Step 1. Acquire lock

If FAIL look for next KEY
Step 3. Release lock
Step 2. Get an
available ID that has
the flag=0 and starts
with 0
k=_0_0 v=<String>
k=_0_a v=<String>
..
k=_z_z v=<String>
C
B
Delete this key from C* A
Memcached / Redis
F
G
Primary key = ((flag, starts_with), gen_id)
116
Part 6
Modeling
117
Before you model ...

1
User
Report
End user
Dashboard Fast response
C* way
Biz question
RDBMS way
Structure
Model
Data
Keys
Storage
Data-types
118
QDD
C* modelling is
"Query Driven Design"
Give me the query and I will organize the data via a

model to get you performance for that query.
119
Denormalization
Unlike RDBMS you cannot use any foreign
key relationships
FK and joins are not supported anyway!
Embed details, it will cause duplication but
that is alright
Helps in getting the complete data in a
single read
More efficient, less random I/O
May have to create more than one
denormalized table to serve different
queries
UDT is a great way (UDT a bit later...)
120
Datatypes
Int, decimal, float, double

Varchar, text
Collections - Map (K=V), List (ordered V), Set (Unordered unique V)
Designed to store small amount of data (in hundreds)

Limit is 64k
Max size per element is 64KB
Cannot be part of Primary key
Nesting is not allowed
Timestamp, timeuuid
Counter
User defined types (UDT)
There are other datatypes as well
position frozen <tuple<float,float,float>> - useful in 3D coordinate geometry

blob - good practice is to gzip and then store
Static modifer - value remains the same, save storage space BUT very
specific usecase
121
Choice of partition key

If a key is not unique then C* will upsert
with no warning!
Good choices
Natural key like an email address or meter id
Surrogate keys like uuid, timeuuid (uuid with
time component)
CQL provides a number of timeuuid

functions
now()
dateof(tuuid) - extracts the timestamp as date
unixtimestampof - raw 64 bit integer
122
Wide row? Use composite PK

Only event type as the PK will store all events ever generated in a single
partition. That can result in a huge partition and may breach C* limits
CREATE TABLE all_events (
event_type int,
date int, //format as yyyymmdd
created_hh int,
created_min int,
created_sec int,
created_nn int,
data text,
PRIMARY KEY((event_type, date), created_hh)
);
Example of a compound PK ((...), ...) syntax
CREATE TABLE trunc_events (
event_type int,
date number,
created_on timestamp,
data text,
PRIMARY KEY((event_type, date), created_on)
) WITH CLUSTERING ORDER BY (created_on DESC);
insert into trunc_events (event_type, date, created_on, data) values(1, '2014-06-24', '2014-06-26 12:47:54',
'{"event" : "details of the event will go here"}') USING TTL 300;
123
Partition key IN clause debate

where meter_id = 'Meter_id_001' and date IN ( ... )
Partition key
Meter_id_001
26-Jan-2014
Meter_id_001
27-Jan-2014
Meter_id_001
28-Jan-2014
...
...
Meter_id_001
15-Dec-2014
T = ( t1 ,
t2 , t3, tn)
147
159
165
...
183
5698 5712 5745

12
58
86
91
...
t1
t2
302
t3
tn
t1 t2 t3 tn
Is T predictable? What are the factors responsible for

determining the overall fetch time T?
This is also called as "Multi get slice"
124
Partition key IN clause (Slicing)

If you want to search across two partition keys then use
"IN" in the where clause. You can have a further limiting
condition based on cluster columns.
In a composite PK, IN works only on the last key. Eg Pk ((key1, key2),

clust1, clust2). IN will work only on key2 (last column only).
125
Modeling notation
Chen's notation for conceptual modeling
helps in design
126
Sample model contd ...
127
Relationship table and Dup tables

Relationship tables are not uncommon to support different fetch
criteria - get the list of stocks given an industry in this case
A duplicate table with just a different sort order of the clustering

column is also fine eg
PRIMARY KEY ((stock_symbol, trade_date), trade_time) can give ONE symbol and a
range of date
PRIMARY KEY ((trade_date, stock_symbol)), trade_time) can give ONE date and a
range of symbols
Tables catering specifically to a single query is also a correct approach

128
Bitmap type of index example

Multiple parts to a key
Create a truth table of the various combinations
However, inserts will increase to the number of
combinations
Eg Find a car (vehicle id) in a car park by
variable combinations
129

CREATE TABLE skldb.car_model_index (
make varchar,
model varchar,
color varchar,
vehicle_id int,
lot_id int,
PRIMARY KEY ((make, model, color), vehicle_id)
);
We are pre-optimizing for 7 possible queries of the index on insert.
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('Ford', 'Mustang', 'Blue', 1234, 8675309);
VALUES ('Ford', 'Mustang', '', 1234, 8675309);
VALUES ('Ford', '', 'Blue', 1234, 8675309);
VALUES ('Ford', '', '', 1234, 8675309);
VALUES ('', 'Mustang', 'Blue', 1234, 8675309);
VALUES ('', 'Mustang', '', 1234, 8675309);
VALUES ('', '', 'Blue', 1234, 8675309);
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
130

select * from skldb.car_model_index
where make='Ford'
and model=''
and color='Blue';
select * from skldb.car_model_index

where make=''
and model=''
and color='Blue';
YES, there will be more writes and more data BUT that is an acceptable
fact in big data modeling that looks to optimize the query and user
experience more than data normalization
131
Counters & Collections

CREATE KEYSPACE counter WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3
};
use counter;
CREATE TABLE event_counter (
event_type int,
event_name varchar,
counter_value counter,
PRIMARY KEY (event_type, event_name)
);
update event_counter
set counter_value = counter_value + 1
WHERE event_type=0 AND event_name='Login';
CREATE TABLE users (
user_id text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>,
todo map<timestamp, text>
);
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES('nm', 'Nirmallya', 'Mukherjee', {'nm1@email.com', 'nm2@email.com'});
132
TTL - Time To Live

CREATE TABLE users (
user_id text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>,
todo map<timestamp, text>
);
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES('nm', 'Nirmallya', 'Mukherjee', {'nm1@email.com', 'nm2@email.com'},
{'2014-12-25' : 'Christmas party', '2014-12-31' : 'At Gokarna'}
);
UPDATE users USING TTL <computed_ttl>
SET todo['2012-10-1'] = 'find water' WHERE user_id = 'nm';
Rate limiting is a good idea in many cases. Eg allow password reset to

only 3 per day per user. Use TTL sliding window.
A promotion is generated for a customer with a TTL.
A temporary login is generated and is valid for a given TTL period.
Table level TTL overrides row level TTL specification.
133
UDT - User Defined Types

create keyspace address_book with replication = { 'class':'SimpleStrategy', 'replication_factor': 1 };
CREATE TYPE address_book.address (
street text,
city text,
zip_code int,
phones set<text>
);
CREATE TYPE address_book.fullname (
firstname text,
lastname text
);
CREATE TABLE address_book.users (
id int PRIMARY KEY,
name frozen <fullname>,
addresses map<text, frozen <address>>
);
// a collection map
INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});
UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',
zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;
SELECT name.firstname from users where id = 1;
CREATE INDEX on address_book.users (name);
SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};
134
UDT mapping to a JSON
135
Queries
CREATE TABLE mailbox (
login text,
message_id timeuuid,
Interlocutor text,
Message text,
PRIMARY KEY(login, message_id);
Get message by user and message_id (date)
SELCT * FROM mailbox
WHERE login=jdoe
and message_id ='2014-07-12 16:00:00';
Get message by user and date interval
SELCT * FROM mailbox
WHERE login=jdoe
and message_id <='2014-07-12 16:00:00'
and message_id >='2014-01-12 16:00:00';
136
Debate - are these correct?

Get message by message_id
SELCT * FROM mailbox WHERE message_id ='2014-07-12 16:00:00';
Get message by date interval

SELCT * FROM mailbox WHERE
message_id <='2014-07-12 16:00:00'
and message_id >='2014-01-12 16:00:00';
137
Debate - are these correct?

Get message by user range (range query on #partition)
SELCT * FROM mailbox WHERE login >= hsue and login <= jdoe;
Get message by user pattern (not exact match on #partition)

SELCT * FROM mailbox WHERE login like '%doe%';
138
WHERE clause restrictions
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition (where can

filter in contiguous data)
"Allow filtering" can override the default behaviour of cross partition access but it is
not a good practice at all (incorrect data model)
select .. from meter_reading where record_date > '2014-12-15'
(assuming there is a secondary index on record_date)
Only exact match (=) predicate on #partition, range queries (<,<=,<=,>) not
allowed (means there are no multi row updates)
In case of a compound partition key IN is allowed in the last column of the compound
key (if partition key has one only column then it works on that column)
On clustering columns, exact match and range query predicates (<,<=,<=,>, IN) are
allowed
WHERE clause is only possible on columns defined in primary key
Order of the filters must match the order of primary key definition otherwise create
secondary index (anti-pattern)
139
Order by restrictions
If the primary key is (industry_id,
exchange_id, stock_symbol) then the
following sort orders are valid
order by exchange_id desc, stock_symbol desc
order by exchange_id asc, stock_symbol asc
Following is an example of invalid

sort order
order by exchange_id desc, stock_symbol asc
140
Secondary Index
Consider the earlier example of the table all_events, what will happen if we
try to get the records based on minute?
Create a secondary index based on the minute

Can be created on any column except counter, static and collection columns
It is actually another table with the key=<index column> and columns=keys that
contain the key
Do not use secondary indexes if
Secondary indexes are for searching convenience
On high-cardinality columns because you then query a huge volume of records

for a small number of results
In tables that use a counter column
On a frequently updated or deleted column (too many tombstones)
To look for a row in a large partition unless narrowly queried
High volume queries
Use with low-cardinality columns

Columns that may contain a relatively small set of distinct values
Use when prototyping, ad-hoc querying or with smaller datasets
See this link for more details (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html)

141
Secondary Index - rebuilding

While Cassandras performance will always be best using
partition-key lookups, secondary indexes provide a nice
way to query data on small or offline datasets
Occasionally, secondary indexes can get out-of-sync
Unfortunately, there is no great way to verify secondary
indexes, other than rebuilding them.
You can verify by a query using indexes, and look for
significant inconsistencies on different nodes
Also note that each node keeps its own index (they are not
distributed), so youll need to run this on each node
Must run nodetool repair to ensure the index gets built on
current data
nodetool rebuild_index my_keysapce my_column_family

142
CQL Limits
Limits (as per Datastax documentation)
Clustering column value, length of: 65535
Collection item, value of: 2GB (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
Collection item, number of: 2B (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
Columns in a partition: 2B
Fields in a tuple: 32768, try not to have 1000' of
fields
Key length: 65535
Query parameters in a query: 65535
Single column, value of: 2GB, xMB are
recommended
Statements in a batch: 65535
143
C* batch - Is it always a good idea?
Leverages token aware policies

Cluster load is evenly balanced
Means if one fails we re-try ONLY that query
Not an all or none success model
Tip: batches work well ONLY if all records have the same partition key
In this case all records go to the same node.
144
Batch - use case

Keep two tables in synch - same event being stored in two
tables, one by customer ID and the other by staff ID
Emulating a classical database transaction - all or nothing!
145
Antipatterns - things to watch out!

Singleton messages - add, consume, remove loop
Update columns as nulls in CQL. Tombstone for each null
Intense update on a single column
Dynamic schema change (frequent changes) and topology change in

production is not a good idea.
Use C* as a queue like singletons (add/consume loop)
146
But... I need a queue!

Break the partition key to carry some
time buckets eg. Queue1 in the last
hour
At the time of select you can use the
clustering column eg Queue1 with
timestamp clustering column >
SOME time. This helps in not going
through all those tombstones
147
Part 7
Applications with Cassandra
148
Implementing the architecture - Data

Isolate data change rate
Not changing (lookups)
Moderate changing
Fast changing
Treat slow changing data differently from fast

changing data
Caching infrequently changing data helps
Cache warmup model needs to be in place
Cache redudancy is needed for high volume reads

(primary - secondary model)
Mutex model should be in place as well in addition to the
warmup of values
149
The Client DAO and the Driver

Session instances are thread-safe and usually a
single instance is all you need per application.
However, a given session can only be set to one
keyspace at a time, so one instance per keyspace
is necessary
Use policies
Retry
Reconnect
LoadBalancing
http://www.datastax.com/documentation/develop
er/java-driver/2.1/javadriver/fourSimpleRules.html
150
C* client sample
151
C* client sample
http://www.datastax.com/drivers/java/2.1/index.html
152
C* client sample - the entity
153
C* client sample
See this to work with UDT -http://www.datastax.com/documentation/developer/javadriver/2.1/java-driver/reference/udtApi.html

154
Asynch calls in the DAO

A careful design can yield many
benefits such as asynch data
acquisition while doing something
else
Consider this example
ResultSetFuture futureRS = session.executeAsync(query);
Do other processing here
ResultSet rs = futureRS.getUninterruptibly();
What benefits do you see?
155
Part 8
Administration
156
C* != (Fire and forget)
C* is not a fire a forget system.

It does require administration
and monitoring to ensure the
infrastructure performs optimally
at all times.
157
Monitoring - Opscenter
Workload modeling
Workload characterization
Performance characteristics of the
cluster
Latency analysis of the cluster
Performance of the OS
Disk utilization
Read / Write operations per second
OS Memory utilization
Heap utilization
158
Monitoring - Datastax Opscenter
159
160
161
162
163
164
Dropped MUTATION messages this means that the mutation was not applied to all replicas it
was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps 165
because of load C* is defending itself by dropping messages).
166
167
168
Monitoring - Node coming up
169
Node down
170
Node tool
Very important tool to manage C* in
production for day to day admin
Nodetool has over 40 commands
Can be found in the
$CASSANDRA_HOME/bin folder
Try a few commands
./nodetool -h 10.21.24.11 -p 7199 status
./nodetool -h 10.21.24.11 -p 7199 info
171
Node repair
172
Node repair
How to compare two large data sets?
Assume two arrays
Each containing 1 million numbers
Further assume the order of storage is fixed
How to compare the two arrays?
Potential solution
Loop 1 and check the other?
How long will it take to loop 1 million?
What happens if the data gets into Billions?
Lots of inefficiencies!
173
Wikipedia definition of Merkel Tree
Merkle tree is a tree in which every non-leaf node is labelled with the
hash of the labels of its children nodes. Hash trees are useful because
they allow efficient and secure verification of the contents of large
data structures
Currently the main use of hash trees is to make sure that data blocks
received from other peers in a peer-to-peer network are received
undamaged and unaltered, and even to check that the other peers do
not lie and send fake blocks
A hash tree is a tree of hashes in which

the leaves are hashes of data blocks in,
for instance, a file or set of files. Nodes
further up in the tree are the hashes of
their respective children.
For example, in the picture hash 0 is the

result of hashing the result of
concatenating hash 0-0 and hash 0-1.
That is, hash 0 = hash( hash 0-0 + hash
0-1 ) where + denotes concatenation
174
Node repair using Nodetool

Prime objective of node repair is to detect and fix
inconsistencies among nodes
Per node Merkel tree is created
It provides an efficient way to find differences in data
blocks
Reduces the amount of data transferred to compare the
data blocks
C* uses a fixed depth tree of
15 levels having 32K leaf
nodes
Leaf nodes can represent a
range of data because the
depth is fixed
175
Node repair process

When nodetool repair command is executed, the target node
specified with -h option in the command, coordinates the repair of
each column family in each keyspace
A repair coordinator node requests Merkle tree from each replica
for a specific token range to compare them
Each replica builds a Merkle tree by scanning the data stored
locally in the requested token range.
This can be IO intensive but can be controlled by providing a partition range

eg the primary range but you will still be building the Merkel tree in parallel
You can specify the start token, will repair the range for this token
Finally can repair one DC at a time by passing "local" parameter
Set the compaction and streaming thresholds (next slide)
The repair coordinator node compares the Merkle trees and finds
all the sub token ranges that differ between the replicas and
repairs data in those ranges
176
Nodetool repair - potential strategy

The system.local table has all the tokens that the node is
responsible for
Create a script that runs a repair using -st <token> -et <same
token> in a loop for all tokens for the given node
Care should be taken to connect to the node using -h and use
parallel execution (-par)
This will allow the repair to run on small ranges and in short time
increments
Spool the output and analyze if everything was correct else can
send email or any form of notification to alert the admin
It may be possible to get an exception like the following if the start
token and end tokens are not the same
ERROR [Thread-1099288] 2015-03-14 02:48:08,822 StorageService.java

(line 2517) Repair session failed: java.lang.IllegalArgumentException:
Requested range intersects a local range but is not fully contained in one;
this would lead to imprecise repair
Does not mean there is a problem
177
Schema disagreements
Perform schema changes one at a time, at a steady pace,
and from the same node
Do not make multiple schema changes at the same time
If NTP is not in place it is possible that the schemas may not
be in synch (usual problem in most cases)
Check if the schema is in agreement
http://www.datastax.com/documentation/cassandra/2.0/cas
sandra/dml/dml_handle_schema_disagree_t.html
./nodetool describecluster
178
Nodetool cfhistograms
Will tell how many SSTables were looked at to
satisfy a read
With level compaction should never go > 3
With size tiered compaction should never go > 12
If the above are not in place then compaction is falling
behind
check with ./nodetool compactionstats
it should say "pending tasks: 0"
Read and write letency (no n/w latency)

Write should not be > 150s
Read should not be > 130s (from mem)
SSD should be single digit s
./nodetool cfhistograms <keyspace> <table>

179
Slow queries
Use the DataStax Enterprise Performance Service
to automatically capture long-running queries
(based on response time thresholds you specify)
and then query the performance table that holds
those cql statements
cqlsh:dse_perf> select * from node_slow_log;
180
C* timeout, concurrency, consistency

Cassandra.yaml has multiple timeouts
(these change with versions)
read_request_timeout_in_ms
write_request_timeout_in_ms
cross_node_timeout
Concurrency
concurrent_reads
concurrent_writes
Application
Consistency levels affect read/write
etc (see the yaml file)

181
JNA
Java Native Access allows for more
efficient communication of the JVM
and the OS
Ensure JNA is enabled and if not do
the following
sudo apt-get install libjna-java
But, ideally JNA should be

automatically enabled
182
About me
That's it!
www.linkedin.com/in/nirmallya
nirmallya.mukherjee@gmail.com
Disclaimer - Please use this material at your own discretion.

183

CassandraTraining v3.3.4

Uploaded by

Copyright:

Available Formats

CassandraTraining v3.3.4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CassandraTraining v3.3.4

Uploaded by

Copyright:

Available Formats

Developer track

What's the challange?

Consumer oriented applications

What can it do?

Who uses it?

Gartner magic quadrant

Wide Row Store - Also known as wide-column stores, these databases

Document - Expands on the basic idea of key-value stores where

Graph - Designed for data whose relationships are well represented as a

SQL and NoSQL

1. Datastore, distributed &

Applicability of SQL / NoSQL

High read (not write) oriented

Important to know the access patterns

Very difficult to fix an improper model

What is it? Can I have all?

CAP leads to BASE theory

Basically Available, Soft state, Eventual consistency

AP systems, what about C?

How does this impact architecture?

Good fit use cases

Scenario where the data can not be

Let's brain storm your usecase

Where did the name "Cassandra" come from?

Symantics - no real definition, both are ok

A few interesting observations about datastore

Some interesting reference from Greek mythology

Put = insert/update (Upsert?)

Think HashMaps, Sorted Maps ...

Btree vs LSM Tree

Row oriented datastore

Biggest of all - No SPOF (Masterless)

Virtual node, Ring architecture

Few peers helping to bring a downed node up, Increased load

Virtual node, Ring architecture

Many peers helping to bring a downed node up,

I think we all know what this means in English!

Upto 3 other nodes talk to one another

MurmurHash is a non-cryptographic hash function suitable for general hash-based

In a DC/DR scenario using NetworkTopology strategy different

Also has a performance impact during

"Insert/Update" using a particular consistency level

Partitioner and Replication

For example, consider a simple 4 node cluster

In addition a "Google cloud snitch" is also available for GCE

Snitch - property file

Snitch - property file example

Commodity vs Specialized hardware

Low CAPEX (rent on IaaS)

Elastic Linear Scalability

1.Need more capacity? Add servers

Debate - Failover design strategy

Debate - what if all machines are not

State your observations based on typical machine characteristics

Node - One C* instance

Rack - Logical set of nodes

DC-DR - Logical set of racks

Cluster - Nodes that map to a single token ring

A bit more about multi DC setup