CassandraTraining v3.3.4
CassandraTraining v3.3.4
CassandraTraining v3.3.4
By
Nirmallya Mukherjee
Index
Part
Part
Part
Part
Part
Part
Part
Part
Part
1
2
3
4
5
6
7
8
9
Warmup
Introducing C*
Setup & Installation
Prelude to modelling
Write and Read paths
Modelling
Applications with C*
Administration
Future releases
2
Part 1
Warm up!
eBay
Netflix
Instagram
Comcast
Safeway
Sky
CERN
Travelocity
Spotify
Intuit
GE
Types of NoSQL
Key/Value - These NoSQL databases are some of the least complex as all
of the data within consists of an indexed key and a value. Examples
include Amazon DynamoDB, Riak, and Oracle NoSQL database
CAP Theorem
ACID
Atomicity - Atomicity requires that each transaction be "all or nothing"
Consistency - The consistency property ensures that any transaction
will bring the database from one valid state to another
Isolation - The isolation property ensures that the concurrent execution
of transactions result in a system state that would be obtained if
transactions were executed serially
Durability - Durability means that once a transaction has been
committed, it will remain so under all circumstances
11
RDBMS
Availability
Partition Tolerance
Cassandra
Google Big Table
12
Sensor data
Logs
Events
Online usage impressions
14
Part 2
Introducing Cassandra
15
Database or Datastore
Google BigTable
Dynamo DB from Amazon
Masterless architecture
Why large malls have more than one entrance?
R1
Client 1
R2
Client 2
Client 3
R3
17
Seed node(s)
It is like the recruitment department
Helps a new node come on board
More than one seed is a very good
practice
Per rack based could be good
Starts the gossip process in new
nodes
If you have more than one DC then
include seed nodes from each DC
No other purpose
18
Node 2
Node 3
Ring without
VNodes
D
C
E
E
Node 4
Node 5
D
Node 6
19
Node 2
Node 3
Ring with
VNodes
Node 4
Node 5
Node 6
20
Gossip
Partitioner
How do you all organize your cubicle/office space? Where will your stuff
be? Assume the floor you are on is about 25,000 sq ft.
A partitioner determines how to distribute the data across the nodes in the
cluster
Murmer3Partitioner - recommended for most purposes (default strategy as
well)
Once a partitioner is set for a cluster, cannot be changed without data
reload
The primary key is hashed using the murmer hash to determine which
node it needs to go
There is nothing called as the "Master replica", all replicas are identical
The token range it can produce is 263 to -263
Wikipedia details
22
Replication
How many of you have multiple copies of the most critical files pwd file, IT return confirmation etc on external drives?
Replication determines how many copies of the data will be
maintained in the cluster across nodes
There is no single magic formula to determine the correct number
The widely accepted number is 3 but your use case can be
different
This has an impact on the number of nodes in the cluster - cannot
have a high replication with less nodes
Replication factor <= number of nodes
Data Range 4
(76+
wrapping
range)
Data Range 1
(1-25)
25
Data Range 3
(51-75)
Data Range 1
(26-50)
50
24
Snitch
What can you typically find at the entrance of a very large
mall? What's the need for "Can I help you?" desk?
Objective is to group machines into racks and data centers
to setup a DC/DR
Informs the partitioner about the rack and DC locations determines which nodes the replicas need to go
Identify which DC and rack a node belongs to
Routing requests efficiently (replication)
Tries not to have more than one replica in the same rack
(may not be a physical grouping)
Allows for a truly distributed fault tolerant cluster
Seamless synchronization of data across DC/DR
To be selected at the time of C* installation in C* yaml file
Changing a snitch is a long drawn up process especially if
data exists in the keyspace - you have to run a full repair in
your cluster
25
Snitch
28
Vertical scalability
Horizontal scalability
This is what Google,
LinkedIn, Facebook do.
The norm is now being
adopted by large
corporations as well.
vs
1.
2.
3.
4.
5.
6.
Large CAPEX
Wasted/Idle resource
Failure takes out a large chunk
Expensive redundancy model
One shoe fitting all model
Too much co-existence
1.
2.
3.
4.
5.
6.
Just in time expansion; stay in tune with the load. No need to build ahead in time in
anticipation
29
8 nodes
n1
n1
n8
n2
n7
n10
n2
n9
n3
n8
n4
n3
n6
n4
n5
n7
n5
n6
n4
n1
n3
n2
32
Deployment - 4 dimensions
1
Distributed deployment
//These are the C* nodes that the DAO will look to connect to
public static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };
public enum clusterName { events, counter }
//You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)
cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();
session1 = cluster.connect(Constants.clusterName.events.toString());
RACK 1
RACK 2
Cluster: SKL-Platform
RACK A
RACK 3
RACK B
Region 1 - DC
Region 2 - DR
34
Data Center2
Rack1
Rack1
Node1
Node7
Node3
R1
Node9
Node5
Node11
Rack2
Rack2
Node2
Node8
Node4
Node6
R2
Node10
Node12
Data Center1
R1
R3
R4
R2
Data Center2
R3
R4
35
36
Part 3
Setup & Installation
38
Data directory
Commit Log directory
Cache directory
System log configuration
Creating a Cluster
Adding Nodes to a Cluster
Multiple Seed Nodes Ring management
Sizing is specific to the application - discussion
Clock Synchronization and its implications
39
Cassandra.yaml - an introduction
There are more configuration parameters, but we will cover those as we move along ...
41
Cassandra-env.sh
42
show version
show host
describe keyspaces
describe keyspace <name>
describe tables
describe table <name>
Part 4
Prelude to Modeling
44
Database
45
Admin/System keyspaces
SELECT * FROM
system.schema_keyspaces
system.peers
system.schema_columnfamilies
system.schema_columns
system.local (info about itself)
system.hints
system.<many others>
Database
create table
meterdata.bill_date (...)
pk, references, engine,
charset etc
PK is mandatory in C*
Insert into bill_data () values ();
Update bill_data set=.. where ..
Delete from bill_data where ..
Simple key
Composite key
The PK is hashed
A partition is fetched from the disk in one disk seek, every partition
requires a seek
PK
Hash
meter_id_001
558a666df55
meter_id_002
88yt543edfhj
meter_id_003
aad543rgf54l
48
Partition key
Clustering
column(s)
Unique
Meter_id_001
26-Jan-2014
Meter_id_001
27-Jan-2014
Meter_id_001
28-Jan-2014
...
...
Meter_id_001
15-Dec-2014
147
159
165
...
183
58
86
91
...
302
{C, B, A}
2
{B, A, H}1
{D, C, B}
F
G
private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min,
created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";
Session session = CassandraDAO.getEventSession();
51
BoundStatement boundStatement = getWritableStatement(session, allEventCql);
session.execute(boundStatement.bind(.., .., ..);
PK - data partitioning
Every node in the cluster is an owner of a range of tokens
that are calculated from the PK
After the client sends the write request the coordinators
partitioner uses the configured partitioner to determine the
token value
Then it looks for the node that has the token range (primary
range) and puts the first replica there
A node can have other ranges as well but has one primary
range
Subsequent replicas is with respect to the node of the first
copy/replica
Data with the same partition key will reside on the same
physical node
Clustering columns (in case of compound PK) does not
impact the choice of node
Default sort based on clustering columns is ASC
52
Coordinator Node
Incoming Requests (read/Wrire)
Coordinator handles the request
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
Consistency
Tunable at runtime
ONE (default)
QUORUM (strict majority w.r.t RF)
ALL
Apply both to read and write
protected static BoundStatement getWritableStatement(Session session, String cql,
boolean setAnyConsistencyLevel) {
PreparedStatement statement = session.prepare(cql);
if(setAnyConsistencyLevel) {
statement.setConsistencyLevel(ConsistencyLevel.ONE);
}
BoundStatement boundStatement = new BoundStatement(statement);
return boundStatement;
}
54
What is a Quorum?
quorum = RoundDown(sum_of_replication_factors / 2) + 1
sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .
+ datacentern_RF
Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1
replica down.
Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2
replicas down.
In a two data center cluster where each data center has a replication factor of
3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.
In a five data center cluster where two data centers have a replication factor
of 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.
55
Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
56
Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK
Wait for ONE ack before returning to
client
2
1
n3
n2
n4
5 s
n1
n5
coordinator
n8
n6
n7
57
Write Consistency
Write ONE
Send requests to all replicas in the cluster
applicable to the PK
Wait for ONE ack before returning to client
2
1
n2
n4
120 s
n1
n3
5 s
10 s
coordinator
n8
n5
n6
n7
58
Write Consistency
Write QUORUM
Send requests to all replicas
Wait for QUORUM ack before returning to client
2
1
n3
n2
n4
120 s
n1
5 s
10 s
coordinator
n8
n5
n6
n7
59
Read Consistency
Read ONE
Read from one node among all replicas
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
60
Read Consistency
Read ONE
Read from one node among all replicas
Contact the faster node (stats)
2
1
n2
n1
n3
n4
n5
coordinator
n8
n6
n7
61
Read Consistency
Read QUORUM
Read from one fastest node
2
1
n2
n1
n3
n4
n5
coordinator
n8
n6
n7
62
Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicas to
reach QUORUM
2
1
n3
n2
n1
3
n4
n5
coordinator
n8
n6
n7
63
Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicas
to reach QUORUM
Return most up-to-date data to client
2
1
n3
n2
n4
n5
coordinator
n8
n6
n7
64
Consistency in Action
RF = 3, Write ONE Read ONE
Write ONE: B
Read ONE: A
66
Consistency in Action
RF = 3, Write ONE Read QUORUM
Write ONE: B
Read QUORUM: A
67
Consistency in Action
RF = 3, Write ONE Read ALL
Write ONE: B
Read ALL: B
68
Consistency in Action
RF = 3, Write QUORUM Read ONE
Write QUORUM: B
Read ONE: A
69
Consistency in Action
RF = 3, Write QUORUM Read QUORUM
Write QUORUM: B
Read QUORUM: B
70
Consistency Level
ONE
Fast write, may not read latest
written value
QUORUM / LOCAL_QUORUM
Strict majority w.r.t. Replication Factor
Good balance
ALL
Not the best choice
Slow, no high availability
Hinted Handoff
A write request is received by the
coordinator
Coordinator finds a node is down/offline
It stores "Hints" on behalf of the offline node
if the coordinator knows in advance that the 1
node is down. Handoff is not taken if the
n2
node goes down at the time of write.
Read repair / nodetool repair will fix
Coordinator
inconsistencies.
2
3
n3
n4
n1
n5
Hints
n8
n6
n7
73
Hinted Handoff
When the offline node comes up, the
coordinator forwards the stored hints
The node synchs up the state with the
hints
The coordinator cannot perpetually hold
the hints
2
1
n3
n2
n4
Hints
n1
n8
n5
n6
n7
74
75
2
1
n3
n2
Anderson
ts = 268
Anderson
ts = 521
n4
Neo
ts = 851
n1
n5
coordinator
n8
n6
n7
Part 5
Write and Read path
77
RDBMS
Cassandra
?
?
Storage
Writes are harder than reads to scale
Spinning disks aren't good with
random I/O
Goal: minimize random I/O
Log Structured Merged Tree (LSM)
Push changes from a memory-based
component to one or more disk
components
79
MemTable 1
MemTable 2
MemTable n
Memory
Mem to disk
durab
le
writes
= T RU
E
Coordinator
Appends
Disk
Table1
1
Table2
Commit log1
SSTable2
Commit log2
SSTable1
Commit logn
82
MemTable 2
Memory
MemTable n
Commit log1
Table1
Table3
1
Commit log2
Table2
Disk
SSTable2
SSTable3
SSTable1
Commit logn
These are multiple generations of SSTables
which are compacted into one SSTable
select * from system.sstable_activity;
A Table data
=
It's Memtable
+
All of it's SSTables that have
been flushed
84
85
Offheap memtables
Off heap buffers - moves the cell/column name and value to DirectBuffer
objects. This has the lowest impact on reads the values are still live
Java buffers but only reduces heap significantly when you are storing
large strings or blobs
Off heap objects - moves the entire cell off heap, leaving only the
NativeCell reference containing a pointer to the native (off-heap) data. This
makes it effective for small values like ints or uuids as well, at the cost of
having to copy it back on-heap temporarily when reading from it (likely to
become default in C* 3.0)
Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra
doesnt need to flush as frequently. Bigger sstables means less compaction is needed.
Reads are more or less the same
Commit log
It is replayed in case a node went down and
wants to come back up
This replay creates the MemTables for that node
Commit log comprises of pieces (files) and it can
be controlled (commitlog_segment_size_in_mb)
Total commit log is a controlled parameter as well
(commitlog_total_space_in_mb)
Commit log itself is also acrued in memory. It is
then written to disk in two ways
Batch - if batch then all ack to requests wait until the
commit log is flushed to disk
Periodic - the request is immediately ack but after some
time the commit log is flushed. If a node were to go
down in this period then data can be lost if RF=1
Debate - "periodic" setting gives better performance and we should
87
use it. How to avoid the chances of data loss?
Component can be
CompressionInfo - compression info metadata
Data - PK, data size, column idx, row level tombstone info, column
count, column list in sorted order by name
Filter - Bloom filter
Index - index, also include Bloom filter info, tombstone
Statistics - hostograms for row size, gen numbers of files from where
this SST was compacted
Summary - index summary (that is loaded in mem for read
optimizations)
TOC - list of files
Digest - Text file with a digest
Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"
89
Row
Cache
SSTable1
SSTable2
SSTable3
90
2 Bloom Filter
Row
Cache
?
SSTable1
SSTable2
SSTable3
One Bloom filter exists per SSTable and Memtable that the node is serving
Probablistic filter that tells "Maybe a partition key exists" in its
corresponding SSTable or a "Definite NO"
91
2 Bloom Filter
Row
Cache
F
SSTable1
Partition Key
Cache
SSTable2
SSTable3
If any of the Bloom filters return a possible "YES" then the partition key
may exist in one or more of the SSTables
Proceed to look at the partition key cache for the SSTables for which there
is a probability of finding the partition key, ignore the other SST key
caches
Physically a single key cache is maintained ...
92
Partition Key
Cache
# Partition
Offset
..
# Partition001
# Partition001
0x0
data
# Partition002
0x153
..
data
..
# Partition002
# Partition350
0x5464321
..
# Partition350
data
Data
.
Partition key cache stores the offset position of the partition keys
that have been read recently.
93
Row
Cache
2 Bloom Filter
Key Index
Sample
Partition Key
Cache
SSTable2
If partition key is not found in the key cache then the read proceeds to
check the "Key index sample" or "Partition summary" (in memory). It is a
subset of the "Partition Index" that is the full index of the SSTable
Next a bit more of key index sample ...
94
Key Index
Sample
Sample
Offset
# Partition001
# Partition001
0x0
data
data
# Partition128
0x4500
# Partition256
0x851513
# Partition512
0x5464321
..
# Partition128
# Partition350
data
Data
.
Think of the offset as an absolute disk address that fseek in C can use
Ratio of the key index sample is 1 per 128 keys
95
Row
Cache
Mem Table
Key Index
Sample
2 Bloom Filter
Partition Key
Cache
5
Partition Index
SSTable2
Row caching
Caches are also written on disk so that it comes alive after a restart
Global settings are also possible at the c* yaml file
98
Key caching
Eager retry
If a node is slow in responding to a
request, the coordinator forwards
it to another holding a replica of
the requested partition
Node
1
Valid if RF>1
C* 2.0+ feature
91
Node
4
Read
<pk 91>
Client
91
Node
2
Node
3
Driver
100
#Partition
Jdoe
Age
Name
33
John DOE
101
Jdoe
Age(t1)
Name(t1)
33
John DOE
102
SSTable2
SSTable1
Jdoe
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
RIP
tombstone
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
104
Jdoe
?
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
105
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
106
Compaction
SSTable1
Jdoe
SSTable2
Age(t1)
Name(t1)
33
John DOE
Jdoe
Age(t2)
34
SSTable3
Jdoe
Age(t3)
X
New SSTable
Jdoe
Name(t1)
John DOE
Compaction
Three strategies
Sizetiered (default)
Leveled (needs about 50% more IO than size tiered but the number of SSTables
visited for data will be less)
DateTiered
Size - It is best to use this strategy when you have insert-heavy, read-light
workloads
Level - It is best suited for ColumnFamilys with read-heavy workloads that have
frequent updates to existing rows
Date - for timeseries data along with TTL
108
109
LeveledTiered compaction
Leveled compaction creates sstables of a
fixed, relatively small size (5MB by default
in Cassandras implementation), that are
grouped into levels.
Within each level, sstables are guaranteed
to be non-overlapping. Each level is ten
times as large as the previous
Leveled compaction guarantees that 90%
of all reads will be satisfied from a single
sstable (assuming nearly-uniform row
size). Worst case is bounded at the total
number of levels e.g 7 for 10TB of data
110
DateTiered compaction
This particularly applies to time series
data where the data lives for a specific
time
Use DateTiered compaction strategy
C* looks at the min and max Timestamp of
the SStable and finds out if anything is
really live
If not then that SSTable is just unlinked
and the compaction with another is fully
avoided
Set the TTL in the CF definition so that if a
bad code looks to insert without TTL it
does not cause unnecessary compaction
111
112
Lightweight transactions
Transactions are a bit different
"Compare and Set" model
Triggers
The actual trigger logic resides
outside in a Java POJO
Feature may still be experimental
Has write performance impact
Better to have application logic do
any pre-processing
CREATE TRIGGER myTrigger
ON myTable
USING 'Java class name';
115
Debate - Locking in C*
These instances work exactly like instance 1. They will acquire
locks on other keys so that no two instances step on each other
Instance 1
Instance 2
Instance 3
Step 2. Get an
available ID that has
the flag=0 and starts
with 0
k=_0_0 v=<String>
k=_0_a v=<String>
..
k=_z_z v=<String>
C
B
Memcached / Redis
F
G
116
Part 6
Modeling
117
User
Report
End user
C* way
Biz question
RDBMS way
Structure
Model
Data
Keys
Storage
Data-types
118
QDD
C* modelling is
119
Denormalization
Unlike RDBMS you cannot use any foreign
key relationships
FK and joins are not supported anyway!
Embed details, it will cause duplication but
that is alright
Helps in getting the complete data in a
single read
More efficient, less random I/O
May have to create more than one
denormalized table to serve different
queries
UDT is a great way (UDT a bit later...)
120
Datatypes
Timestamp, timeuuid
Counter
User defined types (UDT)
Static modifer - value remains the same, save storage space BUT very
specific usecase
121
Meter_id_001
26-Jan-2014
Meter_id_001
27-Jan-2014
Meter_id_001
28-Jan-2014
...
...
Meter_id_001
15-Dec-2014
T = ( t1 ,
t2 , t3, tn)
147
159
165
...
183
58
86
91
...
t1
t2
302
t3
tn
t1 t2 t3 tn
Modeling notation
Chen's notation for conceptual modeling
helps in design
126
127
129
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
130
YES, there will be more writes and more data BUT that is an acceptable
fact in big data modeling that looks to optimize the query and user
experience more than data normalization
131
132
133
// a collection map
INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});
UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',
zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;
SELECT name.firstname from users where id = 1;
CREATE INDEX on address_book.users (name);
SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};
134
135
Queries
CREATE TABLE mailbox (
login text,
message_id timeuuid,
Interlocutor text,
Message text,
PRIMARY KEY(login, message_id);
Get message by user and message_id (date)
SELCT * FROM mailbox
WHERE login=jdoe
and message_id ='2014-07-12 16:00:00';
Get message by user and date interval
SELCT * FROM mailbox
WHERE login=jdoe
and message_id <='2014-07-12 16:00:00'
and message_id >='2014-01-12 16:00:00';
136
137
138
"Allow filtering" can override the default behaviour of cross partition access but it is
not a good practice at all (incorrect data model)
select .. from meter_reading where record_date > '2014-12-15'
(assuming there is a secondary index on record_date)
Only exact match (=) predicate on #partition, range queries (<,<=,<=,>) not
allowed (means there are no multi row updates)
In case of a compound partition key IN is allowed in the last column of the compound
key (if partition key has one only column then it works on that column)
On clustering columns, exact match and range query predicates (<,<=,<=,>, IN) are
allowed
Order of the filters must match the order of primary key definition otherwise create
secondary index (anti-pattern)
139
Order by restrictions
If the primary key is (industry_id,
exchange_id, stock_symbol) then the
following sort orders are valid
order by exchange_id desc, stock_symbol desc
order by exchange_id asc, stock_symbol asc
Secondary Index
Consider the earlier example of the table all_events, what will happen if we
try to get the records based on minute?
CQL Limits
Limits (as per Datastax documentation)
Clustering column value, length of: 65535
Collection item, value of: 2GB (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
Collection item, number of: 2B (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
Columns in a partition: 2B
Fields in a tuple: 32768, try not to have 1000' of
fields
Key length: 65535
Query parameters in a query: 65535
Single column, value of: 2GB, xMB are
recommended
Statements in a batch: 65535
143
Tip: batches work well ONLY if all records have the same partition key
In this case all records go to the same node.
144
145
147
Part 7
Applications with Cassandra
148
http://www.datastax.com/documentation/develop
er/java-driver/2.1/javadriver/fourSimpleRules.html
150
C* client sample
151
C* client sample
http://www.datastax.com/drivers/java/2.1/index.html
152
153
C* client sample
155
Part 8
Administration
156
157
Monitoring - Opscenter
Workload modeling
Workload characterization
Performance characteristics of the
cluster
Latency analysis of the cluster
Performance of the OS
Disk utilization
Read / Write operations per second
OS Memory utilization
Heap utilization
158
159
160
161
162
163
164
Dropped MUTATION messages this means that the mutation was not applied to all replicas it
was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps 165
because of load C* is defending itself by dropping messages).
166
167
168
169
Node down
170
Node tool
Very important tool to manage C* in
production for day to day admin
Nodetool has over 40 commands
Can be found in the
$CASSANDRA_HOME/bin folder
Try a few commands
./nodetool -h 10.21.24.11 -p 7199 status
./nodetool -h 10.21.24.11 -p 7199 info
171
Node repair
172
Node repair
How to compare two large data sets?
Assume two arrays
Each containing 1 million numbers
Further assume the order of storage is fixed
How to compare the two arrays?
Potential solution
Loop 1 and check the other?
How long will it take to loop 1 million?
What happens if the data gets into Billions?
Lots of inefficiencies!
173
Merkle tree is a tree in which every non-leaf node is labelled with the
hash of the labels of its children nodes. Hash trees are useful because
they allow efficient and secure verification of the contents of large
data structures
Currently the main use of hash trees is to make sure that data blocks
received from other peers in a peer-to-peer network are received
undamaged and unaltered, and even to check that the other peers do
not lie and send fake blocks
175
The repair coordinator node compares the Merkle trees and finds
all the sub token ranges that differ between the replicas and
repairs data in those ranges
176
Schema disagreements
Perform schema changes one at a time, at a steady pace,
and from the same node
Do not make multiple schema changes at the same time
If NTP is not in place it is possible that the schemas may not
be in synch (usual problem in most cases)
Check if the schema is in agreement
http://www.datastax.com/documentation/cassandra/2.0/cas
sandra/dml/dml_handle_schema_disagree_t.html
./nodetool describecluster
178
Nodetool cfhistograms
Will tell how many SSTables were looked at to
satisfy a read
With level compaction should never go > 3
With size tiered compaction should never go > 12
If the above are not in place then compaction is falling
behind
check with ./nodetool compactionstats
it should say "pending tasks: 0"
Slow queries
Use the DataStax Enterprise Performance Service
to automatically capture long-running queries
(based on response time thresholds you specify)
and then query the performance table that holds
those cql statements
cqlsh:dse_perf> select * from node_slow_log;
180
Concurrency
concurrent_reads
concurrent_writes
Application
Consistency levels affect read/write
JNA
Java Native Access allows for more
efficient communication of the JVM
and the OS
Ensure JNA is enabled and if not do
the following
sudo apt-get install libjna-java
About me
That's it!
www.linkedin.com/in/nirmallya
nirmallya.mukherjee@gmail.com