Durability with sstables, memtables, commit logs and hinted handoff

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Durability with sstables, memtables, commit logs and hinted handoff

Durability

Durability is the property that writes, once completed, will survive permanently, even if the server is killed or crashes or loses power. This requires calling fsync to tell the OS to flush its write-behind cache to disk.

The naive way to provide durability is to fsync your data files with each write, but this is prohibitively slow in practice because the disk needs to do random seeks to write the data to the write location on the physical platters. (Remember that each seek costs 5-10ms on rotational media.)

Instead, like other modern systems, Cassandra provides durability by appending writes to a commitlog first. This means that only the commitlog needs to be fsync’d, which, if the commitlog is on its own volume, obviates the need for seeking since the commitlog is append-only. Implementation details are in ArchitectureCommitLog.

Cassandra’s default configuration sets the commitlog_sync mode to periodic, causing the commitlog to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time. This default behavior is decently performant even when the commitlog shares a disk with data directories. You can also select batch mode, where Cassandra will guarantee that it syncs before acknowledging writes. To avoid syncing after every write, Cassandra groups the mutations into batches and syncs every commitlog_batch_window_in_ms. When using this mode, we strongly recommend putting your commitlog on a separate, dedicated device.

Users familiar with PostgreSQL may note that commitlog_sync_period_in_ms and commitlog_batch_window_in_ms correspond to the PostgreSQL settings of wal_writer_delay and commit_delay, respectively.

Write Phase

Each Cassandra node that receives data (through insert, update or delete columns) first writes it to its local commitlog sequentially
- commitlog acts as a crash recovery log for data
- Write operation will never consider successful at least until the changed data is appended to commitlog
- The sequential write is fast since there is no disk seek time
Cassandra syncs the write-behind cache data to file in two modes: periodic and batch
- periodic – sync the write-behind cache for commitlog to disk every commitlog_sync_period_in_ms (Default: 10000ms)
- batch – In batch mode, Cassandra won’t ack writes until the commit log has been fsynced to disk. It will wait up to CommitLogSyncBatchWindowInMS milliseconds for other writes, before performing the sync.
- Put commitlog in a separate drive to reduce I/O contention with SSTable reads/writes
- Not easy to achieve in today’s cloud computing offerings
- Data will not be lost once commitlog is flushed out to file
- Cassandra replay commitLog log after Cassandra restart to recover potential data lost within 1 second before the crash
After writing to commitlog, Cassandra writes the data to a in-memory structure called Memtable
- Memtable is an in-memory cache with content stored as key/column
- Memtable data are sorted by key
- Each ColumnFamily has a separate Memtable and retrieve column data from the key
Flushing: While flushing, the sorted data is written out sequentially to disk as SSTables (Sorted String Table)
- Memtables are flushed to disk when it runs out of space or, when the number of keys exceed certain limit (128 is default) or, when it reaches the time duration (client provided – no cluster clock)
- Flushed SSTable files are immutable and no changes cane be done
- Numerous SSTables will be created on disk for a column family
- Future changes to the data of same key after flushing will be written to a different SSTables. This makes the writes faster.
- Row read therefore requires reading all existing SSTables for a Column Family to locate the latest value
- SStables will be merged once it reaches some threshold to reduce read overhead
Each SSTable composes of 2 files – Index file and Data file
- Index file contains – Bloom filter and Key-Offset pairs
  - Bloom Filter: A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don’t exist almost free
  - (Key, offset) pairs (points into data file)
- Data file contains the actual column data

Unlike relational database, data changes does not write back to the original data files. It does not involve data read or random disk access. Both commitlog and SSTable are flushed to disk as a new file with sequential write. Hence Cassandra writes data very fast.

Cassandra Compaction – This process merges SSTables and limits the number of SSTables to be read for a row read request. This process merges changes for the same key on different SSTables and combines columns. It also reclaims space for delete row requests (Discard tombstones) and creates a new index for the merged SSTable file.

Minor Compaction takes place when:

- Triggered when at least N SSTables (default to 4) have been flushed to disk
- Four similar-sized SSTables are merged into a single one
- Perform regularly and automatically for each column family
  - Start compaction somewhere beteen min and max threshold on how many SSTables files accumulated
    - min_compaction_threshold (default is 4)
    - max_compaction_threshold (default is 32)
- Obsolete SSTables are deleted asynchronously when the JVM performs a GC

Major Compaction takes place when:

- Done manually through nodetool compact for each keyspace
- Merge all sstables in a given ColumnFamily

Read Phase

To process a key/column read request, Cassandra checks if the in-memory memtable cache still contain the data
- memtable is an in memory read/write cache for each Column Family
If not found, Cassandra will read all the SSTables for that Column Family
For read optimization,
- Cassandra use bloom Filter for each SSTable to determine whether this SSTable contains the key
- Cassandra use index in SSTable to locate the data fast
- Cassandra compaction merges SSTables when the number of SSTables reaches certain threshold. This restricts the total number of SSTable for each Column Famoly
- Cassandra read is slower than write but yet still very fast
Cassandra depends on OS to cache SSTable files
- Do not configure Cassandra to use up most physical memory
- Some deployment configure Cassandra to use 50% of the physical memory so the rest can be used for file cache
- However, memory configuration is sensitive to data access pattern and volume

Cassandra Read Repair

When a query is made against a given key,

Cassandra performs a Read Repair
Read Repair perform a digest query on all replicas for that key
- A digest query ask a replica to return a hash digest value and the timestamp for the key’s data
- Digest query verify whether replica possess the same data without sending the data over the network
Cassandra push the most recent data to any out-of-date replicas to make the queried data consistence again
- Next query will therefore return the a consistent data

How Cassandra delete data (Cassandra Tombstone)

Since SSTables are not mutable, Cassandra cannot simply remove a row from the SSTables. Also, if a replica is down when we delete a row, we need a mechanism to identify its local copy is out of date when it is back online

When a user request a row to be deleted, Cassandra updates the column value with a special value called Tombstone. A read query will consider a tombstoned column as deleted. Cassandra will keep the column for at least gc_grace_seconds (Default: 10 days). After that, it is ready for removal after a compaction. If a replica is down longer than 10 days, we need to do a manual data sync before brining it back online.

Cassandra Consistency

Data is replicated across a cluster for data availability. Nevertheless, for a very short interval, some replicas may have the latest data while other replica are still saving/synchronizing the new value. The consistency gap widen when a dead node is bring back online and data is not in sync. Cassandra provides choices of requesting different level of consistency on each read or write operation at the cost of latency.

Low consistency request may return older copy of data (or stale data) but promise a faster response. Higher consistency request will reduce the chance of stale data but will have longer latency and more vulnerable to replica outage. The consistency setting allows developers to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation before considering the operation successful. Cassandra also runs a read repair for the queried key to bring the key/column back to consistency. For a low consistency level, read repair will run in background. Otherwise, it is done before returning the data.

Cassandra therefore provides eventual consistency rather than the much stricter consistency in the relational DB

Consistency For Cassandra Read Operations

With a replication factor of 3, data will be stored in 3 replica. Setting the consistency level to ONE, Cassandra will return the result to the requester after receiving response from 1 replica. On the contrary, QUORUM setting requires receiving a majority of replicas (2 replicas) to respond before returning the result to a client. Cassandra uses the timestamp in the column to determine which copy is the latest and only return the latest one to client. Hence, ONE have the fastest response time but with the possibility that the data is not the most updated one.

ONE: Returns the response from the first replica that respond. It triggers a read repair on the background to make sure all replicas has the latest data for that key
QUORUM: Returns the record with the most recent timestamp once a quorum (51%+) of replicas has responded. It also triggers a read repair
LOCAL_QUORUM: Returns the record with the most recent timestamp once a quorum of replicas in the datacenter local to the coordinator has responded
EACH_QUORUM: Returns the record with the most recent timestamp once a quorum of replicas in each datacenter has responded
ALL: Queries all replicas and returns the value with the most recent timestamp. If one node does not response, the whole operation is blocked/failed

LOCAL_QUORUM and EACH_QUORUM requires rack-aware replica placement strategies such as NetworkTopologyStrategy

ALL will provide highest consistency but usually not practical since it is vulnerably to a single node outage

Consistency For Cassandra Write operations

ANY: Guarantee the write operation is successful on at least one node (counting those with Hinted Handoff)
ONE: Guarantee the write operation is successful on at least one node including its commit log and memtable
QUORUM: Data has been written to a quorum of nodes
LOCAL_QUORUM: A QUORUM of nodes in the same data center local to the coordinator has written the data successfully
EACH_QUORUM: A QUORUM of nodes in each data center has written the data
ALL: Every replica node must successfully write the data. If one replica does not response, the write operation is blocked/failed

HintedHandoff

Cassandra Hinted hand off is an optimization technique for data write on replicas

When a write is made and a replica node for the key is down

Cassandra write a hint to a live replica node
That replica node will remind the downed node of changes once it is back on line
- HintedHandoff reduce write latency when a replica is temporarily down
- HintedHandoff provides high write availability at the cost of consistency
- A hinted write does NOT count towards ConsistencyLevel requirements for ONE, QUORUM, or ALL
If no replica nodes are alive for this key and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally

MemTable

Cassandra writes are first written to the CommitLog, and then to a per-ColumnFamily structure called a Memtable. When a Memtable is full, it is written to disk as an SSTable.

A Memtable is basically a write-back cache of data rows that can be looked up by key — that is, unlike a write-through cache, writes are batched up in the Memtable until it is full, when it is flushed.

Flushing

The process of turning a Memtable into a SSTable is called flushing. You can manually trigger flush via jmx (e.g. with bin/nodetool), which you may want to do before restarting nodes since it will reduce CommitLog replay time. Memtables are sorted by key and then written out sequentially. Thus, writes are extremely fast, costing only a commitlog append and an amortized sequential write for the flush!

Once flushed, SSTable files are immutable; no further writes may be done. So, on the read path, the server must (potentially, although it uses tricks like bloom filters to avoid doing so unnecessarily) combine row fragments from all the SSTables on disk, as well as any unflushed Memtables, to produce the requested data.

Compaction

To bound the number of SSTable files that must be consulted on reads, and to reclaim space taken by unused data, Cassandra performs compactions: merging multiple old SSTable files into a single new one. Compactions are triggered when at least N SStables have been flushed to disk, where N is tunable and defaults to 4. Four similar-sized SSTables are merged into a single one. They start out being the same size as your memtable flush size, and then form a hierarchy with each one doubling in size. So you’ll have up to N of the same size as your memtable, then up to N double that size, then up to N double that size, etc.

“Minor” only compactions merge sstables of similar size; “major” compactions merge all sstables in a given ColumnFamily. Prior to Cassandra 0.6.6/0.7.0, only major compactions can clean out obsolete tombstones.

Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o. Once compaction is finished, the old SSTable files may be deleted: note that in the worst case (a workload consisting of no overwrites or deletes) this will temporarily require 2x your existing on-disk space used. In today’s world of multi-TB disks this is usually not a problem but it is good to keep in mind when you are setting alert thresholds.

SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. If a GC does not occur before Cassandra is shut down, Cassandra will remove them when it restarts. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted.

ColumnFamilyStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only includes size of non-obsolete files) and getTotalDiskSpaceUsed (includes everything).

assandra write operation life cycle is divided in these steps

commitlog write
memtable write
sstable write

Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is said to successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. When ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space (This mechanism is called Flushing). Once writes are done on SSTable, then you can see the corresponding datas in the data folder, in your case its S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data. Each SSTable composes of mainly 2 files – Index file and Data file

Index file contains – Bloom filter and Key-Offset pairs
- Bloom Filter: A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don’t exist almost free
- (Key, offset) pairs (points into data file)
Data file contains the actual column data

And regarding commitlog files, these are encrypted files maintained intrinsically by Cassandra, for which you are not able to see anything properly.

Hinted Handoff

If a write is made and a replica node for the key is down (and hinted_handoff_enabled == true), Cassandra will write a hint to:

versions prior to 1.0: a live replica node

version 1.0: the coordinator node

indicating that the write needs to be replayed to the unavailable node. If ConsistencyLevel.ANY was specified, the hint will count towards consistency if no replica nodes are alive for this key. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write does NOT count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that “if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write.”

Thus if we write a hint to B and call the write good because it is written “somewhere,” there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, “wait for a write to succeed anywhere, even a hinted write that isn’t immediately readable.”