Site icon Tutorial

Durability with sstables, memtables, commit logs and hinted handoff

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Durability with sstables, memtables, commit logs and hinted handoff

Durability

Durability is the property that writes, once completed, will survive permanently, even if the server is killed or crashes or loses power. This requires calling fsync to tell the OS to flush its write-behind cache to disk.

The naive way to provide durability is to fsync your data files with each write, but this is prohibitively slow in practice because the disk needs to do random seeks to write the data to the write location on the physical platters. (Remember that each seek costs 5-10ms on rotational media.)

Instead, like other modern systems, Cassandra provides durability by appending writes to a commitlog first. This means that only the commitlog needs to be fsync’d, which, if the commitlog is on its own volume, obviates the need for seeking since the commitlog is append-only. Implementation details are in ArchitectureCommitLog.

Cassandra’s default configuration sets the commitlog_sync mode to periodic, causing the commitlog to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time. This default behavior is decently performant even when the commitlog shares a disk with data directories. You can also select batch mode, where Cassandra will guarantee that it syncs before acknowledging writes. To avoid syncing after every write, Cassandra groups the mutations into batches and syncs every commitlog_batch_window_in_ms. When using this mode, we strongly recommend putting your commitlog on a separate, dedicated device.

Users familiar with PostgreSQL may note that commitlog_sync_period_in_ms and commitlog_batch_window_in_ms correspond to the PostgreSQL settings of wal_writer_delay and commit_delay, respectively.

Write Phase


Read Phase

Cassandra Read Repair

When a query is made against a given key,

How Cassandra delete data (Cassandra Tombstone)

Since SSTables are not mutable, Cassandra cannot simply remove a row from the SSTables. Also, if a replica is down when we delete a row, we need a mechanism to identify its local copy is out of date when it is back online

When a user request a row to be deleted, Cassandra updates the column value with a special value called Tombstone. A read query will consider a tombstoned column as deleted. Cassandra will keep the column for at least gc_grace_seconds (Default: 10 days). After that, it is ready for removal after a compaction. If a replica is down longer than 10 days, we need to do a manual data sync before brining it back online.

Cassandra Consistency

Data is replicated across a cluster for data availability. Nevertheless, for a very short interval, some replicas may have the latest data while other replica are still saving/synchronizing the new value. The consistency gap widen when a dead node is bring back online and data is not in sync. Cassandra provides choices of requesting different level of consistency on each read or write operation at the cost of latency.

Low consistency request may return older copy of data (or stale data) but promise a faster response. Higher consistency request will reduce the chance of stale data but will have longer latency and more vulnerable to replica outage. The consistency setting allows developers to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation before considering the operation successful. Cassandra also runs a read repair for the queried key to bring the key/column back to consistency. For a low consistency level, read repair will run in background. Otherwise, it is done before returning the data.

Cassandra therefore provides eventual consistency rather than the much stricter consistency in the relational DB

Consistency For Cassandra Read Operations

With a replication factor of 3, data will be stored in 3 replica. Setting the consistency level to ONE, Cassandra will return the result to the requester after receiving response from 1 replica. On the contrary, QUORUM setting requires receiving a majority of replicas (2 replicas) to respond before returning the result to a client. Cassandra uses the timestamp in the column to determine which copy is the latest and only return the latest one to client. Hence, ONE have the fastest response time but with the possibility that the data is not the most updated one.

 LOCAL_QUORUM and EACH_QUORUM requires rack-aware replica placement strategies such as NetworkTopologyStrategy

ALL will provide highest consistency but usually not practical since it is vulnerably to a single node outage

Consistency For Cassandra Write operations

HintedHandoff

Cassandra Hinted hand off is an optimization technique for data write on replicas

When a write is made and a replica node for the key is down

MemTable

Cassandra writes are first written to the CommitLog, and then to a per-ColumnFamily structure called a Memtable. When a Memtable is full, it is written to disk as an SSTable.

A Memtable is basically a write-back cache of data rows that can be looked up by key — that is, unlike a write-through cache, writes are batched up in the Memtable until it is full, when it is flushed.

Flushing

The process of turning a Memtable into a SSTable is called flushing. You can manually trigger flush via jmx (e.g. with bin/nodetool), which you may want to do before restarting nodes since it will reduce CommitLog replay time. Memtables are sorted by key and then written out sequentially. Thus, writes are extremely fast, costing only a commitlog append and an amortized sequential write for the flush!

Once flushed, SSTable files are immutable; no further writes may be done. So, on the read path, the server must (potentially, although it uses tricks like bloom filters to avoid doing so unnecessarily) combine row fragments from all the SSTables on disk, as well as any unflushed Memtables, to produce the requested data.

Compaction

To bound the number of SSTable files that must be consulted on reads, and to reclaim space taken by unused data, Cassandra performs compactions: merging multiple old SSTable files into a single new one. Compactions are triggered when at least N SStables have been flushed to disk, where N is tunable and defaults to 4. Four similar-sized SSTables are merged into a single one. They start out being the same size as your memtable flush size, and then form a hierarchy with each one doubling in size. So you’ll have up to N of the same size as your memtable, then up to N double that size, then up to N double that size, etc.

“Minor” only compactions merge sstables of similar size; “major” compactions merge all sstables in a given ColumnFamily. Prior to Cassandra 0.6.6/0.7.0, only major compactions can clean out obsolete tombstones.

Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o. Once compaction is finished, the old SSTable files may be deleted: note that in the worst case (a workload consisting of no overwrites or deletes) this will temporarily require 2x your existing on-disk space used. In today’s world of multi-TB disks this is usually not a problem but it is good to keep in mind when you are setting alert thresholds.

SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. If a GC does not occur before Cassandra is shut down, Cassandra will remove them when it restarts. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted.

ColumnFamilyStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only includes size of non-obsolete files) and getTotalDiskSpaceUsed (includes everything).

assandra write operation life cycle is divided in these steps

Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is said to successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. When ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space (This mechanism is called Flushing). Once writes are done on SSTable, then you can see the corresponding datas in the data folder, in your case its S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data. Each SSTable composes of mainly 2 files – Index file and Data file

And regarding commitlog files, these are encrypted files maintained intrinsically by Cassandra, for which you are not able to see anything properly.

Hinted Handoff

If a write is made and a replica node for the key is down (and hinted_handoff_enabled == true), Cassandra will write a hint to:

versions prior to 1.0: a live replica node

version 1.0: the coordinator node

indicating that the write needs to be replayed to the unavailable node. If ConsistencyLevel.ANY was specified, the hint will count towards consistency if no replica nodes are alive for this key. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write does NOT count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that “if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write.”

Thus if we write a hint to B and call the write good because it is written “somewhere,” there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, “wait for a write to succeed anywhere, even a hinted write that isn’t immediately readable.”

The article is based on Cassandra 1.1
Exit mobile version