Replica and their placement
Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance.
Cassandra stores copies, called replicas, of each row based on the row key. You set the number of replicas when you create a keyspace using the replica placement strategy. In addition to setting the number of replicas, this strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology.
The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met.
Choosing the right replication strategy is important because the strategy determines which nodes are responsible for which key ranges. The implication is that you’re also determining which nodes should receive which write operations, which can have a big impact on efficiency in different scenarios. If you set up your cluster such that all writes are going to two data centers—one in Australia and one in Reston, Virginia—you will see a matching performance degradation. The variety of pluggable strategies allows you greater flexibility, so that you can tune Cassandra according to your network topology and needs.
the names of the strategies are changing. The new names in 0.7 will be SimpleStrategy (formerly known as RackUnawareStrategy), OldNetworkTopologyStrategy (formerly known as RackAwareStrategy), and NetworkTopologyStrategy (formerly known as DatacenterShardStrategy).
Simple Strategy is the new name for Rack-Unaware Strategy.
The strategy used by default in the configuration file is org.apache.cassandra.locator.RackUnawareStrategy. This strategy only overrides the calculateNaturalEndpoints method from the abstract parent implementation. This strategy places replicas in a single data center, in a manner that is not aware of their placement on a data center rack. This means that the implementation is theoretically fast, but not if the next node that has the given key is in a different rack than others.
Old Network Topology Strategy
The second strategy for replica placement that Cassandra provides out of the box is org.apache.cassandra.locator.RackAwareStrategy, now called Old Network Topology Strategy. It’s mainly used to distribute data across different racks in the same data center. Like RackUnawareStrategy, this strategy only overrides the calculateNaturalEndpoints method from the abstract parent implementation. This class, as the original name suggests, is indeed aware of the placement in data center racks.
Network Topology Strategy
This strategy, included with the 0.7 release of Cassandra, allows you to specify more evenly than the RackAwareStrategy how replicas should be placed across data centers. To use it, you supply parameters in which you indicate the desired replication strategy for each data center.