Hadoop Data Backup

There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:

  • HDFS uses block level replication for data protection via redundancy.
  • HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
  • The size of “Big Data” doesn’t lend itself to being easily backed up.

Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called ‘distcp’, which allows you to replicate copies of data between clusters. This is what’s typically done for “backups” by most Hadoop operators.

Some companies, like Cloudera, are incorporating the distcp tool into creating a ‘backup’ or ‘replication’ service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.

If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Namenode Backup

If the namenode’s persistent metadata is lost or damaged, the entire filesystem is rendered unusable, so it is critical that backups are made of these files. You should keep multiple copies of different ages (one hour, one day, one week, and one month, say) to protect against corruption, either in the copies themselves or in the live files running on the namenode. Making backup by the dfsadmin command to download a copy of the namenode’s most recent fsimage, is done as

% hdfs dfsadmin -fetchImage fsimage.backup

The distcp tool is ideal for making backups to other HDFS clusters (preferably running on a different version of the software, to guard against loss due to bugs in HDFS) or other Hadoop filesystems (such as S3) because it can copy files in parallel.

HDFS Snapshots

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.

The implementation of HDFS Snapshots is efficient:

  • Snapshot creation is instantaneous: the cost is O(1) excluding the inode lookup time.
  • Additional memory is used only when modifications are made relative to a snapshot: memory usage is O(M), where M is the number of modified files/directories.
  • Blocks in datanodes are not copied: the snapshot files record the block list and the file size. There is no data copying.
  • Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the modifications from the current data.

Snapshots can be taken on any directory once the directory has been set as snapshottable. A snapshottable directory is able to accommodate 65,536 simultaneous snapshots. There is no limit on the number of snapshottable directories. Administrators may set any directory to be snapshottable. If there are snapshots in a snapshottable directory, the directory can be neither deleted nor renamed before all the snapshots are deleted. Nested snapshottable directories are currently not allowed. In other words, a directory cannot be set to snapshottable if one of its ancestors/descendants is a snapshottable directory.

Allow Snapshots – Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable. The usage is as

hdfs dfsadmin -allowSnapshot <path>

Disallow Snapshots – Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots. The usage is as

hdfs dfsadmin -disallowSnapshot <path>

Create Snapshots – Create a snapshot of a snapshottable directory. This operation requires owner privilege of the snapshottable directory. The usage is as

hdfs dfs -createSnapshot <path> [<snapshotName>]

The snapshot name, which is an optional argument. When it is omitted, a default name is generated using a timestamp with the format “‘s’yyyyMMdd-HHmmss.SSS”, e.g. “s20130412-151029.033”.

Delete Snapshot – Delete a snapshot of from a snapshottable directory. This operation requires owner privilege of the snapshottable directory. The usage is as

hdfs dfs -deleteSnapshot <path> <snapshotName>

Dual Load

This is option for backing up data between two Hadoop clusters: primary and secondary. With this approach you load the data to both clusters at the same time, this way you ensure that both of the clusters would contain the same data. This approach might be problematic in case you have a complex ETL running, so even small issues with the source data might cause big difference in the processing results, which would make your secondary cluster completely useless. To avoid this you have to plan for dual data loading from the very beginning and use some distributed queue solutions like Kafka or RabbitMQ to guarantee data delivery to both of your clusters with no data loss. Also you might consider using Flume for putting the data from the source systems to HDFS.

Share this post
[social_warfare]
Hadoop Logging
Add and removal of nodes

Get industry recognized certification – Contact us

keyboard_arrow_up