dfsadmin, fsck and balancer

dfsadmin

It runs a HDFS dfsadmin client. The hadoop dfsadmin command supports a few HDFS administration related operations. The bin/hadoop dfsadmin -help command lists all the commands currently supported. For e.g.:

Rreport : reports basic statistics of HDFS. Some of this information is also available on the NameNode front page.
Ssafemode : though usually not required, an administrator can manually enter or leave Safemode.
FinalizeUpgrade : removes previous backup of the cluster made during last upgrade.
FrefreshNodes : Updates the set of hosts allowed to connect to namenode. Re-reads the config file to update values defined by dfs.hosts and dfs.host.exclude and reads the entires (hostnames) in those files. Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry defined in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has already been marked for decommission. Entires not present in both the lists are decommissioned.
PrintTopology : Print the topology of the cluster. Display a tree of racks and datanodes attached to the tracks as viewed by the NameNode.

In Hadoop 1,

Usage: hadoop dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter | leave | get | wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>…<dirname>] [-clrQuota <dirname>…<dirname>] [-help [cmd]]

COMMAND_OPTION	Description
-report	Reports basic filesystem information and statistics.
-safemode enter \| leave \| get \| wait	Safe mode maintenance command. Safe mode is a Namenode state in which it 1. does not accept changes to the name space (read-only) 2. does not replicate or delete blocks. Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well.
-refreshNodes	Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned.
-finalizeUpgrade	Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process.
-upgradeProgress status \| details \| force	Request current distributed upgrade status, a detailed status or force the upgrade to proceed.
-metasave filename	Save Namenode’s primary data structures to <filename> in the directory specified by hadoop.log.dir property. <filename> will contain one line for each of the following 1. Datanodes heart beating with Namenode 2. Blocks waiting to be replicated 3. Blocks currently being replicated 4. Blocks waiting to be deleted
-setQuota <quota> <dirname>…<dirname>	Set the quota <quota> for each directory <dirname>. The directory quota is a long integer that puts a hard limit on the number of names in the directory tree. Best effort for the directory, with faults reported if 1. N is not a positive integer, or 2. user is not an administrator, or 3. the directory does not exist or is a file, or 4. the directory would immediately exceed the new quota.
-clrQuota <dirname>…<dirname>	Clear the quota for each directory <dirname>. Best effort for the directory. with fault reported if 1. the directory does not exist or is a file, or 2. user is not an administrator. It does not fault if the directory has no quota.
-help [cmd]	Displays help for the given command or all commands if none is specified.

In Hadoop 2,

COMMAND_OPTION	Description
-report [-live] [-dead] [-decommissioning]	Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes.
-safemode enter\|leave\|get\|wait	Safe mode maintenance command. Safe mode is a Namenode state in which it 1. does not accept changes to the name space (read-only) 2. does not replicate or delete blocks. Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well.
-saveNamespace	Save current namespace into storage directories and reset edits log. Requires safe mode.
-rollEdits	Rolls the edit log on the active NameNode.
-restoreFailedStorage true\|false\|check	This option will turn on/off automatic attempt to restore failed storage replicas. If a failed storage becomes available again the system will attempt to restore edits and/or fsimage during checkpoint. ‘check’ option will return current setting.
-refreshNodes	Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned.
-setStoragePolicy <path> <policyName>	Set a storage policy to a file or a directory.
-getStoragePolicy <path>	Get the storage policy of a file or a directory.
-finalizeUpgrade	Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process.
-metasave filename	Save Namenode’s primary data structures to filename in the directory specified by hadoop.log.dir property. filename is overwritten if it exists. filename will contain one line for each of the following 1. Datanodes heart beating with Namenode 2. Blocks waiting to be replicated 3. Blocks currently being replicated 4. Blocks waiting to be deleted
-refreshServiceAcl	Reload the service-level authorization policy file.
-refreshUserToGroupsMappings	Refresh user-to-groups mappings.
-refreshSuperUserGroupsConfiguration	Refresh superuser proxy groups mappings
-refreshCallQueue	Reload the call queue from config.
-refresh <host:ipc_port> <key> [arg1..argn]	Triggers a runtime-refresh of the resource specified by <key> on <host:ipc_port>. All other args after are sent to the host.
-reconfig <datanode \|…> <host:ipc_port> <start\|status>	Start reconfiguration or get the status of an ongoing reconfiguration. The second parameter specifies the node type. Currently, only reloading DataNode’s configuration is supported.
-printTopology	Print a tree of the racks and their nodes as reported by the Namenode
-refreshNamenodes datanodehost:port	For the given datanode, reloads the configuration files, stops serving the removed block-pools and starts serving new block-pools.
-deleteBlockPool datanode-host:port blockpoolId [force]	If force is passed, block pool directory for the given blockpool id on the given datanode is deleted along with its contents, otherwise the directory is deleted only if it is empty. The command will fail if datanode is still serving the block pool. Refer to refreshNamenodes to shutdown a block pool service on a datanode.
-setBalancerBandwidth <bandwidth in bytes per second>	Changes the network bandwidth used by each datanode during HDFS block balancing. <bandwidth> is the maximum number of bytes per second that will be used by each datanode. This value overrides the dfs.balance.bandwidthPerSec parameter. NOTE: The new value is not persistent on the DataNode.
-allowSnapshot <snapshotDir>	Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable.
-disallowSnapshot <snapshotDir>	Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots.
-fetchImage <local directory>	Downloads the most recent fsimage from the NameNode and saves it in the specified local directory.
-shutdownDatanode <datanode_host:ipc_port> [upgrade]	Submit a shutdown request for the given datanode.
-getDatanodeInfo <datanode_host:ipc_port>	Get the information about the given datanode.
-triggerBlockReport [-incremental] <datanode_host:ipc_port>	Trigger a block report for the given datanode. If ‘incremental’ is specified, it will be otherwise, it will be a full block report.
-help [cmd]	Displays help for the given command or all commands if none is specified.

fsck

It runs a HDFS filesystem checking utility. HDFS supports the fsck command to check for various inconsistencies. It is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as ‘bin/hadoop fsck’. fsck can be run on the whole file system or on a subset of files. By default, fsck will not operate on files still open for write by another client.

In Hadoop 1,

Usage: hadoop fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

COMMAND_OPTION	Description
<path>	Start checking from this path.
-move	Move corrupted files to /lost+found
-delete	Delete corrupted files.
-openforwrite	Print out files opened for write.
-files	Print out files being checked.
-blocks	Print out block report.
-locations	Print out locations for every block.
-racks	Print out network topology for data-node locations.

In Hadoop 2,

COMMAND_OPTION	Description
path	Start checking from this path.
-delete	Delete corrupted files.
-files	Print out files being checked.
-files -blocks	Print out the block report
-files -blocks -locations	Print out locations for every block.
-files -blocks -racks	Print out network topology for data-node locations.
-includeSnapshots	Include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it.
-list-corruptfileblocks	Print out list of missing blocks and files they belong to.
-move	Move corrupted files to /lost+found.
-openforwrite	Print out files opened for write.

balancer

Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process.

HDFS data might not always be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:

Policy to keep one of the replicas of a block on the same node as the node that is writing the block.
Need to spread different replicas of a block across the racks so that cluster can survive loss of whole rack.
One of the replicas is usually placed on the same rack as the node writing to the file so that cross-rack network I/O is reduced.
Spread HDFS data uniformly across the DataNodes in the cluster.

Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool for administrators that analyzes block placement and rebalanaces data across the DataNode.

In Hadoop 1,

Usage: hadoop balancer [-threshold <threshold>]

COMMAND_OPTION	Description
-threshold <threshold>	Percentage of disk capacity. This overwrites the default threshold.

In Hadoop 2,

Usage:

hdfs balancer

[-threshold <threshold>] [-policy <policy>] [-exclude [-f <hosts-file> | <comma-separated list of hosts>]] [-include [-f <hosts-file> | <comma-separated list of hosts>]] [-idleiterations <idleiterations>]

COMMAND_OPTION	Description
-policy <policy>	datanode (default): Cluster is balanced if each datanode is balanced. blockpool: Cluster is balanced if each block pool in each datanode is balanced.
-threshold <threshold>	Percentage of disk capacity. This overwrites the default threshold.
-exclude -f <hosts-file> \| <comma-separated list of hosts>	Excludes the specified datanodes from being balanced by the balancer.
-include -f <hosts-file> \| <comma-separated list of hosts>	Includes only the specified datanodes to be balanced by the balancer.
-idleiterations <iterations>	Maximum number of idle iterations before exit. This overwrites the default idleiterations(5).

Team Vskills

Cluster Monitoring

Hadoop Logging

dfsadmin, fsck and balancer

Get Govt. Certified Secure Assured Job Interview

Level Up Your Job Skills Now!

Get industry recognized certification – Contact us

Get Govt. Certified
Secure Assured Job Interview