Hadoop & Mapreduce Tutorial | distcp (Distributed Copy)

Certify and Increase Opportunity.
Be
Govt. Certification in Hadoop & Mapreduce

distcp

DistCP is the shortform of Distributed Copy in context of Apache Hadoop. It is basically a tool which can be used in case we need to copy large amount of data/files in inter/intra-cluster setup. In the background, DisctCP uses MapReduce to distribute and copy the data which means the operation is distributed across multiple available nodes in the cluster. This makes it more efficient and effective copy tool.

DistCP takes a list of files(in case of multiple files) and distribute the data between multiple Map tasks and these map tasks copy the data portion assigned to them to the destination.

DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Usage

The most common invocation of DistCp is an inter-cluster copy

bash$ hadoop distcp2 hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

 

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.

One can also specify multiple source directories on the command line:

 

bash$ hadoop distcp2 hdfs://nn1:8020/foo/a \

hdfs://nn1:8020/foo/b \

hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:

bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist \

hdfs://nn2:8020/bar/foo

Where srclist contains

hdfs://nn1:8020/foo/a

hdfs://nn1:8020/foo/b

When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the options specified. By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt.

It is important that each TaskTracker can reach and communicate with both the source and destination file systems. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol.

After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.

It’s also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFoundException.

Components

The components of the new DistCp may be classified into the following categories

DistCp Driver – The DistCp Driver components are responsible for:

  • Parsing the arguments passed to the DistCp command on the command-line, via OptionsParser, and DistCpOptionsSwitch
  • Assembling the command arguments into an appropriate DistCpOptions object, and initializing DistCp. These arguments include – Source-paths, Target location and Copy options (e.g. whether to update-copy, overwrite, which file-attributes to preserve, etc.)
  • Orchestrating the copy operation by
  • Invoking the copy-listing-generator to create the list of files to be copied.
  • Setting up and launching the Hadoop Map-Reduce Job to carry out the copy.

Based on the options, either returning a handle to the Hadoop MR Job immediately, or waiting till completion.

Test Your Hadoop & Mapreduce Skills By Taking Our Practice Tests On This Link

Apply for Hadoop and Mapreduce Certification Now!!

http://www.vskills.in/certification/Certified-Hadoop-and-Mapreduce-Professional

Hadoop & Mapreduce Tutorial | HDFS Interfaces & Data read/write process
Hadoop & Mapreduce Tutorial | Components & Command Line Interface

Get industry recognized certification – Contact us

keyboard_arrow_up