HDFS Interfaces

Most Hadoop filesystem interactions are mediated through the Java API. The filesystem shell, is a Java application using the Java FileSystem class.

HTTP

Hortonworks developed an additional API to support requirements based on standard REST functionalities. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. The HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible. There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP requests to clients; and via a proxy (or proxies), which accesses HDFS on the client’s behalf using the usual DistributedFileSystem API. Both use the WebHDFS protocol.

WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE. Operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT. APPEND operations is based on HTTP POST, while DELETE is using HTTP DELETE. Authentication can be based on user.name query parameter (as part of the HTTP query string) or if security is turned on then it relies on Kerberos. The requirement for WebHDFS is that the client needs to have a direct connection to namenode and datanodes via the predefined ports. The standard URL format is as follows:

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

</property>

WebHDFS Advantages

  • Calls are much quicker than a regular “hadoop fs” command. You can easily see the difference on cluster with Terabytes of data.
  • If you have a non-java client which needs access to HDFS

Enable WebHDFS

  • Enable WebHDFS in HDFS configuration file. (hdfs-site.xml)
  • Set dfs.webhdfs.enabled as true.
  • Restart HDFS daemons.
  • We can now access HDFS with the WebHDFS API using Curl calls

Some example calls

List the contents of a directory

curl -i “http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS”

Content Summary:

curl -i “http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY”

Read a file

curl -i -L “http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN”

Major difference between WebHDFS and HttpFs is, WebHDFS needs access to all nodes of the cluster and when some data is read it is transmitted from that node directly, whereas in HttpFs, a singe node will act similar to a “gateway” and will be a single point of data transfer to the client node. So, HttpFs could be choked during a large file transfer but the good thing is that we are minimizing the footprint required to access HDFS.

Other Related Components

  • HFTP – this was the first mechanism that provided HTTP access to HDFS. It was designed to facilitate data copying between clusters with different Hadoop versions. HFTP is a part of HDFS. It redirects clients to the datanode containing the data for providing data locality. Nevertheless, it supports only the read operations. The HFTP HTTP API is neither curl/wget friendly nor RESTful. WebHDFS is a rewrite of HFTP and is intended to replace HFTP.
  • HdfsProxy – a HDFS contrib project. It runs as external servers (outside HDFS) for providing proxy service. Common use cases of HdfsProxy are firewall tunneling and user authentication mapping.
  • HdfsProxy V3 – Yahoo!’s internal version that has a dramatic improvement over HdfsProxy. It has a HTTP REST API and other features like bandwidth control. Nonetheless, it is not yet publicly available.
  • Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it is a proxy running outside HDFS, it cannot take advantages of some features such as redirecting clients to the corresponding datanodes for providing data locality. It has advantages, however, in that it can be extended to control and limit bandwidth like HdfsProxy V3, or to carry out authentication translation from one mechanism to HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file systems such as Amazon S3 via Hadoop FileSystem API. At the time of writing of this blog, Hoop is in a process of being committed to Hadoop as a HDFS contrib project.

C

Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface It works using the Java Native Interface (JNI) to call a Java filesystem client. There is also a libwebhdfs library that uses the WebHDFS interface.

The C API is very similar to the Java one, but it typically lags the Java one, so some newer features may not be supported. You can find the header file, hdfs.h, in the include directory of the Apache Hadoop binary tarball distribution.

NFS

It is possible to mount HDFS on a local client’s filesystem using Hadoop’s NFSv3 gateway. Then, use Unix utilities (such as ls and cat) to interact with the filesystem, upload files, and in general use POSIX libraries to access the filesystem from any programming language. Appending to a file works, but random modifications of a file do not, since HDFS can only write to the end of a file.

Share this post
[social_warfare]
HDFS Filesystem Operations
Heartbeats

Get industry recognized certification – Contact us

keyboard_arrow_up