Site icon Tutorial

Cluster Planning

A computer cluster consists of a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Planning the cluster is a complex stack and you might have many questions, like

Workload Patterns

Disk space, I/O Bandwidth (required by Hadoop), and computational power (required for the MapReduce processes) are the most important parameters for accurate hardware sizing. Additionally, if you are installing HBase, you also need to analyze your application and its memory requirements, because HBase is a memory intensive component. Based on the typical use cases for Hadoop, the following workload patterns are commonly observed in production environments:

Planning a Hadoop cluster requires a minimum knowledge the Hadoop architecture, which includes the following

The Distributed Computation

At his heart, Hadoop is a distributed computation platform. This platform’s programming model is Map Reduce. In order to be efficient, Map Reduce has two prerequisites:

The first prerequisite depends on both the type of input data which feeds the cluster and what we want to do with it. The second prerequisite involves having a distributed storage system which exposes where exactly data is stored and allows the execution of code on any storage node. This is where HDFS is useful. Hadoop is a Master / Slave architecture:

The critical component in this architecture is the JobTracker/ResourceManager.

The Distributed Storage

HDFS is a distributed storage filesystem. It runs on top of another filesystem like ext3 or ext4. In order to be efficient, HDFS must satisfy the following prerequisites:

HDFS is a Master / Slave architecture:

The critical components in this architecture are the NameNode and the Secondary NameNode. These are two distinct but complementary architectures. It is possible to not use HDFS with Hadoop. Amazon with their Elastic MapReduce for example rely on their own storage offer, S3 and a desktop tool like KarmaSphere Analyst embeds Hadoop with a local directory instead of HDFS.

HDFS File Management

HDFS is optimized for the storage of large files. You write the file once and access it many times. In HDFS, a file is split into several blocks. Each block is asynchronously replicated in the cluster. Therefore, the client sends its files once and the cluster takes care of replicating its blocks in the background.

A block is a contiguous area, a blob of data on the underlying filesystem, Its default size is 64MB but it can be extended to 128MB or even 256MB, depending on your needs. The block replication, which has a default factor of 3, is useful for two reasons:

From a network standpoint, the bandwidth is used at two moments:

NameNode and the HDFS cluster – The NameNode manages the meta information of the HDFS cluster. This includes meta information (filenames, directories, …) and the location of the blocks of a file. The filesystem structure is entirely mapped into memory. In order to have persistence over restarts, two files are also used:

The in memory image is the merge of those two files. When the NameNode starts, it first loads fsimage and then applies the content of edits on it to recover the latest state of the filesystem. An issue would be that over time, the edits file keeps growing indefinitely and ends up by:

The Secondary NameNode role is to avoid this issue by regularly merging edits with fsimage, thus pushing a new fsimage and resetting the content of edits. The trigger for this compaction process is configurable. It can be:

The following formula can be applied to know how much memory a NameNode needs:

<needed memory> = <total storage size in the cluster in MB> / <Size of a block in MB> / 1000000

In other words, a rule of thumb is to consider that a NameNode needs about 1GB / 1 million blocks.

Determine Storage Needs

Storage needs are split into three parts – Shared needs, NameNode and Secondary NameNode specific needs and DataNode specific needs

Shared needs – Shared needs are already known since it covers – OS partition and the OS logs partition. Those two partitions can be setup as usual.

NameNode and Secondary NameNode specific needs – The Secondary NameNode must be identical to the NameNode. Same hardware, same configuration. A 1TB partition should be dedicated to files written by both the NameNode and the Secondary NameNode. This is large enough so you won’t have to worry about disk space as the cluster grows.

If you want to be closer to the actual occupied size, you need to take into account the parameters of the NameNode we explained above (a combination of the trigger for the compaction, the maximum fsimage size and the edits size) and to multiply this result by the number of checkpoints you want to be retained.

In any case, the NameNode must have an NFS mount point to a secured storage among its fsimage and edits directories. This mount point has the same size than the local partition for fsimage and edits mentioned above. The storage of the NameNode and the Secondary NameNode is typically performed on RAID configuration.

DataNode Specific Needs – Hardware requirements for DataNodes storage is

Do not use RAID on a DataNode. HDFS provides its own replication mechanism. The number of hard drive can vary depending on the total desired storage capacity. A good way to determine the latter is to start from the planned data input of the cluster. It is also important to note that for every disk, 30% of its capacity is reserved to non HDFS use. Let’s consider the following hypothesis:

With these hypothesis, we are able to determine the storage needed and the number of DataNodes. Therefore we have:

Two important elements are not included here:

These information depend on the needs of your business units and it must be taken into account in order to determine storage needs.

Determine your CPU needs

On both NameNode and Secondary NameNode, 4 physical cores running at 2Ghz will be enough. For DataNodes, two elements help you to determine your CPU needs

Roughly, we consider that a DataNode can perform two kind of jobs: I/O intensive and CPU intensive.

I/O intensive jobs – These jobs are I/O bound. For example: indexing, search, clustering, decompression and data import/export. Here, a CPU running between 2Ghz and 2.4Ghz is enough.

CPU intensive jobs – These jobs are CPU bound. For example: machine learning, statistics, semantic analysis and language analysis. Here, a CPU running between 2.6Ghz and 3Ghz is enough.

The number of physical cores determine the maximum number of jobs that can run in parallel on a DataNode. It is also important to keep in mind that there is a distribution between Map and Reduce tasks on DataNodes (typically 2/3 Maps and 1/3 Reduces). To determine you needs, you can use the following formula

(<number of physical cores> – 1) * 1.5 = <maximum number of tasks>

or, if you prefer to start from the number of tasks and adjust the number of cores according to it:

(<maximum number of tasks> / 1.5) + 1 = <number of physical cores>

The number 2 keeps 2 cores away for both the TaskTracker (MapReduce) and DataNode (HDFS) processes. The number 1.5 indicates that a physical core, due to hyperthreading, might process more than one job at the same time.

Determine Memory Needs

This is a two step process:

In both cases, you should use DDR3 ECC memory. Determine the memory of both NameNode and Secondary NameNode As explained above the NameNode manages the HDFS cluster metadata in memory. The memory needed for the NameNode process and the memory needed for the OS must be added to it. The Secondary NameNode must be identical to the NameNode. Given these information we have the following formula:

<Secondary NameNode memory> = <NameNode memory> = <HDFS cluster management memory> + <2GB for the NameNode process> + <4GB for the OS>

Memory of DataNodes – The memory needed for a DataNode is determined depending on the profile of jobs which will run on it. For I/O bound jobs, between 2GB and 4GB per physical core. For CPU bound jobs, between 6GB and 8GB per physical core. In both cases, the following must be added:

Determine Network Needs

The two reasons for which Hadoop generates the most network traffic are:

In spite of it, network transfers in Hadoop follow an East/West pattern which means that even though orders come from the NameNode, most of the transfers are performed directly between DataNodes. As long as these transfers do not cross the rack boundary, it is not a big issue and Hadoop does its best to perform only such transfers. However, inter-rack transfers are sometimes needed, for example for the second replica of an HDFS block. This is complex subject but as a rule of thumb, you should:

Exit mobile version