Need for Hadoop

Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Hadoop has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. It is:

  • Reliable – The software is fault tolerant, it expects and handles hardware and software failures
  • Scalable – Designed for massive scale of processors, memory, and local attached storage
  • Distributed – Handles replication. Offers massively parallel programming model, MapReduce

Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. Hadoop is particularly useful when

  • Complex information processing is needed
  • Unstructured data needs to be turned into structured data
  • Queries can be reasonably expressed using SQL
  • Heavily recursive algorithms
  • Complex but parallelizable algorithms needed, like geo-spatial analysis or genome sequencing
  • Machine learning
  • Data sets are too large to fit into database RAM, discs, or need too many cores (TB up to PB)
  • Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
  • Results are not needed in real time
  • Fault tolerance is critical
  • Significant custom coding would be required to handle job scheduling
Share this post
[social_warfare]
History of Hadoop Project
Hadoop Architecture

Get industry recognized certification – Contact us

keyboard_arrow_up