History of Hadoop Project

Hadoop was created by Doug Cutting and Mike Cafarella. Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project.

  • 2002 – Nutch was started in 2002, and a working crawler and search system quickly emerged. However Mike Cafarella and Doug Cutting realized that their architecture would not scale to the billions of pages on the Web.
  • 2003 – One publication of a paper in 2003 that described the architecture of Googles distributed filesystem, called GFS, which was being used in production at Google. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS), Google published the paper that introduced MapReduce to the world.
  • 2005 – The Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 2006 – NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and they moved out of Nutch to form an independent subproject of Lucene called Hadoop.
  • 2008 – Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale. Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
  • 2009 – Hadoop Core is renamed Hadoop Common. MapReduce and the Hadoop Distributed File System (HDFS) are now separate subprojects. Avro and Chukwa are new Hadoop subprojects.
  • 2010 – Hadoop’s Hive, Pig, Avro and HBase subprojects have graduated to become top-level Apache projects.
  • 2011 – Hadoop’s ZooKeeper subproject has graduated to become a top-level Apache project. Hadoop reaches 1.0.0.
  • 2012 – Yahoo!’s Hadoop cluster counts 42 000 nodes.
  • 2013 – Apache Hadoop release 2.2.0 available. Apache Hadoop 2.x reaches GA milestone.
  • 2015 – Apache Hadoop release 2.6.2 available.

The Apache Hadoop releases followed a straight line till 0.20. It always had a single major release , and there was no forking of the code into other branches. At release 0.20, there was a fan out of the project into three major branches. The 0.20.2 branch is often termed MapReduce v1.0, MRv1, or simply Hadoop 1.0.0. The 0.21 branch is termed MapReduce v2.0, MRv2, or Hadoop 2.0. A few older distributions are derived from 0.20.1.

Share this post
[social_warfare]
Advantages & Disadvantages
Need for Hadoop

Get industry recognized certification – Contact us

keyboard_arrow_up