Big Data and Apache Hadoop

Big data is a term that describes the large volume of data, both structured and unstructured. But it's not the amount of data that's important. It's what organizations do with the data that matters. It is one of the fastest growing field, there is a hugh demand of skilled professionals. We have a interview a number of professionals and here are the frequently asked questions that can help you to lend a job.

Q.1 Explain Big data and its characteristics.
Big Data is basically a huge amount of data that exceeds the processing capacity of conventional database systems and needs a special parallel processing mechanism. The characteristics of Big Data are volume, velocity, variety, value and veracity.
Q.2 What is Hadoop and list its components?
Hadoop refers to an open-source framework that is used for storing large data sets and running applications across the clusters of commodity hardware. The core components of Hadoop are storage unit (HDFS) and processing framework (YARN).
Q.3 What are the Hadoop daemons?
In general, the daemon refer to just a process that runs in the background. Hadoop consists of five such daemons namely: Namenode, DataNode, JobTracker and TaskTracker.
Q.4 What is Avro Serialization in Hadoop?
Avro Serialization is generally the process of translating objects or data structures state into binary or textual form. It can be defined as a language-independent schema. It also offers AvroMapper and AvroReducer for running MapReduce programs.
Q.5 What is YARN?
YARN stands for Yet Another Resource Negotiator. This is one of the major components of Hadoop and is responsible for managing resources for the different applications operating in a Hadoop cluster. Moreover, it schedules tasks on various cluster nodes.
Q.6 What are the features of HDFS?
The features of HDFS include- supporting storage of very large datasets, writing once read many access model, streaming data access, replication with the use of commodity hardware, high fault tolerance and distributed storage.
Q.7 How can you skip the bad records in Hadoop?
Well, Hadoop has a feature called SkipBadRecords class that enabled skipping bad records while processing mapping inputs.
Q.8 Why can’t we perform “aggregation” in mapper and need the “reducer” for this?
Well, we cannot perform “aggregation” in mapper because sorting doesn’t occur in the “mapper” function. Sorting occurs on the reducer side only and without sorting aggregation cannot be done. Moreover, if one tries to aggregate data at mapper, it needs communication between all mapper functions which may be running on different machines. Therefore, it will consume high network bandwidth and might cause network bottlenecking.
Q.9 What happened If a datanode writing fails?
The ack queue is added in front of data queue.
Q.10 Explain HDFS.
HDFS stands for Hadoop Distributed File System which is the primary data storage unit of Hadoop. It stores different types of data as blocks in a distributed environment and follows master and slave topology.
Q.11 What services are provided by Zookeper?
1. Maintaining configuration information.
2. Providing distributed synchronization. 3. Providing group services.
Q.12 What are the main configuration parameters in a “MapReduce” program?
The main configuration parameters needed by users to specify in “MapReduce” framework include: job’s input locations in the distributed file system, input format of data, output format of data, job’s output location in the distributed file system, class including the reduce function, JAR file including the mapper, reducer and driver classes and classes containing the map function.
Q.13 What is split?
Fixed size pieces of a mapreduce job is called split.
Q.14 How do “reducers” communicate with each other? 
The MapReduce programming model doesn’t allow “reducers” to communicate with each other. Hence, “reducers” run in isolation
Q.15 Enabling the compression of map output is configured by which property?
mapred.compress.map.output
Q.16 Explain “Distributed Cache” in a “MapReduce Framework”.
Distributed Cache is a facility offered by the MapReduce framework in order to cache files required by applications. Once one has cached a file for the job, Hadoop framework makes it available on each and every data nodes where the map/reduce tasks are running. Then one can access the cache file as a local file in the Mapper or Reducer job.
Q.17 What is the default value for HADOOP_HEAPSIZE
HADOOP_HEAPSIZE - The maximum amount of heapsize to use, in MB e.g. 1000MB. This is used to configure the heap size for the hadoop daemon. By default, the value is 1000MB.
Q.18 What does a “MapReduce Partitioner” do?
The role of MapReduce Partitioner is to ensure that all the values of a single key go to the same reducer, hence, allowing even distribution of the map output over the reducers. Further, it redirects the mapper output to the reducer by determining the reducer responsible for that particular key.
Q.19 What is the purpose of “RecordReader” in Hadoop?
The InputSplit refers to a slice of work, but doesn’t describe the way to access it. The RecordReader class loads the data from its source and then converts it into pairs suitable for reading by the Mapper task.
Q.20 How will you write a custom partitioner?
For writing a custom partitioner for a Hadoop job we should - • Firstly, Create a new class extending the Partitioner Class • In the second step, use the Override method – getPartition, in the wrapper that runs in the MapReduce. • Then, add the custom partitioner to the job with the help of method set Partitioner or add the custom partitioner to the job as a config file.
Q.21 What are the benefits of Apache Pig over MapReduce? 
The Apache Pig is a platform used for the analysis of large data sets representing them as data flows developed by Yahoo. It is designed to give an abstraction over MapReduce, decreasing the complexities of writing a MapReduce program.
Q.22 What is a “Combiner”? 
A Combiner is basically a mini reducer responsible for performing the local reduce task. It receives the input from the mapper on a specific node and sends the output to the reducer. Subsequently, Combiners help in improving the efficiency of MapReduce by decreasing the quantum of data that is needed to be sent to the reducers.
Q.23 What do you know about “SequenceFileInputFormat”?
The SequenceFileInputFormat is an input format for reading within the sequence files. It is a particular compressed binary file format optimized for passing the data between the outputs of one MapReduce job to the input of some other MapReduce job.
Q.24 What is a UDF?
We can programmatically create UDF i.e. User Defined Functions (UDF) if some functions are unavailable in built-in operators so as to bring those functionalities using other languages such as Java, Python, Ruby, etc. and embed it in Script file.
Q.25 What does a container consists of?
A container contains of a collection of resources, like CPU, RAM, and network bandwidth. Also, it lets the applications use a predefined number of resources.
Q.26 What do you mean by WAL in HBase?
WAL refers to Write Ahead Log. This file is attached to each Region Server that is present inside the distributed environment. Also, it stores the new data which is yet to be kept in permanent storage. Frequently, WAL is used to recover data sets in case of any failure.
Q.27 What are the different data types in Pig Latin?
Pig Latin is capable of handling both atomic data types such as int, float, long, double etc. as well as complex data types including tuple, bag and map.
Q.28 What are the different relational operations in “Pig Latin” you worked with?

The various relational operators are:

  • Distinct
  • for each
  • order by
  • group
  • filters
  • limit
  • join
Q.29 Mention two differences between Sqoop and Flume?
There are several differences between Sqoop and Flume such as loading data is event-driven in Flume but not in Sqoop takes data from multiple sources flows into HDFS takes data from RDBMS, imports it into HDFS, and exports it back to RDBMS whereas Flume takes data from multiple sources flows into HDFS.
Q.30 What are the components of the architecture of Hive?
The components of the architecture of Hive are: • User Interface • Metastore • Compiler • Execution Engine
Q.31 What is the role of a JobTracker in Hadoop?
The basic role of a JobTracker’s is resource management, tracking resource availability along with task life cycle management that includes tracking the tasks’ progress and fault tolerance.
Q.32 What are the significant components in the execution environment of Pig?
The major components of a Pig execution environment are: • Parser • Compiler • Pig Scripts • Optimizer • Execution Engine
Q.33 Is it possible to import or export tables in HBase?
Yes, we can import and export tables between HBase clusters with the use of the commands including : For import: create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10} hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location” For export: hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location”
Q.34 Why does Hive not store metadata in HDFS?
HDFS does not store metadata because the read/write operations in HDFS take a large amount of time. Therefore, Hive uses RDBMS for storing this metadata in the megastore instead of HDFS. This makes the process faster and allows us to achieve low latency.
Q.35 What are the components of HBase?

The main components of HBase are:

• Region Server

• HMaster

• ZooKeeper

Q.36 What is Speculative Execution in Hadoop?
Well, there are several reasons for the slow performance of tasks, which are sometimes not easy to detect. So, instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches the equivalent tasks as backup. This backup mechanism in Hadoop is speculative execution.
Q.37 What is the command used to open a connection in HBase?
The code used to open a connection in HBase is : - Configuration myConf = HBaseConfiguration.create(); HTableInterface usersTable = new HTable(myConf, “users”);
Q.38 How does Sqoop import or export data between HDFS and RDBMS?

You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Q.39 What is the use of RecordReader in Hadoop?
The RecordReader takes the byte-oriented data from its source and then converts it into record-oriented key–value pairs in a way that it is fit for the Mapper task to read it. Meanwhile, InputFormat defines this Hadoop RecordReader instance.
Q.40 What is the OutputCommitter class?
OutputCommitter basically describes the commit of task output for a MapReduce job. For example org.apache.hadoop.mapreduce.OutputCommitter
Q.41 Define Oozie workflow.
These jobs are a set of sequential actions that require to be executed.
Q.42 Can we write the output of MapReduce in different formats?

Yes, we can write the output of MapReduce in different formats. Hadoop supports several input and output File formats, like:

• TextOutputFormat

• DBOutputFormat

• MapFileOutputFormat

• SequenceFileOutputFormat

• SequenceFileAsBinaryOutputFormat

Q.43 What happens when a node running a map task fails before sending the output to the reducer?
In such situation, map tasks would be assigned to a new node, and the whole task will be rerun so as to re-create the map output. In Hadoop v2, the YARN framework consists of a temporary daemon called application master. Hence, if a task on a particular node failed due to the unavailability of a node then it is the role of the application master to have this task scheduled on another node.
Q.44 Explain the process of spilling in MapReduce.
Spilling is the process of copying data from memory buffer to disk when the buffer usage reaches a threshold size. This usually happens when there is not enough memory to fit all of the mapper output.
Q.45 How can you set the mappers and reducers for a MapReduce job?
We can set the number of mappers and reducers in the command line using: -D mapred.map.tasks=5 –D mapred.reduce.tasks=2
Q.46 What are the features of Hadoop v2?

The primary features of Hadoop v2 are -

• Scalability

• Compatibility

• Resource utilization

• Multitenancy

Q.47 Mention Hadoop core components?

Hadoop core components include,

• HDFS

• MapReduce

Q.48 Which process has replaced JobTracker from MapReduce v1?
Resource manager has replaced JobTracker from MapReduce v1.
Q.49 Explain how JobTracker schedules a task?
Well, the task tracker sends out heartbeat messages to Jobtracker generally every few minutes so as to ensure that JobTracker is active.  This message also informs JobTracker about the number of available slots, so that the JobTracker can stay up to date with wherein the cluster work can be delegated.
Q.50 Explain what is Hadoop?
Hadoop is basically an open-source software framework used for storing data and running applications on clusters of commodity hardware. It also gives large processing power and massive storage for any type of data.
Q.51 Explain the role of Sequencefileinputformat?
The major role of Sequencefileinputformat is to read files in sequence. It is a compressed binary file format optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Q.52 Explain what does the conf.setMapper Class do?
Conf.setMapperclass is responsible for setting the mapper class and all the stuff related to map job like reading data and generating a key-value pair out of the mapper.
Q.53 Can we have more than one ResourceManager in a YARN-based cluster?
Yes, Hadoop v2 enables us to have more than one ResourceManager. We can have a high availability YARN cluster where one can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.
Q.54 What are the different schedulers available in YARN?
The various schedulers that are available in YARN are: • FIFO scheduler • Capacity scheduler • Fair scheduler
Q.55 What happened If a datanode writing fails?
The ack queue is added in front of data queue.
Q.56 What services are provided by Zookeper?
1. Maintaining configuration information.
2. Providing distributed synchronization. 3. Providing group services.
Q.57 What is split?
Fixed size pieces of a mapreduce job is called split.
Q.58 Enabling the compression of map output is configured by which property?
Enabling the compression of map output is configured by mapred.compress.map.output
Q.59 What is the default value for HADOOP_HEAPSIZE
HADOOP_HEAPSIZE - The maximum amount of heapsize to use, in MB e.g. 1000MB. This is used to configure the heap size for the hadoop daemon. By default, the value is 1000MB.
Get Govt. Certified Take Test