- What are real-time industry applications of Hadoop?
Hadoop is an open-source programming stage for flexible and distributed figuring of considerable volumes of data. It gives quick, predominant and insightful examination of composed and unstructured data made on digital platforms and within the enterprise. It is used as a piece of all divisions and regions today. A segment of the illustrations where Hadoop is used:
- Managing traffic on paths.
- Streaming processing.
- Content Management and Archiving Emails.
- Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
- Managing content, images, posts and videos on social media platforms
- Scam detection and Prevention.
- Public sector fields such as intelligence, defense, cyber security and scientific research
- Hadoop are used by the advertisements platforms to capture and examine click stream, transaction and social media data.
- Analyzing customer data in real-time for improving business performance.
- Getting access to formless data like output from medical devices, doctor’s notes, imaging reports, lab results, medical correspondence, clinical data, and financial data.
- What is Big Data?
Big data is the huge amount of structured, unstructured or semi-structured data that has vast potential for mining but is so large that it cannot be managed using traditional database systems. Big data is categorized by its high velocity, volume and variety that require cost effective and innovative methods for information processing to draw meaningful business visions.
- How is Hadoop different from other parallel computing systems?
Hadoop is a scattered file system, which allows you to store and handle gigantic measure of data on a surge of machines, managing data reiteration. Go through this HDFS content to know how the scattered file system works. The fundamental benefit is that since data is secured in a couple of center points, it is perfect to process it in appropriated way. Each node can process the data stored on it instead of spending time in moving it over the network.
Really, in Relational database figuring system, you can address data in real-time, notwithstanding it is efficient to store data in tables, records and portions when the data is huge.
- What are the different types of modes Hadoop can be run in?
Hadoop can run in three modes:
- Standalone Mode: This is a default mode that uses local file system for input and output processes. This mode is generally used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom setup is required for mapred-site.xml, core-site.xml and hdfs-site.xml files. It is much faster when compared to other modes.
- Pseudo-Distributed Mode: In this case, you need setup for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
- Fully Distributed Mode: This is the production stage of Hadoop where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are selected as Master and Slave.
- What is distributed cache and what are its benefits?
Distributed Cache is a service by MapReduce framework to cache files when required. Once a file is cached for a specific job, Hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing. Later, you can easily access and read the cache file and populate any collection in your code. Benefits of distributed cache are:
- It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
- Distributed cache tracks the timestamps of cache files, which informs that the files should not be modified until a job is executing currently.
- What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
- Text Input Format: Default input format in Hadoop.
- Key Value Input Format: It is used for plain text files where the files are broken into lines
- Sequence File Input Format: it is used for reading files in sequence
- Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does not get a message from datanode in 10 minutes, it reflects it to be dead or out of place, and starts repetition of blocks that were hosted on that data node such that they are hosted on other data node. A Block Report contains list of all blocks on a DataNode.
The NameNode manages the replication of data blocks from one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
- What are the core methods of a Reducer?
- setup(): this method is mainly used for configuring several parameters like input data size, distributed cache.
- reduce(): heart of the reducer always called once per key with the associated reduced task
- cleanup(): this method is called to clean temporary files, only once at the end of the task
- What is SequenceFile in Hadoop?
SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:
- Uncompressed key value records.
- Block compressed key records – Both keys and values are collected in ‘blocks’ distinctly and compressed. The size of the ‘block’ is configurable.
- Record compressed key records – Only ‘values’ are compressed here.
- What is Job Tracker role in Hadoop?
The primary role of Job Tracker is to manage the task trackers, tracking resource availability and tracking the task progress and fault tolerance.
- It is a method that runs on a separate node, not on a DataNode.
- Job Tracker communicates with the NameNode to identify data location.
- Finds the suitable Task Tracker Node to execute task on given node.
- Monitors individual Task Trackers and submits the overall job back to the client.
- It tracks the execution of MapReduce workloads local to the slave node.
- What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Hadoop
It will be read as “Welcome to Hadoop” using RecordReader.
- What is Speculative Execution in Hadoop?
One drawback of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. There are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of recognizing and fixing the slow-running tasks, Hadoop tries to find when the task runs slower than expected and then launches other equivalent task as backup. This backup device is known as Speculative Execution.
- What happens if you try to run a Hadoop job with an output directory that is already present?
It will throw an exception saying that the output file directory already exists. To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.
- How can you debug Hadoop code?
First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.
Run: “ps –ef | grep –I ResourceManager”
Thereafter look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
On the basis of RM logs, recognize the worker node that was involved in execution of the task.
Now, login to that node and run – “ps –ef | grep –iNodeManager”
Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
- How to configure Replication Factor in HDFS?
Hdfs-site.xml is used to configure HDFS. Varying the dfs.replication property in hdfs-site.xml will change the default replication for all the files placed in HDFS.
You can also modify the replication factor on a per-file basis using the
Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,
You can also modify the replication factor of all the files inside a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
- How to compress mapper output but not the reducer output?
For this compression, you should set:
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
- What is the difference between Map Side join and Reduce Side Join?
Map side Join is performed data reaches the map. You need a strict structure for describing map side join. On the other hand, Reduce side Join is simpler than map side join since the input datasets need not be structured. Though, it is less efficient as it will have to go through the sort and shuffle phases, coming with network outlays.
- How can you transfer data from Hive to HDFS?
By writing the query:
hive> insert overwrite directory '/' select * from emp;
You can write your query for the data you want to transfer from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.
- What are the companies using Hadoop?
Hadoop have changed the rules of the game in this blog post. Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis, Amazon, Netflix, Adobe, eBay, Spotify, Twitter, Adobe, etc.
- Explain about the indexing process in HDFS.
Indexing process in HDFS is based on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is placed.