HDFS (Hadoop Distributed File System)
HDFS is relatively a new name in the market and many professionals are trying to step into the realm of this future technology. However, to facilitate candidates preparing for a career on Hadoop file system (HDFS), here are some common questions and solutions that are often asked in the interview board:
- What are the essential features of HDFS?
Hadoop Distributed File System (HDFS) is one high fault-tolerant system and is architected to run on low-cost hardware. It has high throughput to access application data and is suited for applications with large data sets. The processing logic of HDFS lies in close proximity to the data and is portable across heterogeneous operating systems and hardware. HDFS is highly scalable and is one perfect solution to process large chunks of data.
- What is called streaming access in HDFS?
Streaming access is one of the most important parts HDFS, as it relies on the principle of ‘Write Once, Read Many’. HDFS does not focus much on the storage of data. Rather, it focuses more to find the best possible means to retrieve records in a faster way.
- What is Namenode?
It refers to the master node where the job trackers operate. The namenode contains metadata and it helps to manage the blocks present on the datanode.
- What is datanode?
Datanode is also referred as slaves and are installed on respective workstations to enable individual data storage. It also serves read/write request for the clients.
- What is heartbeat?
A hearbeat is referred to a signal to indicate the file-system is live. A datanode sends heartbeat to the namenode followed by task tracker to the job tracker. If at any point of time the namenode fails to receive the heartbeat, it will indicate there is some problem in the datanode or the tasktracker fails to perform the task.
- Does namenode and jobtracker belong to the same host?
It is possible to run all daemons on a single machine but in production environment namenode and jobtracker run on different hosts.
- Explain the term ‘block’ in HDFS.
It refers to the minimum data that can either be read or written. The default block size of Hadoop is 64 Mb as compared to 8192 Bytes in Unix/Linux. In HDFS, the files are broken into chunks of the size of a block and are stored in independent units.
- If a file is of size 20 Mb and block size is 64 Mb. Will HDFS consume 64 Mb like other file systems?
No. 64 Mb is referred to the unit where the 20 Mb data will be stored. Here in this case, when upon storing 20 Mb, 44 MB will be spared, which can be used to store anything else.
- How to identify if the datanode is full?
When the data gets stored in datanode, the metadata of the stored data gets updated in the namenode. It is the namenode which identifies if the datanode is full.
- What are the different methods to access HDFS?
There are various ways to access HDFS. In order to access natively, HDFS offers Java API for the application. There is also availability of C language wrapper. Moreover, HTTP browser can also be used to access and browse the files of HDFS instance.
- Is there any way to recover a deleted file in HDFS?
HDFS do not remove a file fully after deletion. The file is first renamed by HDFS and then stored in /trash directory. The file can be recovered from the said directory as long as it remains in the said folder. However, after the expiry of life of the deleted file, the NameNode deletes the file from HDFS namespace.
- What is the command to create a directory in HDFS via FS shell?
bin/hadoop dfs -mkdir /<directory_name>
- What is a snapshot and how it helps when NameNode fails?
A snapshot helps to store a data copy at one particular time. If in case, the NameNode machine fails, the snapshot can be rolled back to derive the last known good configuration.
The post Hadoop HDFS (Common Interview Questions and Answers) appeared first on Wingnity Blog.