1. What is rack awareness? And why is it necessary?
Answer:
Rack awareness is about distributing data nodes across multiple racks.HDFS follows the rack awareness algorithm to place the data blocks. A rack holds multiple servers. And for a cluster, there could be multiple racks. Let’s say there is a Hadoop cluster set up with 12 nodes. There could be 3 racks with 4 servers on each. All 3 racks are connected so that all 12 nodes are connected and that form a cluster. While deciding on the rack count, the important point to consider is the replication factor. If there is 100GB of data that is going to flow every day with the replication factor 3. Then it’s 300GB of data that will have to reside on the cluster. It’s a better option to have the data replicated across the racks. Even if any node goes down, the replica will be in another rack.
2. What is the default block size and how is it defined?
Answer:
128MB and it is defined in hdfs-site.xml and also this is customizable depending on the volume of the data and the level of access. Say, 100GB of data flowing in a day, the data gets segregated and stored across the cluster. What will be the number of files? 800 files. (1024*100/128) [1024 à converted a GB to MB.] There are two ways to set the customize data block size.
hadoop fs -D fs.local.block.size=134217728 (in bits)
In hdfs-site.xml add this property à block.size with the bits size.
If you change the default size to 512MB as the data size is huge, then the no.of files generated will be 200. (1024*100/512)
3. How do you get the report of hdfs file system? About disk availability and no.of active nodes?
Answer:
Command: sudo -u hdfs dfsadmin –report
These are the list of information it displays,
Configured Capacity – Total capacity available in hdfs
Present Capacity – This is the total amount of space allocated for the resources to reside beside the metastore and fsimage usage of space.
DFS Remaining – It is the amount of storage space still available to the HDFS to store more files
DFS Used – It is the storage space that has been used up by HDFS.
DFS Used% – In percentage
Under replicated blocks – No. of blocks
Blocks with corrupt replicas – If any corrupted blocks
Missing blocks
Missing blocks (with replication factor 1)
4. What is Hadoop balancer and why is it necessary?
Answer:
The data spread across the nodes are not distributed in the right proportion, meaning the utilization of each node might not be balanced. One node might be over utilized and the other could be under-utilized. This leads to having high costing effect while running any process and it would end up running on heavy usage of those nodes. In order to solve this, Hadoop balancer is used that will balance the utilization of the data in the nodes. So whenever a balancer is executed, the data gets moved across where the under-utilized nodes get filled up and the over utilized nodes will be freed up.
5. Difference between Cloudera and Ambari?
Answer:
Cloudera Manager
Ambari
Administration tool for Cloudera
Administration tool for Horton works
Monitors and manages the entire cluster and reports the usage and any issues
Monitors and manages the entire cluster and reports the usage and any issues
Comes with Cloudera paid service
Open source
6. What are the main actions performed by the Hadoop admin?
Answer:
Monitor health of cluster -There are many application pages that have to be monitored if any processes run. (Job history server, YARN resource manager, Cloudera manager/ambary depending on the distribution)
turn on security – SSL or Kerberos
Tune performance – Hadoop balancer
Add new data nodes as needed – Infrastructure changes and configurations
Optional to turn on MapReduce Job History Tracking Server à Sometimes restarting the services would help release up cache memory. This is when the cluster with an empty process.
7. What is Kerberos?
Answer:
It’s an authentication required for each service to sync up in order to run the process. It is recommended to enable Kerberos. Since we are dealing with the distributed computing, it is always good practice to have encryption while accessing the data and processing it. As each node are connected and any information passage is across a network. As Hadoop uses Kerberos, passwords not sent across the networks. Instead, passwords are used to compute the encryption keys. The messages are exchanged between the client and the server. In simple terms, Kerberos provides identity to each other (nodes) in a secure manner with the encryption.
Configuration in core-site.xml
Hadoop.security.authentication: Kerberos
8. What is the important list of hdfs commands?
Answer:
Commands
Purpose
hdfs dfs –ls <hdfs path>
To list the files from the hdfs filesystem.
Hdfs dfs –put <local file> <hdfs folder>
Copy file from the local system to the hdfs filesystem
Hdfs dfs –chmod 777 <hdfs file>
Give a read, write, execute permission to the file
Hdfs dfs –get <hdfs folder/file> <local filesystem>
Copy the file from hdfs filesystem to the local filesystem
Hdfs dfs –cat <hdfs file>
View the file content from the hdfs filesystem
Hdfs dfs –rm <hdfs file>
Remove the file from the hdfs filesystem. But it will be moved to trash file path (it’s like a recycle bin in windows)
Hdfs dfs –rm –skipTrash <hdfs filesystem>
Removes the file permanently from the cluster.
Hdfs dfs –touchz <hdfs file>
Create a file in the hdfs filesystem
9. How to check the logs of a Hadoop job submitted in the cluster and how to terminate the already running process?
Answer:
yarn logs –applicationId <application_id> — The application master generates logs on its container and it will be appended with the id it generates. This is will be helpful to monitor the process running status and the log information.
yarn application –kill <application_id> — If an existing process that was running in the cluster needs to be terminated, kill command is used where the application id is used to terminate the job in the cluster.
What is HDFS?
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
BWhat is filesystem?
In computing, a file system (or filesystem) is used to control how data is stored and retrieved. Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into pieces and giving each piece a name, the information is easily isolated and identified.
Iwhy HDFS as another filesystem is required?
Filesystems like NTFS, FAT, FAT32, Ext2, Ext3, Ext4 etc. are local to a particular node or machine. Information stored in one of node in NTFS or Ext will not know what information is stored in another nodes NTFS or Ext filesystem. Apache Hadoop is an open-source software framework that allows to store and process big data in a distributed environment across clusters of computers. Now Hadoop to work seamlessly in a distributed environment HDFS was introduced which works on top of your local filesystem.
BWhat are the key features of HDFS?
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.
BWhat is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.
BWhat is a heartbeat in HDFS?
In general a heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode. If the Namenode does not receive heart beat then they will decide that there is some problem in datanode.
BWhat is a ‘block’ in HDFS?
A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.
BWhat is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode
IHow do you define “block” in HDFS? What is the block size in Hadoop 1 and in Hadoop 2? Can it be changed?
A “block” is the minimum amount of data that can be read or written. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. Hadoop 1 default block size: 64 MB Hadoop 2 default block size: 128 MB Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.
IWhat are the benefits of block transfer?
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
IWhy do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.
AReplication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
ASince the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
No, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
AIf we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.
IExplain how indexing in HDFS is done?
Hadoop has its own way of indexing data. Depending on the block size, HDFS will continue storing the last part of the data. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
IHow do you do a file system check in HDFS?
FSCK command is used to do a file system check in HDFS. It is a very useful command to check the health of the file, block names and block locations.
hdfs fsck /dir/hadoop-test -files -blocks –locations
What is a commodity hardware?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Commodity hardware includes RAM because there will be some services which will be running on RAM. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop to execute jobs.
BWhat is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
IWhat is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness.
IOn what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.
AHow do you define “rack awareness” in Hadoop?
It is the manner in which the “Namenode” decides how blocks are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
IExplain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
IIs HDFS is a replacement for your local filesystem?
HDFS by no means is a replacement for your filesystem. OS still reply on local filesystem. HDFS still go through the local filesystem for saving each block. HDFS is placed on top of local filesystem.
AHow a small file is stored in HDFS?
A 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB. In HDFS the block size is more about how a single file is split up / partitioned, not about some reserved part of the file system.
Aif it stores 1Mb file as 1Mb file then why the concept of block size, come into picture in hadoop?
Because a 1GB file will be stored in blocks of 128MB (for example), with each block being replicated on 3 (typically) nodes in the cluster. The idea being that it's quicker for 8 tasks to process 128MB each, rather than a single task processing all 1G.
AWhat happens to the memory not allocated in HDFS when the file size is not a multiple of 64 MB or the default file storage block size?
The block sizes are going to be 64MB and 36MB. The second block is not of 64MB size (assuming default size of a block is 64 MB). HDFS isn't going to waste disk space by allocating a 64MB block for a smaller chunk.
The Name Node does not care about the size of the data nodes. A disk file is created on each of the Data Node of the appropriate size. An easier way to think of this is to consider writing a file of 65 MB. HDFS wouldn't allocate two 64 MB chunks to allocate storage of 1MB extra space. It would create a new block of 1MB size.
Iwhat happens when 1MB file is getting stored in HDFS?
When 1MB file is stored in HDFS, 1MB disk space is consumed eventhough the HDFS block size is 128MB(assuming 128MB block size is configured).
Iwhat happens when 1KB file is getting stored in HDFS?
When 1KB file is stored in HDFS, 4KB disk space is consumed eventhough the HDFS block size is 128MB(assuming 128MB block size is configured). Why 4KB, why not 1KB because HDFS stores data in underneath local file system. Assuming ext4 has local file system in linux, its block size is 4KB. It consumes full 4KB as block, leaving 4KB as empty.
Iwhat happens when 130MB file is getting stored in hdfs?
When 130MB file is stored in HDFS, 130MB disk space is consumed consisting of 2 HDFS blocks of size 128MB and 2MB(assuming 128MB block size is configured). Yes remaining 126MB is not wasted as in local file system.
Awhy block size is 128MB in HDFS, why not 4KB?
Why is a Block in HDFS So Large?
Block size is just an indication to HDFS how to split up and distribute the files across the cluster. OS will lay blocks in contiguos locations, as a result read and write in HDFS will be faster because blocks are laid out next to each other. Then disk head doesnt have to seek and position itsef over and over again for blocks. This is a huge benefit because of HDFS design. So to reduce seek time HDFS block size is kept so high.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
AWhat is streaming access?
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
IDo we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
AWhat if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.
AWhat is ‘Key value pair’ in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.
AWhat happens when two clients try to access the same file on the HDFS?
HDFS supports exclusive writes only. When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client
AWhy do we sometimes get a “file could only be replicated to 0 nodes, instead of 1” error?
This happens because the “Namenode” does not have any available DataNodes.
AHow does one switch off the “SAFEMODE” in HDFS?
You use the command: hadoop dfsadmin –safemode leave
AWhy is it that in HDFS, ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Using the MapReduce program, the file can be read by splitting its blocks. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.
ACopy a directory from one node in the cluster to another
Use ‘-distcp’ command to copy,
AIs there a hdfs command to see available free space in hdfs
hadoop dfsadmin -report
AWhat does "file could only be replicated to 0 nodes, instead of 1" mean?
The namenode does not have any available DataNodes.
AWhat are Problems with small files and HDFS?
HDFS is not good at handling large number of small files. Because every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies approx 150 bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go for a billion files the memory requirement in namenode cannot be met.
AHow can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified in 2 ways
Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command
$hadoop fs –setrep –w 2 /my/test_file
Note: test_file is the filename whose replication factor will be set to 2
Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
$hadoop fs –setrep –w 5 /my/test_dir
Note:test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5
AExplain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
IWhat is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
IWhat happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
IExplain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
IMention what is the best way to copy files between HDFS clusters?
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
AHow do you overwrite replication factor?
There are few ways to do this. Look at the below illustration.
hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv
AHow to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS. You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely, you can also change the replication factor of all the files under a directory.
[corejavaguru@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
YARN interview
What is YARN?
Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.
YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically used directly by user code. Instead, users write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user.
Below figure shows some distributed computing frameworks (MapReduce, Spark,and so on) running as YARN applications on YARN.
Apache YARN
IIs YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and other distributed computing frameworks and is also referred to as MapReduce version 2.
AWhy YARN?
With older versions of Hadoop, you were limited to executing MapReduce jobs only. This was great if the type of work you were performing fit well into the MapReduce processing model, but it was restrictive for those wanting to perform graph processing, iterative computing, or any other type of work.
In Hadoop 2 the scheduling pieces of MapReduce were separated and reworked into a new component called YARN. The YARN doesn’t know or care about the type of applications that are running, nor does it care about keeping any historical information about what has executed on the cluster. Because of this design YARN can scale beyond the levels of MapReduce.
IWhat are the YARN responsibilities?
YARN is responsible for two activities:
Responding to a client’s request to create a container(A container is in essence a process, with a contract governing the physical resources that it’s permitted to use).
Monitoring containers that are running, and terminating them if needed(Containers can be terminated if a YARN scheduler wants to free up resources so that containers from other applications can run, or if a container is using more than its allocated resources).
IWhat are the benefits YARN brings in to Hadoop?
Yarn does efficient utilization of the resource. There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
YARN is backward compatible. This means that existing MapReduce job can run on Hadoop 2.0 without any change.
YARN fixed the old mapreduce scalability issue and can now run on larger clusters than MapReduce 1.
The biggest benefit of YARN is that it opens up Hadoop to other types of distributed application beyond MapReduce. MapReduce is just one YARN application among many.
AExplain how scalability issue is fixed in YARN?
In YARN, (Contrast to the jobtracker in MapReduce 1), instance of an application say a MapReduce job has a dedicated application master, which runs for the duration of the application. This model is actually closer to the original Google MapReduce paper, which describes how a master process is started to coordinate map and reduce tasks running on a set of workers.
MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks, because of the reason that the jobtracker has to manage both jobs and tasks. YARN overcomes these limitations by virtue of its split resource manager/application master architecture which is designed to scale up to 10,000 nodes and 100,000 tasks.
BWhat are the key components of YARN?
The basic idea of YARN is to split the functionality of resource management and job scheduling/monitoring into separate daemons. YARN consists of the following different components
ResourceManager - The ResourceManager is the YARN master process, and its sole function is to arbitrate resources on a Hadoop cluster. It responds to client requests to create containers, and a scheduler determines when and where a container can be created
NodeManager - The NodeManager is the slave process that runs on every node in a cluster. Its job is to create, monitor, and kill containers. It services requests from the ResourceManager and ApplicationMaster to create containers, and it reports on the status of the containers to the ResourceManager.
ApplicationMaster - ApplicationMaster is a per-application component which doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it is responsible for negotiating resource requirements for the resource manager and working with NodeManagers to execute and monitor the tasks.
The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
Container - A container is an application-specific process that’s created by a NodeManager on behalf of an ApplicationMaster with a constrained set of resources (Memory, CPU, etc.)
YARN child - After submitting the application, application master dynamically launch YARN child to do the MapReduce tasks.
IWhat is ResourceManager in YARN?
The ResourceManager is the YARN master process, and its only function is to arbitrate resources on a Hadoop cluster. It responds to client requests to create containers, and a scheduler determines when and where a container can be created.
The ResourceManager has two main components - Scheduler and AppicationsManager
Scheduler - The scheduler is responsible for allocating resources.
ApplicationManager - The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
IWhat is ApplicationMaster in YARN?
ApplicationMaster is a per-application component which doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it is responsible for negotiating resource requirements for the resource manager and working with NodeManagers to execute and monitor the tasks.
The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
BWhat are the scheduling policies available in YARN?
YARN scheduler is responsible for scheduling resources to user applications based on a defined scheduling policy. YARN provides three scheduling options-
FIFO Scheduler - FIFO scheduler puts application requests in queue and runs them in the order of submission (first in, first out). Requests for the first application in the queue are allocated first; once its requests have been satisfied, the next application in the queue is served, and so on.
Capacity Scheduler - Capacity scheduler has a separate dedicated queue for smaller jobs and starts them as soon as they are submitted.
Fair Scheduler - Fair scheduler dynamically balances and allocates resources between all the running jobs. Just after the first (large) job starts, it is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is allocated half of the cluster resources so that each job is using its fair share of resources.
AHow do you setup ResourceManager to use CapacityScheduler?
You can configure the ResourceManager to use CapacityScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler in the file conf/yarn-site.xml.
AHow do you setup ResourceManager to use FairScheduler?
You can configure the ResourceManager to use FairScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler in the file conf/yarn-site.xml.
What is HDFS?
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
BWhat is filesystem?
In computing, a file system (or filesystem) is used to control how data is stored and retrieved. Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into pieces and giving each piece a name, the information is easily isolated and identified.
Iwhy HDFS as another filesystem is required?
Filesystems like NTFS, FAT, FAT32, Ext2, Ext3, Ext4 etc. are local to a particular node or machine. Information stored in one of node in NTFS or Ext will not know what information is stored in another nodes NTFS or Ext filesystem. Apache Hadoop is an open-source software framework that allows to store and process big data in a distributed environment across clusters of computers. Now Hadoop to work seamlessly in a distributed environment HDFS was introduced which works on top of your local filesystem.
BWhat are the key features of HDFS?
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.
BWhat is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.
BWhat is a heartbeat in HDFS?
In general a heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode. If the Namenode does not receive heart beat then they will decide that there is some problem in datanode.
BWhat is a ‘block’ in HDFS?
A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.
BWhat is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode
IHow do you define “block” in HDFS? What is the block size in Hadoop 1 and in Hadoop 2? Can it be changed?
A “block” is the minimum amount of data that can be read or written. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. Hadoop 1 default block size: 64 MB Hadoop 2 default block size: 128 MB Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.
IWhat are the benefits of block transfer?
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
IWhy do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.
AReplication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
ASince the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
No, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
AIf we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.
IExplain how indexing in HDFS is done?
Hadoop has its own way of indexing data. Depending on the block size, HDFS will continue storing the last part of the data. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
IHow do you do a file system check in HDFS?
FSCK command is used to do a file system check in HDFS. It is a very useful command to check the health of the file, block names and block locations.
hdfs fsck /dir/hadoop-test -files -blocks –locations
What is a commodity hardware?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Commodity hardware includes RAM because there will be some services which will be running on RAM. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop to execute jobs.
BWhat is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
IWhat is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness.
IOn what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.
AHow do you define “rack awareness” in Hadoop?
It is the manner in which the “Namenode” decides how blocks are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
IExplain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
IIs HDFS is a replacement for your local filesystem?
HDFS by no means is a replacement for your filesystem. OS still reply on local filesystem. HDFS still go through the local filesystem for saving each block. HDFS is placed on top of local filesystem.
AHow a small file is stored in HDFS?
A 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB. In HDFS the block size is more about how a single file is split up / partitioned, not about some reserved part of the file system.
Aif it stores 1Mb file as 1Mb file then why the concept of block size, come into picture in hadoop?
Because a 1GB file will be stored in blocks of 128MB (for example), with each block being replicated on 3 (typically) nodes in the cluster. The idea being that it's quicker for 8 tasks to process 128MB each, rather than a single task processing all 1G.
AWhat happens to the memory not allocated in HDFS when the file size is not a multiple of 64 MB or the default file storage block size?
The block sizes are going to be 64MB and 36MB. The second block is not of 64MB size (assuming default size of a block is 64 MB). HDFS isn't going to waste disk space by allocating a 64MB block for a smaller chunk.
The Name Node does not care about the size of the data nodes. A disk file is created on each of the Data Node of the appropriate size. An easier way to think of this is to consider writing a file of 65 MB. HDFS wouldn't allocate two 64 MB chunks to allocate storage of 1MB extra space. It would create a new block of 1MB size.
Iwhat happens when 1MB file is getting stored in HDFS?
When 1MB file is stored in HDFS, 1MB disk space is consumed eventhough the HDFS block size is 128MB(assuming 128MB block size is configured).
Iwhat happens when 1KB file is getting stored in HDFS?
When 1KB file is stored in HDFS, 4KB disk space is consumed eventhough the HDFS block size is 128MB(assuming 128MB block size is configured). Why 4KB, why not 1KB because HDFS stores data in underneath local file system. Assuming ext4 has local file system in linux, its block size is 4KB. It consumes full 4KB as block, leaving 4KB as empty.
Iwhat happens when 130MB file is getting stored in hdfs?
When 130MB file is stored in HDFS, 130MB disk space is consumed consisting of 2 HDFS blocks of size 128MB and 2MB(assuming 128MB block size is configured). Yes remaining 126MB is not wasted as in local file system.
Awhy block size is 128MB in HDFS, why not 4KB?
Why is a Block in HDFS So Large?
Block size is just an indication to HDFS how to split up and distribute the files across the cluster. OS will lay blocks in contiguos locations, as a result read and write in HDFS will be faster because blocks are laid out next to each other. Then disk head doesnt have to seek and position itsef over and over again for blocks. This is a huge benefit because of HDFS design. So to reduce seek time HDFS block size is kept so high.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
AWhat is streaming access?
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
IDo we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
AWhat if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.
AWhat is ‘Key value pair’ in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.
AWhat happens when two clients try to access the same file on the HDFS?
HDFS supports exclusive writes only. When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client
AWhy do we sometimes get a “file could only be replicated to 0 nodes, instead of 1” error?
This happens because the “Namenode” does not have any available DataNodes.
AHow does one switch off the “SAFEMODE” in HDFS?
You use the command: hadoop dfsadmin –safemode leave
AWhy is it that in HDFS, ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Using the MapReduce program, the file can be read by splitting its blocks. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.
ACopy a directory from one node in the cluster to another
Use ‘-distcp’ command to copy,
AIs there a hdfs command to see available free space in hdfs
hadoop dfsadmin -report
AWhat does "file could only be replicated to 0 nodes, instead of 1" mean?
The namenode does not have any available DataNodes.
AWhat are Problems with small files and HDFS?
HDFS is not good at handling large number of small files. Because every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies approx 150 bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go for a billion files the memory requirement in namenode cannot be met.
AHow can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified in 2 ways
Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command
$hadoop fs –setrep –w 2 /my/test_file
Note: test_file is the filename whose replication factor will be set to 2
Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
$hadoop fs –setrep –w 5 /my/test_dir
Note:test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5
AExplain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
IWhat is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
IWhat happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
IExplain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
IMention what is the best way to copy files between HDFS clusters?
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
AHow do you overwrite replication factor?
There are few ways to do this. Look at the below illustration.
hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv
AHow to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS. You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely, you can also change the replication factor of all the files under a directory.
[corejavaguru@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
YARN interview
What is YARN?
Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.
YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically used directly by user code. Instead, users write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user.
Below figure shows some distributed computing frameworks (MapReduce, Spark,and so on) running as YARN applications on YARN.
Apache YARN
IIs YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and other distributed computing frameworks and is also referred to as MapReduce version 2.
AWhy YARN?
With older versions of Hadoop, you were limited to executing MapReduce jobs only. This was great if the type of work you were performing fit well into the MapReduce processing model, but it was restrictive for those wanting to perform graph processing, iterative computing, or any other type of work.
In Hadoop 2 the scheduling pieces of MapReduce were separated and reworked into a new component called YARN. The YARN doesn’t know or care about the type of applications that are running, nor does it care about keeping any historical information about what has executed on the cluster. Because of this design YARN can scale beyond the levels of MapReduce.
IWhat are the YARN responsibilities?
YARN is responsible for two activities:
Responding to a client’s request to create a container(A container is in essence a process, with a contract governing the physical resources that it’s permitted to use).
Monitoring containers that are running, and terminating them if needed(Containers can be terminated if a YARN scheduler wants to free up resources so that containers from other applications can run, or if a container is using more than its allocated resources).
IWhat are the benefits YARN brings in to Hadoop?
Yarn does efficient utilization of the resource. There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
YARN is backward compatible. This means that existing MapReduce job can run on Hadoop 2.0 without any change.
YARN fixed the old mapreduce scalability issue and can now run on larger clusters than MapReduce 1.
The biggest benefit of YARN is that it opens up Hadoop to other types of distributed application beyond MapReduce. MapReduce is just one YARN application among many.
AExplain how scalability issue is fixed in YARN?
In YARN, (Contrast to the jobtracker in MapReduce 1), instance of an application say a MapReduce job has a dedicated application master, which runs for the duration of the application. This model is actually closer to the original Google MapReduce paper, which describes how a master process is started to coordinate map and reduce tasks running on a set of workers.
MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks, because of the reason that the jobtracker has to manage both jobs and tasks. YARN overcomes these limitations by virtue of its split resource manager/application master architecture which is designed to scale up to 10,000 nodes and 100,000 tasks.
BWhat are the key components of YARN?
The basic idea of YARN is to split the functionality of resource management and job scheduling/monitoring into separate daemons. YARN consists of the following different components
ResourceManager - The ResourceManager is the YARN master process, and its sole function is to arbitrate resources on a Hadoop cluster. It responds to client requests to create containers, and a scheduler determines when and where a container can be created
NodeManager - The NodeManager is the slave process that runs on every node in a cluster. Its job is to create, monitor, and kill containers. It services requests from the ResourceManager and ApplicationMaster to create containers, and it reports on the status of the containers to the ResourceManager.
ApplicationMaster - ApplicationMaster is a per-application component which doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it is responsible for negotiating resource requirements for the resource manager and working with NodeManagers to execute and monitor the tasks.
The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
Container - A container is an application-specific process that’s created by a NodeManager on behalf of an ApplicationMaster with a constrained set of resources (Memory, CPU, etc.)
YARN child - After submitting the application, application master dynamically launch YARN child to do the MapReduce tasks.
IWhat is ResourceManager in YARN?
The ResourceManager is the YARN master process, and its only function is to arbitrate resources on a Hadoop cluster. It responds to client requests to create containers, and a scheduler determines when and where a container can be created.
The ResourceManager has two main components - Scheduler and AppicationsManager
Scheduler - The scheduler is responsible for allocating resources.
ApplicationManager - The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
IWhat is ApplicationMaster in YARN?
ApplicationMaster is a per-application component which doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it is responsible for negotiating resource requirements for the resource manager and working with NodeManagers to execute and monitor the tasks.
The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
BWhat are the scheduling policies available in YARN?
YARN scheduler is responsible for scheduling resources to user applications based on a defined scheduling policy. YARN provides three scheduling options-
FIFO Scheduler - FIFO scheduler puts application requests in queue and runs them in the order of submission (first in, first out). Requests for the first application in the queue are allocated first; once its requests have been satisfied, the next application in the queue is served, and so on.
Capacity Scheduler - Capacity scheduler has a separate dedicated queue for smaller jobs and starts them as soon as they are submitted.
Fair Scheduler - Fair scheduler dynamically balances and allocates resources between all the running jobs. Just after the first (large) job starts, it is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is allocated half of the cluster resources so that each job is using its fair share of resources.
AHow do you setup ResourceManager to use CapacityScheduler?
You can configure the ResourceManager to use CapacityScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler in the file conf/yarn-site.xml.
AHow do you setup ResourceManager to use FairScheduler?
You can configure the ResourceManager to use FairScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler in the file conf/yarn-site.xml.
No comments:
Post a Comment