### HDFS Re-balancer
which balances data across multiple data nodes (make over utilized nodes to under utilized)
Whenever NAME Node is trying to write data to the data node based on heartbeat mechanism. In this scenario whichever the data node sends the heart beat first NN will assign data to that NN nodes.
Why need of balancer :If any DN out of contact with NN, if there is any network glitch, the NN will not write data to that particular DN.In this scenario few of the DN's few of them are highly utilized and few of them are under-utilized.
To overcome this scenario HDFS came up with concept called rebalancing
Balancer is one of the daemons that runs in HFDS along with NN,DN,SNN.
It will Help us to balance the data (move the blocks)
Move the blocks from one DN to other DN (DN communicates with each other over Http protocol).So it requires network bandwidth.
It make sure that the blocks and data is balanced equally to all the DNs in the cluster.
We need to run balancer during non-business hours(when ever there is no data ingestion is happening.
If you run balancer in business hours it will decrase performance, some times the jobs may fail, to over come this the balancer need to run on non-business hours and also over the week ends.
How frequently we need to run the balancer?
At least once in a week. In our environment we are running every friday night.
suppose 10 DNs in in my clsuter and the disk capacity is full in this 10 DNs, now i want to add addtional few DNs to the clster( now i need to balace all the data across the DNs(In the scenarion the balacer help us to balance the data from the existing DNs to the new DNs to make sure that all the DNs are equally balamced.
How does the balancer knows how much data need to be balance ?
By default the balancer runs based on the threshold limit 10%. (It will consider the AVG disk utilization of entire hadoop cluster). It will make sure that all the disks in the DNs are equally balanced with +/- 10% across all the servers.( the less the value so high the balancing capacity)
Eg. If one server having 50% some other servers may having 48% or 52%.
we nee to assign band with beacuase Dns commicated with Http protocol.
***You can run the script with a different threshold; for example:
start-balancer.sh
hdfs balancer -threshold 10 ( percentage od disk capacity based on AVG clsuter utilization)
***adjust the network bandwidth used by the balancer,before you run the balancer; for example:
hadoop dfsadmin -setBalancerBandwidth 200000000
Normallly all the servers are connected with 10-GBPS network
at least you need to assign 1GB or 2-GB bandwith to the balancer
we can run balancer only with in the clsuetr across the DNs
*** we can exclude/include the specific DNs from being balanced by the balacer
hdfs balancer -exclude/include -f <hosts-file> | <comma-seperated list of hosts>
*** Intra node balancer
Avaialble in hadoop 2.x, means balancing the hdfs data acroos the disk (balace across multiple disks with in the server)
If the disk capacity is getting full, rather than adding a new Dn to the cluster,the client is saying that i don;t have enough budget to add a new server so i can provide you more number of disks to the existing servers, so that we can add additional 2 or 3 disks to each and every server and you can add those disks in the DN configuration in Hdfs-ste.xml file
config property is (dfs.data.dir) under this dir, you need to add aditonal 2 or 3 disks and then you need to run the balancer
we don't do this normally because hadoop supports Horizontal scaling ( adding more number of server to existing clustr)rather than veritical (if you do the disk capacity will be keep on inceasing and it will degrade the performance) scaling.
How do we know the data is not distributed equally ( use df -h) command using cluster shell utility.(also get it from NN web UI)
### How to change the replication factor
If you want to change replication factor to a particular file or directory we need to change it using cmd line terminal ( we should not change (dfs.replication) globally in hdfs configuration.
*** to see replication factor of a file
hdfs dfs -stat "%n %o" /path/to/file (display file name and size)
hdfs dfs -stat "%r" /filename (display replication factor)
*** to change replication of a file (Increse /decerase replication factor using (setrep)cmd)
hdfs dfs -setrep 2 /path/to/file (if it is 3 then you need to set it to 2)
Eg. project manager said set replication factor as 2 in entire cluster level (if you changein hdfs config - it will effect only the new data, old data replication factor will be 3 beacuse it's meta data is alerady uploaded in the NN)
## How to change block size
By deafult block size is 128 MB. If you want to change to a greater value or lesser value then you can change in the hdfs config then that will imapct all data/files that is being stored
hdfs dfs -Ddfs.block.size=XXX -cp /<path of file with old blocksoze> /<path to new file with large block size> (It will copy file from one location to diff location)
If you want to save particular file with a diff block size (100MB) then run above cmd (XXX=100)
No comments:
Post a Comment