Big Data Hadoop administration: Performance Tuning

*************** Performance Tunning --------------Aug 18

HDFS and YARN

we can do performance tunning in multiple level (OS level, MQSQL level,HDFS level,YARN level-queue or schedulers or memory allocations to containers, HIVE level --partitions and bucking)

OS-Tuning

1.diable transparenat huge pages,firewalld,se linux,power saver mode

2.enable the NTPD,HTTPD,fastest mirror

3.VM swapiness=1

numbers of open file limits for a user /etc/security/limits.conf

number of processes that can be run by a user /etc/security/limits.d

*** if you have 256 GB of RAM in NN****

DN=10GB

NM=4GB

HRS=36GB

container=206 GB

Overcommit memory?

it will not allow you to allocate memory to any of the processor

disable power saver mode so that the server will not get tuned off

Storage layer ...DATA disk scalling

In hadoop disks its preferable to have ,more number of smaller disks are preferable (20 x 2TB=40TB (intstead of less numbers of HIger disks)

LVM -logical volume mgr, RAID -redundant array of individula disk,JBOD-just a bunch of disks (each and every disk will mount in a sigle directory in JBOD configuration)

better to use 10kto20k RPM disks

In real time-for NN data dirs and JN data dirs and also ZK data dirs---sugged to use SSD disks beacuse more i/os(here use RAID 1-mirring)--replicated to second disk (NN-meta data,JN--edits file,ZK-all service state info)

DN data dirs it is suggested to use ----SATA disks in real time--need to configure in a JBOD

Kafaka we can use JBOD with RAID-10 (kafaka also stores data in local file system)--we have replication in partition but it will store in diff servers---RAID10--mirroing and stripping---if you having 10 disks ---n/2*2TB but use can able to use 10TB reaming is for replication

Mount disk volumes with NOATIME

*****Quotas in HDFS

quotas==allocating a space for a particular directory or limiting the number of files to a directory in HDFS

where it is used?

mamly used in cluster where it is shared to multiple customeres (like where we have multiple tenants)

clsuetr is shared with multiple teams or customers

Lets take a simple example: i have clsuter of 100 TB being used vy QA team,dev tam , BI team, admin team

devs are using 30 TBs are space (we need to sahre this to how much of clsueter need to use by each and every team while planning clsuetr)

QA-10%-------------------multi-tenant environment (one cluster is used by diff teams or customers)

dev-40%

ML-10%

AI-20%

Admin-10%

finace-10% of cluster size

How can i limit usagae of sapce in HDFS clsuetr? here quotas will come

with the help of HDFS quotas --we can limit the the size of directory or numbers of files

name quota--setting HARD limit---hard limit on number of files and dirs

space quota--user can store any number of files but limiting the sizes

when i run a cmd in HDFS --we have dir called /user/dev team ------store max of 1million files---if you stores more than 1mmilon files--it will throw error max limit size

hdfs dfsadmin -setquota/clrquota <N> <directory path>

hdfs dfs -count -q -h (used to get the quotas details)

I have dev team ---40%---they have to use 40TB with replication

but in this 40TB they are using more number of files with 1MB---as no of f---iles keeps on inceasing--i will imapct on clsuetr---the performance of NN would be degarded---as NN stores meta disk in both memory and disk

one guy is stores 10millions files--NN need to store all these 10million files----

################## Processing layer ---queues--limit the usage of resources ######### Aug 19

HDFS level----quotas

YARN level--queues (allowing multiple tenants to share the cluster)and schedulers

queue name---property -- "mapreduce.job.queuename"

YRAN capacity of resouces--get it from RM web UI---8088 --- for eaxmple 600GB and 200 cores

i am working on 4 node hadoop cluster with 2 masters and 4 -slaves

256 RAM and 46 cores ---out of 256GB RAM --slaves nodes(we are assignng DN-4GB--DM-160GB and HRS--26GB)

RM(YARN resouces)--640GB / 160 cores

if multiple teams running in the environment---qa,admin,dev,LM---when ever they submiiting jobs they will use these resources---if QA team subit with 160 cores--(containers is combination of memory and cores)---so when other guy submit a job it will not run (un-assigned state)

so how can we limit the YARN resources to particular team of user?

limit usease of each and wvwry team using queues and schedulers

How queues works?

root queue----dev (can use max of 10% resources)-- default (20%) --- prod (70%)

examaple i have given 60GB--20cores to Dev team (users:A,B,C,D---if a user users all cores then if B submit a job it will go to un assined state)

premtion (specity the time : 2.5min): if QA (30% 180GB +60cores)team (X,Y,Z) ---which can get the resources from underutilized queue to queue where it requires more resouces (with in 2.5m the rousoures shouls be back to QA queue it will kill containers and provide resouces back to QA queue)

####### Schedulers----limit the usage of resources between users across multiple users

QA,Admin,ML,AI (10,10,40,20)--every users have some deliver the jobs(process data and sibmit the deliverables)

If dev queue (60GB+20cores)--A,B,C--ia A user running spark job with (3GB +15 cores)--all the resources present in dev is occupied this user (same time A user submit a jobs (hee need to wait for A to complete the job)

how can we make sure that all the team members equally distribute resources

scdedulers: FIFO,Fair,Capacity scheduler (FFC)

FIFO (by default comes with APache hadoop)--first in frst out--with in a queue who ever the user submits a hob first he will get the resources first (non of the prod customers will be using this)---first user will get all the resources reaming users sunbit a job they will be un-assigned state

FAIR (CDH and MAPR by default scdehuler)---uses majority of hadoop clusters---if you are multiple jobs running in a queue (the resouces will be divided equally assigned to all the jobs)--in our prod--default preemtion time is 2.5min

Capacity (bydeafult comes with Horton works)--

FAIR scheduler works:

example : from multiple queue and you have assigned resources--and these queues having A,B,C, users (min resources 60GB+20cores, max 300GB_100cores)--now user A submitted -15 executors--B has submited 10 executors--C submitted a job 4 executors----during this scnarion this resouces will assign faily/equally to all the user jobs (assign multiple jobs with in a queue)

Drawback: if we have 4 node managers (one NM is giving 160GBram+40cores) 4NM=640GB_160cores) based on the total resources assiging jobs to queues

ou t of 4 NM one of NM went down---then 3 NM can give you 480GB +120cores ,now out of thsi you have already defined min and max quere

i have 10 rs--i will give 2rs to A user--4rs to B another 4rs to C users (then i will ask 2rs (2 resouces --now i have 8 rources , in this case if one of the other queue lack of resoucres we can.t full fill the queues)

Capacity schedulers:

Assign resoucres based on percentage basis(evene if you lose one NM goes down ) we will be using % basis of queues (supports preemption)

we can specity the weight(priority) for each queue

tomorrow--quotas--creation of queue and schedulers and tuniing YARN configurations

pending--impala,spark,encrytion in security,Ansible ------------RESUME preparation --Interview preparation ----DATA flow of the project

Big Data Hadoop administration

Pages

Performance Tuning

No comments:

Post a Comment