MR - YARN

*** Clous era Installation pre-requisites (july 18)
Cloud era distribution haoop -CDH
CDP -cloud era data paltform -CDP-7.0 (CDP public cloud and CDP Data center)
cureent prod version - CDH 5.16.2 -(end by Dec 2020)
now upgrading to 6.x

*** Basic pre-requisites to setup hadoop cluster (across all the servers)-LJMMSTFNHFP
1.same linunx version should be installed -7.x (7.7 or 7.8)
2.same java version -1.8
3.Mysql server installed in master server
4.my sql connector JAR in all servers (/usr/java and /usr/share/java
5.SE-linux should  be disabled
6.Transaparent huge pages THP -should be disabled
7.firewalld-disabled
8.NTPD-enabled
9.htpd-enabled
10.fastest mirror-enabled
11.powersaver mode-disabled

Once this is done -we need to install CM (cloud era manager)and CDH 
CM-web UI

** cloud era architecture
1.building hadoop cluser 3 Msaters 5 Slave servers
2.choose one mater server to install -CM (install in mster servr ot set small VM)
3.Once CM istalled-CM install CM -agents in all the servers part of cluster -agtents will send heatbeat to CM server-7180 port number for CM  agents communication using 7182 port
4. cmd to start CM servers (service cloudera-scm-server start/stop/restart)
5.restart agents in all servers (service cloudera -scm -agent start/stop/restart) -CM -agetns will deploy in all the servers in the cluster -agents will send communications to master regarding status of all serverivices (h/w utilization memory,cpu,disk) -CM server will diaplay in web UI.
6.services (HDFS,YARN,HIVE<zookeeper..)-do it feom UI (start and stop)
7.CM server will send request to Cloud era manger agent then agent will stop the respestive data node 
8. If CM down just imapact web UI -all the services will run fine

** How did you install cloud era cluster
1. Install CM first
2. Installed CDH packages
3. Ran role distribution

** Importance in MYsql in cloud era installation
CM (master)--CM(agent)-1,...4
Port number of mysql servive - 3306 (mdql service will run on the particular port number

Colud era management services (report mgr,service monitor,navigation manager,aler t publisher)

All the agents will send heatbear to CM-server
If cloud era manager services ae down-you can't see amything in web UI

** where does cloud era management services store DATA
stores data in Mysql db 
1.when ever install cluster we are creating DB tables called 
RMAN-Report managrt
AMON-Activity Monitor
NAV-navigation manager
SCM-used for cloud era manager serv 

** If CM is not coming up means (not getting started)- you also need to verify whether mysql is running or not (beacause CM-server will fetch all details in SCM DB present in mqsql server)
If mqsql is down the SCM
the services store data in SQL DB (just config details)
Also have tools called HIVE,oozie(stores configs in musql DB)

****** YARN -----Processsing layer
Hadoop divided into 2 layers
1.storage layer (hDFS)-HA, scalability,performance while write/read
2.Processing layer(MR)

If you want to query data in HDFS (processing layer will help)
Processing layer- mainly used for processing batch data (historical data) and help us the process straming data (Live data) eg. baking huge amount of data generated pushing into HDFS (YARN-spark (used to process streaming data and provide output in real time))

To process LIVE data (we use map-reduce processing layer)
1.x -Map reduce (MRv1)
2.x -YARN (MRv2)
3.x-YRAN

*** MRv1 -Map Reduce frame work to process data (java program)
I have file called txs -flipkart
clent requsted to SUM of txs happed on 12th June 2020 (10AM to 1PM) and number of Txs
Now developer need to check txs data and provide output
cat or vi txs file.txt (resposibility of hadoop dev - write MR java code)
IF Only HDFS - contain - NN,DN,BALANCER,SNN
If it is HA- NN(A),NN(S),DN,BALANCER,JNs (3),ZK (2 INSTALLED IN NNs)
In MR(MRv1)-services - JT(job tracker-master) and TT(Task traker-slave)

JT installed in Master servers
TT installed in slaves servers 
TT send Heart beat to JT 
where ever you installed DN -there you need to install TTs (beacuse of Data locality means process the data in server where ever we have our Data)-TTs will process the data in the same server it self 

When ever you process any data the request will got o JT and JT will assign task to TTs (TTs responsible for processing the data and send output to JT)
If JT service is down then we can't process any data (beacuse JT is responsible for acceting all the requests).TTs process the data in containers(slots)-it is combination of logical memory and CPU. By default the size of slot is 1GB memeory + 1 core processor

eg. server having 10TB HDD - 256 RAM and 48 cores of vCPU (now DN service using 1otB and memory and cores used by TTs (to process the data)

I am aasing out of 48 cores -30 cores to TT and 180 GB Ram (amout of memory vary depends upon size of data)

** Core- computaion unit of CPU
   CPU -Have many cores to perform tasks 
Total no of tasks performed by CPU = no of cores*no of HW threds per core
Eg: 1 CPU with 2 cores with hyper threadung of 2
If laptop ges into Hung state - you might have opened may application that can't handle by CPU. open task mager and kill some apps. In cores we have concept called multi-threading (one core can be assigned two diff tasks)

eg: TT-assined 18oG RAM -30 cores - it can lauch 30 slots -containers parally (TT will carte slot inside the server and assign memory and cores)
Heap will occupy 80% of memory

(where does NN stores metadata) NN stores in Heap memory of a JVM or java process. (Heap memory occupies 80% of java space
If you keep 1GB -jave memory (800MB-heap memory and 200MB -JVM companests)

Here JT assign work to TT lauch a JVM with (meory and cores) -dev speciry how much of merory that need to lauch to specific job). Once task completed then TT free up merory i.e assinged to slot (same memory used for diff JVM process)

** MAP Reduce architecture
client--> JT-->TT(divide Tasks in the form of slots)

2 phases in MR processing layer 
1.Mapper phase (MAP)
2.Reducer phanse (Reduce)

JT - roles and responsibilies
1.Master Deamon in the Mapredue framework
2.JT is the frst point of contant for all processing requets
3.Assign tasks to task tracker
4.keep track of all TTs and the tasks are aaigned to the TTs for a specific job
5.Jt responsible for ens to end completion of the job
6.Job status will be acked to the client
7.Jt is responsible for Resource Management, Job scheduling and Health Monitoring 

TT - roles and responsibilies
1.slave deamon in mapreduce framework
2.TTs are installed in every server where Datanodes are installed
3.TTs executes the tasks that are assigned by the JT in the form of slots
4.Runs each map/reduce task in separate JVM
5.TT updates the status of each map/reduce tasks to the JT

Simple Mapreduce prog example (word count)-want to find no of words and how many times they are repeated
1. when ever we try to store file in HDFS it will divide file into blocks (64Mb block in MR1) -blocks will present in diff DNs.Based on data locality concept. TT will provide output. If you want to get accurate output you should not process the data based on blocks.that why they came up with concept splits.

DN1-TT1,DN2-TT2,DN3-TT3
Blocks-64MB-Physical boundary (if it is more than 64MB it will store it in a different block)
Split -logical boundary of a particular file (split size should be greater than block size)normally it is 512 MB split size.

Data--divided into blocks--blocks are stored in HDFS
JT will connect to NN and it will get meta data from NN
storing -file --divided--blocks--stored in NN
Processing-JT (NN meta data)--file-splits--TT

No comments:

Post a Comment