Hadoop Cluster Capacity Planning with maximum efficiency considering all
the requirements.
What is a Hadoop Cluster?
cluster is a collection of computers interconnected to each other over a
network. Similarly, a Hadoop Cluster is a
collection of computational systems designed and
deployed to store, optimise, and analyse petabytes of Big Data
Factors deciding the Hadoop Cluster Capacity
Since the introduction of Hadoop, the volume of data also increased
exponentially.
when the user gets to remove outdated, invalid, and unnecessary data
from the Hadoop Storage to save space and improve cluster computation speeds.
Data is never stored directly as it is obtained. It undergoes through a
process called Data Compression.
data is encrypted and compressed using various Data Encryption and Data
Compression algorithms so that the data security is achieved and the
space consumed to save the data is as minimal as possible.
Work Load on the processor can be classified into 3 types. Intensive, normal, and low.
Some jobs like Data Storage cause low workload on the processor.
Jobs like Data Querying will have intense workloads on both
the processor and the storage units of the Hadoop
Cluster.
Hardware Requirements for Hadoop Cluster
Architecture basically has the following components.
- NameNode
- Job Tracker
- DataNode
- Task Tracker
NameNode/Secondary NameNode/Job Tracker.
The NameNode and Secondary NameNode servers are dedicated to storing the
meta data
Component
|
Requirement
|
Operating System
|
1 Terabyte Harddisk Space
|
FS-Image
|
2 Terabyte Harddisk Space
|
Other Softwares(Zookeeper)
|
1 Terabyte Harddisk Space
|
Processor
|
Octa-Core Processor 2.5 GHz
|
RAM
|
128 GB
|
Intenet
|
10 GBPS
|
DataNode/Task Tracker
Actual data is stored and the Hadoop jobs get executed in data nodes and
Task Tacker respectively. Hardware requirements for DataNode and Task Tracker.
Component
|
Requirement
|
Number of Nodes
|
24 nodes(4 Terabytes each)
|
Processor
|
Octa-Core Processor 2.5 GHz
|
RAM
|
128 GB
|
Internet
|
10 GBPS
|
Operating System Requirement
When it comes to
software, then the Operating System becomes most important. You can set up your
Hadoop cluster using the operating system of your choice. Few of the most
recommended operating Systems to set up a Hadoop Cluster are,
- Solaris
- Ubuntu
- Fedora
- RedHat
- CentOS
Sample Hadoop Cluster Plan
Let us assume that we have to deal with the minimum data of 10
TB and assume that there is a gradual growth of data, say 25% per
every 3 months. In future, assuming that the data grows per every
year and data in year 1 is 10,000 TB.
By then end of 5 years, let us assume that it may grow
to 25,000 TB. If we assume 25% of year-by-year growth and
10,000 TB data per year, then after 5 years, the resultant data is nearly 100,000
TB.
So, how exactly can we even estimate the number of data nodes that we
might require to tackle all this data? The answer is simple. Using the formula
as mentioned below.
Hadoop Storage (HS)
= CRS / (1-i)
Where
- C= Compression Ratio
- R= Replication Factor
- S= Size of the data to be
moved into Hadoop
- i= Intermediate Factor
Calculating the
number of nodes required.
Assuming that we
will not be using any sort of Data Compression, hence, C
is 1.
The standard replication
factor for Hadoop is 3.
The Intermediate
factor is 0.25, then the calculation for Hadoop, in
this case, will result as follows
HS = (1*3*S) /
(1-(1/4)
HS = 4S
The expected Hadoop
Storage instance, in this case, is 4 times the initial storage. The following
formula can be used to estimate the number of data nodes.
N = HS/D =
(CRS/(1-i)) / D
Where D is
Diskspace available per Node.
Let us assume
that 25 TB is the available Diskspace per single node. Each
Node Comprising of 27 Disks of 1 TB each. (2
TB is dedicated to Operating System). Also assuming
the initial Data Size to be 5000 TB.
N = 5000/25 = 200
Hence, We need 200 Nodes in this scenario.
----------------------------
YARN schedlders (fifo, fair, capacity) ---assign the resouces to (queue)job
oozie -job schedling --run job in particular time
tell me about your self ? --no personal things -speak more on technical skills--years of exp,roles and responsibilties
cluster architecture?--masters,slaves,total capacity,how much date, h/w configs
data flow in your project?from how (kafaka),where getting data,where you are exporting data for visulization (imp question)
hdfs,yarn,zk,hive,hbase,kafka,sqoop,oozie,spark
As an admin--ozzie, job types,cmds
**** cluster planning,upgrade,spark (impala only suppoted in CDH) -------Aug 8th evening
cluster planning and scaling - EMR cluster (hOw to plan hadoop cluster?)
How to plan clster?
who will plan ckuster
pre-requisites while planning cluster?
ex: a retail comapany flipkart,amazon,walmart,myntra
10000 produts in flipkart millons of user access website (purchases,window shpooig)
flipkart need to atore all the details of customers (either purchases or window shopping)
if flipkart not able to store 1000 customer data it require infrssture--solution (storing and processing hugs data)
flipkart reach out to --vendor TCS--TCS will assign to Architects to plan cluster---plan cluster based on the customer requirements
Pre-requisites architcts take in to consideration -----if SQL 500TBs of data
1.If there is any existing data to move SQL to Hadoop (if 500TB -- 500x3=1500TB =1.5PB)
2.Daily data growth (how much amount of data get into hadoop ckustr). exaple 30Gb of data every day
3. what is AVG file size(block size) example : 3 (30x3=900GB)
4.Replication factor
5.Data retention peroid(how long you want to store data in to hadoop cluster) exaple : 2 years (900GBx12=10800Gb per year)apprx 10TB so 10TBx2yers=20TB finally 60TB (20TBx3repliation factor=60TB)
6.How many users access the cluster
7.How many number of jobssubmitted by the user in a day OR AVG utilization of resouces
8.based on budget decide license,softwares
9.Type of data strured,unstrued(either batch data or straming data)based on that we can decide the tools we need to use.
How munch actual HDFS space we require? should be 20% more (buffer value to maintain additional disk space)than calculated HDFS space
HDFS space = sum all the disk space in linux servers (compute/slave linux servers)
IF linux disk space is full (you can't handle new data so plan additonal 30% space 2000+600=2.6PB=2600TB)
all intermitted mapper outputs are stores in local file syste
map reduce job==map(store in local fs)maper ops=intermittent ops+reduce
container logs are stored in local fs (30% local fs remaning 70% take HDFS
biuinding hadoop cluster on top of linux servers (each server will have CPU memory and HDD)
we requie 2.6PB of space
2600TB
server count?
how many disks per server?
what should be the disk capacity?
what shouls be the type of disk?
thumd rule of CDH--in a single server 40 to 50TB of disk space for ideal performance
2600TB/40=65 servers (20 disks x 2TB each=40TB)
more no of disks will give you good performance (there will be one header for this to collcet blosks)
256Gb RAM,48 cores for compute server
network between hadoop clsuert shouls be 10000gbps
what is your clster architecture?
for any project you will have 3 diff cluster dev cluster or pre prod clsyter and testing clstter (QA UAT), prod cluster
POC--proof of concept-first he need to do testing before planning clster (if they want to move one frmae work to other frame work)
dev cluster-if dev wite any code it will not move hadoop direcly(they test it in dev clsuter
QA team test -verify code (decide ennhancements)-once qa provides green signal --move it to prod
there will be minimal down time
what is your prod clster architecture?
h/w config of your cluster
DELL-RACK server
JBOD-just a bunch of disks
LVM
RAID--never use RAID in hadoop compute server (only use JBOD)
why we don;t use RAID?
to incrase the throughput or to provide fault tolanace
provides high avaialblity of data
RAID -1 used for installing OS in hadoop cluster
DEV QA Pre-Prod
3 masters (256GB RAM 32 cores, 2TB HDD configured with raid 1 and OS is 100Gb HDD-with raid 1
3servers for zr and Jounal nodes----32 gb RAM ,8 cores 1TB hdd -RAID 1
3server for kafka (data 1...5 2x5=10 HDD
51 servers as compure servers with hardware 256 RAM 48 cores---20x2tB=40TB
how many edge node (gate way node/connectivity node/application nodes)we require per clsuer ? 100 users --will be having 1 edge node
how they confige edge nodes in real time ? user will do Shh 10.0.0.2 (they will connect to other server) they will creste virtual ip --assigned to load balancer
interview questions?
clster capacity --4PB
no of nodes
total number of disks in cluster
2 mster servers
2 edge servers
5 kafka server
80 compte nodes
haddop will give good performance whwn we do auto sclaing
what type of disk we need to use?
compute servers---normal SATA-7.5,12k RPM disks
zk,JN,master --SSD--12k to 15k RPM
kafka---SSD--15k RPM
s/w linix OS RHEL -7.6 version
Java version --1.8
mysql--5.7.32
CDH---6.x and CDP --7.x
netwrok between servers--10GBPS
How do you plan
100Tb data====300Tb hdfs require + additonaly 20% of 300Tb is 60 ==360TB
468TB --linux ----115
YARN----10jobs require 100 containers
110 containers require to process 10 jobs per da
No comments:
Post a Comment