Capacity Plan






Hadoop Cluster Capacity Planning with maximum efficiency considering all the requirements.

What is a Hadoop Cluster?

cluster is a collection of computers interconnected to each other over a network. Similarly, a Hadoop Cluster is a collection of computational systems designed and deployed to store, optimise, and analyse petabytes of Big Data 

Factors deciding the Hadoop Cluster Capacity

  • Data Volume 

Since the introduction of Hadoop, the volume of data also increased exponentially.

  • Data Retention

when the user gets to remove outdated, invalid, and unnecessary data from the Hadoop Storage to save space and improve cluster computation speeds.

  • Data Storage

Data is never stored directly as it is obtained. It undergoes through a process called Data Compression.

data is encrypted and compressed using various Data Encryption and Data Compression algorithms so that the data security is achieved and the space consumed to save the data is as minimal as possible.

  • Type of Work Load

Work Load on the processor can be classified into 3 types. Intensive, normal, and low.

Some jobs like Data Storage cause low workload on the processor. Jobs like Data Querying will have intense workloads on both the processor and the storage units of the Hadoop Cluster.

Hardware Requirements for Hadoop Cluster

Architecture basically has the following components.

  • NameNode
  • Job Tracker
  • DataNode
  • Task Tracker

NameNode/Secondary NameNode/Job Tracker. 

The NameNode and Secondary NameNode servers are dedicated to storing the meta data

Component

Requirement

Operating System

1 Terabyte Harddisk Space

FS-Image

2 Terabyte Harddisk Space

Other Softwares(Zookeeper)

1 Terabyte Harddisk Space

Processor

Octa-Core Processor 2.5 GHz

RAM

128 GB

Intenet

10 GBPS


DataNode/Task Tracker

Actual data is stored and the Hadoop jobs get executed in data nodes and Task Tacker respectively. Hardware requirements for DataNode and Task Tracker.

Component

Requirement

Number of Nodes

24 nodes(4 Terabytes each)

Processor

Octa-Core Processor 2.5 GHz

RAM

128 GB

Internet

10 GBPS

 Operating System Requirement

When it comes to software, then the Operating System becomes most important. You can set up your Hadoop cluster using the operating system of your choice. Few of the most recommended operating Systems to set up a Hadoop Cluster are,

  • Solaris
  • Ubuntu
  • Fedora
  • RedHat
  • CentOS 

Sample Hadoop Cluster Plan

Let us assume that we have to deal with the minimum data of 10 TB and assume that there is a gradual growth of data, say 25% per every 3 months. In future, assuming that the data grows per every year and data in year 1 is 10,000 TB.

By then end of 5 years, let us assume that it may grow to 25,000 TB. If we assume 25% of year-by-year growth and 10,000 TB data per year, then after 5 years, the resultant data is nearly 100,000 TB.

So, how exactly can we even estimate the number of data nodes that we might require to tackle all this data? The answer is simple. Using the formula as mentioned below.

Hadoop Storage (HS) = CRS / (1-i)

Where

  • C= Compression Ratio
  • R= Replication Factor
  • S= Size of the data to be moved into Hadoop
  • i= Intermediate Factor

Calculating the number of nodes required.

Assuming that we will not be using any sort of Data Compression, hence, C is 1.

The standard replication factor for Hadoop is 3.

The Intermediate factor is 0.25, then the calculation for Hadoop, in this case, will result as follows

HS = (1*3*S) / (1-(1/4)

HS = 4S

The expected Hadoop Storage instance, in this case, is 4 times the initial storage. The following formula can be used to estimate the number of data nodes.

N = HS/D = (CRS/(1-i)) / D

Where D is Diskspace available per Node.

Let us assume that 25 TB is the available Diskspace per single node. Each Node Comprising of 27 Disks of 1 TB each. (2 TB is dedicated to Operating System). Also assuming the initial Data Size to be 5000 TB.

N = 5000/25 = 200

Hence, We need 200 Nodes in this scenario.

----------------------------
YARN schedlders (fifo, fair, capacity) ---assign the resouces to (queue)job
oozie -job schedling --run job in particular time

tell me about your self ? --no personal things -speak more on technical skills--years of exp,roles and responsibilties
cluster architecture?--masters,slaves,total capacity,how much date, h/w configs
data flow in your project?from how (kafaka),where getting data,where you are exporting data for visulization (imp question)
hdfs,yarn,zk,hive,hbase,kafka,sqoop,oozie,spark
As an admin--ozzie, job types,cmds

**** cluster planning,upgrade,spark (impala only suppoted in CDH)  -------Aug 8th evening
cluster planning and scaling - EMR cluster (hOw to plan hadoop cluster?)
How to plan clster?
who will plan ckuster
pre-requisites while planning cluster?
ex: a retail comapany flipkart,amazon,walmart,myntra
10000 produts in flipkart millons of user access website (purchases,window shpooig)
flipkart need to atore all the details of customers (either purchases or window shopping)
if flipkart not able to store 1000 customer data it require infrssture--solution (storing and processing hugs data)
flipkart reach out to --vendor TCS--TCS will assign to Architects to plan cluster---plan cluster based on the customer requirements
Pre-requisites architcts take in to consideration -----if SQL 500TBs of data 
1.If there is any existing data to move SQL to Hadoop (if 500TB -- 500x3=1500TB =1.5PB)
2.Daily data growth (how much amount of data get into hadoop ckustr). exaple 30Gb of data every day 
3. what is AVG file size(block size) example : 3 (30x3=900GB)
4.Replication factor
5.Data retention peroid(how long you want to store data in to hadoop cluster) exaple : 2 years (900GBx12=10800Gb per year)apprx 10TB so 10TBx2yers=20TB   finally 60TB (20TBx3repliation factor=60TB)
6.How many users access the cluster
7.How many number of jobssubmitted by the user in a day OR AVG utilization of resouces
8.based on budget decide license,softwares
9.Type of data strured,unstrued(either batch data or straming data)based on that we can decide the tools we need to use.

How munch actual HDFS space we require? should be 20% more (buffer value to maintain additional disk space)than calculated HDFS space 
HDFS space = sum  all the disk space in linux servers (compute/slave linux servers)
IF linux disk space is full (you can't handle new data so plan additonal 30% space 2000+600=2.6PB=2600TB)
all intermitted mapper outputs are stores in local file syste
map reduce job==map(store in local fs)maper ops=intermittent ops+reduce
container logs are stored in local fs (30% local fs remaning 70% take HDFS

biuinding hadoop cluster on top of linux  servers (each server will have CPU memory and HDD)
we requie 2.6PB of space
2600TB 
server count?
how many disks per server?
what should be the disk capacity?
what shouls be the type of disk?
thumd rule of CDH--in a single server 40 to 50TB of disk space for ideal performance
2600TB/40=65 servers (20 disks x 2TB each=40TB)
more no of disks will give you good performance (there will be one header for this to collcet blosks)
256Gb RAM,48 cores for compute server
network between hadoop clsuert shouls be 10000gbps
what is your clster architecture?
for any project you will have 3 diff cluster dev cluster or pre prod clsyter and testing clstter (QA UAT), prod cluster
POC--proof of concept-first he need to do testing before planning clster (if they want to move one frmae work to other frame work)
dev cluster-if dev wite any code it will not move hadoop direcly(they test it in dev clsuter
QA team test -verify code (decide ennhancements)-once qa provides green signal --move it to prod
there will be minimal down time
what is your prod clster architecture?
h/w config of your cluster

DELL-RACK server

JBOD-just a bunch of disks
LVM
RAID--never use RAID in hadoop compute server (only use JBOD)
why we don;t use RAID?
to incrase the throughput or to provide fault tolanace
provides high avaialblity of data 
RAID -1 used for installing OS in hadoop cluster 

DEV            QA                Pre-Prod
3 masters (256GB RAM 32 cores, 2TB HDD configured with raid 1 and OS is 100Gb HDD-with raid 1
3servers for zr and Jounal nodes----32 gb RAM ,8 cores 1TB hdd -RAID 1
3server for kafka (data 1...5 2x5=10 HDD
51 servers as compure servers with hardware 256 RAM 48 cores---20x2tB=40TB
how many edge node (gate way node/connectivity node/application nodes)we require per clsuer ? 100 users --will be having 1 edge node
how they confige edge nodes in real time ? user will do Shh 10.0.0.2 (they will connect to other server) they will creste virtual ip --assigned to load balancer

interview questions?
clster capacity --4PB
no of nodes
total number of disks in cluster
2 mster servers
2 edge servers
5 kafka server
80 compte nodes

haddop will give good performance whwn we do auto sclaing
what type of disk we need to use?
compute servers---normal SATA-7.5,12k RPM disks
zk,JN,master --SSD--12k to 15k RPM
kafka---SSD--15k RPM
s/w linix OS RHEL -7.6 version
Java version --1.8
mysql--5.7.32
CDH---6.x and CDP --7.x
netwrok between servers--10GBPS


How do you plan 
100Tb data====300Tb hdfs require + additonaly 20% of 300Tb is 60 ==360TB
468TB --linux ----115

YARN----10jobs require 100 containers
110 containers require to process 10 jobs per da


No comments:

Post a Comment