Big Data Hadoop administration: Capacity Plan

Hadoop Cluster Capacity Planning with maximum efficiency considering all the requirements.

What is a Hadoop Cluster?
Prerequisites for deciding the Hadoop Cluster Capacity
Hardware Requirements for Hadoop Cluster

Operating System Requirement
Sample Hadoop Cluster Plan

What is a Hadoop Cluster?

cluster is a collection of computers interconnected to each other over a network. Similarly, a Hadoop Cluster is a collection of computational systems designed and deployed to store, optimise, and analyse petabytes of Big Data

Factors deciding the Hadoop Cluster Capacity

Data Volume

Since the introduction of Hadoop, the volume of data also increased exponentially.

Data Retention

when the user gets to remove outdated, invalid, and unnecessary data from the Hadoop Storage to save space and improve cluster computation speeds.

Data Storage

Data is never stored directly as it is obtained. It undergoes through a process called Data Compression.

data is encrypted and compressed using various Data Encryption and Data Compression algorithms so that the data security is achieved and the space consumed to save the data is as minimal as possible.

Type of Work Load

Work Load on the processor can be classified into 3 types. Intensive, normal, and low.

Some jobs like Data Storage cause low workload on the processor. Jobs like Data Querying will have intense workloads on both the processor and the storage units of the Hadoop Cluster.

Hardware Requirements for Hadoop Cluster

Architecture basically has the following components.

NameNode
Job Tracker
DataNode
Task Tracker

NameNode/Secondary NameNode/Job Tracker.

The NameNode and Secondary NameNode servers are dedicated to storing the meta data

Component	Requirement
Operating System	1 Terabyte Harddisk Space
FS-Image	2 Terabyte Harddisk Space
Other Softwares(Zookeeper)	1 Terabyte Harddisk Space
Processor	Octa-Core Processor 2.5 GHz
RAM	128 GB
Intenet	10 GBPS

DataNode/Task Tracker

Actual data is stored and the Hadoop jobs get executed in data nodes and Task Tacker respectively. Hardware requirements for DataNode and Task Tracker.

Component	Requirement
Number of Nodes	24 nodes(4 Terabytes each)
Processor	Octa-Core Processor 2.5 GHz
RAM	128 GB
Internet	10 GBPS

Operating System Requirement

When it comes to software, then the Operating System becomes most important. You can set up your Hadoop cluster using the operating system of your choice. Few of the most recommended operating Systems to set up a Hadoop Cluster are,

Solaris
Ubuntu
Fedora
RedHat
CentOS

Sample Hadoop Cluster Plan

Let us assume that we have to deal with the minimum data of 10 TB and assume that there is a gradual growth of data, say 25% per every 3 months. In future, assuming that the data grows per every year and data in year 1 is 10,000 TB.

By then end of 5 years, let us assume that it may grow to 25,000 TB. If we assume 25% of year-by-year growth and 10,000 TB data per year, then after 5 years, the resultant data is nearly 100,000 TB.

So, how exactly can we even estimate the number of data nodes that we might require to tackle all this data? The answer is simple. Using the formula as mentioned below.

Hadoop Storage (HS) = CRS / (1-i)

Where

C= Compression Ratio
R= Replication Factor
S= Size of the data to be moved into Hadoop
i= Intermediate Factor

Calculating the number of nodes required.

Assuming that we will not be using any sort of Data Compression, hence, C is 1.

The standard replication factor for Hadoop is 3.

The Intermediate factor is 0.25, then the calculation for Hadoop, in this case, will result as follows

HS = (1*3*S) / (1-(1/4)

HS = 4S

The expected Hadoop Storage instance, in this case, is 4 times the initial storage. The following formula can be used to estimate the number of data nodes.

N = HS/D = (CRS/(1-i)) / D

Where D is Diskspace available per Node.

Let us assume that 25 TB is the available Diskspace per single node. Each Node Comprising of 27 Disks of 1 TB each. (2 TB is dedicated to Operating System). Also assuming the initial Data Size to be 5000 TB.

N = 5000/25 = 200

Hence, We need 200 Nodes in this scenario.

----------------------------

YARN schedlders (fifo, fair, capacity) ---assign the resouces to (queue)job

oozie -job schedling --run job in particular time

tell me about your self ? --no personal things -speak more on technical skills--years of exp,roles and responsibilties

cluster architecture?--masters,slaves,total capacity,how much date, h/w configs

data flow in your project?from how (kafaka),where getting data,where you are exporting data for visulization (imp question)

hdfs,yarn,zk,hive,hbase,kafka,sqoop,oozie,spark

As an admin--ozzie, job types,cmds

**** cluster planning,upgrade,spark (impala only suppoted in CDH) -------Aug 8th evening

cluster planning and scaling - EMR cluster (hOw to plan hadoop cluster?)

How to plan clster?

who will plan ckuster

pre-requisites while planning cluster?

ex: a retail comapany flipkart,amazon,walmart,myntra

10000 produts in flipkart millons of user access website (purchases,window shpooig)

flipkart need to atore all the details of customers (either purchases or window shopping)

if flipkart not able to store 1000 customer data it require infrssture--solution (storing and processing hugs data)

flipkart reach out to --vendor TCS--TCS will assign to Architects to plan cluster---plan cluster based on the customer requirements

Pre-requisites architcts take in to consideration -----if SQL 500TBs of data

1.If there is any existing data to move SQL to Hadoop (if 500TB -- 500x3=1500TB =1.5PB)

2.Daily data growth (how much amount of data get into hadoop ckustr). exaple 30Gb of data every day

3. what is AVG file size(block size) example : 3 (30x3=900GB)

4.Replication factor

5.Data retention peroid(how long you want to store data in to hadoop cluster) exaple : 2 years (900GBx12=10800Gb per year)apprx 10TB so 10TBx2yers=20TB finally 60TB (20TBx3repliation factor=60TB)

6.How many users access the cluster

7.How many number of jobssubmitted by the user in a day OR AVG utilization of resouces

8.based on budget decide license,softwares

9.Type of data strured,unstrued(either batch data or straming data)based on that we can decide the tools we need to use.

How munch actual HDFS space we require? should be 20% more (buffer value to maintain additional disk space)than calculated HDFS space

HDFS space = sum all the disk space in linux servers (compute/slave linux servers)

IF linux disk space is full (you can't handle new data so plan additonal 30% space 2000+600=2.6PB=2600TB)

all intermitted mapper outputs are stores in local file syste

map reduce job==map(store in local fs)maper ops=intermittent ops+reduce

container logs are stored in local fs (30% local fs remaning 70% take HDFS

biuinding hadoop cluster on top of linux servers (each server will have CPU memory and HDD)

we requie 2.6PB of space

2600TB

server count?

how many disks per server?

what should be the disk capacity?

what shouls be the type of disk?

thumd rule of CDH--in a single server 40 to 50TB of disk space for ideal performance

2600TB/40=65 servers (20 disks x 2TB each=40TB)

more no of disks will give you good performance (there will be one header for this to collcet blosks)

256Gb RAM,48 cores for compute server

network between hadoop clsuert shouls be 10000gbps

what is your clster architecture?

for any project you will have 3 diff cluster dev cluster or pre prod clsyter and testing clstter (QA UAT), prod cluster

POC--proof of concept-first he need to do testing before planning clster (if they want to move one frmae work to other frame work)

dev cluster-if dev wite any code it will not move hadoop direcly(they test it in dev clsuter

QA team test -verify code (decide ennhancements)-once qa provides green signal --move it to prod

there will be minimal down time

what is your prod clster architecture?

h/w config of your cluster

DELL-RACK server

JBOD-just a bunch of disks

LVM

RAID--never use RAID in hadoop compute server (only use JBOD)

why we don;t use RAID?

to incrase the throughput or to provide fault tolanace

provides high avaialblity of data

RAID -1 used for installing OS in hadoop cluster

DEV QA Pre-Prod

3 masters (256GB RAM 32 cores, 2TB HDD configured with raid 1 and OS is 100Gb HDD-with raid 1

3servers for zr and Jounal nodes----32 gb RAM ,8 cores 1TB hdd -RAID 1

3server for kafka (data 1...5 2x5=10 HDD

51 servers as compure servers with hardware 256 RAM 48 cores---20x2tB=40TB

how many edge node (gate way node/connectivity node/application nodes)we require per clsuer ? 100 users --will be having 1 edge node

how they confige edge nodes in real time ? user will do Shh 10.0.0.2 (they will connect to other server) they will creste virtual ip --assigned to load balancer

interview questions?

clster capacity --4PB

no of nodes

total number of disks in cluster

2 mster servers

2 edge servers

5 kafka server

80 compte nodes

haddop will give good performance whwn we do auto sclaing

what type of disk we need to use?

compute servers---normal SATA-7.5,12k RPM disks

zk,JN,master --SSD--12k to 15k RPM

kafka---SSD--15k RPM

s/w linix OS RHEL -7.6 version

Java version --1.8

mysql--5.7.32

CDH---6.x and CDP --7.x

netwrok between servers--10GBPS

How do you plan

100Tb data====300Tb hdfs require + additonaly 20% of 300Tb is 60 ==360TB

468TB --linux ----115

YARN----10jobs require 100 containers

110 containers require to process 10 jobs per da

Big Data Hadoop administration

Pages

Capacity Plan

No comments:

Post a Comment