Tuesday, May 26, 2020

Hadoop Administration Interview Questions

### Basic Questions ###
San Interview list
https://1drv.ms/w/s!AuUEONyMlo6omX17xYykNwhZkdKK?e=x3rhSu
No of CPU cores in your node system?
Each node is 16-cores with hyperthreading
Ram size in your node system?
Normally name node used to be higher RAM 252GB of RAM
DATA NODES will be low RAM with High Hard Disk
hard disk size in your node system?
13.8 TB of Hard DIsk capacity
Name node Hard disk is 1.8 TB because the name node doesn't want to store DATA. Everything it processes through RAM only.
how many nodes in your cluster? Total cluster size?
47 Nodes. so 47*256GB =RAM , 47*13.8TB=HDD, 47*16cores with hyperthreading=CPU
How many RACs you maintain? Each RAC how many nodes?
3 Racks each with 49/3= Roughly 15+16+18
Daily how much data will come?
Not sure about this part because multiple teams will connect.up to 80% disk threshold.(In 3 years we have more than 80% of disk space is free.
What is your backup policy? How frequently you take backup?
We have a cloud era BDR(backup Disaster Recovery). As per cloud era suggestion, we have BDR weekly once. We have 2 BDR's, using BDR whole cluster data we can transfer it to other clusters.
I am aware of that but i never worked on the backup part. Generally, i used to take care of Dev related issues
How many users will connect?
we have 3 teams (more than 30 to 40 members-Dev guys) will be logging to our cluster. user will configure in AD (Wintel team). They will connect through gateway nodes (we have 3 gateway nodes)
How many edge nodes?
Edge is the gateway node (It's 3 edges nodes). to prevent direct access, we have edge node/gateway node
what are the default tables you see once you login to the server?
activity manager, host monitor, report manager(HCM)
How do you raise a ticket to Cloudera?
You can raise a ticket in Cloudera support page
go to support tab-->select component-->severity-->select bundle(cluster-like or component-wise-->we select cluster like a bundle
Examples :
1. Impala jobs are getting failed (Impala daemon was not configured in 3 of the nodes)
2. Active name node was frequently getting down and stand by name node working as an active name node(normally we used to restart the service, even though it is getting down. we have raised a ticket to CDH. What they said like, there will be mount point, In a single mount we have configured name node and journal node (when it is trying to read/write data from a single mount point due to network congestion issue journal node was not getting proper response from Active name node so it is marking active name node as dead(what we did is like we have raised a CRQ with Linux team,they created 3 different mount points and it gets resolved)
Frequent errors you faced in HIVE?
certain times jobs will get fail.they will not update table stats.
In fresh Hadoop cluster first, edit logs generated or fs image generated?
first Fs image file then edits logs will get generated.EditLogs is a transaction log that records the changes in the HDFS file system or any action performed on the HDFS cluster such as the addition of a new block, replication, deletion etc., It records the changes since the last FsImage was created, it then merges the changes into the FsImage file to create a new FsImage file. Checkpointing in HDFS plays a vital role. Checkpointing is basically a process that involves merging the fsimage along with the latest edit log and creating a new fsimage for the namenode to possess the latest configured metadata of HDFS namespace.
Dev is trying to RUN a job, it is saying like the file is not available? what will you do?
that can't be possible. Because we are using the Hadoop environment. we will have a replication factor of 3, then how come file won't be available.
The file would have corrupted that why he get that error(it's not possible because there will be  replication factor of 3)
How Many region servers you configured?
It is equal to total DATA Nodes we have 
How Data Encryption works in your cluster? If dev asks encrypted file how will you do?
I am not sure about this part.
Difference between Cloudera and Horton-works?
Cloudera manager /Ambari and sentry/Ranger
What is a Load Balancer in Hadoop?
It comes into picture when we add new node to an existing cluster.HDFS data might not always be distributed uniformly across DataNodes.The HDFS Balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes.  The reason for non-uniform distribution is the addition of new DataNodes to an existing cluster.eg: cluster -10% (free up)
Production Versions:
Hadoop - 2.7.2
Linux - 7.7
java - open JDK 1.8
Cloudera - CDM 5.14
------------------------------------------------------------------------------------------------------------------------
##### JPM ##### Real-time work
1)if any hive job or query fails --they(dev) will raise--Jira dashboard ticket ---issue posting--pick JIRA ticket, take a call with them and ask for a screen share to troubleshoot 2)audit checking--tenants(users) will be there--check tenant HDFS space--if log period they don't work--remove (onboard and off-board users) 3)if permission denied errors--login sentry UI and check the error---grant privileges(hive web UI --grant) 4)we have dev cluster access(not prod access)--80 node cluster --total storage in petabytes 5)per day data-- we are not sure 6)we don't have Hbase 7)we have ecosystems tools like----HDFS, HIVE, swoop, sentry, and Kerberos 8)scoop (import and export devs will do)if any issue --like import job fail--admin will check logs--logs will be(lack of resources, no JVM found, no container found) then check user disk quota in YARN or resource manager web UI (there will be quota)

----------------------------------------------------------------------------------------------------------------------
##### Hadoop CTS- Real-time ##### 1)Auditing team--stakeholder--disk access---Linux level permissions--disk quota used by users-- used tools--cloud era(HIVE, swoop, Hbase, Kerberos and sentry) 2)HIVE--back end issues (query logs) 3)checking YARN logs 4)if NN down--restart it (some time place it in maintenance mode) 5)If hive query failed --there will be application ID in YARN web UI--httpd error logs (https out)(/var/log/httpd-access.log) or 6)go to Hue browser and check query (check query-code issues or syntax issue or resources not there and main target and output or input data size (JVM having space or not)--goto dev and sit with them, some times issue will not resolve --it will take 2,3 days 7)In Dev environment --more issues won't be there 8)We have environments like --- DEV-UAT-PROD --I have worked on Dev security level issues 9)kinit--keytab issues (fail or KDC va\logs)tail -f if KDC server down - check KDC mapping-JDBC connection or not (port number check) Kerberos-Authentication Sentry-Authorization (Horton works - Ranger (checking read/write permissions)-GRANT SELECT, INSERT, UPDATE, DELETE ON employees TO Kumar; Kerberos only cluster entering (user will enter to HIVE) In HIVE user want to access a particular table (it is sentry level)there will be SQL prompt to give grant privileges (cross-check HDFS cluster access) 10)Frequently we get YARN issues--If they(dev) don't have proper resources (CPU cores, RAM) job won't run properly Based on issues we do increase CPU cores, RAM size (it's there in YARN logs settings)-some times permission denied for the particular user then we go to YARN web UI add that user ACL-access control list (there are some groups but we need to give access to a particular user in that group)-HDFS setfacl and getfacl ( http://www.hadooplessons.info/ )realtime example site space quota-if 50GB increased (if his job is beyond 50GB then he will get an exception like threshold reached (JVM exception is out of memory exception)-we need to increase the quota dev will raise a ticket in JIRA to my team (take on-call and one-day time will be there so, we will allocate ACL to that user)
-----------------------------------------------------------------------------------------------
Hadoop Real-time Work -Intrinsic solutions
I have 3 years of experience as a Hadoop admin. I have worked for multiple hadoop services and i have worked for cloud era manager and horton works as well Hadoop,Hive,spark,yarn,zookeeper,oozie services used in my org (CDH services or apps)-used to process data day to day activies like checking health of the file system and monitoring cluster multiple time we got alerts i used to acked the alerts and find the RCA of that If user is is facing issue then we need to fins the root cause for that i was part of upgradation of hadoop cluster as well as deployment of hadoop cluster What is your deployment process? First we used to create VPC - (virtual private cloud) in the cloud era platform for the cloud era platform and we create public and private subnets in that we will create a security group according to that, configure the EDGE node on the public subnet rest of all the nodes in the private subnet configure cloud era recommemded pre-requisites on the edge node once we are done with edge node we will automate the process by creating AMI and then we will lauch the cluster in a private sub-net we will configure CDH on private sub-net and we will configure MYSQL,jDBC drivers,clod era manager repository and we install JDK on that after that we will strt the cloud era manager server streaming DATA? we have multiple API's so we will get multiple kinds of DATA, for the we are using kafka tool How do you get the issue? normally we will get the issue to our mail box once we get the issue then we will create a JIRA for that issue then we used to check the logs of that issue, based on logs we will do RCA amd give the appropriate permission to user Common issues releated to Impala and Hbase? I have not worked on impala, we have worked on only HIVE (i know the process impala folooewd) common issue like, some times cordinator memory is not established properly Because of the resources query is taking more time to execute (HIVE)-these type of common issues we face What is your Teamsize? 8 members -2 senir admis-2jr admin-2 dev-1 Manager and one Lead Setting up new hadoop users and secure cluster? we used to add user in secure cluster using add user command and we will crate their priciple,change thier ownership and give appropriate permission to that user commisioning and de-commisiong of nodes? If data is not processed in particular node that time we used to do the de-commisioning If disk getting ailed in some nodes (that time we used to do de-commisiong the nodes) process is like- de-commison the node and de-comissin the node manager If RCA takes time what you will do? If issue is not resolved by us then we used to raise a ticket with clud era manager (we used to collect logs and give it to cloud era manager) and ask them like why we are getting this error (why both the name nodes are getting into the safe mode)-recently faced isse Importing and exporting date in HDFS? for that we used scoop - sqoop import-connection with JDBC-user name and password-table name

--------------------------------------------------------------------------------------------------------------------
  • Tell me something about Impala and what are issues you faced with impala
  • What is the new feature you know in Hadoop 3.0
  • How you clear the locks in Hadoop system
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/managing-hive/content/hive_view_transaction_locks.html
https://mapr.com/community/s/question/0D50L00006BIt0HSAT/mapr-file-locks-and-hive-queries

How to find the total number of nodes on which Hadoop is installed
hdfs dfsadmin -report
The above command will show  active and dead nodes
Why do we use multiple data nodes to store the information in HDFS?
Data replication is an essential part of the HDFS format. 
Since it is hosted on the commodity hardware, it is expected that nodes can go down without any warning.
So, the data is stored in a redundant manner (multiple data nodes) in order to access it at any time. 
HDFS stores a file in a sequence of blocks.

  • --------------------------------------------------------------------------------------------------------
  • what is the file size you’ve used? 
  • How long does it take to run your script in production cluster? How did you optimized the timings. Challenges you have faced.
  • what was the file size for production environment?
  • Are you planning for anything to improve the performance?
  • what did you do to increase the performance(Hive,pig)?
  • what is your cluster size? 
  • what are the challenges you have faced in your project? Give 2 examples?
  • How to debug production issue?(logs, script counters, JVM)
  • how do you select the eco system tools for your project?
  • How many nodes you are using currently?
  • what is the job scheduler you use in production cluster?
  • --------------------------------------------------------------------------------------------------------
######################### Ericsson  ####################################
Adding HDFS DataNodes to a Cluster?
You can use Cloudera Director to increase the number of HDFS DataNodes in a cluster:

1. Log in to Cloudera Director at http://director-server-hostname:7189.
    Cloudera Director opens with a list of clusters.
2. If the cluster has a status of Ready, click the Actions list box to the right of the target cluster and             select Modify Cluster.
The Modify Cluster page appears, displaying the number of gateways, workers, and masters.
3. On the Modify Cluster page, click Edit and increase the number of workers and gateways to the              desired size.
Which certificate you add for data node? 
Cloudera recommends obtaining certificates from one of the trusted public certificate authorities (CA) such as Symantec or Comodo for TLS/SSL encryption for the cluster
How to check Hadoop version in your system ?
$ hadoop version  
Hadoop 2.4.1 
Which package will you install for heartbeat updates?
curl -L -O https://artifacts.elastic.co/downloads/beats/heartbeat/heartbeat-7.7.0-x86_64.rpm
sudo rpm -vi heartbeat-7.7.0-x86_64.rpm
Certificate is trust store or key store or any other 3rd party certificate?
Truststore is used for the storage of certificates from the trusted Certificate Authority (CA), which is used in the verification of the certificate provided by the server in an SSL connection. On the other hand, a Keystore is used to store the private key and own identity certificate to be identified for verification.
What is Beeline in Hive
Beeline is a Hive client that is included on the head nodes of your HDInsight cluster. To connect to the Beeline client installed on your HDInsight cluster, or install Beeline locally, see Connect to or install Apache Beeline. Beeline uses JDBC to connect to HiveServer2, a service hosted on your HDInsight cluster. You can also use Beeline to access Hive on HDInsight remotely over the internet.
Explain sentry? 
Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. ... Sentry is designed to be a pluggable authorization engine for Hadoop components.
Where TGT will store in Kerberos 
The client stores the TGT and when it expires the local session manager will request another TGT (this process is transparent to the user)The client stores the TGT and when it expires the local session manager will request another TGT (this process is transparent to the user)
How do you troubleshoot Kerberos user connectivity issues?
  • Verifying Kerberos Configuration
  • Authenticate to Kerberos using the kinit command line tool
  • Troubleshooting using service keytabs maintained by Cloudera Manager
  • Examining Kerberos credentials with klist
  • Reviewing Service Ticket Credentials in Cross Realm Deployments
  • Enabling Debugging in Cloudera Manager for CDH Services
  • Enabling Debugging for Command Line Troubleshooting
  • Failure of the Key Distribution Center (KDC)
  • Missing Kerberos or OS packages or libraries
  • Incorrect mapping of Kerberos REALMs for cross-realm authentication
what is Data locality
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation.

We can say that Data locality improves the overall execution of the system and makes Hadoop faster. It reduces network congestion.
There are two benefits of data Locality in Hadoop.
i. Faster Execution
ii. High Throughput

Have you done Dynamic resource pool allocation to your users? give a scenario
Dynamic resource pools allow you to schedule and allocate resources to YARN applications and Impala queries based on a user's access to specific pools and the resources available to those pools. If a pool's allocation is not in use, it can be preempted and distributed to other pools. Otherwise, a pool receives a share of resources according to the pool's weight. Access control lists (ACLs) restrict who can submit work to dynamic resource pools and administer them.

What different issues you face from users

  • Miss-configuration
  • Resource exhaustion
  • network partition
  • Inefficient Cluster Utilization
  • The Difficulty in Finding Root Cause of Problems
  • The business impact of Hadoop inefficiencies

  • Hdfs fencing
  • Hive version
  • Loadbalancing? How will you trigger loadbalancing? What is the process?
  • How you troubleshoot long running jobs
  • File formats which is best which file formats using more
  • I want to explain my cluster setup configuration so please give me one example to explain in interviews
  • If you have any sentry video please share me
==================================================================
Capgemini 
1.what will happen when a data node failed at read and write operation.
What is the scenario when a name node fails

2.what is  shoot the gun comcept in hdfs ..

3.One of my mapreduce job is going on , then one of my node goes down (node manager),or datanode will that operation fails?

4.what are the common issues that we got from developers regarding hive.

5.how will you know that the injestion in sqoop hasbeen failed or semi injestion happen how to resolve that.

6.in hive what if metastore server  hangs or going down frequently  what does it mean ,,how will look for resolutions...what to do first.

7.letsay if a hive qery ran for 5 hours and failed what is the role of admin to handle the scenario.

8.what are the tips that given to developers  to optimise hive performance.

===================================================================
Mind Tree



1. What is the default block size in HDFS?

2. What is the benefit of a large block size in HDFS?

3. What are the overheads of maintaining too large Block size?

4. If a file of size 10 MB is copied on to HDFS of block size 256MB, then how much storage will be allocated to the file on HDFS?

5. What are the benefits of the block structure concept in HDFS?

6. What if we upgrade our Hadoop version in which, the default block size is higher than the current Hadoop version’s default block size. Suppose 128 MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).

7. What is Block replication?

8. What is the default replication factor and how to configure it?

9. What is HDFS distributed copy (distcp) ?

10. What is the use of fsck command in HDFS?


=============================================================