HIVE





****** HIVE (31st July)***** port for HIVE server is 10000
I have backing projct whree they i have DB SQL.They have migrated data using sqoop.
migrated SQL to Hadoop cluster HDFS
If you wabt to process data you need to have knowledge on YARN.
If you want to process stured data i came up with HQL HIVE (similar to SQL) so need to have YARN map reduce code.
HIVE=used to process stured data(data warehousing tool)(apache open cource project)(developed by facebook)(can be integrated with CDH or HDP)
Mainly used for online analytical purpose means batch/historical data (not suitable for OLTP means transactional data or streaming data
Storage pattren is samme (SQL/HQL server --->DBs(API,SMS)-->tables--->rows and cols
WHere does HIVE stores the data? (HDFS directory /user/warehouse/<username>/<db name>)
what are deamons running on HIVE? (Hive server2 (HS2) and HIVE meta store) both services should be installed in slave same server(master) there is no slave for HIVE
where we need to install HIVE gateways?(edge nodes or gate way nodes where dev will connect)
How exactly HIVE works? (process only strued data (should have proper schema)
Dev-->select * from emp where sal>1000;
whenever user runs HIVE query--YRAN will initiate job to process query (It means HIVE storing DATA in HDFS and it is processing data using YARN)
How to execute HIVE queries? (using HIVE shell--just type cmd [HIVE])(real-time we will not use HIVE shell beacse of security reason)
We can connect HIVE shell using beeline (cmd line utility to connect to HIVE)(users will be using their own credentials
Devs will use beeline for executing HIVE queries (CLI)(just type beeline (beeline -u 'jdbc:hive2://ipadd of server:1000) or just type cmd beeline then run !connect jdbc:hive2//192.168.0.11:10000
other way to exceute HIVE quries using HUE--GUI to execute quries related to (HIVE,imapala,hbase)
*** HIVE Architecture
user subbit HIVE query--authencation and autorizarion--HIVE server2--check for syntax(parser--planner plan for resouse--excution(YARN)--MS (meta store)client--meta store(store table schema info belongs to HIVE)--HDFS
all hadoop services in java (how does it identify HQL code --we have thrift service used to connect SQL code with java code
what is meta sotre client(ms client) -whenever storing the table in HIVE-size 10GB--HIVE table (set of rows and cls belong to a particular table called table schema)-HIVE will store table schema in mysql DB ---and actual content of the data will be stored in HDFS that's why meta store
*** HIVE tables --2 diff table (whwn ever ctraing HIVE table we can crte manged/external table)-when ever create new tbale table will store in HDFS and table shema stored in mysql-HIVE DB
managed tables--if you drop managed table-- it will delete both table content and table schema -mens parnetly deleted 
external tables--if you drop external table--only schema will be deleted from mqsql DB--actual conetent of data stores in HDFS (we can re-use data)
Hive is a session based --if session is closed then everything is closed
*** HIVE meta store--central repository for HIVE  metadata.

HIVE installation --HA
HIVE cmds/queries
connecting to HIVE

Installation of HIVE--August 2 
add servrice in CDH--select HIVE---ask gate way ---Hive DB (hive-hive-pwd)--test connection--portnubver--after installing click on instances and status
for testing #mysql -u root -p dsadas ----show databases;---you can see hive here
***Http port numbers
NN-50070
DN-50071
RM-8088
NM-8042
JHS-19888
zk-2181
HS2--10000
Oozie server -11000
Hue master-8888
Hive server-
Kafka--9092

***testing whether it has created hive server directory or not
hive user directory and hive ware house dir---
where does HIve stores configurations? ---hive-ste.xml
HDFS stores all configs in ---hdfs-site.xml
YARN stores all configs in -yarn-site.xml----------hdfs and yarn common configs--core-site.xml
/etc/hadoop/conf ---dir saves all config files
How to connect HIVE shell?--3 ways-HIVE cmd,using beeline,HUE service web interface (real time HIVE shell will be blocked so mostly install beeline)
If you want to install HUE first you have to install oozie service---add service in CDH--username,pwd-test connection (/var/lib/oozie)-destination pat h
Now install HUE (run queries and scripts)service--add service--Db setup -test connection---if succsfull-click on continue-then it will install (finish)
if you want to connect HIve--opwn HUE-instances-hue server is running---login Hue web UI (hue and pwd-hue)--now you can see hue web UI--it is already connected to HIVE here
select--Hue--query--HIve---run cmd----show databases;--when you run same cmd in backend
open server--$hive --type hive>show databases;--you can see default---create test db for teasting---create database test;--create database retail_hive;
use retail_hive;--desc table name;
create table for stroing trasactionacreate table table name() location /user/hive warehouse/
*******what is haddop admin task in hive?
1.Hive installation -one time task
2.trouble shooting hive issues (when ever user submit hive query--it will execute using yarn)
3.monitor HIVE service
4.performance tuning,HA and upgrades of HIVE
***types of tables in HIVE?--managed (scema and orinal content will be deleted)and external (only schema deletes data stored in hdfs)
1. when you cretae any table by default it is manged table
2. any time you can convert mangaed to external table (alter table table name set table properties('extreanal='TRUE')

$hdfs dfs -ls /user/hive/warehouse ----hive data dir
$hdfs dfs -ls /user/hive/warehouse/retail_hive.db/tables..
login hive $hive ----- >show databases;--use db;---show tables;descibe formatted tablename;(which user create this table you can see here)--epoc time stamp--use epoch conver to see time in human readable formatted
select count(*) from emp;---counting number of records-it won;t run map reduce job
select * from tables name--it is processing data----this will run a MAP reduce -YRAN job

example--alreay file there in hfds---how to craete hive table
carete schema---create database smsv1----load cmds will also be there --load data local inpath '/home/gggg'into table

If you wnat to connect belline --enter $beeline --then connect hive shell - !connect jdbc:hive2:/192.16944:10000  (it will ask username and pwd)

how to touble shot hive issues? how to check hive logs --when hive service goes down
goto cdh--diagotics and check logs
by default in linux machine all service logs are stored under -- /var/log --dirctory -- /var/log/hadoop-jdfs -- /hadoop/yarn --hadoop/hive --hadoop/kafka

#cd /var/log/hive --press enter--- ll--you can see logs -by just looking at log file you can determaine basic things
$vi hadoop-cmf-hive-hiveserver2........(check log in vi)
open Hue --create databases haoop -parally open --tail -f logs --it will append logs----same you can check in CDH UI -hIve-diatics- filter services-logs-

***** Hive tables partitions and buckets -how does it will work?
example--i have table consits of student details in india --in india -states-dists--ctites,towsn,mulcipalities--so here state and caste table is there
cluent reqst--to get details of all OC students who belongs to kerala (there are 50lakhs people across india) we can't get it from ctrl+f
select * from srudetd where state=andhra----to scan entire table it will take long time--we cant get good performance--so partions and bucking came into picture
if  you stores date in a file and serch it will sech entire file
Partiions : sub dir under a table :sepertae tabel based on state wise (divide tables based on primary key)--to get more performace performace crate buckets under partions-oc candidate detalis in one bucket--sc candidate detils in one bucket to get more performance
/user/hive/warehouse/student.db/detalis/studenttabel  -hive dir,db name,table name
under single partion you can store max of 256 buckets --distribute execution load horizontally--qury performance will be faster by avoiding full table scan
***when ever dev trying to scan table--if tables size is big---we need to suggest to crete partions 
example : txs table--stored in /user/hive/warehouse/test.db/txs_table --here creta one more table tcs_table pertioned by XXXX(year string)
***Hive tabel locks?
ex: tehre a table and user trying to do update on a table (tehre other user who is tring to scan table--he won't get proper data beace other user updating it)
this is the reason we use locks (shared and exclisive)
shared--for any kind of select quries
exclusine:for any kind of insert,drop and few alter table cases;
**bedug locks?
show locks tabel name;
show locks table extended;
lock table tablename shared; --in HUE -results--explicitly menas we have done manually
when and where we use hive?
not emant for real time repoting purposeused to store process sturd data
also perform updated on HIVE data
***sentry removed from CDH ---no need to learn (now ranger in updagrde one)
HIVE :mperformance tunning ?--user is passing limited mapper and reduces then we can incease it
>set mapredue.mapjob.maps=20;
when every run qury it is map recude job--yyou can see in resouce manager web UI
*** commone issue in HIVE ?
1.lock issues
2.HIVE server login issues 
3.huge no of hive quries runing in clsuter--HS2--HMS--mysql  (tunning)
4.we need to tune the MAX connections (2k 5k ) in our env 50k configured
5.qury takinglong time to scan table (check size of table they are qurying,how many resouces they are using,resouce utilicxzation in clsyter)ask app ID and check in RM--click on couters--for no of mapper and reducres (if 40 mapes 20 resouces it will be slow)
6.some times user may not be able to excute hive quries bechese of pakgs installtin
7. If hive qury gfaile--appid--RM--trouble hoout issues

***for real time issue ?---goto cloud era community and serach there --filter--with lables (hive)--here you will get real time issues
you can also get free traing --register in cloudera.com--traing.html

*** HIVE server2 and Hive meta store HA and load balancing--aug 2 evening--practicles
How zoo-keeper handles HIVE HA --refer word document
whdn ever setup hadoop cluster nn-A and SNN-S
YARN-HA-RM-A and RM-S  elactore-zk
ZK-HA (3 zoo-keeper servers)--together form a quorum --only only zk failure allowed --fail over handled by ZKFC service
Kafka-3 borkers-rep=3
sqoop there is no deamon-it's a client -no HA
Hive meta store(store table schema info in mysql/oracle) HS2- we need to enable HA for HMS and HS2
How to enable HA and load balancing of HIVE quries beween multiple HS2's
login CM---click on HIVE-under you see instances--only one HMS and HS2--now click on Add role instance--select HMS and HS2--select slave servers--continue--now back and click on HIVE--now yuo will see stale configuration until you save it===select rolling restart--restart now
after restrt-click on HIVE-instances check HMS -HS2 multiple services got installed

***********Load balancing between HS2
#beeline -u "jdbc:hive2://slave2.hadoop.com:10000........" enter -------it is connected to slave 1
now try to connect slave2 --open in 2 terminals -this is how zookeeper balance data
#mysql -u root -p   -- pwd
>show databases>use HIVE>show tables;--check meta store DB properties table
If you tune mysql connections ----cat /usr/my.cnf file --once you tune--service mysqld restart--restart


********* MYSQL HA ********
CM--SCM DB
Hive--Hive DB
CMS-clod era mgmt service---report manager (RM) use  RAMAN DB ---navigation mgr uses-NAV DB-----activity monitor uses --AMON DB
Hue service use-Hue DB
OOzie will use -oozie DB
This is the reason MSQL should be HA-----this will take care DB team 
If you want to restrt MYSQL you need to take care about all these particular services.
HIVE completed 


tomorrow Hbase (2 classes) and oozie and impala (similar to HIVE) and spark ---

HDFS
YARN
Zoo keeper
HIVE
Habse
kafka---data getting generated in hadoop cluster
spark
impala
oozie 
scoop

securty,disaster recover, performance tunning


Colud era installation

CM pakgs
CDH pakgs

5.x kafka and spark pkgas

6.x CM and CDH


linux server
ip/hostname
prerequisites
se linux diabled
firewalld disabled
mysql enabled
http-enabled
ntpd-enabled
tm swapiness-1
fastef mirror
power saving
mqsal connection


CMS---master dedicated server
ip:7180----next--next---agrents and deamons all the servr
CDH pkgs 
if no internet---downlaod CM pkgs and copy them to /var/www/html/CM/  
craet report
/etc/yum.repos.d/...
vi cloudera repo ----copy here
/opt/cloudera/parcels.repo/CDH            
copy CDH and restert could era server again
password less SSH



No comments:

Post a Comment