Sqoop






****** July 25 *******
All eco system tools will use HDFS and YARN in backend

HDFS--storage
YARN--processing
ZOO KEEPER--config mgmt tool---------80% of hadoop

HA
good performance while processing data
scalabilty--add remove servers

ALL ABOBE are bassic haoop cluster

hadoop clsuetr--2PB--200TB

How to get the data to hadoop clster ------streaming data----
sqoop for --historical/batch data
kafka for --straming data into cluster

baking project running MQSQL,Oracle (past 4 year data---1000TB in current env)
now they want to migrate to hadoop (DB--to--hadoop cluster)
Q.mqsql stores only strued data (copy 500TB of data)--copy data from SQL to hadoop/hive/hbase ---they come with tool called sqoop (sql+hadoop)
sqoop--import/export data from RDBMS to hadoop  VICE VERSA (migrating strured data from hadoop to sql and sql to hadoop)

SQOOPimport---copy data from SQL to hadoop 9only for stured data)
sqoopexport----hadoop to sql

proces--data--hadoop YARN---MR prog      (hadoop--mapper calss,reducer class)
hive----used to proces strured data---HQL lang--similar to SQL
Hbase--no sql DB--colt storage of data---hive nad hbase store data in HDFS
hadoop  stores diff variteis of data ---stred,un-stred,semi-strured

1.open project of apache hadoop
2.develped in CDH 
3.suppotrs importing sql to hdfs,hbase,hive  (export from HDFS to sql)
4.uses JDBC connctor (java DB connector) to copy data from sorce to destination
5. when ever run sqoop import/export by defaut uses only 4 mappers (sqoop will not use any reducers)-it will launa a map-reduce job(yarn job) and it will run sqoop import/export
 why it will not any reducers?
 mapper phase---data dived into splits -proces in form of key and values---reducer--aggrigation
 here we are just copying data from sorce  to destination(not doing any operations on data)

Artitecture of sqoop (import data from DBs to HDFA)
1. once sqoop job exceuted 

sql db (1gb file)(JDBC connector)---sqoop import----hadoop (hdfs)
data dived based on no of mappers
execute all mappers class and provide o/p files---this o/p files stored in HDFS

sqoop --stores data in HDFS ---process using YARN
HIve---stores data in HDFS ---process using YARN
Hbase===stores data in HDFS ---process using YARN


If you get the batch dta/historical data--in RDBMS - you want to get this data to hadoop (here using sqoop)

Features:
SQL server--under we have multiple DBs----each DB have multiple tables---each table will have data(in from of rows and cols)
sqoop you can import -entrie DB-particular table or all tables--particuler rows/cols (using where/select clauses)

all sqoop cmds will be exceuted in hadoop cluster

how to setup sqoop service in cloud era
CM--add service--sqoop tool---select the server where we need to install sqoop client(gateway conets to hadoop cluster-DB)
its not like master slave architecture
sqoop clients need to install in edge nodes (dev will do import/expopt) edge nodes=gate want nodes==connectivity nodes==application nodes

In reql time dev will run import / export cmds
admins need to analyse why the sqoop job got faild

troble shooting sqoop jobs (sqoop runs YARN job for impoting and expoting data)
1.ask app ID to dev---application_0001
2.login to RM web UI---open history--size of data (mappers and reducer how many lauch)
3.get container logs using YARN cmd (YARN logs -applicationID applioin_0001) > /tmp/app.logs (redirect logs)
vi /tmp/app.log.txt (serach for error/exceptions)--type /error or exception --type n -next error ---

***** all sqoop cmds 

list all dbs in mysql from haoop cluster----sqoop list-databases --connect jdbc:mysql://sam.hadoop.com --username root --password
list tables belongs to db1---sqoop list -tables --connect jdbc:mysql://sam.hadoop.com/sales --username root --password
crate connection file sudo vi /var/lib/sqoopconnection

sqoop list-databases --options-file /var/lib/sqoopconnection

max of 2 min you can copy 100GB data
In laptop if you copy 100Gb of data it will take 3 hours -here it will take 2 min(becuase launhing 100 mapeers here--parall copying)

when ever running sqoop import where/how does it stores data?
i ran sqoop cmd with Hdfs user ---/user/username/table name/partm files (mapper o/ps) ---this is how sqoop stores data in HDFS (if 4 mappers means partmoo1..002..002.003)

should not give 777--give r/w/e to all ---user/group/others  
sql-stores data as rows and cols(fields)

Note
In realtime you will not run any cmds--all cmds in run by devs (source--destination--format--location of o/p)
example:100GB tx file--import from DB---when ever client ask requirement
client want to send coupns who purchased more tham 10000 on july 27 (7am to 7pm)--dev only do this(query the data)--during this process if particular job fails (he will reach out to admin)
import data -to DB--run hive query--get o/p--alaytics team will send notification coupon to user
sqoop as a fnlity -run hive query 

As an Admin
1.how to install sqoop and run the basic testing using sqoop cmds
2.performance tuning in sqoop
3.de-bugging sqoop job failures by working with developer

***** Performance tuning in sqoop  (write remaining in sheet)
permonce mainly depends on --volume,source system performance,network performance,hadoop cluster performnce
high numbers of mappers the hign performance (which scearion we use more mappers)
if importing 50GB of data using 4 mappers -it will take more time (that's why depends on data we need to incese mappers)


************** 26 july
1. delete map redure in CDH because we use YARN
2. each and every servery agent will be installed in particular server (to see config)
cd /etc/cloudera-scm-agent
vi config.ini
3. install sqoop fromm CDH

now run sqoop cmds

hdfs@master root $ sqoop
$$sqoop help
$sqoop list-databases --connect jdbc:mysql://san.hadoop.com --username root --P
enter password
login to mysql
mysqul -u root -p
enter pwd
show databases;
$sqoop list-tables --connect jdbc:mysql://san.hadoop.com/scm --username root --P   (list tables belongs to SCM DB)
mysql>use scm
show tables
if you want to import tables to HDFS 
vi /var/lib/hadoop-dhfs/sqoopconnection (like alis)
paste this in above file (--connect jdbc:mysql://san.hadoop.com/scm --username root --p root@123#)--stores all mqsql connctivity details
now use cmd like ($sqoop list-tables --options-file /var/lib/hadoop-dhfs/sqoopconnection
$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1 (mappers here is 1 -bydefault 4)
go to CDH check job status in YARN --applications
if job failed --ask dev to -application id of the job
$yarn logs applicationID 
you will get logs in contains (redirect logs to particular file)
$yarn logs applicationID >> 
vi test.txt
/Error or /exception (user root pasworwd denied)
provide permisiions to particular DB
check CDH installation document for -how to provide permiksiions
mqsql>grant all to scm --------  now again run sqoop import
$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1  (again issues)
in hdfs if file is alredy present it will fail (so remove this file)
$hdfs dfs -rm -r -f -skiptrsh /user/hdfs/CONFIGS
$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1
here number of splits=1 application_43543_534543
now see CDH UI -job is completed
$hdfs dfs -ls /user (here hdfs user ran job)
$hdfs dfs -ls /user/hdfs/
$hdfs dfs -ls /user/hdfs/CONFIGS
$hdfs dfs -cat /user/hdfs/CONFIGS/part -m-000
each and every field seperated by comma
mqsql>select * from CONFIGS; (cross checking)----check sqooppactice.txt document for more cmds

No comments:

Post a Comment