Big Data Hadoop administration: Sqoop

****** July 25 *******

All eco system tools will use HDFS and YARN in backend

HDFS--storage

YARN--processing

ZOO KEEPER--config mgmt tool---------80% of hadoop

good performance while processing data

scalabilty--add remove servers

ALL ABOBE are bassic haoop cluster

hadoop clsuetr--2PB--200TB

How to get the data to hadoop clster ------streaming data----

sqoop for --historical/batch data

kafka for --straming data into cluster

baking project running MQSQL,Oracle (past 4 year data---1000TB in current env)

now they want to migrate to hadoop (DB--to--hadoop cluster)

Q.mqsql stores only strued data (copy 500TB of data)--copy data from SQL to hadoop/hive/hbase ---they come with tool called sqoop (sql+hadoop)

sqoop--import/export data from RDBMS to hadoop VICE VERSA (migrating strured data from hadoop to sql and sql to hadoop)

SQOOPimport---copy data from SQL to hadoop 9only for stured data)

sqoopexport----hadoop to sql

proces--data--hadoop YARN---MR prog (hadoop--mapper calss,reducer class)

hive----used to proces strured data---HQL lang--similar to SQL

Hbase--no sql DB--colt storage of data---hive nad hbase store data in HDFS

hadoop stores diff variteis of data ---stred,un-stred,semi-strured

1.open project of apache hadoop

2.develped in CDH

3.suppotrs importing sql to hdfs,hbase,hive (export from HDFS to sql)

4.uses JDBC connctor (java DB connector) to copy data from sorce to destination

5. when ever run sqoop import/export by defaut uses only 4 mappers (sqoop will not use any reducers)-it will launa a map-reduce job(yarn job) and it will run sqoop import/export

why it will not any reducers?

mapper phase---data dived into splits -proces in form of key and values---reducer--aggrigation

here we are just copying data from sorce to destination(not doing any operations on data)

Artitecture of sqoop (import data from DBs to HDFA)

1. once sqoop job exceuted

sql db (1gb file)(JDBC connector)---sqoop import----hadoop (hdfs)

data dived based on no of mappers

execute all mappers class and provide o/p files---this o/p files stored in HDFS

sqoop --stores data in HDFS ---process using YARN

HIve---stores data in HDFS ---process using YARN

Hbase===stores data in HDFS ---process using YARN

If you get the batch dta/historical data--in RDBMS - you want to get this data to hadoop (here using sqoop)

Features:

SQL server--under we have multiple DBs----each DB have multiple tables---each table will have data(in from of rows and cols)

sqoop you can import -entrie DB-particular table or all tables--particuler rows/cols (using where/select clauses)

all sqoop cmds will be exceuted in hadoop cluster

how to setup sqoop service in cloud era

CM--add service--sqoop tool---select the server where we need to install sqoop client(gateway conets to hadoop cluster-DB)

its not like master slave architecture

sqoop clients need to install in edge nodes (dev will do import/expopt) edge nodes=gate want nodes==connectivity nodes==application nodes

In reql time dev will run import / export cmds

admins need to analyse why the sqoop job got faild

troble shooting sqoop jobs (sqoop runs YARN job for impoting and expoting data)

1.ask app ID to dev---application_0001

2.login to RM web UI---open history--size of data (mappers and reducer how many lauch)

3.get container logs using YARN cmd (YARN logs -applicationID applioin_0001) > /tmp/app.logs (redirect logs)

vi /tmp/app.log.txt (serach for error/exceptions)--type /error or exception --type n -next error ---

***** all sqoop cmds

list all dbs in mysql from haoop cluster----sqoop list-databases --connect jdbc:mysql://sam.hadoop.com --username root --password

list tables belongs to db1---sqoop list -tables --connect jdbc:mysql://sam.hadoop.com/sales --username root --password

crate connection file sudo vi /var/lib/sqoopconnection

sqoop list-databases --options-file /var/lib/sqoopconnection

max of 2 min you can copy 100GB data

In laptop if you copy 100Gb of data it will take 3 hours -here it will take 2 min(becuase launhing 100 mapeers here--parall copying)

when ever running sqoop import where/how does it stores data?

i ran sqoop cmd with Hdfs user ---/user/username/table name/partm files (mapper o/ps) ---this is how sqoop stores data in HDFS (if 4 mappers means partmoo1..002..002.003)

should not give 777--give r/w/e to all ---user/group/others

sql-stores data as rows and cols(fields)

Note

In realtime you will not run any cmds--all cmds in run by devs (source--destination--format--location of o/p)

example:100GB tx file--import from DB---when ever client ask requirement

client want to send coupns who purchased more tham 10000 on july 27 (7am to 7pm)--dev only do this(query the data)--during this process if particular job fails (he will reach out to admin)

import data -to DB--run hive query--get o/p--alaytics team will send notification coupon to user

sqoop as a fnlity -run hive query

As an Admin

1.how to install sqoop and run the basic testing using sqoop cmds

2.performance tuning in sqoop

3.de-bugging sqoop job failures by working with developer

***** Performance tuning in sqoop (write remaining in sheet)

permonce mainly depends on --volume,source system performance,network performance,hadoop cluster performnce

high numbers of mappers the hign performance (which scearion we use more mappers)

if importing 50GB of data using 4 mappers -it will take more time (that's why depends on data we need to incese mappers)

************** 26 july

1. delete map redure in CDH because we use YARN

2. each and every servery agent will be installed in particular server (to see config)

cd /etc/cloudera-scm-agent

vi config.ini

3. install sqoop fromm CDH

now run sqoop cmds

hdfs@master root $ sqoop

$$sqoop help

$sqoop list-databases --connect jdbc:mysql://san.hadoop.com --username root --P

enter password

mysqul -u root -p

enter pwd

show databases;

$sqoop list-tables --connect jdbc:mysql://san.hadoop.com/scm --username root --P (list tables belongs to SCM DB)

mysql>use scm

show tables

if you want to import tables to HDFS

vi /var/lib/hadoop-dhfs/sqoopconnection (like alis)

paste this in above file (--connect jdbc:mysql://san.hadoop.com/scm --username root --p root@123#)--stores all mqsql connctivity details

now use cmd like ($sqoop list-tables --options-file /var/lib/hadoop-dhfs/sqoopconnection

$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1 (mappers here is 1 -bydefault 4)

go to CDH check job status in YARN --applications

if job failed --ask dev to -application id of the job

$yarn logs applicationID

you will get logs in contains (redirect logs to particular file)

$yarn logs applicationID >>

vi test.txt

/Error or /exception (user root pasworwd denied)

provide permisiions to particular DB

check CDH installation document for -how to provide permiksiions

mqsql>grant all to scm -------- now again run sqoop import

$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1 (again issues)

in hdfs if file is alredy present it will fail (so remove this file)

$hdfs dfs -rm -r -f -skiptrsh /user/hdfs/CONFIGS

$sqoopimport --options-file /var/lib/hadoop-dhfs/sqoopconnection --table CONFIGS -m 1

here number of splits=1 application_43543_534543

now see CDH UI -job is completed

$hdfs dfs -ls /user (here hdfs user ran job)

$hdfs dfs -ls /user/hdfs/

$hdfs dfs -ls /user/hdfs/CONFIGS

$hdfs dfs -cat /user/hdfs/CONFIGS/part -m-000

each and every field seperated by comma

mqsql>select * from CONFIGS; (cross checking)----check sqooppactice.txt document for more cmds

Big Data Hadoop administration

Pages

Sqoop

No comments:

Post a Comment