Big Data Hadoop administration: Hadoop Basics

What is Big Data ?

Massive volume of both structured and unstructured data that is so large and it is difficult to process using traditional database.

What is Apache Hadoop?

Open-source software framework for storing data and running applications on clusters of commodity hardware.

It provides massive storage, enormous processing power and the ability to handle virtually limitless concurrent jobs.

Designed to scale up from single server to thousands of machines, each offering local computation and storage.

What is Apache Sqoop?

command-line interface application used for transferring data between relational databases and Hadoop

It is used to import data from relational databases to Hadoop HDFS, and export from Hadoop file system to relational DB’s.

What is Apache Pig?

Tool used to analyze larger sets of data representing them as data flows. It provides a simple language called Pig Latin(scripting language).

we can perform data manipulation operations in Hadoop using Pig.

What is Apache Hive?

Hive is a (ETL) data warehouse system for data summarization and analysis

Hive is a database present in Hadoop ecosystem performs DDL and DML operations,

and it provides flexible query language such as HQL for better querying and processing of data.

What is Apache HBase?

HBase is a column-oriented NOSQL DB that runs on top of HDFS.

HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.

It is well suited for real-time data processing or random read/write access to large volumes of data.

What is Apache Flume? Tool used for moving the data from source to destination.

Tool used to collect, aggregate and moving large amounts of streaming data (logs) into HDFS (ex : Twitter Data Streaming)

What is Apache Oozie?

workflow scheduler to manage Hadoop jobs. It is a system which runs the workflow of dependent jobs.

what is Apache ZooKeeper?

Acts as a centralized service and is used to maintain naming and configuration data and to provide flexible and robust synchronization within distributed systems (zookeeper is team leader (watch all the nodes activities and Health)

It is often used as a fault-tolerant storage for meta-data in large-scale distributed systems.

Hadoop Skills?

Cluster deployment

Add and remove nodes

Keep track of jobs

Monitor critical parts of the cluster

Configure high availability

Back up and disaster recovery

Integration of Eco systems

HDFS: Hadoop Distributed File System

YARN: Yet Another Resource Negotiator

MapReduce: Programming based Data Processing

Spark: In-Memory data processing

PIG, HIVE: Query-based processing of data services

HBase: NoSQL Database

Mahout, Spark MLLib: Machine Learning algorithm libraries

Solar, Lucene: Searching and Indexing

Zookeeper: Managing cluster

Oozie: Job Scheduling

Big Data Hadoop administration

Pages

Hadoop Basics

No comments:

Post a Comment