Hadoop Basics

What is Big Data ? 
Massive volume of both structured and unstructured data that is so large and it is difficult to process using traditional database. 
What is Apache Hadoop? 
Open-source software framework for storing data and running applications on clusters of commodity hardware.  
It provides massive storage, enormous processing power and the ability to handle virtually limitless concurrent jobs. 
Designed to scale up from single server to thousands of machines, each offering local computation and storage. 
What is Apache Sqoop?            
command-line interface application used for transferring data between relational databases and Hadoop   
It is used to import data from relational databases to Hadoop HDFS, and export from Hadoop file system to relational DB’s.    
What is Apache Pig? 
Tool used to analyze larger sets of data representing them as data flows. It provides a simple language called Pig Latin(scripting language).
we can perform data manipulation operations in Hadoop using Pig. 
What is Apache Hive? 
Hive is a (ETL) data warehouse system for data summarization and analysis  
Hive is a database present in Hadoop ecosystem performs DDL and DML operations,  
and it provides flexible query language such as HQL for better querying and processing of data. 
What is Apache HBase? 
HBase is a column-oriented NOSQL DB that runs on top of  HDFS 
HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.  
It is well suited for real-time data processing or random read/write access to large volumes of data. 
What is Apache Flume? Tool used for moving the data from source to destination. 
Tool used to collect, aggregate and moving large amounts of streaming data (logs) into HDFS (ex : Twitter Data Streaming) 
What is Apache Oozie? 
workflow scheduler to manage Hadoop jobs. It is a system which runs the workflow of dependent jobs. 
what is Apache ZooKeeper? 
Acts as a centralized service and is used to maintain naming and configuration data and to provide flexible and robust synchronization within distributed systems (zookeeper is team leader (watch all the nodes activities and Health)
It is often used as a fault-tolerant storage for meta-data in large-scale distributed systems. 
Hadoop Skills? 
  1. Cluster deployment 
  1. Add and remove nodes 
  1. Keep track of jobs 
  1. Monitor critical parts of the cluster 
  1. Configure high availability 
  1. Back up and disaster recovery 
  1. Integration of Eco systems 

 
HDFS: Hadoop Distributed File System 
YARN: Yet Another Resource Negotiator 
MapReduce: Programming based Data Processing 
Spark: In-Memory data processing 
PIG, HIVE: Query-based processing of data services 
HBase: NoSQL Database 
Mahout, Spark MLLib: Machine Learning algorithm libraries 
Solar, Lucene: Searching and Indexing 
Zookeeper: Managing cluster 
Oozie: Job Scheduling 

No comments:

Post a Comment