Skip to content

Big Data and Hadoop

Posted on:September 23, 2022 at 03:22 PM

Big data Hadoop

HDFS: Hadoop distributed file system

YARN: Yet another resource negotiator. sits on top of mapreduce.

Mapreduce: Programming metaphore/model that allows you to distribute the data processing over a cluster. It consists of mappers and reducers.

Note:

Pig: A SQL style scripting, can be used in place of Java and Python. It sits on top of mapreduce.

Hive: Just like pig, it sits over mapreduce and is very similar to SQL db.

Apache Ambari: Web view of the Hadoop cluster.

Ambari alternatives: Cloudera, MapR etc.

Mesos: A yarn alternative.

Spark: Sits at the same level as mapreduce.

Tez: Used in conjunction with Hive. Works on directed acyclic graph, similar to spark.

HBASE: Exposes the data on cluster to transactional platforms. HBASE is a nosql db, a columnar datastore.

Apache storm: A way of processing streaming data ie from sensors or weblogs, it is similar to spark streaming.

OOzie: A way to schedule jobs on the cluster.

Zookeeper: A tech for coordinating everything on your cluster, ie which nodes are up or down, keeping track of which node is master.

Data ingestion:

External data store:

Note: Cassandra/Mongodb sits between real time app and hadoop cluster.

Query Engines(Hive is built into hadoop)

HDFS

HDFS Architecture

Reading a file

Writing a file

Handling failure of Name node

HDFS Federation

HDFS high availability

Using HDFS