Big data Hadoop
HDFS: Hadoop distributed file system
YARN: Yet another resource negotiator. sits on top of mapreduce.
- Manages the resources on hadoop cluster.
Mapreduce: Programming metaphore/model that allows you to distribute the data processing over a cluster. It consists of mappers and reducers.
Note:
-
Mappers transform the data in parallel across the cluster and reducers aggregate the data together.
-
Earlier mapreduce and yarn were a single module.
Pig: A SQL style scripting, can be used in place of Java and Python. It sits on top of mapreduce.
Hive: Just like pig, it sits over mapreduce and is very similar to SQL db.
Apache Ambari: Web view of the Hadoop cluster.
Ambari alternatives: Cloudera, MapR etc.
Mesos: A yarn alternative.
Spark: Sits at the same level as mapreduce.
Tez: Used in conjunction with Hive. Works on directed acyclic graph, similar to spark.
HBASE: Exposes the data on cluster to transactional platforms. HBASE is a nosql db, a columnar datastore.
Apache storm: A way of processing streaming data ie from sensors or weblogs, it is similar to spark streaming.
OOzie: A way to schedule jobs on the cluster.
Zookeeper: A tech for coordinating everything on your cluster, ie which nodes are up or down, keeping track of which node is master.
Data ingestion:
-
sqoop: Ties the cluster to relational db.
-
Flume: Transfers weblogs to the cluster.
-
Kafka: General purpose data ingestion for hadoop cluster.
External data store:
- Cassandra
- Mongodb
- MySQL
Note: Cassandra/Mongodb sits between real time app and hadoop cluster.
Query Engines(Hive is built into hadoop)
- Apache Drill: SQL like queries
- Hue: Similar to Amabari
- Apache pheonix: Similar to Drill
- Presto
- Apache Zeppelin: Notebook style interface.
HDFS
- Used for handling veey large files across a cluster.
- Breaks down large files into smaller blocks(eg 128MB) on different computers/node.
- Each node can parallely process the data.
- More than one copy of a block is stored accross nodes for better resilience.
HDFS Architecture
- Name node and Data nodes.
- Name node keeps track of where all the blocks of data are stored. Also keeps edit log.
- Data nodes: Talks to client applications.
- Data nodes communicate with each other to keep data blocks in sync.
Reading a file
- Client appp first inetract with name node for data.
- Name node them tells the clients app about which data nodes conatins the requested data.
Writing a file
- If a client app wants to create a new file. It will contact name node.
- Name node will respond with list of data nodes to be used.
- The app will contact a single data node with the data.
- The data node will talk to other data nodes to replicate the data.
- The data node will then inform the status to client app which will then send the info to name node.
Handling failure of Name node
- Backup the dat and metadata(edit log) on local storage or NFS.
- Secondary name node: maintains a merged copy of edit log yoou can restore from.
HDFS Federation
- Scale beyond a single name node.
- Each namenode manages a specific namespace volume.
- Used when there are large number of small files.
- Also helps in resilience.
HDFS high availability
- Prevents downtime on cluster in case of failure.
- Previous approaches lead to small downtime.
- Hot standby namenode using shared edit log(which is outside of cluster?).
- Hot standby namenode can takeover right after the namenode failure.
- There is no downtime unlike earlier methods.
- Zookeeper tracks active namenode.
- Uses extreme measures to ensure only one namenode is used at a time.
Using HDFS
- Ambari Web UI.
- CLI.
- HTTP/HDFS proxies.
- Java interface.
- NFS gateway.