STUDENT ID :C0727079STUDENT NAME: ALAN SALOPROJECTS BASED ON BIG DATAThe growing demand for Big Data in industries pave way for the development of numerous Big Data projects along with Hadoop.Many of them comes into picture to sharpen the Hadoop environment,but some have standalone capabilities developed for their respective needs.
Some of the Big data projects include:Apache Ambari:It is an open source project developed to manage,authenticate and monitor hadoop clusters.As Hadoop matured more technologies were added to the framework,which makes difficult for the cluster the manage and keep track of multiple to simultaneously.This is where Ambari find its emergence to make distributed computing easier. Ambari provides a highly interactive web UI supported by a collection of RESTful APIs which allows the management and monitoring of every application on hadoop clusters.Ambari includes two major components:Server and Agent.
Server interacts with the the agents on the node and collect the metadata of its processes.Agent update the status of the node along with diverse operational metrics.Some of the features of Ambari are:platform independent,pluggable component,version management and upgrade,extensibility,failure recovery,security. 2. Apache Flume:It is a project designed to handle data storage of huge data streaming datasets into hdfs.The simple and flexible architecture along with its distributed , reliable services for data handling performs its operation in a robust and fault tolerant mechanisms makes it apt for online analytic applications.
Specifically Flume allows the user to stream data,insulate system and scale data horizontally,in addition it also provide guaranteed data delivery as it usus channel-based transactions. 3. Apache Sqoop:Sqoop act as a mediator between Hadoop and relational databases for transferring data,which provides an efficient interactive command line interface .Sqoop uses MapReduce to import and export data between relational database and hdfs,which provides parallel operations in a fault tolerant way.
The import file can be database table which will be read into hdfs in row by row manner.This will give a an output of multiple files.The functions of sqoop include:load balancing,efficient data analysis,fast data copies,parallel data transfer,data imports,import direct ORC Files,import sequential datasets from mainframe. 4. Apache Oozie:Oozie is a workflow scheduler designer to schedule the Apache Hadoop jobs.
It is a server-based program which direct the jobs in a Directed Acyclical Graphs (DAGs) manner.Its workflow is managed by control flow node and action nodes.Control flow nodes takes care of the starting,ending and failure definition and the path of the workflow.
Action node are defined to trigger the processes by a workflow.The oozie work flow are recurrent which are triggered by time and data availability. 5. Apache Kafka:It is a open source distributed streaming platform which aims to handle real time data feeds in an efficient manner.It manages the storage of real time records in fault tolerant way.
It acts as a message queue which lets us subscribe and publish real time data and process it.Kafka runs as a cluster in server to store streaming data in form of key-value pair and a timestamp.Additionally,it communicates with external system for import/export through Kafka connect and provide a library which enable java streaming.