STUDENT ID :C0727079STUDENT NAME: ALAN SALOPROJECTS BASED ON BIG DATAThe growing demand for Big Data in industries pave way for the development of numerous Big Data projects along with Hadoop.Many of them comes into picture to sharpen the Hadoop environment,but some have standalone capabilities developed for their respective needs.

Some of the Big data projects include:Apache Ambari:It is an open source project developed to manage,authenticate and monitor hadoop clusters.As Hadoop matured more technologies were added to the framework,which makes difficult for the cluster the manage and keep track of multiple to simultaneously.This is where Ambari find its emergence to make distributed computing easier.           Ambari provides a highly interactive web UI supported by a collection of RESTful           APIs which allows the management and monitoring of every application on hadoop                clusters.Ambari includes two major components:Server and Agent.

Server interacts                with the the agents on the node and collect the  metadata of its processes.Agent                     update the status of the node along with diverse operational metrics.Some of the               features of Ambari are:platform independent,pluggable component,version                management and upgrade,extensibility,failure recovery,security.       2.   Apache Flume:It is a project designed to handle data storage of huge data             streaming datasets into hdfs.The simple and flexible architecture along with its           distributed , reliable services for data handling performs its operation in a robust and             fault tolerant mechanisms makes it apt for online analytic applications.

Specifically               Flume allows the user to stream data,insulate system and scale data horizontally,in             addition it also provide guaranteed data delivery as it usus channel-based           transactions.   3.    Apache Sqoop:Sqoop act as a mediator between Hadoop and relational             databases for transferring data,which provides an efficient interactive command line          interface .Sqoop uses MapReduce to import and export data between relational            database and hdfs,which provides parallel operations  in a fault tolerant way.

The               import file can be  database table which will be read into hdfs in row by row          manner.This will give a an output of multiple files.The functions of sqoop include:load            balancing,efficient data analysis,fast data copies,parallel data transfer,data                      imports,import direct ORC Files,import sequential datasets from mainframe.    4.    Apache Oozie:Oozie is a workflow scheduler designer to schedule the            Apache Hadoop jobs.

It is a server-based program which direct the jobs in a Directed         Acyclical Graphs (DAGs) manner.Its workflow is managed by control flow node and         action nodes.Control flow nodes takes care of the starting,ending and failure definition         and the path of the workflow.

Action node are defined to trigger the processes by a         workflow.The oozie work flow are recurrent which are triggered by time and data           availability.  5.   Apache Kafka:It is a open source distributed streaming platform which aims to        handle real time data feeds in an efficient manner.It manages the storage of real time        records in fault tolerant way.

It acts as a message queue which lets us subscribe and          publish real time data and process it.Kafka runs as a cluster in server to store streaming          data in form of key-value pair and a timestamp.Additionally,it communicates with           external system for import/export through Kafka connect and provide a library which           enable  java streaming.                      


I'm Katy!

Would you like to get a custom essay? How about receiving a customized one?

Check it out