Article · October 2022 doi: 10. 17303/jcssd

Download 474,75 Kb. Pdf ko'rish
bet	7/8
Sana	20.05.2024
Hajmi	474,75 Kb.
	#245707

1 2 3 4 5 6 7 8

Bog'liq
2.1 ga oid

Mapreduce Framework

Hadoop Framework
Hadoop is an open source project under Apache Soft-
ware for developing distributed applications that can process
huge of data. It was introduced in 2005. It works in an environ-
ment that gives distributed systems and computation across var-
ious clusters [7]. Hadoop is developed to scale up from a single
server to many servers. Hadoop is designed to process big data
efficiently and used to support the processing of big data in a
distributed computing Environment.
Hadoop ecosystem consist of multiple components are
explained as below [6]:
•
Hadoop Distributed File System (HDFS): is responsible
for splitting and storing files into computer nodes.
•
MapReduce Framework: is responsible for processing
each data block in parallel [8].
•
HBase: A column oriented distributed NoSQL database
for random read/write access.
•
Pig: A high level data programming language for ana
-
lyzing data of Hadoop computation.
•
Hive: A data warehousing application that provides a
SQL like access and relational model.
•
Sqoop: A project for transferring/importing data be
-
tween relational databases and Hadoop.
•
Oozie: An orchestration and workflow management for
dependent Hadoop jobs [9].

5
JScholar Publishers
J Comput Sci Software Dev 2022 | Vol 2: 303
Mapreduce Framework
A way to speed up the mining of big data is to distrib-
ute the training process into several machines in parallel. Ma-
pReduce framework is configured as master-slave JobTracker. It
is designed for processing extremely big data in parallel mode by
splitting the job into various independent tasks. A MapReduce
program in general is a combination of two tasks: Map and Re-
duce. In map phase, the data is filtered and sorted containing a
key-value pair.
In reduce phase, they are aggregated for better results.
And its advantages as following below [10]:
A. Simplicity: programming jobs to run using Ma-
pReduce is simple understanding of system infrastructure is not
required.
B. Fault-tolerance: In an environment with thousands
of data nodes, defects are expected to occur. MapReduce can deal
with this problem, so no loss of results or interruption of work
can happen.
C. Flexibility: MapReduce does not require data to be
organized in a specific format.
D. Scalability: MapReduce can scale to more of clusters.
With MapReduce parallel programming being applied
to many data mining algorithms. Data mining algorithms usual-
ly need to scan through the training data for obtaining the sta-
tistics to solve or optimize model. to mine information from Big
data, parallel computing-based algorithms such as MapReduce
are used. In such algorithms, large data sets are divided into a
number of subsets and then, mining algorithms are applied to
those subsets. Finally, summation algorithms are applied to the
results of mining algorithms, to meet the goal of Big Data min-
ing. The data mining algorithms can be converted into big data
map reduce algorithm which is based on parallel computing ba-
sis [11].

Download 474,75 Kb.

1 2 3 4 5 6 7 8

Download 474,75 Kb.

Pdf ko'rish