5
JScholar Publishers
J Comput Sci Software Dev 2022 | Vol 2: 303
Mapreduce Framework
A way to speed up the mining
of big data is to distrib-
ute the training process into several machines in parallel. Ma-
pReduce framework is configured as master-slave JobTracker. It
is designed for processing extremely big data in parallel mode by
splitting the job into various independent tasks. A MapReduce
program in general is a combination of two tasks: Map and Re-
duce.
In map phase, the data is filtered and sorted containing a
key-value pair.
In reduce phase, they are aggregated for better results.
And its advantages as following below [10]:
A. Simplicity: programming jobs to run using Ma-
pReduce is simple understanding of system
infrastructure is not
required.
B. Fault-tolerance: In an environment with thousands
of data nodes, defects are expected to occur. MapReduce can deal
with this problem, so no loss of results
or interruption of work
can happen.
C. Flexibility: MapReduce does not require data to be
organized in a specific format.
D. Scalability: MapReduce can scale to more of clusters.
With MapReduce parallel programming being applied
to many data mining algorithms. Data mining algorithms usual-
ly need to scan through the training data for obtaining the sta-
tistics to solve or optimize model. to
mine information from Big
data, parallel computing-based algorithms such as MapReduce
are used. In such algorithms, large data sets are divided into a
number of subsets and then, mining
algorithms are applied to
those subsets. Finally, summation algorithms are applied to the
results of mining algorithms, to meet the goal of Big Data min-
ing. The data mining algorithms can
be converted into big data
map reduce algorithm which is based on parallel computing ba-
sis [11].