• Mapreduce Framework
  • Article · October 2022 doi: 10. 17303/jcssd




    Download 474,75 Kb.
    Pdf ko'rish
    bet7/8
    Sana20.05.2024
    Hajmi474,75 Kb.
    #245707
    1   2   3   4   5   6   7   8
    Bog'liq
    2.1 ga oid

    Hadoop Framework
    Hadoop is an open source project under Apache Soft-
    ware for developing distributed applications that can process 
    huge of data. It was introduced in 2005. It works in an environ-
    ment that gives distributed systems and computation across var-
    ious clusters [7]. Hadoop is developed to scale up from a single 
    server to many servers. Hadoop is designed to process big data 
    efficiently and used to support the processing of big data in a 
    distributed computing Environment.
    Hadoop ecosystem consist of multiple components are 
    explained as below [6]:
    • 
    Hadoop Distributed File System (HDFS): is responsible 
    for splitting and storing files into computer nodes.
    • 
    MapReduce Framework: is responsible for processing 
    each data block in parallel [8].
    • 
    HBase: A column oriented distributed NoSQL database 
    for random read/write access.
    • 
    Pig: A high level data programming language for ana
    -
    lyzing data of Hadoop computation.
    • 
    Hive: A data warehousing application that provides a 
    SQL like access and relational model.
    • 
    Sqoop: A project for transferring/importing data be
    -
    tween relational databases and Hadoop.
    • 
    Oozie: An orchestration and workflow management for 
    dependent Hadoop jobs [9].


    5
    JScholar Publishers
    J Comput Sci Software Dev 2022 | Vol 2: 303
    Mapreduce Framework
    A way to speed up the mining of big data is to distrib-
    ute the training process into several machines in parallel. Ma-
    pReduce framework is configured as master-slave JobTracker. It 
    is designed for processing extremely big data in parallel mode by 
    splitting the job into various independent tasks. A MapReduce 
    program in general is a combination of two tasks: Map and Re-
    duce. In map phase, the data is filtered and sorted containing a 
    key-value pair.
    In reduce phase, they are aggregated for better results. 
    And its advantages as following below [10]:
    A. Simplicity: programming jobs to run using Ma-
    pReduce is simple understanding of system infrastructure is not 
    required.
    B. Fault-tolerance: In an environment with thousands 
    of data nodes, defects are expected to occur. MapReduce can deal 
    with this problem, so no loss of results or interruption of work 
    can happen.
    C. Flexibility: MapReduce does not require data to be 
    organized in a specific format.
    D. Scalability: MapReduce can scale to more of clusters.
    With MapReduce parallel programming being applied 
    to many data mining algorithms. Data mining algorithms usual-
    ly need to scan through the training data for obtaining the sta-
    tistics to solve or optimize model. to mine information from Big 
    data, parallel computing-based algorithms such as MapReduce 
    are used. In such algorithms, large data sets are divided into a 
    number of subsets and then, mining algorithms are applied to 
    those subsets. Finally, summation algorithms are applied to the 
    results of mining algorithms, to meet the goal of Big Data min-
    ing. The data mining algorithms can be converted into big data 
    map reduce algorithm which is based on parallel computing ba-
    sis [11].

    Download 474,75 Kb.
    1   2   3   4   5   6   7   8




    Download 474,75 Kb.
    Pdf ko'rish

    Bosh sahifa
    Aloqalar

        Bosh sahifa



    Article · October 2022 doi: 10. 17303/jcssd

    Download 474,75 Kb.
    Pdf ko'rish