Computer Science

Hadoop

Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It provides a reliable, scalable platform for big data analytics and supports the processing of structured and unstructured data. Hadoop's key components include the Hadoop Distributed File System (HDFS) for storage and MapReduce for data processing.

Written by Perlego with AI-assistance

7 Key excerpts on "Hadoop"

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.
  • Blockchain
    eBook - ePub

    Blockchain

    From birth of Internet to future of Blockchain

    ...Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data, they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. The base Apache Hadoop framework is composed of the following modules: • Hadoop Common - contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; • Hadoop YARN - introduced in 2012 is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications; and • Hadoop MapReduce - an implementation of the MapReduce programming model for large-scale data processing. The term Hadoop has come to refer not just to the aforementioned base modules and sub-modules, but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache Zookeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program...

  • Big Data, Big Analytics
    eBook - ePub

    Big Data, Big Analytics

    Emerging Business Intelligence and Analytic Trends for Today's Businesses

    • Michael Minelli, Michele Chambers, Ambiga Dhiraj(Authors)
    • 2012(Publication Date)
    • Wiley
      (Publisher)

    ...At one-tenth the cost of traditional solutions, Hadoop excels at supporting complex analyses—including detailed, special-purpose computation—across large collections of data. Hadoop handles a variety of workloads, including search, log processing, recommendation systems, data warehousing, and video/image analysis. Today’s explosion of data types and volumes means that Big Data equals big opportunities and Apache Hadoop empowers organizations to work on the most modern scale-out architectures using a clean-sheet design data framework, without vendor lock-in. Apache Hadoop is an open-source project administered by the Apache Software Foundation. The software was originally developed by the world’s largest Internet companies to capture and analyze the data that they generate. Unlike traditional, structured platforms, Hadoop is able to store any kind of data in its native format and to perform a wide variety of analyses and transformations on that data. Hadoop stores terabytes, and even petabytes, of data inexpensively. It is robust and reliable and handles hardware and system failures automatically, without losing data or interrupting data analyses. Hadoop runs on clusters of commodity servers and each of those servers has local CPUs and disk storage that can be leveraged by the system. The two critical components of Hadoop are: The Hadoop Distributed File System (HDFS). HDFS is the storage system for a Hadoop cluster. When data lands in the cluster, HDFS breaks it into pieces and distributes those pieces among the different servers participating in the cluster. Each server stores just a small fragment of the complete data set, and each piece of data is replicated on more than one server. MapReduce. Because Hadoop stores the entire dataset in small pieces across a collection of servers, analytical jobs can be distributed, in parallel, to each of the servers storing part of the data...

  • Intelligent Data Analysis for e-Learning
    eBook - ePub

    Intelligent Data Analysis for e-Learning

    Enhancing Security and Trustworthiness in Online Learning Systems

    • Jorge Miguel, Santi Caballé, Fatos Xhafa(Authors)
    • 2016(Publication Date)
    • Academic Press
      (Publisher)

    ...The run-time framework takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers, without any experience with parallel and distributed systems, to utilize the resources of a large distributed system [ 167 ]. In other words, the MapReduce model arises as a reaction to the complexity of the parallel computing programming models, which consider the specific parallel factors involved in software development processes. 5.3.3 Hadoop Architecture The Apache Hadoop software library is a framework devoted to processing large data sets across distributed clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage [ 170 ]. Although the Apache Hadoop project includes many Hadoop-related projects, the main modules are the Hadoop MapReduce and Hadoop distributed file system (HDFS). HDFS supports large data-sets across multiple hosts to achieve parallel processing. HDFS is a block-structured file system based on splitting input data into small blocks of fixed size, which are delivered to each node in the cluster. The Apache Hadoop NextGen MapReduce, also known as Apache Hadoop yet another resource negotiator (YARN), or MapReduce 2.0 (MRv2), is a cluster management technology. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker into resource management and job scheduling. The idea is to have a global resource manager and per-application master. The resource manager and per-node slave manager (ie, node manager) form the data-computation framework. The resource manager is the authority that arbitrates resources between all the applications in the system...

  • Predictive Analytics, Data Mining and Big Data
    eBook - ePub

    Predictive Analytics, Data Mining and Big Data

    Myths, Misconceptions and Methods

    ...Just as Microsoft Office is made up of Excel, Word, PowerPoint, Outlook and so on, so the Hadoop family encompasses a number of tools. The second thing to say about Hadoop is that you should not think about it as a database in the traditional sense, where the data is nicely prepared and formatted, processed and indexed to make it fit the requirements of the relational model. Instead you should view it as a way of storing massive amounts of diverse data and processing it quickly. The foundation software you can’t do without is the Hadoop Database Core tools (The Hadoop Common) and the Hadoop Distributed Filing System (HDFS). The HDFS manages the storage of data across any number of commodity servers. A commodity server simply means that it’s an off-the-shelf computer that typically costs anything from a few hundred to a few thousand dollars. 7 There is nothing special about the hardware. If you want to increase your storage and data processing capabilities you just buy a few more servers. A key feature of the HDFS is that you shouldn’t think about any particular piece of data being stored on a particular server. One of the jobs of the HDFS is to maintain multiple copies of the data on different servers so that if one server fails, the data is not lost and a replacement copy can be generated and placed on another server. This is a very different way of protecting against server failures and data loss compared to the standard RAID 8 /regular backup model used with most relational databases. The second core Element of Hadoop is MapReduce. MapReduce was developed by Google in 2004 as a way to split very large processing jobs into a number of smaller bits. Each bit is then processed by a separate server and the results combined together to deliver the final result...

  • Building Big Data Applications

    ...Within a few months they reached an architecture scalability of 20 nodes running Nutch without missing a heartbeat. This provided Yahoo the next move to hire Cutting and adopt Hadoop to become one of its core platforms. Yahoo kept the platform moving with its constant innovation and research. Soon many committers and volunteer developers/testers started contributing to the growth of a healthy ecosystem around Hadoop. At this time of writing (2018), we have seen two leading distributors of Hadoop with management tools and professional services emerge—Cloudera and HortonWorks. We have also seen the emergence of Hadoop-based solutions from MapR, IBM, Teradata, Oracle, and Microsoft. Vertica, SAP, and others are also announcing their own solutions in multiple partnerships with other providers and distributors. The most current list at Apache's website for Hadoop lists the top level stable projects and releases and also incubated projects which are evolving Fig. 2.5. Hadoop core components At the heart of the Hadoop framework or architecture there are components that can be called as the foundational core. These components include the following (Fig. 2.6): Let us take a quick look at these components and further understand the ecosystem evolution and recent changes. Figure 2.5 Apache top level Hadoop projects. HDFS The biggest problem experienced by the early developers of large-scale data processing was the ability to break down the files across multiple systems and process each piece of the file independent of the other pieces, but yet consolidate the results together in a single result set. The secondary problem that remained unsolved for was the fault tolerance both at the file processing level and the overall system level in the distributed processing systems. With GFS the problem of scalingout processing across multiple systems was largely solved. HDFS, which is derived from NDFS, was designed to solve the large distributed data processing problem...

  • Data Processing and Modeling with Hadoop
    eBook - ePub

    Data Processing and Modeling with Hadoop

    Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

    • Vinicius Aquino do Vale(Author)
    • 2021(Publication Date)
    • BPB Publications
      (Publisher)

    ...With these profiles in hand, it is already possible to understand the audience and target the products that best fit the profile, creating a unique experience for the consumer. But how can one do this with billions, even trillions, of data? Google, understanding this need and seeing a way to absorb more and more knowledge, developed several tools, one of which was the Google File System (GFS) and MapReduce. These were the most important papers released up to that moment, basically starting the era of what we know as big data. That paper gave rise to one of the main tools: Hadoop. Within the most current versions Hadoop we have YARN, which offers one of the main characteristics when we talk about parallel processing. Before we talk about processing, first we need to separate the layers so that we can see the environment in a more simplified way. Let's separate the storage environment, which is where the data is stored and processing. Imagine that the processing is formed by several computers, spread over several data centers. Now, imagine that it is necessary to process a file. If there are several computers, how can the programs run in such a way that the amount of processors and memory do not exceed the computational limits? Someone needs to control how each resource will be used by the programs. It is the resource manager who governs everything that happens in the environment. YARN is the resource manager who knows how each process is running and can control processes that fail and need to be restarted. This is how parallel processing ensures that your data, even if it fails, will still return to the end user. Because it has an important capacity for the ecosystem, it is possible to use this characteristic to plug in another type of massive parallel processing, which is concerned with processing and delegates resources to another. We have MapReduce itself, with its characteristics, and Spark, with its most evolutionary capabilities...

  • Intelligent Data Analysis
    eBook - ePub

    Intelligent Data Analysis

    From Data Gathering to Data Comprehension

    • Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, Kalpna Sagar, Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, Kalpna Sagar(Authors)
    • 2020(Publication Date)
    • Wiley
      (Publisher)

    ...A Hadoop distributed file system stores the input data set shown in Table 7.3 for processing using the MapReduce framework. Using the framework, data will be parallelly processed on multiple clusters. Hadoop splits the task into two phases. The two phases used for parallel processing are as follows: MAP phase: In MAP phase, the output is in the form of <key, value> pair. REDUCE phase: In Reduce phase, the output from Map phase will be combined together and passed in Reduce phase. The Reduce phase will process each of the <key, value> pairs to generate the output. The final output obtained from Reducer will be stored in HDFS. The working principle of the Map Reduce framework is shown in Figure 7.9, which depicts the methodology adopted, whereas the input data file obtained from Table 7.3 splits into different chunks on cluster machines and passes through different phases to get the summarized results. The unstructured data stored in HDFS after shuffling is refined to obtain an output and is called “Reduce phase.” The MapReduce technique uses two components when a request for information comes in, i.e., a Job Tracker and a Task Tracker. The Job Tracker works on the Hadoop master node, which is the Namenode, and the TaskTraker works on each node within the Hadoop network. The advantage of Hadoop is the mechanism it follows for the distribution of data on multiple clusters, which results in the faster processing of data and the best utilization of storage space and the processing power of machines. 7.6.3.2 Stage 2 Apriori Algorithm In stage 2 of the methodology, association rule mining is implemented on the synthesized educational data set shown in Table 7.3. The association rules will be discovered through the Apriori algorithm where support and confidence are calculated for frequent itemsets. The support for each item is calculated in the data set and checked with a minimum threshold value set. The items not fulfilling the minimum threshold value will be eliminated...