Computer Science

Map Reduce and Filter

Map, reduce, and filter are fundamental higher-order functions used in functional programming and data processing. Map applies a function to each element in a collection, producing a new collection. Reduce combines elements of a collection into a single value using a specified operation. Filter selects elements from a collection based on a given condition, creating a new collection containing only the matching elements.

Written by Perlego with AI-assistance

5 Key excerpts on "Map Reduce and Filter"

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.
  • Intelligent Data Analysis for e-Learning
    eBook - ePub

    Intelligent Data Analysis for e-Learning

    Enhancing Security and Trustworthiness in Online Learning Systems

    • Jorge Miguel, Santi Caballé, Fatos Xhafa(Authors)
    • 2016(Publication Date)
    • Academic Press
      (Publisher)

    ...[ 167 ], MapReduce is a programming model and an associated implementation for processing and generating large data sets. The model is based on specifying a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key. The Map/Reduce functions are as follows [ 167 ]: • The Map function takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values and passes them to the Reduce function. • The Reduce function accepts an intermediate key (produced by the Map function) and a set of values for that key. It merges together these values to form a smaller set of values. Typically, just zero or one output value is produced per Reduce invocation. Fig. 5.2 shows a schema that illustrates the MapReduce main features and basic flow process. Fig. 5.2 MapReduce basic schema. 5.3.2 MapReduce in Parallel Computing Classifications Regarding the parallel computing model and classification discussed in Section 5.1, MapReduce programs are automatically executed in a parallel cluster-based computing environment [ 167 ]. The run-time framework takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers, without any experience with parallel and distributed systems, to utilize the resources of a large distributed system [ 167 ]...

  • Building Big Data Applications

    ...Technology startups like Google, RankDex(now known as Baidu), and Yahoo, open source projects like Nutch were all figuring out how to increase the performance of the search query to scale infinitely. Out of these efforts came the technologies, which are now the foundation of big data processing. MapReduce MapReduce is a programming model for processing extremely large sets of data. Google originally developed it for solving the scalability of search computation. Its foundations are based on principles of parallel and distributed processing without any database dependency. The flexibility of MapReduce lies in the ability to process distributed computations on large amounts of data on clusters of commodity servers, with simple task based models for management of the same. The key features of MapReduce that makes it the interface on Hadoop or Cassandra include the following: • Automatic parallelization • Automatic distribution • Faulttolerance • Status and monitoring tools • Easy abstraction for programmers • Programming language flexibility • Extensibility MapReduce programming model MapReduce is based on functional programming models largely from Lisp. Typically the users will implement two functions: • Map (in_key, in_value) -> (out_key, intermediate_value) list • Map function written by the user, will receive an input pair of keys and values, and postcomputation cycles produces a set of intermediate key/value pairs. • Library functions then are used to group together all intermediate values associated with an intermediate key I and passes them to the Reduce function. • Reduce (out_key, intermediate_value list) - > out_value list • The Reduce function written by the user will accept an intermediate key I, and the set of values for the key. • It will merge together these values to form a possibly smaller set of values. • Reducer outputs are just zero or one output value per invocation. • The intermediate values are supplied to the reduce function via an iterator...

  • Data Processing and Modeling with Hadoop
    eBook - ePub

    Data Processing and Modeling with Hadoop

    Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

    • Vinicius Aquino do Vale(Author)
    • 2021(Publication Date)
    • BPB Publications
      (Publisher)

    ...We have MapReduce itself, with its characteristics, and Spark, with its most evolutionary capabilities. In addition to Impala and even Apache Mesos, which offers features very similar to YARN. What is MapReduce? As we enter the 21st century, we begin the period of the most drastic change on our planet: we start to consume more and more data. When we start working with big data, our main challenge is to understand how data is processed in parallel and how the system does this task transparently. We have to imagine that there is a possibility of some action failing and the system must be able to recover data to continue processing without user interference. When we started to go deeper into big data, the first difficulty we encountered was in the way the data is processed in parallel and how the tools do this task in a transparent way, which, despite being a very talked about and common subject in the daily lives of companies, is still very poorly understood by professionals. MapReduce is one of the oldest and best-known frameworks for clustering. It follows the functional programming model and performs explicit synchronization through computational steps. According to Wikipedia, MapReduce is a programming model and framework introduced by Google to support parallel computations in large collections of data in computer clusters. MapReduce is data oriented and processes data in two primary phases: Map and Reduce. The MapReduce framework is a large distributed species. The points that define it are: An input reader : Divides the data into appropriately sized 'blocks' and the framework assigns a block to each mapping function...

  • Big Data Computing
    eBook - ePub

    Big Data Computing

    A Guide for Business and Technology Managers

    ...MapReduce was the software framework invented by Google to address this issue; the name MapReduce was inspired by the map and reduce functions of the functional programming language Lisp. Parallel programming on a massive scale has the potential to not only address the issue of reliability but also deliver a huge boost in performance. This is opportune because, given the problems with large data sets of the web, without massive parallelism, leave aside reliability, the processing itself may not be achievable. Instituting strategies and mechanisms that accommodate instead of avoiding possibilities of routine faults and failures is not new. TCP/IP that is a grouping of transmission control protocol (TCP) and Internet protocol (IP) are two of the core technology -standards on which the Internet is based. TCP/IP is a low-level protocol that ensures that signals can be moved from one place to another. IP moves packets of data from one point to another, with routers helping those packets find their way across networks; this transmission of packets is unreliable. TCP builds reliable connections on top of IP, accepting that not all packets will complete their journeys and resending them as necessary. 1.7 Brewer’s CAP Theorem Techniques for achieving Atomicity, Consistency, Isolation, and Durability (ACID) properties in a database system are explained in the note below. However, applying these techniques in large-scale scenarios such as data services in the cloud leads to scalability problems: the amount of data to be stored and processed and the transaction and query load to be managed are usually too large to run the database services on a single machine. To overcome this data storage bottleneck, the database must be stored on multiple nodes, for which horizontal scaling is the typically chosen approach. The database is partitioned across the different nodes: either tablewise or by sharding (see Chapter 9 Sub-section 9.2.3, “Row Partitioning or Sharding”)...

  • Intelligent Data Analysis
    eBook - ePub

    Intelligent Data Analysis

    From Data Gathering to Data Comprehension

    • Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, Kalpna Sagar, Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, Kalpna Sagar(Authors)
    • 2020(Publication Date)
    • Wiley
      (Publisher)

    ...In such a scenario where data is being generated at an alarming rate from many sources like media files, data from social media websites, e-mails generated, Google search, instant messaging, mobile users, internet of things (IOT), etc., data grows enormously and it may not fit on a single node, which is why the the term “big data” was coined. It is a term for data sets that are so large in volume or complex that it becomes arduous to manage them with customary data processing application software. Using big data techniques, unstructured data is gathered and analyzed to reveal informative data for operations, which includes the gathering of data for storage and analysis purposes that gains control over operations, such as experimental, fact-finding, allocation, data visualization, updating, and maintaining the confidentiality of information. The problem pertaining to the enormously large size of a data set can be solved by Hadoop using a MapReduce model, which works on a Hadoop layer and allows parallel processing of the data stored in a Hadoop distributed file system that allows dumping any kind of data across the cluster. MapReduce tasks run over Hadoop clusters by splitting the big data, i.e., input file, into small pieces and processing the data on parallel distributed clusters. It is an open-source programming prototype that performs parallel processing of applications on clusters, and with the distribution of data, computation becomes faster. A MapReduce programming framework executes its operations into three stages, i.e., map phase, shuffle phase, and reduce phase. In data mining, association rule learning is a method for discovering interesting relations among variables in large databases [ 5 ]. A big data approach using association rule mining can help colleges, institutions, and universities get a comprehensive perspective of their students...