In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents.

At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase.

Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations.

(For more resources related to this topic, see here.)

Map functions

Consider the following Python snippet:

numbers = [1, 2, 3, 4, 5]
doubled = map(lambda n: n * 2, numbers)
#doubled == [2, 4, 6, 8, 10]

These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two.

This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code:

numbers = [1, 2, 3, 4, 5]
defdouble_odd(num):
  if num % 2 == 0:
    return num
  else:
    return num * 2
 
doubled = map(double_odd, numbers)
#doubled == [2, 2, 6, 4, 10]

Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map.

Reduce functions

Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows:

numbers = [1, 2, 3, 4, 5]
sum = reduce(lambda x, y: x + y, numbers)
#sum == 15

You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation.

Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list:

x = 1, y = 2
x = 3, y = 3
x = 6, y = 4
x = 10, y = 5
x = 15

As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value.

Couchbase MapReduce

Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce.

You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed.

Basic mapping

To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here:

var books=[
    {
"id": 1,
"title": "The Bourne Identity",
"author": "Robert Ludlow"
    },
    {
"id": 2,
"title": "The Godfather",
"author": "Mario Puzzo"
    },
    {
"id": 3,
"title": "Wiseguy",
"author": "Nicholas Pileggi"
    }
];

Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript:

books.map(function(book) {
  return book.author;
});

In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows:

["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"]

At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it.

With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index:

books.map(function(book) {
  return [book.author, book.id];
});

In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title.

[["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]]

We'll soon see how this structure more closely resembles the values stored in a Couchbase index.

Basic reducing

Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place.

Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows:

var books=[
    {
"id": 1,
"title": "The Bourne Identity",
"author": "Robert Ludlow"
    },
    {
"id": 2,
"title": "The Bourne Ultimatum",
"author": "Robert Ludlow"
    },
    {
"id": 3,
"title": "The Godfather",
"author": "Mario Puzzo"
    },
    {
"id": 4,
"title": "The Bourne Supremacy",
"author": "Robert Ludlow"
    },
    {
"id": 5,
"title": "The Family",
"author": "Mario Puzzo"
    },
 {
"id": 6,
"title": "Wiseguy",
"author": "Nicholas Pileggi"
    }
];

We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author.

These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author.

Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript:

mapped = books.map(function (book) {
    return ([book.id, book.author]);
});
 
counts = {}
reduced = mapped.reduce(function(prev, cur, idx, arr) {
var key = cur[1];
    if (! counts[key]) counts[key] = 0;
    ++counts[key]
}, null);

This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary.

Couchbase views

At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view.

In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next:

mapreduce-functions-img-0

First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity):

{
 "name": "Sundog",
 "type": "beer",
 "brewery_id": "new_holland_brewing_company",
 "description": "Sundog is an amber ale...",
 "style": "American-Style Amber/Red Ale",
 "category": "North American Ale"
}

As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this:

SELECT Id FROM Beers WHERE Name = ?

You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index.

To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows:

function(doc) {
  emit(doc.name);
}

This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names.

The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity):

{
  "name": "Thomas Hooker Brewing",
  "city": "Bloomfield",
  "state": "Connecticut",
  "website": "http://www.hookerbeer.com/",
  "type": "brewery"
}

If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents.

The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another.

In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents:

function(doc) {
  if (doc.type == "beer") {
    emit(doc.name);
  }
}

The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query:

SELECT Category, COUNT(*) FROM Beers GROUP BY Category

To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property:

function(doc) {
  if (doc.type == "beer") {
    emit(doc.category, 1);
  }
}

The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function:

function(keys, values) {
  return values.length;
}

In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example.

Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console.

Summary

In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems.