In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB.

(For more resources related to this topic, see here.)

Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process.

As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle.

A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed.

Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable.

Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn.

The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models.

The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases.

For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary.

This article will cover the following:

Introducing your documents and collections
The document's characteristics and structure

Introducing documents and collections

MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON).

Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB.

The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document.

An example of a document is:
{
   "_id": 123456,
   "firstName": "John",
   "lastName": "Clay",
   "age": 25,
   "address": {
     "streetAddress": "131 GEN. Almério de Moura Street",
     "city": "Rio de Janeiro",
     "state": "RJ",
     "postalCode": "20921060"
   },
   "phoneNumber":[
     {
         "type": "home",
         "number": "+5521 2222-3333"
     },
     {
         "type": "mobile",
         "number": "+5521 9888-7777"
     }
   ]
}

Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures.

We can have, for instance, on the same users collection:

{
   "_id": "123456",
   "username": "johnclay",
   "age": 25,
   "friends":[
     {"username": "joelsant"},
     {"username": "adilsonbat"}
   ],
   "active": true,
   "gender": "male"
}

We can also have:

{
   "_id": "654321",
   "username": "santymonty",
   "age": 25,
   "active": true,
   "gender": "male",
   "eyeColor": "brown"
}

In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to:

Define what data can be read, written, and/or updated in queries
Define which fields will be updated
Create indexes
Configure replication
Query the information from the database

Before we go deep into the technical details of documents, let's explore their structure.

JSON

JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described.

JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations.

As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python.

The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures:

A set or group of key/value pairs
A value ordered list

So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern:

{
   "key" : "value"
}

In relation to the value ordered list, a collection is represented as follows:

["value1", "value2", "value3"]

In the JSON specification, a value can be:

A string delimited with " "
A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E
Boolean values (true or false)
A null value
Another object
Another value ordered array

The following diagram shows us the JSON value structure:

documents-and-collections-data-modeling-mongodb-img-0

Here is an example of JSON code that describes a person:

{
   "name" : "Han",
   "lastname" : "Solo",
   "position" : "Captain of the Millenium Falcon",
   "species" : "human",
   "gender":"male",
   "height" : 1.8
}

BSON

BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents.

If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/.

If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web.

The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence.

The types of data representation in BSON are:

String UTF-8 (string)
Integer 32-bit (int32)
Integer 64-bit (int64)
Floating point (double)
Document (document)
Array (document)
Binary data (binary)
Boolean false (x00 or byte 0000 0000)
Boolean true (x01 or byte 0000 0001)
UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch
Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp
Null value ()
Regular expression (cstring)
JavaScript code (string)
JavaScript code w/scope (code_w_s)
Min key()—the special type that compares a lower value than all other possible BSON element values
Max key()—the special type that compares a higher value than all other possible BSON element values
ObjectId (byte*12)

Characteristics of documents

Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled.

The document size

We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS.

GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size.

Names and values for a field in a document

There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are:

The _id field is reserved for a primary key
You cannot start the name using the character $
The name cannot have a null character, or (.)

Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes.

The document primary key

As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created.

The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop.

Support collections

In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value.

There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection.

Let's see an example for this method:

db.counters.insert(
   {
     _id: "userid",
     seq: 0
   }
)
 
function getNextSequence(name) {
   var ret = db.counters.findAndModify(
         {
           query: { _id: name },
           update: { $inc: { seq: 1 } },
           new: true
         }
   );
   return ret.seq;
}
 
db.users.insert(
   {
     _id: getNextSequence("userid"),
     name: "Sarah C."
   }
)

The optimistic loop

The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document:

function insertDocument(doc, targetCollection) {
   while (1) {
       var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);
       var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;
       doc._id = seq;
       var results = targetCollection.insert(doc);
       if( results.hasWriteError() ) {
           if( results.writeError.code == 11000 /* dup key */ )
               continue;
           else
               print( "unexpected error inserting data: " +                 tojson( results ) );
       }
       break;
   }
}

In this function, the iteration does the following:

Searches in targetCollection for the maximum value for _id.
Settles the next value for _id.
Sets the value on the document to be inserted.
Inserts the document.
In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends.

The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass.