Data Modeling

When constructing a data model for MongoDB collection, there are various options user can choose from, each of which has its strengths and weaknesses. The following sections guide user through key design decisions and detail various considerations for choosing the best data model for application needs.

Data Model Design – It presents the different strategies that user can choose from when determining data model, their strengths and their weaknesses.
Operational Factors and Data Models – It details features user should keep in mind when designing data model, such as lifecycle management, indexing, horizontal scalability, and document growth.
GridFS – GridFS is a specification for storing documents that exceeds the BSON-document size limit of 16MB.

Each of the above approaches are discussed below

Data Model Design – Effective data models support application needs. The key consideration for the structure of documents is the decision to embed or to use references.

Embedded Data Models – With MongoDB, user may embed related data in a single structure or document. These schemas are generally known as “de-normalized” models, and take advantage of MongoDB’s rich documents. Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. In general, use embedded data models when

user have “contains” relationships between entities.
user have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are viewed in the context of the “one” or parent documents.

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.

However, embedding related data in documents may lead to situations where documents grow after creation. Document growth can impact write performance and lead to data fragmentation. Furthermore, documents in MongoDB must be smaller than the maximum BSON document size. For bulk binary data, consider GridFS. To interact with embedded documents, use dot notation to “reach into” embedded documents.

Normalized Data Models – Normalized data models describe relationships using references between documents. In general, use normalized data models when

Embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
To represent more complex many-to-many relationships.
To model large hierarchical data sets.

Operational Factors and Data Models – Modeling application data for MongoDB depends on both the data itself, as well as the characteristics of MongoDB itself. For example, different data models may allow applications to use more efficient queries, increase the throughput of insert and update operations, or distribute activity to a sharded cluster more effectively.

These factors are operational or address requirements that arise outside of the application but impact the performance of MongoDB based applications. When developing a data model, analyze all of your application’s read operations and write operations in conjunction with the following considerations.

Document Growth – Some updates to documents can increase the size of documents size. These updates include pushing elements to an array (i.e. $push) and adding new fields to a document. If the document size exceeds the allocated space for that document, MongoDB will relocate the document on disk. Relocating documents takes longer than in place updates and can lead to fragmented storage. Although MongoDB automatically adds padding to document allocations to minimize the likelihood of relocation, data models should avoid document growth when possible. For instance, if your applications require updates that will cause document growth, you may want to re-factor your data model to use references between data in distinct documents rather than a de-normalized data model. MongoDB adaptively adjusts the amount of automatic padding to reduce occurrences of relocation.
Atomicity – In MongoDB, operations are atomic at the document level. No single write operation can change more than one document. Operations that modify more than a single document in a collection still operate on one document at a time. Ensure that your application stores all fields with atomic dependency requirements in the same document. If the application can tolerate non-atomic updates for two pieces of data, you can store these data in separate documents. A data model that embeds related data in a single document facilitates these kinds of atomic operations. For data models that store references between related pieces of data, the application must issue separate read and write operations to retrieve and modify these related pieces of data.
Sharding – MongoDB uses sharding to provide horizontal scaling. These clusters support deployments with large data sets and high-throughput operations. Sharding allows users to partition a collection within a database to distribute the collection’s documents across a number of mongod instances or shards. To distribute data and application traffic in a sharded collection, MongoDB uses the shard key. Selecting the proper shard key has significant implications for performance, and can enable or prevent query isolation and increased write capacity. It is important to consider carefully the field or fields to use as the shard key.

Other operational factors include indexes, large number of collections and data lifecycle management which, are discussed as

Indexes – Use indexes to improve performance for common queries. Build indexes on fields that appear often in queries and for all operations that return sorted results. MongoDB automatically creates a unique index on the _id field. As user create indexes, consider the following behaviors of indexes

Each index requires at least 8KB of data space.
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
Collections with high read-to-write ratio often benefit from additional indexes. Indexes do not affect un-indexed read operations.
When active, each index consumes disk space and memory. This usage can be significant and should be tracked for capacity planning, especially for concerns over working set size.

Large Number of Collections – In certain situations, user might choose to store related information in several collections rather than in a single collection. Consider a sample collection logs that stores log documents for various environment and applications. The logs collection contains documents of the following form

{ log: “dev”, ts: …, info: … }

{ log: “debug”, ts: …, info: …}

If the total number of documents is low, user may group documents into collection by type. For logs, consider maintaining distinct log collections, such as logs.devandlogs.debug. The logs.devcollection would contain only the documents related to the dev environment.

Generally, having a large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing. When using models that have a large number of collections, consider the following behaviors

Each collection has a certain minimum overhead of a few kilobytes.
Each index, including the index on _id, requires at least 8KB of data space.
For each database, a single namespace file (i.e. <database>.ns) stores all meta-data for that database, and each index and collection has its own entry in the namespace file. MongoDB places limits on the size of namespace files.
MongoDB has limits on the number of namespaces. User may wish to know the current number of namespaces in order to determine how many additional namespaces the database can support. To get the current number of namespaces, run the following in the mongo shell

db.system.namespaces.count()

Data Lifecycle Management – Data modeling decisions should take data lifecycle management into consideration. The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if application requires some data to persist in the database for a limited period of time. Additionally, if application only uses recently inserted documents, consider Capped Collections. Capped collections provide first-in-first-out(FIFO) management of inserted documents and efficiently support operations that insert and read documents based on insertion order.

GridFS – GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB. Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document. By default GridFS limits chunk size to 256k. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

When user query a GridFS store for a file, the driver or client will reassemble the chunks as needed. User can perform range queries on files stored through GridFS. user also can access information from arbitrary sections of files, which allows user to “skip” into the middle of a video or audio file.

GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which user want access without having to load the entire file into memory. For more information on the indications of GridFS

To store and retrieve files using GridFS, use either of the following

A MongoDB driver.
The mongo files command-line tool in the mongo shell.

GridFS Collections – GridFS stores files in two collections as

chunks – stores the binary chunks.
files – stores the file’s metadata.

GridFS places the collections in a common bucket by prefixing each with the bucket name. By default, GridFS uses two collections with names prefixed by fs bucket

files
chunks

User can choose a different bucket name than fs, and create multiple buckets in a single database. Each document in the chunks collection represents a distinct chunk of a file as represented in the GridFS store. Each chunk is identified by its unique ObjectID stored in its _id field.

GridFS uses a unique, compound index on the chunks collection for the files _id and n fields. The files _id field contains the _id of the chunk’s “parent” document. Then field contains the sequence number of the chunk. GridFS numbers all chunks, starting with 0.

GridFS Index – GridFS uses a unique, compound index on the chunks collection for the files_id and n fields. The files_id field contains the _id of the chunk’s “parent” document. The n field contains the sequence number of the chunk. GridFS numbers all chunks, starting with 0. The GridFS index allows efficient retrieval of chunks using the files_id and n values, as shown in the following example

cursor = db.fs.chunks.find({files_id: myFileID}).sort({n:1});

If your driver does not create this index, issue the following operation using the mongo shell:

db.fs.chunks.ensureIndex( { files_id: 1, n: 1 }, { unique: true } );

As an example interface, the following is an example of the GridFS interface in Java. The example is for demonstration purposes only. By default, the interface must support the default GridFS bucket, named fs, as in the following

// returns default GridFS bucket (i.e. “fs” collection)

GridFS myFS = new GridFS(myDatabase);

// saves the file to “fs” GridFS bucket

myFS.createFile(new File(“/tmp/largething.mpg”));

Optionally, interfaces may support other additional GridFS buckets as in the following example:

// returns GridFS bucket named “contracts”

GridFS myContracts = new GridFS(myDatabase, “contracts”);

// retrieve GridFS object “smithco”

GridFSDBFile file = myContracts.findOne(“smithco”);

// saves the GridFS file to the file system

file.writeTo(new File(“/tmp/smithco.pdf”));