Site icon Tutorial

Data Modeling

Data Modeling

When constructing a data model for MongoDB collection, there are various options user can choose from, each of which has its strengths and weaknesses. The following sections guide user through key design decisions and detail various considerations for choosing the best data model for application needs.

Each of the above approaches are discussed below

Data Model Design – Effective data models support application needs. The key consideration for the structure of documents is the decision to embed or to use references.

Embedded Data Models – With MongoDB, user may embed related data in a single structure or document. These schemas are generally known as “de-normalized” models, and take advantage of MongoDB’s rich documents. Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. In general, use embedded data models when

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.

However, embedding related data in documents may lead to situations where documents grow after creation. Document growth can impact write performance and lead to data fragmentation. Furthermore, documents in MongoDB must be smaller than the maximum BSON document size. For bulk binary data, consider GridFS. To interact with embedded documents, use dot notation to “reach into” embedded documents.

Normalized Data Models – Normalized data models describe relationships using references between documents. In general, use normalized data models when

Operational Factors and Data Models – Modeling application data for MongoDB depends on both the data itself, as well as the characteristics of MongoDB itself. For example, different data models may allow applications to use more efficient queries, increase the throughput of insert and update operations, or distribute activity to a sharded cluster more effectively.

These factors are operational or address requirements that arise outside of the application but impact the performance of MongoDB based applications. When developing a data model, analyze all of your application’s read operations and write operations in conjunction with the following considerations.

Other operational factors include indexes, large number of collections and data lifecycle management which, are discussed as

Indexes – Use indexes to improve performance for common queries. Build indexes on fields that appear often in queries and for all operations that return sorted results. MongoDB automatically creates a unique index on the _id field. As user create indexes, consider the following behaviors of indexes

Large Number of Collections – In certain situations, user might choose to store related information in several collections rather than in a single collection. Consider a sample collection logs that stores log documents for various environment and applications. The logs collection contains documents of the following form

{ log: “dev”, ts: …, info: … }

{ log: “debug”, ts: …, info: …}

If the total number of documents is low, user may group documents into collection by type. For logs, consider maintaining distinct log collections, such as logs.devandlogs.debug. The logs.devcollection would contain only the documents related to the dev environment.

Generally, having a large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing. When using models that have a large number of collections, consider the following behaviors

db.system.namespaces.count()

Data Lifecycle Management – Data modeling decisions should take data lifecycle management into consideration. The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if  application requires some data to persist in the database  for a limited period of time. Additionally, if application only uses recently inserted documents, consider Capped Collections. Capped collections provide first-in-first-out(FIFO) management of inserted documents and efficiently support operations that insert and read documents based on insertion order.

GridFS – GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB. Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document. By default GridFS limits chunk size to 256k. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

When user query a GridFS store for a file, the driver or client will reassemble the chunks as needed. User can perform range queries on files stored through GridFS. user also can access information from arbitrary sections of files, which allows user to “skip” into the middle of a video or audio file.

GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which user want access without having to load the entire file into memory. For more information on the indications of GridFS

To store and retrieve files using GridFS, use either of the following

GridFS Collections – GridFS stores files in two collections as

GridFS places the collections in a common bucket by prefixing each with the bucket name. By default, GridFS uses two collections with names prefixed by fs bucket

User can choose a different bucket name than fs, and create multiple buckets in a single database. Each document in the chunks collection represents a distinct chunk of a file as represented in the GridFS store. Each chunk is identified by its unique ObjectID stored in its _id field.

GridFS uses a unique, compound index on the chunks collection for the files _id and n fields. The files _id field contains the _id of the chunk’s “parent” document. Then field contains the sequence number of the chunk. GridFS numbers all chunks, starting with 0.

GridFS Index – GridFS uses a unique, compound index on the chunks collection for the files_id and n fields. The files_id field contains the _id of the chunk’s “parent” document. The n field contains the sequence number of the chunk. GridFS numbers all chunks, starting with 0. The GridFS index allows efficient retrieval of chunks using the files_id and n values, as shown in the following example

cursor = db.fs.chunks.find({files_id: myFileID}).sort({n:1});

If your driver does not create this index, issue the following operation using the mongo shell:

db.fs.chunks.ensureIndex( { files_id: 1, n: 1 }, { unique: true } );

As an example interface, the following is an example of the GridFS interface in Java. The example is for demonstration purposes only. By default, the interface must support the default GridFS bucket, named fs, as in the following

// returns default GridFS bucket (i.e. “fs” collection)

GridFS myFS = new GridFS(myDatabase);

// saves the file to “fs” GridFS bucket

myFS.createFile(new File(“/tmp/largething.mpg”));

Optionally, interfaces may support other additional GridFS buckets as in the following example:

// returns GridFS bucket named “contracts”

GridFS myContracts = new GridFS(myDatabase, “contracts”);

// retrieve GridFS object “smithco”

GridFSDBFile file = myContracts.findOne(“smithco”);

// saves the GridFS file to the file system

file.writeTo(new File(“/tmp/smithco.pdf”));

Apply for MongoDB Certification Now!!

https://www.vskills.in/certification/databases/mongodb-server-administrator

Back to Tutorial

Exit mobile version