0% found this document useful (0 votes)
215 views130 pages

MongoDB Indexing Strategies Explained

The document outlines the MongoDB Certification Program, specifically focusing on indexing strategies, query plans, and data retrieval methods. It discusses the Plan Cache, various index types (like TTL, unique, and wildcard indexes), and the importance of optimizing queries and data models. Additionally, it covers the Database Profiler and aggregation pipelines for advanced data processing.

Uploaded by

shreyagoudar06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views130 pages

MongoDB Indexing Strategies Explained

The document outlines the MongoDB Certification Program, specifically focusing on indexing strategies, query plans, and data retrieval methods. It discusses the Plan Cache, various index types (like TTL, unique, and wildcard indexes), and the importance of optimizing queries and data models. Additionally, it covers the Database Profiler and aggregation pipelines for advanced data processing.

Uploaded by

shreyagoudar06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MongoDB Certification Program (MCP)

MCP300
MongoDB for
Developers
Nishit Rao
Solutions Architect
MongoDB India
Follow along!

Indexing
Advanced
Choosing Query Plans
MongoDB checks in the Plan Cache for an
optimal Query Plan used previously.

If an entry is not found (Missing):

● Candidate Plans are generated & evaluated


based on the Query Shape
● A winner is chosen & added to the cache as
Inactive entry

If an entry is found:

● It may be Active or Inactive


● Inactive follows same process as Missing, &
passes a performance test to become Active
● Active does not require regenerating
candidate plans (best path)

7
Plan Cache
Least Recently Used (LRU) Plan Cache entries are automatically evicted
when:

● The server is restarted


● An index is added or removed in that collection
● The collection is dropped
● Manually issuing the following command:
db.<collectionName>.getPlanCache().clear()

To force the use of an index:


[Link]().hint( { age: 1 } ) OR
[Link]().hint( "age_1" )

8
How MongoDB retrieves data
Covered Queries

Select Find Walk Fetch Filter Project

MongoDB Query terms Index Identified Additional Document


selects best found in walked for documents filtering fields
index. index. matching fetched from applied. removed or
terms. collection. added.

If no appropriate index found


i.e. COLLSCAN
9
Find the [Link] = "John"
[Link](query)

key in Key (name) Value (identity)

the index "Andrew"


"Daniel"
"John"
4
5
1
"Norm" 2
"Rick" 3

Relative cost: 1 unit per lookup


Cost if not in cache: +20 units
10
Walk the
[Link] = 25
[Link](query)

index Key (score) Value (identity)

keys
5 1

20 4

25 2

25 5

35 3

Relative cost: 0.05 units per step


Cost if not in cache: +20 units per
block! 11
How to reduce the cost?
We can utilize various indexing strategies:

● Periodically monitor for unused & redundant indexes


○ Atlas users can use Performance Advisor!

● Periodically analyze slow queries and identify indexes to create


○ Atlas users can use Performance Advisor & Query Profiler!

● Always remember the ESR rule (Equality-Sort-Range)

● Try to use Covered Queries wherever possible, but don’t create too many
indexes

● Optimize your Data Model to the Queries, not the other way around.

12
How to reduce the cost?
… and MongoDB features:

● Wildcard Index ● TTL Index ● Clustered Index

● Partial Index ● Unique Index

● Sparse Index ● (Text) Search Index

13
Index Statistics
To get index usage statistics for a collection:
[Link]( { $indexStats : { } } )

14
Index Sizes
To get index sizes for a collection:
[Link]().indexSizes

15
Database Profiler
Database profiler records information about slow operations:

● Definition of slow is configurable. Default slowness threshold = 100 ms

● Recorded in a collection called [Link]file, size capped at 10 MB

● Can be enabled globally or for any database

● Creates extra writes


○ Enabling profiler for every operation is bad - creates extra writes
○ Always turn off Profiling unless required - Affects performance

16
[Link]filingLevel( level, slowms )
level: Syntax:
0 - Profiler is
o [Link](<level>,<slowms>)
ff OR
1 - Store only slow queries [Link](<level>,<options>)
2 - Store all queries

slowms:threshold in milliseconds for


slow operations
default = 100 ms

17
Query Profile Data
To see the 5 most recent profiled events:
show profile

Query the [Link]file collection directly:


[Link]( { op: 'query',
ns: 'sample_mflix.movies' }).sort({ ts: -1 })

To get the current profiling setting:


[Link]()

18
Index
Special Features
Partial indexes [Link](
{ keywords: 1 },
{ partialFilterExpression: {
{ score: { $gte: 4 } } }
score: 5.0, )
keywords: [ 'friendly',
'clean',
'bright'
]
Key (keywords) Value (identity)
} SORTED
{
score: 3.0, bright 1
keywords: [ 'cold',
'near beach',
'dark' cheap 3
]
} clean
{
1
score: 4.0,
keywords: [ 'friendly', friendly 1
'near beach',
'cheap' friendly
] 3
}
near beach 3
20
Sparse indexes [Link](
{ avatar_img: 1 },
{ sparse: true }
{ )
_id: 12,
username: 'testUser',
avatar_img: '[Link]'
},
Key (avatar_img) Value (identity)
{
_id: 34,
SORTED
username: 'anotherUser',
avatar_img: '[Link]' [Link] 1
},
{
[Link] 2
_id: 56,
username: 'test',
avatar_img: null null 3
},
{
null 4
_id: 78,
username: 'coolUser',
}

This entry would exist for


non-sparse indexes.
21
Partial vs Sparse Indexes
● Both index types help in reducing storage requirement and reducing
memory footprint.
● However, partial indexes are recommended as they are a superset over
sparse indexes, and can handle complex criteria for indexing.
● Sparse indexes are useful when you are only interested in whether a field is
present or not, regardless of its value.
● Some index types are sparse by default:
○ 2d & 2dsphere
○ text
○ wildcard
22
Unique Indexes
● Indexes can enforce a unique constraint
● Syntax:
[Link]({ account_id: 1 }, { unique: true })
● Uniqueness is applied on the combination of the fields.
● A unique index cannot be created on a collection that already violates it.
● Unique + Sparse Index:
○ If index is only unique, and the field is not included in a document, that
document is indexed with null value for that field. So, only one such record
can exist.
○ Unique + Sparse index solves this – multiple documents can omit the field,
but those that do must have unique values in them.

23
TTL Indexes
● Special single-field indexes that help to automatically remove documents
from a collection after a certain amount of time, or at a specific clock time.
● Expiration threshold = indexed field value + specified no. of seconds
● Syntax:
[Link]({ ts: 1 }, { expireAfterSeconds: 60 })
● A background task removes expired documents every 60 seconds, but it
may take more than 60 seconds depending on the workload on your cluster.
● Ways to use TTL index:
○ Put the current timestamp and specify age at which to delete
○ Put an expiration timestamp and specify expireAfterSeconds = 0

24
TTL Indexes
● TTL is the only index type which you can modify after creation.
To change the expireAfterSeconds, or to convert a non-TTL index to TTL
index:
[Link]({
collMod: 'sensordata',
index: {
keyPattern: { ts : 1 },
expireAfterSeconds: '3600'
}
})
25
TTL Index Restrictions
● TTL index must have 1 time field only, else document will not be expired.

● TTL index is not supported on the _id field.

● It supports queries just like any other index. So, if there is already a non-TTL

single-field index on a time field, you cannot create a new TTL index on that

same field.

● Documents deleted by TTL process cannot be accessed again. If you want

to keep documents accessible after archival, consider Atlas Online Archive.

26
Wildcard Indexes
● Dynamic schema makes hard to index all fields, e.g. IoT, metadata
● Wildcard indexes index all fields, or a subtree (sub-documents & arrays)
● If created on an embedded document:
○ It descends into the embedded document and indexes all its fields.
○ It descends all levels till it finds a primitive/scalar value.
● If created on an embedded array, it indexes all its elements
○ If element is an embedded document, it descends into that document.
○ If element is an array, it indexes that element array as a value.
● Not advisable as the index can become very big and inefficient.
Consider using the Attribute Pattern instead!
27
Wildcard Indexes
Syntax:
[Link]( { "$**" : 1 } ) [Note: you must use ""]
[Link]( { "tomatoes.$**" : 1 } )
[Link]( { year : 1, "tomatoes.$**" : 1 } )
[Link]( { "$**" : 1 },
{ wildcardProjection : { _id : 1, year : 0 } } ) [Note: _id excluded by default]

Restrictions for compound wildcard index:


● Cannot use unique or TTL options
● Non-wildcard fields cannot be arrays
● Only 1 wildcard term allowed per compound wildcard index
● Non-wildcard fields cannot be at the same level as wildcard. Else, use
wildcardProjection to exclude those fields from the wildcard.
28
Clustered index vs. Regular index
Points to individual record
Points to a
block of data 1 1 5

1 2 7

1 1
3
1 2 4
4
2
3 5 8
3 3
6 3
4 3
7 6
4
8 2

Similar to a dictionary Similar to the index of a book


The data and index are sorted and stored together The data and index are stored separately
You can have only one clustered index You can have multiple non-clustered indexes

29
Clustered index Regular index
Great at… Great at…
Quickly accessing range Speeding up many different types
of pre-sorted data of operational queries that have
random access patterns to the data

Poor at… Poor at…


Random data access patterns Providing really fast access to
typical in operational apps ranges of data that is always
written and read by time

30
Clustered Indexes
● Collections created with a clustered index are called clustered collections.
○ Clustered index can only be created at the time of creating the collection.
○ A clustered collection can have only 1 clustered index, that too on _id field.
● Faster queries on the clustered index field, such as queries with range scans and
equality comparisons on the clustered index key.
● Clustered collections have a lower storage size, which improves performance for
queries and bulk inserts.
● Clustered collections can eliminate the need for a secondary TTL index.
○ A clustered index is also a TTL index if you specify the expireAfterSeconds field.
○ If you use a clustered index as a TTL index, it improves document delete
performance and reduces the clustered collection storage size.
● The primary use-case of clustered indexes is Time-Series Collections.
31
Clustered Indexes
Syntax:

[Link](
"stocks",
{
clusteredIndex: {
"key": { _id: 1 },
"unique": true,
"name": "stocks clustered key"
}
}
)

32
Quiz Time!
#1. What things might prevent an index
being used to 'cover' a query?

Sorting query
Projecting the Having a
A _id field B multikey index C results by one of
the index fields

Having more
Retrieving an
D array field E than one index
on the same field

Answer in the next slide.


34
#2. Select three actions TTL Indexes
can perform:

Delete data at a Place


Delete data at a
A specific time B preset time after
the field value
C unexpected write
load on a server

Automatically
Automatically
move data to
D move data to an
archive server
E another
collection

Answer in the next slide.


36
#3. Select TRUE statements about
Database Profiler

Default slowness
Adds write Logs only slow
A overhead B threshold is
500ms
C read commands

Slow queries are


level = 0 disables
D logged in
[Link]file
E profiling

Answer in the next slide.


38
#4. Which of these are valid options
when creating an index?

A expireAfterSeconds B name C sparse

D unique E partialFilterExpression

Answer in the next slide.


40
Follow along!

MongoDB
Aggregations
The Power of MQL
For the sample_mfl[Link] collection, write MQL queries for the following SQL queries:

SELECT * FROM movies WHERE year > 1990;

[Link]({ year: { $gt: 1990 } })

SELECT title, year, countries FROM movies WHERE year > 1990;

[Link]({ year: { $gt: 1990 } },


{ _id: 0, title: 1, year: 1, countries: 1 })

SELECT countries, count(*) AS movie_count FROM movies WHERE year > 1990
GROUP BY countries HAVING count(*) > 100 ORDER BY movie_count DESC
LIMIT 10;
44
Programmatic Approach to the Solution

Find movies where Break down


Group by country
year > 1990 countries array

Sort by count in Find records where Count the size of


descending order count > 100 each bucket

Limit to 10 results Rename fields

45
MongoDB Approach to the Solution

{ $match: { year: { { $unwind:


$gt: 1990 } } } "$countries" } {
$group: {
_id: "$countries",
{ movies_count: { $count: {} }
{ $match: { }
$sort: { movies_count: }
movies_count: -1 } { $gt: 100 }
} }
}

{ $limit: 10 } { $project: { _id: 0, countries: "$_id", movies_count: 1 } }

46
MongoDB Approach to the Solution

Pipeline
Stages

47
MongoDB Aggregation Pipelines

Advanced data processing pipeline


for transformations
and analytics
Multiple stages

Similar to a unix pipe


• Construct modular, composable processing pipelines

Rich Expressions

Example Aggregation Command


on the Orders Collection:

[Link]( [
$match stage {$match: { status: "A" } },
$group stage {$group: { _id: "$cust_id", total: { $sum: "$amount" } } }
] )

48
SQL Joins & Aggregation Syntax

SELECT
city,
SUM(annual_spend) Total_Spend,
SQL queries have a
AVG(annual_spend) Average_Spend,
nested structure
MAX(annual_spend) Max_Spend,
COUNT(annual_spend) customers
FROM (
SELECT [Link], customer.annual_spend
Understanding the
FROM customer
outer layers requires
LEFT JOIN (
understanding the
SELECT address.address_id, [Link],
inner ones
address.customer_id, [Link]
FROM address LEFT JOIN city
ON address.city_id = city.city_id
) AS t1
ON
(customer.customer_id = t1.customer_id AND
[Link] = "home")
) AS t2
So SQL has to be read
GROUP BY city; “inside-out”

49
MongoDB Aggregation Pipeline Syntax
[Link]([
{
$unwind: "$address", These “phases” are distinct and
}, easy to understand
{
$match: {"[Link]": "home"}
},
{
$group: {
_id: "$[Link]",
totalSpend: {$sum: "$annualSpend"},
averageSpend: {$avg: "$annualSpend"}, They can be thought about
maximumSpend: {$max: "$annualSpend"}, in order… no nesting
customers: {$sum: 1}
}
}
])

MongoDB aggregation syntax makes it easier to understand, debug, rewrite and optimize.

For pipelines with only a single stage, the square braces [] can be omitted.

50
Aggregation features
A feature rich framework for data transformation and Analytics

Pipeline Stages Operators


• $match • $lookup • Mathematical • Conditionals • String
• $group • $out $add, $abs, $subtract,
$multiply, $divide, $log,
$and, $or, $eq, $lt, $lte, $gt,
$gte, $cmp, $cond, $switch,
$toUpper, $toLower, $substr, $strcasecmp,
$concat, $split, etc.
$log10, $stdDevPop, $in, etc.
• $facet • $project $stdDevSam, $avg, • Literals
$sqrt, $pow, $sum, $zip, • Temporal
• $geoNear • $search $convert, $round, etc. $exp, $let, $literal, $map, $type, etc.
Window Functions
• $graphLookup • $sort • Array $dateAdd, $dateDiff,
• Regex
$push, $reduce, $dateSubtract, $dateTrunc $regexFind, $regexMatch, etc
• $setWindowFields $reverseArray,
$dateFromParts,
$addToSet,
$dateToParts, • Trigonometry
• $unionWith $arrayElemAt, $slice,
$dateFromString,
etc. $sin, $cos, $degreesToRadians, etc.
$dateToString,
• $unwind $dayOfMonth, $isoWeek,
$minute, $month, $year, etc. • Custom Aggregation
• $limit Expressions
• ...and more

PRACTICAL MONGODB AGGREGATIONS EBOOK

51
Basic Aggregation Stages
Stage Syntax
$match equivalent to find(query)
$project equivalent to find({},projection)
$sort equivalent to find().sort(order)
$limit equivalent to find().limit(num)
$skip equivalent to find().skip(num)
$count similar to countDocuments()
**Specify new field name which contains count; Filter is not specified

52
Dollar ($) overloading
{$match: {a: 5}}
Dollar on left means a stage name - in this case a $match stage

{$set: {b: "$a"}}


Dollar on right of colon "$a" refers to the value of field a

{$set: {area: {$multiply: [5,10]}}


$multiply is an expression name left of colon

{$set: {displayprice: {$literal: "$12"}}}


Use $literal when you want either a string with a $ or to $project an explicit number

53
Aggregation Expressions
Expressions can be a literal value:
{$set: {answer: 42}}

Value of another field:


{$set: {answer: "$ultimate_answer"}}

Simple calculation:
{$set: {answer: { $multiply: [6, 7]}}}

Nested calculation:
{$set: {simple_interest:
{$divide: [
{ $multiply: ["$principal","$interest_rate", "$time"] },
100
] }
} }

54
Using $project
● $project specifies the output document shape
● Just like in .find(), use 0 to exclude fields and 1 to include fields (not
together)
● New fields can also be added using $project
● Expressions define the field values
{
$project: {
_id: 0,
name: "$fullname",
average_speed: { $divide: ["$distance","$time"] }
}
} 55
Using $set or $addFields
● To add in additional fields rather than specify the whole output
use $set, not $project
● $set and $addFields are aliases of each other, and can be used
interchangeably.
● It also lets you replace existing values
{
$set: {
average_speed: { $divide : ["$distance","$time"]
}
}
}
56
Using $group
Syntax:
{
$group: {
_id: <expression>, //group key
● Groups documents by a field1: {<accumulator>: <expression>}, …
“group key” }
}
● Each unique “group key” Example:
value represents one {
“group” $group: {
_id: "$country", //group key
urban_population: {$sum: "$city_population" }
● Output is one document }
per group }

Output:
● Additional fields can
[
contain results of {_id: "South Africa", urban_population:
accumulator expressions 40928486},
{_id: "Germany", urban_population: 65286292},

] 57
Common $group Accumulators

$avg average of numerical values


$first/$last value from the first or last document for each group
$max/$min highest or lowest expression value for each group
$count number of documents in each group
$push array of expression values for each group
$sum sum of numerical values

58
Using $unwind Syntax:
{ $unwind: "$<field-name>" } OR
{
● Opposite of $group $unwind: {
path: "$<field-name>",
● Applied on array fields includeArrayIndex: "<array-index-field-name>",
only preserveNullAndEmptyArrays: < true | false >
}
}
● Generates 1 document
per element in array. Example: document is { a: 1, b: [ 2, 3, 4 ] }

● Optionally, compute { $unwind: "$b" }


Output:
position in array, and [ { a: 1, b: 2}, { a: 1, b: 3}, { a: 1, b: 4 } ]
also preserve documents
without that array. { $unwind: { path: "$b", includeArrayIndex: "pos" } }
Output:
[ { a: 1, b: 2, pos: 0}, { a: 1, b: 3, pos: 1}, … ]

59
Using $lookup
● Like Left Outer Join or Syntax:
Nested Select
{
$lookup: {
● Only on collections in the from: <collection to join>,
same database! localField: <field from the input documents>,
foreignField: <field from "from" collection>,
● Embeds results as array as: <field name of output array>
in parent document }
}

● Needs indexing and


tuning.

60
Using $lookup - Example
Let’s fetch all comments on the movie "The Hunger Games", along with the movie title.

61
Using $out
● Helps to preserve results of a complex computation without having to recompute
the data.
● Creates a new collection in a specified database from the pipeline output.
● If the target collection already exists, its contents will be replaced entirely.
○ The target collection can be the same as the source collection.
● This stage must be the very last stage of the pipeline.
● Syntax:
○ In the same database:
{ $out: "<output-collection>" }
○ In a different database:
{ $out: { db: "<output-db>", coll: "<output-collection>" } }

62
Revisiting our Aggregation Pipeline

63
Aggregation Pipeline - Tips & Tricks
● Place the $match, $limit & $skip stages as early in the pipeline as possible. It reduces
the number of documents passed on, which reduces the processing required.

● $match or $match + $sort at the start of the pipeline is equivalent to a single query
(with a sort), and can use an index.

● Perform $sort before $limit, else the results may be inaccurate.

● MongoDB works out the required early projection; do not $project to optimize.

● Document size inside the pipeline can be upto 64MB. Document size in the final output
must be at most 16MB, since it must be BSON.

● The aggregation pipeline merges stages and reorder as needed.

● Avoid using $lookup as much as possible. Data modeling should be optimized for the
workload.

64
Try it at home!
Use the sample_analytics database and collections customers, accounts & transactions.

● Find the top 5 customers who have the most number of accounts associated with them.
Output only the username, email and number_of_accounts of each customer.
Hint: to get the size of accounts array, use { $size : "$accounts"}

● Find the 3 most popular products among customers who are born on or after 1st Jan
1996. The output should contain only the fields product_name and product_count.
Store the output in a collection called popular_products.

● Challenging:
For all customers born on or after 1st Jan 1997, output their username, age, and
last_transaction_date.
Hint: 1 year = 31556952000 milliseconds

65
Quiz Time!
#1. What stage is used to compute an
average of a range of values?

A $match B $project C $group

D $lookup E $facet

Answer in the next slide.


67
#2. Which one of these aggregation stages
should be used last in the pipeline?

A $project B $sort C $out

D $addFields E $redact

Answer in the next slide.


69
#3. What stage is the opposite of
$unwind in an aggregation pipeline?

A $facet B $group C $bucket

D $merge E $set

Answer in the next slide.


71
Multi-Document
ACID Transactions
No partial updates

Follow the rules

Database Respect other users

basics
changes

Don't lose data

What we need from any database.

74
Atomic

Consistent

Database Isolated

basics Durable

ACID

75
How MongoDB updates data

Find Lock Check Modify Unlock Repeat

Identify a Create a Ensure the Update the Release the If updateMany()


document blocking lock document still data and all lock then start on
that meets that prevents meets the relevant next matching
the filter any writes or criteria indexes document
supplied wait for
existing lock
to be released

76
Atomic ✔
Single Consistent

document Isolated ✔
updates
ACID
Durable

77
Atomic

Multi Consistent

document Isolated

updates
ACID
Durable

78
async function postReview(review)

Sessions & Transactions {


const session = [Link]();
[Link]();
● Session is a group of database operations await [Link](review, { session });
related to each other, and should be run await [Link](
together. { _id: [Link] },
{ $inc: { reviews: 1,
● These operations are run within a singular score: [Link] },
transaction within a session. $push : { latest_reviews: {
$each: [review],
● Operations are committed only when the
$slice: 10 }}},
transaction is committed. { session });
try{
● Upon error or certain conditions, transaction await [Link]();
can be aborted & operations will be rolled back. } catch (e) {
await [Link]();
● By default, transaction timeout = 60 seconds throw e;
} finally {
● Session object must be passed in the await [Link]();
}
operation’s method call. }
79


When you commit - all changes are made

When you abort - no changes are made


Atomic ✔
● Schema Validation & constraints still
apply, just as before. Consistent

● After first operation in a transaction you
see no changes to the DB except your
own (Read Your Own Writes) Isolated ✔
● If you try to modify a record that has
changed since you began, it will fail.

● No one sees any of your changes until you


commit. Due to this, Durable applies only
Durable

on commit.
80
Callback API async function postReview(review)
{

● The previous example showed Core API. const session = [Link]();


try {
○ In case of transient or network errors, await [Link](async()=>{
await [Link](review, { session });
transactions must be retried.
await [Link](

○ Retry logic must be written in the { _id: [Link] },


{ $inc: { reviews: 1,
application layer.
score: [Link] },

● This example shows Callback API $push : { latest_reviews: {


$each: [review],
○ Retry logic is built into the $slice: 10 }}}
{ session });
[Link]() method. });
} finally {
○ Retries will done as many times as await [Link]();
}
needed until 2 minutes from start of
}
withTransaction().
81
Data Modeling
Basics
A common misconception

84
With MongoDB you are designing an
optimal schema for your use cases - not
a generic schema for every possible use
case.

85
Considerations for Data Modeling
Schema is defined at the application-level. Its design should focus on the
application, and not the abstract nature of the data.

● What does my application do?

● What data will I store?

● How will users access this data?

● What data will be most valuable?

86
Easier manageability

More efficient queries


A good
data model Lower memory & CPU
leads to:
Lower costs

Data that is accessed


together should be
stored together!
87
A document in MongoDB

Referencing

Embedding

88
Embedding Referencing
Single query to
No duplication
retrieve data

Single operation to
Smaller documents
update/delete data

Need to join data from


Data duplication
multiple documents
Need to update data in
Large documents
multiple documents
Types of Relationships

Relationship Type Examples

1. One-to-One Book ISBN - Book Name


Aadhaar Number - Person Name

2. One-to-Few Customer ID - Customer Address


Product ID - Available Colors

3. One-to-Many Car Model - Car Parts


TV Show - Episodes

4. One-to-Squillions Twitter ID - Followers


Server - Log Messages

5. Many-to-Many Books - Authors


Students - Courses

90
Types of Relationships - MongoDB Examples

One-to-One

One-to-Few

91
Types of Relationships - MongoDB Examples
One-to-Many One-to-Many
(Embedding) (Referencing on “One” side)

92
Types of Relationships - MongoDB Examples
One-to-Squillions
(Referencing on
“Many” side)

93
Types of Relationships - MongoDB Examples
Many-to-Many
(Referencing on both sides)

94
Things to Avoid ‒ Schema Anti-Patterns
● Massive or Unbounded Arrays

● Massive Number of Collections

● Bloated Documents

● Unnecessary Indexes

● Queries without Indexes

● Data that is accessed together but stored separately

95
Step-by-step iteration
Business domain expertise
Current and predicted scenarios
Production logs and stats

Collections with
documents fields and
shapes for each
Evaluate the Map out entities Finalize the
application and their data model for Data size
workload relationships each collection Database queries
and indexes
Data size CRD: Identify and Current operations
apply relevant assumptions, and
A list of Collection growth projections
design patterns
operations Relationship
ranked by Diagram
importance (link or embed?)
Quiz Time!
#1. Which of these are relevant for
MongoDB Schema Design?

App Relationship
A Functionality B Go Live Date C between Entities

Data Access
D Patterns E Server Location

Answer in the next slide.


98
#2. Which of these are NOT data
relationship types?

A One-to-One B One-to-Squillions C Many-to-One

D Many-to-Many E One-to-Few

Answer in the next slide.


100
#3. Which of these are benefits of
Embedding?

Richer
A Atomicity B Avoid $lookup C documents

Large
D documents E Data duplication

Answer in the next slide.


102
#4. Which of these are Schema Design
Anti-Patterns?

Average Normalization of
A Document Size
= 10 MB
B Small Array Size C data accessed
together

10000
Unnecessary
D Collections in a
Database
E Indexes

Answer in the next slide.


104
Follow along!

Atlas Search
What is Search?
"With a database query - you know exactly what you want.
When you use search you are open to suggestions." - unknown

[Link]({ title : "Raising Arizona" })


Title IS EXACTLY "Raising Arizona"

[Link]({ title: /Raising Arizona/ })


Title CONTAINS the phrase "Raising Arizona."

[Link]( [ { $search :
{ text : { query : "Raising Arizona", path: "title" } } } ] )
Title mentions some variations of "Raising," "Arizona," or both

108
What is Search?

Search, or “Full-Text Search” is the ability to search across all of your data
(JSON Docs, CSVs, PDFs, Web Pages … etc) and efficiently return a list of results,
ranked based on how well they matched to the search term

Usually defined as a Search bar Javascript Book

Score
Javascript the Definitive Guide 75

Javascript the Good Parts 60


Results based on a score
The Beginners Guide to [Link] 35

Angular for Dummies 25

Introducing MongoDB 10

109
Atlas Search
● Powered by Apache Lucene, which is the world’s leading text search engine.

● It only runs on MongoDB Atlas, not on Enterprise or Community Edition.

● Requires the creation of a “Search Index” on the searchable fields.


This is different from the database indexes we studied earlier.

● Requires the use of the Aggregation Framework, through the $search and
$searchMeta stages.

● Makes search workloads more efficient and scalable via Search Nodes.

110
Facets Fuzzy Search - Autocompletion Geo search
- Typo tolerance
Filters

Custom Score

Highlight
111
Hands-On Exercise
Run your first Atlas Search query
● Go to Atlas Search on the left panel and Create Search Index
○ Select the sample_mflix database & movies collection.
○ Proceed with the defaults & create the Search Index
○ Allow it to build and reach the Ready status.

● When it’s Ready, click the Query button to run Search Queries.

● Type a search word in the search box and observe the results
○ Try different variations of the same keywords and observe the results.
○ e.g. “Tom Cruise”, “tom cruise”, “tam cruise” etc.
○ Try searching with the same terms using .find(), or .aggregate() with $match

● Click Edit Query and observe the shape of the query.

● Click Index Overview on the left panel and observe the index configuration.
114
Index Configuration

Edit Index Definition


with JSON Editor

115
Analyzers

Analyzers apply parsing and Lions and tigers and bears, oh my!
language rules to the query
L lions and tigers and bears, oh my!
Tokens are the individual terms that
are queried; think of these as the
words that matter in a search query. lions tigers bears oh my

“shard”
Analyzers also support language
rules. A search for a particular word
can return results for related words. [“shard”, “shards”, “sharded”,
“sharding”, ...]
Lucene Inverted Indexes
Lucene uses a hash table to get to a compact
list of documents for a term that can be
1 = { txt : "dark clouds gather storm" } stored as multiple segments
2 = { txt: "storm clouds gather today" } Deletion and merging/compacting of these
3 = {txt: "today clouds are not dark clouds"} lists is done asynchronously

hash("gather")=5 Term Dictionary Slower to write but faster for multi-index


1 are queries, counting, and facets
2 today
3 clouds Postings list
4 3
5 gather 2,3
6 dark 1,2
7 1,2,3
8 3
9 storm 1,3
10 1,2
11 not 118
Dynamic Mapping
● If it is set to true, all fields in the document are indexed, except timestamps.
○ Any field-level definitions will override the defaults for those fields alone.

● If it is set to false, field mappings for the desired fields must be specified.

● Default = true on Atlas Web Console; Default = false on mongosh

● Typically, Search indexes with dynamic field mappings can be significantly larger in size
and much less performant.

119
Static Mapping
● To create a Search index with only the desired fields mapped, static field mappings can
be specified.

● Unlike a regular index, the field data type must be specified.


Atlas Search data types: string, number, date, boolean, objectId, token, etc…

● You can add other field-level specifications like analyzer & searchAnalyzer. If specified, it
overrides the dynamic field mapping specification.

● For mapping fields within embedded documents, the parent field must also be mapped.

120
Static Mapping - Try It Yourself!
● Create a fresh Search Index on sample_mfl[Link]
with a different name.
○ Before creating, click on Refine Your Index button
on the top.
○ Disable Dynamic Mapping.
○ Click on Add Field Mapping button and select Field
Name as title. Click Add.
○ Also add field mapping for cast and [Link]

● Save changes and create the Search Index.


Once ready, compare its size to that of the default index.

● Click the ••• menu and select Edit with JSON Editor.
Observe the syntax.

121
Running a Search Query
● Uses $search stage in .aggregate()
○ It must be the very first stage!
○ Following stages can be as usual.

● Unlike regular queries, index name must be specified.

● Use operators to define the type of search


○ text performs full-text search using the analyzer
■ Most commonly used operator.
■ Specify the path and query
■ Optionally specify fuzzy

○ equals checks for an exact match.


■ Requires token index for strings
■ Specify the path and value

● Use the sort option to avoid having to $sort later.


122
Search Result Scores
● Search results are sorted by their relevance score,
which is influenced by:
○ Position of the search query in the document
○ Frequency of occurrence
○ Type of operator & analyzer used

● By default, the score is not visible in the result.


○ It is part of the search results metadata
○ It has to be included using $set or $project
{$project: { score: { $meta: "searchScore" }}}

● Scores for a given field can be modified


○ given a boost (multiplied), or
○ set to a constant

123
Compound Search
● To search with multiple conditions, use the compound operator
● Use one or more of these clauses
○ must
■ logical AND
■ Only include results matching this clause
○ mustNot
■ logical AND NOT
■ Negation of must; doesn’t impact score
○ should
■ logical OR
■ Results may match the clause(s)
■ Can specify minimumShouldMatch (default = 0)
○ filter
■ equivalent to must; doesn’t impact score
● Can be nested to multiple levels.
124
Compound Search - Example
Note the syntax:

● compound has object {} value

● must, mustNot, should etc.


have array [] value

● minimumShouldMatch has
integer value

In this example, what happens if


we set

● minimumShouldMatch = 1 ?
● minimumShouldMatch = 2 ?

125
Autocomplete
● Gives the ability to “search as you type” and to perform partial matches.

● Only works with string fields; use with data type autocomplete.

● Optionally, only index the range of characters required using minGrams


(default = 2) & maxGrams (default = 15).
○ Smaller the minGrams, larger the index.
○ Larger the maxGrams, larger the index.

● To use in a $search query, simply use it similar to the text operator, except
use the autocomplete operator.

Try it out:

Edit the static mapping index, and add the field title as an autocomplete type.
Query >> Edit Query >> Create Query From Template >> autocomplete (Insert)
126
Tokenization String : "Quick fox"

● When a string is broken down into nGram edgeGram rightEdgeGram


(size 3-4) (size 3-4) (size 3-4)
matchable tokens for
autocomplete, we use one of these Qui Qui fox
methods Quic Quic _fox

● nGram are all sets of consecutive uic fox ick


letters and can include spaces
uick uick
● edgeGram always start at the
ick
beginning of a word.
ick_
● rightEdgeGram always start at the
ck_
end of a word.
ck_f

... 127
Grouping Results using Facets
● Facets are buckets into which we group our search results.

● Facets can be used with only numbers, dates & strings.

● Special data types: stringFacet numberFacet dateFacet

● For queries, use the $searchMeta stage instead of $search

○ Facets are part of the search metadata, not the search results
themselves

○ $searchMeta can also return the count of documents in each bucket.

● In the query:

○ For number and date facets, specify boundaries as an array.

○ For string facets, specify numBuckets as an integer.


128
Facets - Try it out!
● Create a new Search index
“movie-facet”

● Dynamic mapping OFF

● Map the fields as follows:

○ countries : string

○ genres : stringFacet

○ [Link] : numberFacet

○ released : dateFacet

129
Facets - Try it out!

130
Facets - Try it out!

131
Quiz Time!
#1. Which of these are valid Search
Index definitions?
{
{ { analyzer:
mappings: mappings: "[Link]"
A { dynamic: true } B { dynamic: false } C mappings:
} } { dynamic: true }
}

{
{ searchAnalyzer:
analyzer: "[Link]"
D "[Link]" E mappings:
} { dynamic: true }
}

Answer in the next slide.


133
#2. Which two stages cannot exist in
the same aggregation pipeline?

A $match B $search C $searchMeta

D $limit E $group

Answer in the next slide.


135
#3. Which of the following search
queries returns {city: "New Delhi"}
when searched with fuzzy: {} option?

A Deli B delhi C dilli

D New Delhi E ncr

Answer in the next slide.


137
#4. What buckets will we get if
boundaries: [ 5, 10, 15]

A 0-5 B 5-10 C 10-15

D 5-15 E 15+

Answer in the next slide.


139
#5. Select two true statement about the
autocomplete feature in Atlas Search
The index must
It is only
Autocomplete is include a
applicable to
A automatically set
for all searches
B minGrams
parameter (set
C autocomplete
indexes
to 4 or more)

Tokenization
It only works if method must be
D the field has a
string value
E specified to
edgeGram or
nGram in the search

Answer in the next slide.


141
Language-specific
Drivers
MongoDB Drivers
● Idiomatic clients in popular programming languages, allow your application
to connect to your MongoDB cluster.

● Supported languages:
C, C++, C#, Go, Java, [Link], PHP, Python, Ruby, Rust, Scala, Swift,
Typescript, Kotlin

● MongoDB Developer Center ([Link]/developer) has in-depth


tutorials and full sample apps for major programming languages.

145
MongoDB Drivers
Language MongoClient Library / Package

PyMongo (synchronous apps)


Python Motor (asynchronous apps)
PyMongoArrow (Pandas, Numpy, Arrow)

[Link] mongodb

Java mongodb-driver-sync

C# [Link]

PHP mongodb

146
MongoClient
● Connection between the cluster and the driver is established via the
MongoClient instance.

● It is initialized using the connection string.


e.g. client = MongoClient("mongodb+srv://………")

● Connection Pools

○ Helps reduce application latency by reusing connections from the pool

○ Limits the number of connections by avoiding wasteful creation

○ Maintained automatically through the MongoClient instance

○ Use only one MongoClient instance per cluster per application

147
Syntax for Python Driver (PyMongo)
Client client = MongoClient("mongodb+srv://…")
db = client.sample_mflix
coll = [Link]
[Link]()

Create coll.insert_one(doc) coll.insert_many([doc,doc,doc])

Read coll.find_one(query, projection) [Link](query, projection)

Update coll.update_one(query, change) coll.update_many(query, change)

Delete coll.delete_one(query) coll.delete_many(query)

Aggr. [Link](pipeline)

Txns. with client.start_session() as session:


session.with_transaction(<function>) 148
Syntax for [Link] Driver
Client const client = new MongoClient("mongodb+srv://…")
[Link]()
const db = [Link]("sample_mflix")
const coll = [Link]("movies")
[Link]()

Create [Link](doc) [Link]([doc,doc,doc])

Read [Link](query, projection) [Link](query, projection)

Update [Link](query, change) [Link](query, change)

Delete [Link](query) [Link](query)

Aggr. [Link](pipeline)

Txns. const session = [Link]()


const txnResults = await [Link]( async() => {...} )
149
Syntax for Java Driver
Client MongoClient client = [Link]("mongodb+srv://…");
MongoDatabase db = [Link]("sample_mflix");
MongoCollection<Document> coll = [Link]("movies");
Document movie = new Document("_id", new ObjectId())
.append("title","Interstellar");
[Link]()

Create [Link](doc) [Link](List<Document>)

Read [Link](query, projection).first() [Link](query, projection)

Update [Link](query, change) [Link](query, change)

Delete [Link](query) [Link](query)

Aggr. [Link](asList([Link](...),[Link](...)));

Txns. final ClientSession session = [Link]();


[Link](new TransactionBody<String>(){
public String execute(){ ... }
});
150
Syntax for C# Driver
Client var client = new MongoClient("mongodb+srv://…");
var db = [Link]("sample_mflix");
var coll = [Link]<BsonDocument>("movies");
var doc = new BsonDocument{ {"title", "Interstellar"}, ... };

Create [Link](doc) [Link](new[]{doc,doc,..})

Read [Link](query).FirstOrDefault() [Link](query)

Update [Link](query, change) [Link](query, change)

Delete [Link](query) [Link](query)

Aggr. [Link]().Match(...).Group(...).Project(...);

Txns. using( var session = [Link]() ){


[Link]( (s,ct) => { ... } };

151
Syntax for PHP Driver
Client $client = new MongoDB\Client('mongodb+srv://…');
$db = $client->sample_mflix;
$coll = $db->movies;
$doc = [ 'title' => 'Interstellar' ];

Create $coll->insertOne($doc) $coll->insertMany([[...], [...]])

Read $coll->findOne($filter) $coll->find($filter)

Update $coll->updateOne($query,$change) $coll->updateMany($query,$change)

Delete $coll->deleteOne($query) $coll->deleteMany($query)

Aggr. $coll->aggregate([ ['$match'=> ['year'=> ['$gt'=>2000] ] ] ]);

Txns. $callback = function(\MongoDB\Driver\Session $session)


$session = $client->startSession();
MongoDB\with_transaction($session, $callback);
152
Quiz Time!
#1. How many MongoClients should be
used per cluster per application?

A One B Two C Ten

D Zero E As many as needed

Answer in the next slide.


154
#2. Which of these helps with
controlling Connection Pool size?

A minSizeOfPool B poolSizeRange C maxSizeOfPool

D minPoolSize E maxPoolSize

Answer in the next slide.


156
Thank you

You might also like