MongoDB Indexing Strategies Explained
MongoDB Indexing Strategies Explained
MCP300
MongoDB for
Developers
Nishit Rao
Solutions Architect
MongoDB India
Follow along!
Indexing
Advanced
Choosing Query Plans
MongoDB checks in the Plan Cache for an
optimal Query Plan used previously.
If an entry is found:
7
Plan Cache
Least Recently Used (LRU) Plan Cache entries are automatically evicted
when:
8
How MongoDB retrieves data
Covered Queries
keys
5 1
20 4
25 2
25 5
35 3
● Try to use Covered Queries wherever possible, but don’t create too many
indexes
● Optimize your Data Model to the Queries, not the other way around.
12
How to reduce the cost?
… and MongoDB features:
13
Index Statistics
To get index usage statistics for a collection:
[Link]( { $indexStats : { } } )
14
Index Sizes
To get index sizes for a collection:
[Link]().indexSizes
15
Database Profiler
Database profiler records information about slow operations:
16
[Link]filingLevel( level, slowms )
level: Syntax:
0 - Profiler is
o [Link](<level>,<slowms>)
ff OR
1 - Store only slow queries [Link](<level>,<options>)
2 - Store all queries
17
Query Profile Data
To see the 5 most recent profiled events:
show profile
18
Index
Special Features
Partial indexes [Link](
{ keywords: 1 },
{ partialFilterExpression: {
{ score: { $gte: 4 } } }
score: 5.0, )
keywords: [ 'friendly',
'clean',
'bright'
]
Key (keywords) Value (identity)
} SORTED
{
score: 3.0, bright 1
keywords: [ 'cold',
'near beach',
'dark' cheap 3
]
} clean
{
1
score: 4.0,
keywords: [ 'friendly', friendly 1
'near beach',
'cheap' friendly
] 3
}
near beach 3
20
Sparse indexes [Link](
{ avatar_img: 1 },
{ sparse: true }
{ )
_id: 12,
username: 'testUser',
avatar_img: '[Link]'
},
Key (avatar_img) Value (identity)
{
_id: 34,
SORTED
username: 'anotherUser',
avatar_img: '[Link]' [Link] 1
},
{
[Link] 2
_id: 56,
username: 'test',
avatar_img: null null 3
},
{
null 4
_id: 78,
username: 'coolUser',
}
23
TTL Indexes
● Special single-field indexes that help to automatically remove documents
from a collection after a certain amount of time, or at a specific clock time.
● Expiration threshold = indexed field value + specified no. of seconds
● Syntax:
[Link]({ ts: 1 }, { expireAfterSeconds: 60 })
● A background task removes expired documents every 60 seconds, but it
may take more than 60 seconds depending on the workload on your cluster.
● Ways to use TTL index:
○ Put the current timestamp and specify age at which to delete
○ Put an expiration timestamp and specify expireAfterSeconds = 0
24
TTL Indexes
● TTL is the only index type which you can modify after creation.
To change the expireAfterSeconds, or to convert a non-TTL index to TTL
index:
[Link]({
collMod: 'sensordata',
index: {
keyPattern: { ts : 1 },
expireAfterSeconds: '3600'
}
})
25
TTL Index Restrictions
● TTL index must have 1 time field only, else document will not be expired.
● It supports queries just like any other index. So, if there is already a non-TTL
single-field index on a time field, you cannot create a new TTL index on that
same field.
26
Wildcard Indexes
● Dynamic schema makes hard to index all fields, e.g. IoT, metadata
● Wildcard indexes index all fields, or a subtree (sub-documents & arrays)
● If created on an embedded document:
○ It descends into the embedded document and indexes all its fields.
○ It descends all levels till it finds a primitive/scalar value.
● If created on an embedded array, it indexes all its elements
○ If element is an embedded document, it descends into that document.
○ If element is an array, it indexes that element array as a value.
● Not advisable as the index can become very big and inefficient.
Consider using the Attribute Pattern instead!
27
Wildcard Indexes
Syntax:
[Link]( { "$**" : 1 } ) [Note: you must use ""]
[Link]( { "tomatoes.$**" : 1 } )
[Link]( { year : 1, "tomatoes.$**" : 1 } )
[Link]( { "$**" : 1 },
{ wildcardProjection : { _id : 1, year : 0 } } ) [Note: _id excluded by default]
1 2 7
1 1
3
1 2 4
4
2
3 5 8
3 3
6 3
4 3
7 6
4
8 2
29
Clustered index Regular index
Great at… Great at…
Quickly accessing range Speeding up many different types
of pre-sorted data of operational queries that have
random access patterns to the data
30
Clustered Indexes
● Collections created with a clustered index are called clustered collections.
○ Clustered index can only be created at the time of creating the collection.
○ A clustered collection can have only 1 clustered index, that too on _id field.
● Faster queries on the clustered index field, such as queries with range scans and
equality comparisons on the clustered index key.
● Clustered collections have a lower storage size, which improves performance for
queries and bulk inserts.
● Clustered collections can eliminate the need for a secondary TTL index.
○ A clustered index is also a TTL index if you specify the expireAfterSeconds field.
○ If you use a clustered index as a TTL index, it improves document delete
performance and reduces the clustered collection storage size.
● The primary use-case of clustered indexes is Time-Series Collections.
31
Clustered Indexes
Syntax:
[Link](
"stocks",
{
clusteredIndex: {
"key": { _id: 1 },
"unique": true,
"name": "stocks clustered key"
}
}
)
32
Quiz Time!
#1. What things might prevent an index
being used to 'cover' a query?
Sorting query
Projecting the Having a
A _id field B multikey index C results by one of
the index fields
Having more
Retrieving an
D array field E than one index
on the same field
Automatically
Automatically
move data to
D move data to an
archive server
E another
collection
Default slowness
Adds write Logs only slow
A overhead B threshold is
500ms
C read commands
D unique E partialFilterExpression
MongoDB
Aggregations
The Power of MQL
For the sample_mfl[Link] collection, write MQL queries for the following SQL queries:
SELECT title, year, countries FROM movies WHERE year > 1990;
SELECT countries, count(*) AS movie_count FROM movies WHERE year > 1990
GROUP BY countries HAVING count(*) > 100 ORDER BY movie_count DESC
LIMIT 10;
44
Programmatic Approach to the Solution
45
MongoDB Approach to the Solution
46
MongoDB Approach to the Solution
Pipeline
Stages
47
MongoDB Aggregation Pipelines
Rich Expressions
[Link]( [
$match stage {$match: { status: "A" } },
$group stage {$group: { _id: "$cust_id", total: { $sum: "$amount" } } }
] )
48
SQL Joins & Aggregation Syntax
SELECT
city,
SUM(annual_spend) Total_Spend,
SQL queries have a
AVG(annual_spend) Average_Spend,
nested structure
MAX(annual_spend) Max_Spend,
COUNT(annual_spend) customers
FROM (
SELECT [Link], customer.annual_spend
Understanding the
FROM customer
outer layers requires
LEFT JOIN (
understanding the
SELECT address.address_id, [Link],
inner ones
address.customer_id, [Link]
FROM address LEFT JOIN city
ON address.city_id = city.city_id
) AS t1
ON
(customer.customer_id = t1.customer_id AND
[Link] = "home")
) AS t2
So SQL has to be read
GROUP BY city; “inside-out”
49
MongoDB Aggregation Pipeline Syntax
[Link]([
{
$unwind: "$address", These “phases” are distinct and
}, easy to understand
{
$match: {"[Link]": "home"}
},
{
$group: {
_id: "$[Link]",
totalSpend: {$sum: "$annualSpend"},
averageSpend: {$avg: "$annualSpend"}, They can be thought about
maximumSpend: {$max: "$annualSpend"}, in order… no nesting
customers: {$sum: 1}
}
}
])
MongoDB aggregation syntax makes it easier to understand, debug, rewrite and optimize.
For pipelines with only a single stage, the square braces [] can be omitted.
50
Aggregation features
A feature rich framework for data transformation and Analytics
51
Basic Aggregation Stages
Stage Syntax
$match equivalent to find(query)
$project equivalent to find({},projection)
$sort equivalent to find().sort(order)
$limit equivalent to find().limit(num)
$skip equivalent to find().skip(num)
$count similar to countDocuments()
**Specify new field name which contains count; Filter is not specified
52
Dollar ($) overloading
{$match: {a: 5}}
Dollar on left means a stage name - in this case a $match stage
53
Aggregation Expressions
Expressions can be a literal value:
{$set: {answer: 42}}
Simple calculation:
{$set: {answer: { $multiply: [6, 7]}}}
Nested calculation:
{$set: {simple_interest:
{$divide: [
{ $multiply: ["$principal","$interest_rate", "$time"] },
100
] }
} }
54
Using $project
● $project specifies the output document shape
● Just like in .find(), use 0 to exclude fields and 1 to include fields (not
together)
● New fields can also be added using $project
● Expressions define the field values
{
$project: {
_id: 0,
name: "$fullname",
average_speed: { $divide: ["$distance","$time"] }
}
} 55
Using $set or $addFields
● To add in additional fields rather than specify the whole output
use $set, not $project
● $set and $addFields are aliases of each other, and can be used
interchangeably.
● It also lets you replace existing values
{
$set: {
average_speed: { $divide : ["$distance","$time"]
}
}
}
56
Using $group
Syntax:
{
$group: {
_id: <expression>, //group key
● Groups documents by a field1: {<accumulator>: <expression>}, …
“group key” }
}
● Each unique “group key” Example:
value represents one {
“group” $group: {
_id: "$country", //group key
urban_population: {$sum: "$city_population" }
● Output is one document }
per group }
Output:
● Additional fields can
[
contain results of {_id: "South Africa", urban_population:
accumulator expressions 40928486},
{_id: "Germany", urban_population: 65286292},
…
] 57
Common $group Accumulators
58
Using $unwind Syntax:
{ $unwind: "$<field-name>" } OR
{
● Opposite of $group $unwind: {
path: "$<field-name>",
● Applied on array fields includeArrayIndex: "<array-index-field-name>",
only preserveNullAndEmptyArrays: < true | false >
}
}
● Generates 1 document
per element in array. Example: document is { a: 1, b: [ 2, 3, 4 ] }
59
Using $lookup
● Like Left Outer Join or Syntax:
Nested Select
{
$lookup: {
● Only on collections in the from: <collection to join>,
same database! localField: <field from the input documents>,
foreignField: <field from "from" collection>,
● Embeds results as array as: <field name of output array>
in parent document }
}
60
Using $lookup - Example
Let’s fetch all comments on the movie "The Hunger Games", along with the movie title.
61
Using $out
● Helps to preserve results of a complex computation without having to recompute
the data.
● Creates a new collection in a specified database from the pipeline output.
● If the target collection already exists, its contents will be replaced entirely.
○ The target collection can be the same as the source collection.
● This stage must be the very last stage of the pipeline.
● Syntax:
○ In the same database:
{ $out: "<output-collection>" }
○ In a different database:
{ $out: { db: "<output-db>", coll: "<output-collection>" } }
62
Revisiting our Aggregation Pipeline
63
Aggregation Pipeline - Tips & Tricks
● Place the $match, $limit & $skip stages as early in the pipeline as possible. It reduces
the number of documents passed on, which reduces the processing required.
● $match or $match + $sort at the start of the pipeline is equivalent to a single query
(with a sort), and can use an index.
● MongoDB works out the required early projection; do not $project to optimize.
● Document size inside the pipeline can be upto 64MB. Document size in the final output
must be at most 16MB, since it must be BSON.
● Avoid using $lookup as much as possible. Data modeling should be optimized for the
workload.
64
Try it at home!
Use the sample_analytics database and collections customers, accounts & transactions.
● Find the top 5 customers who have the most number of accounts associated with them.
Output only the username, email and number_of_accounts of each customer.
Hint: to get the size of accounts array, use { $size : "$accounts"}
● Find the 3 most popular products among customers who are born on or after 1st Jan
1996. The output should contain only the fields product_name and product_count.
Store the output in a collection called popular_products.
● Challenging:
For all customers born on or after 1st Jan 1997, output their username, age, and
last_transaction_date.
Hint: 1 year = 31556952000 milliseconds
65
Quiz Time!
#1. What stage is used to compute an
average of a range of values?
D $lookup E $facet
D $addFields E $redact
D $merge E $set
basics
changes
74
Atomic
Consistent
Database Isolated
basics Durable
ACID
75
How MongoDB updates data
76
Atomic ✔
Single Consistent
✔
document Isolated ✔
updates
ACID
Durable
✔
77
Atomic
❌
Multi Consistent
✔
document Isolated
❌
updates
ACID
Durable
✔
78
async function postReview(review)
●
When you commit - all changes are made
84
With MongoDB you are designing an
optimal schema for your use cases - not
a generic schema for every possible use
case.
85
Considerations for Data Modeling
Schema is defined at the application-level. Its design should focus on the
application, and not the abstract nature of the data.
86
Easier manageability
Referencing
Embedding
88
Embedding Referencing
Single query to
No duplication
retrieve data
Single operation to
Smaller documents
update/delete data
90
Types of Relationships - MongoDB Examples
One-to-One
One-to-Few
91
Types of Relationships - MongoDB Examples
One-to-Many One-to-Many
(Embedding) (Referencing on “One” side)
92
Types of Relationships - MongoDB Examples
One-to-Squillions
(Referencing on
“Many” side)
93
Types of Relationships - MongoDB Examples
Many-to-Many
(Referencing on both sides)
94
Things to Avoid ‒ Schema Anti-Patterns
● Massive or Unbounded Arrays
● Bloated Documents
● Unnecessary Indexes
95
Step-by-step iteration
Business domain expertise
Current and predicted scenarios
Production logs and stats
Collections with
documents fields and
shapes for each
Evaluate the Map out entities Finalize the
application and their data model for Data size
workload relationships each collection Database queries
and indexes
Data size CRD: Identify and Current operations
apply relevant assumptions, and
A list of Collection growth projections
design patterns
operations Relationship
ranked by Diagram
importance (link or embed?)
Quiz Time!
#1. Which of these are relevant for
MongoDB Schema Design?
App Relationship
A Functionality B Go Live Date C between Entities
Data Access
D Patterns E Server Location
D Many-to-Many E One-to-Few
Richer
A Atomicity B Avoid $lookup C documents
Large
D documents E Data duplication
Average Normalization of
A Document Size
= 10 MB
B Small Array Size C data accessed
together
10000
Unnecessary
D Collections in a
Database
E Indexes
Atlas Search
What is Search?
"With a database query - you know exactly what you want.
When you use search you are open to suggestions." - unknown
[Link]( [ { $search :
{ text : { query : "Raising Arizona", path: "title" } } } ] )
Title mentions some variations of "Raising," "Arizona," or both
108
What is Search?
Search, or “Full-Text Search” is the ability to search across all of your data
(JSON Docs, CSVs, PDFs, Web Pages … etc) and efficiently return a list of results,
ranked based on how well they matched to the search term
Score
Javascript the Definitive Guide 75
Introducing MongoDB 10
109
Atlas Search
● Powered by Apache Lucene, which is the world’s leading text search engine.
● Requires the use of the Aggregation Framework, through the $search and
$searchMeta stages.
● Makes search workloads more efficient and scalable via Search Nodes.
110
Facets Fuzzy Search - Autocompletion Geo search
- Typo tolerance
Filters
Custom Score
Highlight
111
Hands-On Exercise
Run your first Atlas Search query
● Go to Atlas Search on the left panel and Create Search Index
○ Select the sample_mflix database & movies collection.
○ Proceed with the defaults & create the Search Index
○ Allow it to build and reach the Ready status.
● When it’s Ready, click the Query button to run Search Queries.
● Type a search word in the search box and observe the results
○ Try different variations of the same keywords and observe the results.
○ e.g. “Tom Cruise”, “tom cruise”, “tam cruise” etc.
○ Try searching with the same terms using .find(), or .aggregate() with $match
● Click Index Overview on the left panel and observe the index configuration.
114
Index Configuration
115
Analyzers
Analyzers apply parsing and Lions and tigers and bears, oh my!
language rules to the query
L lions and tigers and bears, oh my!
Tokens are the individual terms that
are queried; think of these as the
words that matter in a search query. lions tigers bears oh my
“shard”
Analyzers also support language
rules. A search for a particular word
can return results for related words. [“shard”, “shards”, “sharded”,
“sharding”, ...]
Lucene Inverted Indexes
Lucene uses a hash table to get to a compact
list of documents for a term that can be
1 = { txt : "dark clouds gather storm" } stored as multiple segments
2 = { txt: "storm clouds gather today" } Deletion and merging/compacting of these
3 = {txt: "today clouds are not dark clouds"} lists is done asynchronously
● If it is set to false, field mappings for the desired fields must be specified.
● Typically, Search indexes with dynamic field mappings can be significantly larger in size
and much less performant.
119
Static Mapping
● To create a Search index with only the desired fields mapped, static field mappings can
be specified.
● You can add other field-level specifications like analyzer & searchAnalyzer. If specified, it
overrides the dynamic field mapping specification.
● For mapping fields within embedded documents, the parent field must also be mapped.
120
Static Mapping - Try It Yourself!
● Create a fresh Search Index on sample_mfl[Link]
with a different name.
○ Before creating, click on Refine Your Index button
on the top.
○ Disable Dynamic Mapping.
○ Click on Add Field Mapping button and select Field
Name as title. Click Add.
○ Also add field mapping for cast and [Link]
● Click the ••• menu and select Edit with JSON Editor.
Observe the syntax.
121
Running a Search Query
● Uses $search stage in .aggregate()
○ It must be the very first stage!
○ Following stages can be as usual.
123
Compound Search
● To search with multiple conditions, use the compound operator
● Use one or more of these clauses
○ must
■ logical AND
■ Only include results matching this clause
○ mustNot
■ logical AND NOT
■ Negation of must; doesn’t impact score
○ should
■ logical OR
■ Results may match the clause(s)
■ Can specify minimumShouldMatch (default = 0)
○ filter
■ equivalent to must; doesn’t impact score
● Can be nested to multiple levels.
124
Compound Search - Example
Note the syntax:
● minimumShouldMatch has
integer value
● minimumShouldMatch = 1 ?
● minimumShouldMatch = 2 ?
125
Autocomplete
● Gives the ability to “search as you type” and to perform partial matches.
● Only works with string fields; use with data type autocomplete.
● To use in a $search query, simply use it similar to the text operator, except
use the autocomplete operator.
Try it out:
Edit the static mapping index, and add the field title as an autocomplete type.
Query >> Edit Query >> Create Query From Template >> autocomplete (Insert)
126
Tokenization String : "Quick fox"
... 127
Grouping Results using Facets
● Facets are buckets into which we group our search results.
○ Facets are part of the search metadata, not the search results
themselves
● In the query:
○ countries : string
○ genres : stringFacet
○ [Link] : numberFacet
○ released : dateFacet
129
Facets - Try it out!
130
Facets - Try it out!
131
Quiz Time!
#1. Which of these are valid Search
Index definitions?
{
{ { analyzer:
mappings: mappings: "[Link]"
A { dynamic: true } B { dynamic: false } C mappings:
} } { dynamic: true }
}
{
{ searchAnalyzer:
analyzer: "[Link]"
D "[Link]" E mappings:
} { dynamic: true }
}
D $limit E $group
D 5-15 E 15+
Tokenization
It only works if method must be
D the field has a
string value
E specified to
edgeGram or
nGram in the search
● Supported languages:
C, C++, C#, Go, Java, [Link], PHP, Python, Ruby, Rust, Scala, Swift,
Typescript, Kotlin
145
MongoDB Drivers
Language MongoClient Library / Package
[Link] mongodb
Java mongodb-driver-sync
C# [Link]
PHP mongodb
146
MongoClient
● Connection between the cluster and the driver is established via the
MongoClient instance.
● Connection Pools
147
Syntax for Python Driver (PyMongo)
Client client = MongoClient("mongodb+srv://…")
db = client.sample_mflix
coll = [Link]
[Link]()
Aggr. [Link](pipeline)
Aggr. [Link](pipeline)
Aggr. [Link](asList([Link](...),[Link](...)));
Aggr. [Link]().Match(...).Group(...).Project(...);
151
Syntax for PHP Driver
Client $client = new MongoDB\Client('mongodb+srv://…');
$db = $client->sample_mflix;
$coll = $db->movies;
$doc = [ 'title' => 'Interstellar' ];
D minPoolSize E maxPoolSize