GitHub Overview
mongodb/mongo
The MongoDB Database
Topics
Star History
Database
MongoDB Atlas Vector Search
Overview
MongoDB Atlas Vector Search is a vector search capability integrated into MongoDB's popular document-oriented NoSQL database. Offered as part of MongoDB Atlas (cloud managed service), it enables vector search within existing MongoDB infrastructure. By combining the flexibility of document databases with vector search, it facilitates building RAG applications and recommendation systems.
Details
MongoDB was developed by 10gen (now MongoDB Inc.) in 2009 and has become one of the most popular NoSQL databases. MongoDB Atlas Vector Search was introduced in 2023, enabling storage and search of vector embeddings. It leverages an Apache Lucene-based search engine to provide advanced similarity search capabilities.
Key features of MongoDB Atlas Vector Search:
- Integration of document database and vector search
- Approximate Nearest Neighbor (ANN) algorithm
- Support for high-dimensional vectors (up to 4096 dimensions)
- Similarity calculations using Euclidean distance, cosine similarity, and dot product
- Pre-filtering and post-filtering
- Automatic index management
- Advanced search features through Atlas Search integration
- Vector search across globally distributed clusters
- Multi-tenant support
- Enterprise-grade security
Architecture Features
- Apache Lucene-based search engine
- Distributed replica set architecture
- Automatic sharding and load balancing
- Real-time data synchronization
Pros and Cons
Pros
- Integrated solution: Manage structured, unstructured, and vector data on one platform
- Ease of development: Leverage existing MongoDB APIs and toolchain
- Flexible data model: Combination of document flexibility and vector search
- Managed service: No infrastructure management required
- Global scale: Easy deployment across regions worldwide
- Enterprise features: Comprehensive security, auditing, and compliance capabilities
- Rich ecosystem: Integration with many frameworks and tools
Cons
- Cost: High operational costs due to managed service
- Vendor lock-in: Atlas-only features make migration difficult
- Performance: May lag behind dedicated vector databases
- Feature limitations: Limited features compared to vector-specific databases
- Learning curve: Requires knowledge of both MongoDB and vector search
Key Links
Usage Examples
Setup and Index Creation
// MongoDB connection (Node.js)
const { MongoClient } = require('mongodb');
const client = new MongoClient('mongodb+srv://username:[email protected]');
// Vector search index definition
const vectorSearchIndex = {
name: "vector_index",
type: "vectorSearch",
definition: {
fields: [
{
type: "vector",
path: "embedding",
numDimensions: 768,
similarity: "cosine"
},
{
type: "filter",
path: "category"
},
{
type: "filter",
path: "metadata.year"
}
]
}
};
// Create index (execute via Atlas UI or API)
Document Insertion
async function insertDocuments() {
const db = client.db('vectordb');
const collection = db.collection('documents');
// Insert documents
const documents = [
{
title: "MongoDB Vector Search",
content: "How to implement vector search with MongoDB",
embedding: Array(768).fill(0).map(() => Math.random()),
category: "database",
metadata: {
year: 2024,
author: "MongoDB Team"
}
}
];
await collection.insertMany(documents);
}
Vector Search Execution
async function vectorSearch(queryVector) {
const db = client.db('vectordb');
const collection = db.collection('documents');
// Vector search pipeline
const pipeline = [
{
$vectorSearch: {
index: "vector_index",
path: "embedding",
queryVector: queryVector,
numCandidates: 100,
limit: 10,
filter: {
category: "database"
}
}
},
{
$project: {
title: 1,
content: 1,
score: { $meta: "vectorSearchScore" }
}
}
];
const results = await collection.aggregate(pipeline).toArray();
return results;
}
Python Implementation
from pymongo import MongoClient
import numpy as np
from datetime import datetime
# Connection
client = MongoClient("mongodb+srv://username:[email protected]")
db = client.vectordb
collection = db.documents
# Insert documents
documents = [
{
"title": "MongoDB Atlas Vector Search",
"content": "Enabling large-scale vector search",
"embedding": np.random.rand(768).tolist(),
"category": "database",
"metadata": {
"year": 2024,
"tags": ["nosql", "vector", "search"]
},
"created_at": datetime.now()
}
]
collection.insert_many(documents)
# Vector search
def vector_search(query_vector, filters=None):
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "embedding",
"queryVector": query_vector,
"numCandidates": 200,
"limit": 10
}
}
]
# Add filters
if filters:
pipeline[0]["$vectorSearch"]["filter"] = filters
# Projection
pipeline.append({
"$project": {
"title": 1,
"content": 1,
"category": 1,
"score": {"$meta": "vectorSearchScore"}
}
})
results = list(collection.aggregate(pipeline))
return results
# Execute search
query_embedding = np.random.rand(768).tolist()
results = vector_search(
query_embedding,
filters={"category": "database", "metadata.year": {"$gte": 2023}}
)
Hybrid Search
# Combined text and vector search
def hybrid_search(text_query, query_vector):
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "embedding",
"queryVector": query_vector,
"numCandidates": 100,
"limit": 20
}
},
{
"$match": {
"$text": {"$search": text_query}
}
},
{
"$project": {
"title": 1,
"content": 1,
"vectorScore": {"$meta": "vectorSearchScore"},
"textScore": {"$meta": "textScore"},
"combinedScore": {
"$add": [
{"$multiply": [{"$meta": "vectorSearchScore"}, 0.7]},
{"$multiply": [{"$meta": "textScore"}, 0.3]}
]
}
}
},
{
"$sort": {"combinedScore": -1}
},
{
"$limit": 10
}
]
return list(collection.aggregate(pipeline))
Batch Processing and Optimization
# Batch embedding processing
def batch_embed_and_insert(texts, embeddings):
documents = []
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
doc = {
"content": text,
"embedding": embedding.tolist(),
"metadata": {
"batch_id": datetime.now().isoformat(),
"index": i
}
}
documents.append(doc)
# Bulk insert
if documents:
collection.insert_many(documents, ordered=False)
# Get index statistics
def get_index_stats():
stats = db.command("collStats", "documents", indexDetails=True)
return stats