GitHub Overview
lancedb/lancedb
Developer-friendly, embedded retrieval engine for multimodal AI. Search More; Manage Less.
Topics
Star History
What is LanceDB
LanceDB is a developer-friendly, multimodal retrieval engine built on the philosophy of "Search More; Manage Less." It's an open-source vector database designed for AI applications, providing storage, management, querying, and retrieval of embeddings on large-scale multimodal data. Offering both embedded and serverless versions, it adapts to various scales and architectures.
Key Features
Flexible Deployment Options
- Embedded Mode (OSS): Runs in-process, providing low-latency access and simplified self-hosting
- Serverless Cloud: SaaS solution with compute-storage separation (up to 100x cost savings)
- No Servers, No Hassle: Minimizes operational overhead
Multimodal Data Support
- Actual Data Storage: Stores images, videos, text, audio files alongside embeddings and metadata
- Lance Format: Provides automatic data versioning and blazing fast retrieval and filtering
- Unified Management: Simplifies complex data pipelines by managing multiple data types together
High-Performance Architecture
- Rust Core: Built on Lance columnar data format, optimized for performant ML workloads and fast random access
- Zero-Copy Access: True zero-copy access in shared memory via Apache Arrow integration
- SIMD & GPU Acceleration: Leverages latest hardware capabilities for speedup
Comprehensive Search Capabilities
- Vector Similarity Search: Search billions of vectors in milliseconds using state-of-the-art indexing
- Full-Text Search: Text-based search capabilities
- Hybrid Search: Combines vector and text search
- SQL Queries: SQL interface via DataFusion
Latest Features (2025)
- LanceDB Cloud: Serverless cloud service launched June 1st, 2025
- Automatic Data Versioning: Manages data versions without additional infrastructure
- GPU Support: GPU support for building vector indices (Python SDK)
Pros and Cons
Pros
- Designed specifically for performant ML workloads
- Unified management of multimodal data
- Flexible deployment from embedded to cloud
- Tight integration with Apache Arrow ecosystem
- 1000x faster than Parquet for random access
- Automatic data versioning capabilities
Cons
- Relatively new project with limited production experience
- Learning resources more limited compared to other vector DBs
- Enterprise features may require paid version
Key Links
Installation
Python
pip install lancedb
JavaScript/TypeScript
npm install @lancedb/lancedb
Rust
[dependencies]
lancedb = "0.11.0"
Code Examples
Basic Usage (Python)
import lancedb
import numpy as np
# Connect to database
uri = "./sample-lancedb"
db = lancedb.connect(uri)
# Prepare data
data = [
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
]
# Create table and insert data
table = db.create_table("my_table", data=data)
# Vector search
result = table.search([100, 100]).limit(2).to_pandas()
print(result)
Multimodal Data Example
import lancedb
from PIL import Image
import requests
from io import BytesIO
# Connect to database
db = lancedb.connect("./multimodal-db")
# Create table with image data
data = [
{
"image_url": "https://example.com/image1.jpg",
"description": "Beautiful landscape photo",
"vector": [0.1, 0.2, 0.3, 0.4] # embedding vector
},
{
"image_url": "https://example.com/image2.jpg",
"description": "City nightscape",
"vector": [0.5, 0.6, 0.7, 0.8]
}
]
table = db.create_table("images", data=data)
# Similar image search
query_vector = [0.15, 0.25, 0.35, 0.45]
results = table.search(query_vector).limit(5).to_pandas()
print(results)
Full-Text and Hybrid Search
import lancedb
db = lancedb.connect("./hybrid-search-db")
# Create table with full-text index
data = [
{
"text": "Machine Learning and AI Basics",
"content": "This article explains the basic concepts of machine learning",
"vector": [1.0, 2.0, 3.0]
},
{
"text": "Deep Learning Applications",
"content": "How to apply deep learning to real-world problems",
"vector": [4.0, 5.0, 6.0]
}
]
table = db.create_table("articles", data=data)
table.create_fts_index("content") # Create full-text index
# Hybrid search (vector + full-text)
results = (
table.search([1.5, 2.5, 3.5])
.where("content MATCH 'machine learning'")
.limit(10)
.to_pandas()
)
print(results)
JavaScript/TypeScript Example
import * as lancedb from '@lancedb/lancedb';
async function main() {
// Connect to database
const uri = './sample-lancedb';
const db = await lancedb.connect(uri);
// Prepare data
const data = [
{ vector: [3.1, 4.1], item: 'foo', price: 10.0 },
{ vector: [5.9, 26.5], item: 'bar', price: 20.0 }
];
// Create table
const table = await db.createTable('my_table', data);
// Perform search
const results = await table.search([100, 100]).limit(2).toArray();
console.log(results);
}
main().catch(console.error);
Integrations and Ecosystem
Apache Arrow Ecosystem
- Pandas: Direct DataFrame integration
- Polars: High-performance data frame operations
- DuckDB: Fast analytical queries
- Pydantic: Data validation and serialization
AI Framework Integrations
- LangChain: RAG application building
- LlamaIndex: Document search and QA systems
- Haystack: NLP pipelines
Cloud Storage Support
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
- Local file systems
Summary
LanceDB is a next-generation vector database designed for multimodal AI applications. With high performance through the Lance columnar format, unified multimodal data management, and flexible deployment options, it addresses the complex requirements of modern AI development. Its flexibility to adapt from embedded to serverless cloud deployments across various scales and architectures is its greatest strength.