GitHub Overview

lancedb/lancedb

Developer-friendly, embedded retrieval engine for multimodal AI. Search More; Manage Less.

Stars7,182
Watchers36
Forks559
Created:February 28, 2023
Language:Python
License:Apache License 2.0

Topics

approximate-nearest-neighbor-searchimage-searchnearest-neighbor-searchrecommender-systemsearch-enginesemantic-searchsimilarity-searchvector-database

Star History

lancedb/lancedb Star History
Data as of: 7/30/2025, 03:04 AM

What is LanceDB

LanceDB is a developer-friendly, multimodal retrieval engine built on the philosophy of "Search More; Manage Less." It's an open-source vector database designed for AI applications, providing storage, management, querying, and retrieval of embeddings on large-scale multimodal data. Offering both embedded and serverless versions, it adapts to various scales and architectures.

Key Features

Flexible Deployment Options

  • Embedded Mode (OSS): Runs in-process, providing low-latency access and simplified self-hosting
  • Serverless Cloud: SaaS solution with compute-storage separation (up to 100x cost savings)
  • No Servers, No Hassle: Minimizes operational overhead

Multimodal Data Support

  • Actual Data Storage: Stores images, videos, text, audio files alongside embeddings and metadata
  • Lance Format: Provides automatic data versioning and blazing fast retrieval and filtering
  • Unified Management: Simplifies complex data pipelines by managing multiple data types together

High-Performance Architecture

  • Rust Core: Built on Lance columnar data format, optimized for performant ML workloads and fast random access
  • Zero-Copy Access: True zero-copy access in shared memory via Apache Arrow integration
  • SIMD & GPU Acceleration: Leverages latest hardware capabilities for speedup

Comprehensive Search Capabilities

  • Vector Similarity Search: Search billions of vectors in milliseconds using state-of-the-art indexing
  • Full-Text Search: Text-based search capabilities
  • Hybrid Search: Combines vector and text search
  • SQL Queries: SQL interface via DataFusion

Latest Features (2025)

  • LanceDB Cloud: Serverless cloud service launched June 1st, 2025
  • Automatic Data Versioning: Manages data versions without additional infrastructure
  • GPU Support: GPU support for building vector indices (Python SDK)

Pros and Cons

Pros

  • Designed specifically for performant ML workloads
  • Unified management of multimodal data
  • Flexible deployment from embedded to cloud
  • Tight integration with Apache Arrow ecosystem
  • 1000x faster than Parquet for random access
  • Automatic data versioning capabilities

Cons

  • Relatively new project with limited production experience
  • Learning resources more limited compared to other vector DBs
  • Enterprise features may require paid version

Key Links

Installation

Python

pip install lancedb

JavaScript/TypeScript

npm install @lancedb/lancedb

Rust

[dependencies]
lancedb = "0.11.0"

Code Examples

Basic Usage (Python)

import lancedb
import numpy as np

# Connect to database
uri = "./sample-lancedb"
db = lancedb.connect(uri)

# Prepare data
data = [
    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
]

# Create table and insert data
table = db.create_table("my_table", data=data)

# Vector search
result = table.search([100, 100]).limit(2).to_pandas()
print(result)

Multimodal Data Example

import lancedb
from PIL import Image
import requests
from io import BytesIO

# Connect to database
db = lancedb.connect("./multimodal-db")

# Create table with image data
data = [
    {
        "image_url": "https://example.com/image1.jpg",
        "description": "Beautiful landscape photo",
        "vector": [0.1, 0.2, 0.3, 0.4]  # embedding vector
    },
    {
        "image_url": "https://example.com/image2.jpg", 
        "description": "City nightscape",
        "vector": [0.5, 0.6, 0.7, 0.8]
    }
]

table = db.create_table("images", data=data)

# Similar image search
query_vector = [0.15, 0.25, 0.35, 0.45]
results = table.search(query_vector).limit(5).to_pandas()
print(results)

Full-Text and Hybrid Search

import lancedb

db = lancedb.connect("./hybrid-search-db")

# Create table with full-text index
data = [
    {
        "text": "Machine Learning and AI Basics",
        "content": "This article explains the basic concepts of machine learning",
        "vector": [1.0, 2.0, 3.0]
    },
    {
        "text": "Deep Learning Applications", 
        "content": "How to apply deep learning to real-world problems",
        "vector": [4.0, 5.0, 6.0]
    }
]

table = db.create_table("articles", data=data)
table.create_fts_index("content")  # Create full-text index

# Hybrid search (vector + full-text)
results = (
    table.search([1.5, 2.5, 3.5])
    .where("content MATCH 'machine learning'")
    .limit(10)
    .to_pandas()
)
print(results)

JavaScript/TypeScript Example

import * as lancedb from '@lancedb/lancedb';

async function main() {
  // Connect to database
  const uri = './sample-lancedb';
  const db = await lancedb.connect(uri);
  
  // Prepare data
  const data = [
    { vector: [3.1, 4.1], item: 'foo', price: 10.0 },
    { vector: [5.9, 26.5], item: 'bar', price: 20.0 }
  ];
  
  // Create table
  const table = await db.createTable('my_table', data);
  
  // Perform search
  const results = await table.search([100, 100]).limit(2).toArray();
  console.log(results);
}

main().catch(console.error);

Integrations and Ecosystem

Apache Arrow Ecosystem

  • Pandas: Direct DataFrame integration
  • Polars: High-performance data frame operations
  • DuckDB: Fast analytical queries
  • Pydantic: Data validation and serialization

AI Framework Integrations

  • LangChain: RAG application building
  • LlamaIndex: Document search and QA systems
  • Haystack: NLP pipelines

Cloud Storage Support

  • AWS S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Local file systems

Summary

LanceDB is a next-generation vector database designed for multimodal AI applications. With high performance through the Lance columnar format, unified multimodal data management, and flexible deployment options, it addresses the complex requirements of modern AI development. Its flexibility to adapt from embedded to serverless cloud deployments across various scales and architectures is its greatest strength.