GitHub Overview

chroma-core/chroma

the AI-native open-source embedding database

Stars21,359
Watchers114
Forks1,693
Created:October 5, 2022
Language:Rust
License:Apache License 2.0

Topics

document-retrievalembeddingsllms

Star History

chroma-core/chroma Star History
Data as of: 7/30/2025, 12:10 AM

Database

Chroma

Overview

Chroma is an AI-native open-source embedding database designed to simplify the development of LLM applications. With its developer-friendly API and simple design, it's a vector database that enables rapid transition from prototyping to production environments. Optimized for RAG (Retrieval-Augmented Generation) development, it helps build more accurate AI applications by making knowledge, facts, and skills pluggable for LLMs, thereby avoiding hallucinations.

Details

Chroma is designed so that the same API that runs in your Python notebook scales to your cluster. By leveraging in-memory storage mechanisms, it ensures swift access to data without the latency associated with disk-based systems. You can use the same API for development, testing, and production environments, enabling seamless transition from local development to cloud deployment.

Key features of Chroma:

  • Batteries Included: Everything needed for storage, embedding, and querying
  • Simple API: Intuitive and easy-to-learn interface
  • Multi-modal Support: Supports various data types including text, images, and audio
  • Filtering Capabilities: Advanced filtering with metadata
  • Full-text Search: Integration of vector search and full-text search
  • Document Storage: Manages original documents alongside embeddings
  • Multiple Language Support: Python and JavaScript/TypeScript clients
  • Flexible Deployment: In-memory, local disk, server, and cloud options
  • Open Source: Provides transparency and customizability
  • Active Community: Continuous improvements and rich ecosystem

Architecture

Chroma adopts a service-oriented architecture:

  1. Frontend Services: Handle API requests, authentication, authorization, and routing
  2. Query Service: Process vector similarity searches and filtering operations
  3. Compaction Service: Manage data organization and storage optimization
  4. System Database (SysDB): Maintain metadata about collections, segments, tenants, and databases
  5. Storage Layer: Persist vector data, metadata, and documents

Advantages and Disadvantages

Advantages

  • Developer-Friendly: Simple API with low learning curve
  • Fast Prototyping: Start immediately with pip install
  • Cost-Effective: Free open-source, experiment without cloud costs
  • Flexible Deployment: Supports various environments from local to cloud
  • Easy Integration: Works with LangChain, LlamaIndex, Hugging Face
  • Efficient Similarity Search: Fast searches even on large datasets
  • Customizable: Open-source with high transparency, extensible as needed

Disadvantages

  • Scalability Limitations: Upper limit of up to one million vector points
  • Limited Enterprise Features: No RBAC support
  • Data Management Considerations: Requires unique document IDs, matching embedding dimensions
  • Operational Complexity: Manual management needed for large-scale environments
  • No Full ACID Support: Limited transaction capabilities
  • Limited Managed Services: Primarily self-hosted

Key Links

Code Examples

Installation and Setup

# Install Python SDK
pip install chromadb

# Install JavaScript/TypeScript SDK
npm install chromadb
# or
yarn add chromadb

# Run with Docker
docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma

# Start development server
chroma run --path ./chroma_data

Basic Operations (Python)

import chromadb

# Initialize client (various types)
# In-memory (for development/testing)
client = chromadb.EphemeralClient()

# Persistent local disk
client = chromadb.PersistentClient(path="./chroma_db")

# HTTP client (remote server connection)
client = chromadb.HttpClient(host="localhost", port=8000)

# Async HTTP client
import asyncio
async def main():
    client = await chromadb.AsyncHttpClient()
    collection = await client.create_collection(name="async_collection")

# Create collection
collection = client.create_collection(
    name="my_collection",
    metadata={"description": "My first collection"}
)

# Get or create existing collection
collection = client.get_or_create_collection(name="my_collection")

# Add documents
collection.add(
    documents=[
        "Chroma is an AI-native embedding database",
        "It makes LLM app development easy",
        "Perfect for building RAG systems"
    ],
    metadatas=[
        {"source": "docs", "category": "database"},
        {"source": "blog", "category": "ai"},
        {"source": "tutorial", "category": "rag"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Execute query
results = collection.query(
    query_texts=["Tell me about vector databases"],
    n_results=2,
    where={"category": "database"}
)

print(results)

Using Embedding Functions

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# OpenAI embeddings
openai_ef = OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

# Sentence Transformers (local)
sentence_ef = SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create collection with custom embedding function
collection = client.create_collection(
    name="custom_embeddings",
    embedding_function=openai_ef
)

# Automatically generate embeddings from text
collection.add(
    documents=["How to use Chroma", "Implementing vector search"],
    ids=["id1", "id2"]
)

Advanced Search Features

# Metadata filtering
results = collection.query(
    query_texts=["about machine learning"],
    n_results=5,
    where={
        "$and": [
            {"source": "docs"},
            {"year": {"$gte": 2023}}
        ]
    }
)

# Multiple query texts (batch processing)
results = collection.query(
    query_texts=[
        "What is a vector database",
        "How to implement RAG",
        "LLM use cases"
    ],
    n_results=3
)

# Use embedding vectors directly
import numpy as np
query_embeddings = [np.random.rand(384).tolist()]
results = collection.query(
    query_embeddings=query_embeddings,
    n_results=5
)

# Specify fields to include in results
results = collection.query(
    query_texts=["ChromaDB"],
    n_results=10,
    include=["metadatas", "documents", "distances"]
)

JavaScript/TypeScript Usage

import { ChromaClient } from 'chromadb';

// Initialize client
const client = new ChromaClient({
    path: "http://localhost:8000"
});

// Create collection
const collection = await client.createCollection({
    name: "js_collection",
    metadata: { description: "JavaScript collection" }
});

// Add documents
await collection.add({
    ids: ["id1", "id2"],
    documents: [
        "Chroma works with JavaScript too",
        "Full TypeScript support included"
    ],
    metadatas: [
        { lang: "en", type: "intro" },
        { lang: "en", type: "feature" }
    ]
});

// Execute query
const results = await collection.query({
    queryTexts: ["How to use with JavaScript"],
    nResults: 2,
    where: { type: "intro" }
});

console.log(results);

RAG Application Implementation

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Prepare documents
documents = [...] # Your documents
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

# Create Chroma vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Build RAG chain
llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Question answering
query = "What are the main features of Chroma?"
result = qa_chain({"query": query})
print(result["result"])

Data Management

# Update data (upsert)
collection.upsert(
    ids=["doc1", "doc2"],
    documents=[
        "Updated document 1",
        "Updated document 2"
    ],
    metadatas=[
        {"version": "2.0"},
        {"version": "2.0"}
    ]
)

# Retrieve data
results = collection.get(
    ids=["doc1", "doc2"],
    include=["documents", "metadatas", "embeddings"]
)

# Delete data
collection.delete(
    ids=["doc3"],
    where={"category": "outdated"}
)

# Get collection information
print(f"Count: {collection.count()}")
print(f"Collection name: {collection.name}")

# List collections
collections = client.list_collections()
for col in collections:
    print(col.name)

# Delete collection
client.delete_collection(name="old_collection")

Multi-modal Search Example

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from PIL import Image

# OpenCLIP embedding function (supports images and text)
embedding_function = OpenCLIPEmbeddingFunction()

# Create multi-modal collection
collection = client.create_collection(
    name="multimodal",
    embedding_function=embedding_function
)

# Add images and text
collection.add(
    ids=["img1", "txt1"],
    images=["path/to/image.jpg"],  # Image path
    documents=["Beautiful sunset photo"],  # Text description
    metadatas=[
        {"type": "image", "location": "beach"},
        {"type": "text", "topic": "landscape"}
    ]
)

# Search images with text
results = collection.query(
    query_texts=["sunset"],
    n_results=5
)

# Search similar images with image
results = collection.query(
    query_images=["path/to/query_image.jpg"],
    n_results=5
)

Performance Optimization

# Speed up with batch processing
batch_size = 100
all_documents = [...]  # Large number of documents

for i in range(0, len(all_documents), batch_size):
    batch = all_documents[i:i+batch_size]
    collection.add(
        documents=[doc["text"] for doc in batch],
        ids=[doc["id"] for doc in batch],
        metadatas=[doc["metadata"] for doc in batch]
    )

# Optimize persistence settings
client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=chromadb.Settings(
        anonymized_telemetry=False,
        allow_reset=True,
        is_persistent=True
    )
)

# Memory-efficient queries
# Retrieve only necessary fields
results = collection.query(
    query_texts=["search query"],
    n_results=100,
    include=["documents", "distances"]  # Exclude embeddings
)