GitHub Overview
cockroachdb/cockroach
CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
Topics
Star History
Database
CockroachDB + pgvector
Overview
CockroachDB is a PostgreSQL-compatible distributed SQL database designed for globally distributed applications. By using the pgvector extension, vector search capabilities can be added, enabling storage and search of vector data in distributed environments. It allows running AI workloads while maintaining strong consistency and high availability.
Details
CockroachDB was developed by Cockroach Labs in 2015, with a design inspired by Google Spanner. Being PostgreSQL-compatible, many PostgreSQL extensions including pgvector are available. With features like distributed transactions, automatic replication, and geographic distribution, it enables building vector search applications at global scale.
Key features of CockroachDB + pgvector:
- Vector search in distributed environments
- pgvector support through PostgreSQL compatibility
- Strong consistency (ACID compliant)
- Automatic replication and sharding
- Geographically distributed clusters
- Zero-downtime scaling
- Automatic failure recovery
- Multi-region support
- Familiar SQL interface
- Enterprise-grade security
Architecture Features
- Raft consensus algorithm
- Distributed transactions (2PC)
- Range-based data distribution
- Multi-version concurrency control (MVCC)
Pros and Cons
Pros
- Geographic distribution: Globally distributed data and vector search
- Strong consistency: Full ACID transaction support
- High availability: Automatic failover and zero downtime
- PostgreSQL compatible: Leverage existing tools and libraries
- Automatic management: Automated sharding and replication
- Horizontal scaling: Easy scale-out by adding nodes
Cons
- Latency: Delays due to consensus in distributed environments
- Complexity: Managing and optimizing distributed systems
- Resource requirements: Minimum 3-node configuration
- pgvector limitations: Not native support
- Cost: Licensing fees for enterprise features
Key Links
Usage Examples
Setup and Table Creation
-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create vector data table
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL,
content TEXT,
embedding vector(768),
category TEXT,
metadata JSONB,
created_at TIMESTAMP DEFAULT now()
);
-- Create vector index
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Geographic partitioning (multi-region)
ALTER TABLE documents
SET LOCALITY REGIONAL BY ROW AS "region";
Basic Operations with Python
import psycopg2
from psycopg2.extras import RealDictCursor
import numpy as np
from typing import List, Dict
# Connection
conn = psycopg2.connect(
"postgresql://username:password@localhost:26257/vectordb?sslmode=require"
)
# Insert document
def insert_document(title: str, content: str, embedding: np.ndarray):
with conn.cursor() as cur:
cur.execute("""
INSERT INTO documents (title, content, embedding, category, metadata)
VALUES (%s, %s, %s, %s, %s)
RETURNING id
""", (
title,
content,
embedding.tolist(),
'technology',
{'source': 'manual', 'version': 1}
))
doc_id = cur.fetchone()[0]
conn.commit()
return doc_id
# Vector search
def vector_search(query_vector: np.ndarray, limit: int = 10) -> List[Dict]:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT id, title, content,
1 - (embedding <=> %s::vector) as similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vector.tolist(), query_vector.tolist(), limit))
return cur.fetchall()
# Usage example
embedding = np.random.rand(768).astype(np.float32)
doc_id = insert_document(
"CockroachDB Vector Search",
"Implementing vector search in distributed SQL database",
embedding
)
# Execute search
query_embedding = np.random.rand(768).astype(np.float32)
results = vector_search(query_embedding)
for result in results:
print(f"Title: {result['title']}, Similarity: {result['similarity']:.4f}")
Hybrid Search and Filtering
# Combined text and vector search
def hybrid_search(
query_vector: np.ndarray,
text_query: str,
category: str = None,
limit: int = 10
) -> List[Dict]:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
query = """
SELECT id, title, content,
1 - (embedding <=> %s::vector) as vector_score,
ts_rank(to_tsvector('english', content),
plainto_tsquery('english', %s)) as text_score,
(1 - (embedding <=> %s::vector)) * 0.7 +
ts_rank(to_tsvector('english', content),
plainto_tsquery('english', %s)) * 0.3 as combined_score
FROM documents
WHERE (%s IS NULL OR category = %s)
ORDER BY combined_score DESC
LIMIT %s
"""
cur.execute(query, (
query_vector.tolist(),
text_query,
query_vector.tolist(),
text_query,
category,
category,
limit
))
return cur.fetchall()
# Geographically optimized query
def region_optimized_search(
query_vector: np.ndarray,
region: str,
limit: int = 10
) -> List[Dict]:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT id, title, content,
1 - (embedding <=> %s::vector) as similarity
FROM documents
WHERE region = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (
query_vector.tolist(),
region,
query_vector.tolist(),
limit
))
return cur.fetchall()
Batch Processing and Optimization
# Batch insert
def batch_insert_documents(documents: List[Dict]):
with conn.cursor() as cur:
# Fast insertion using COPY statement
cur.execute("BEGIN")
for doc in documents:
cur.execute("""
INSERT INTO documents (title, content, embedding, category, metadata)
VALUES (%s, %s, %s, %s, %s)
""", (
doc['title'],
doc['content'],
doc['embedding'].tolist(),
doc.get('category', 'general'),
doc.get('metadata', {})
))
cur.execute("COMMIT")
# Rebuild index
def rebuild_vector_index():
with conn.cursor() as cur:
# Drop existing index
cur.execute("DROP INDEX IF EXISTS documents_embedding_idx")
# Create new index with optimized parameters
cur.execute("""
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200)
""")
conn.commit()
# Get cluster statistics
def get_cluster_stats():
with conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT
count(*) as total_documents,
avg(octet_length(embedding::text)) as avg_embedding_size,
count(DISTINCT category) as categories
FROM documents
""")
return cur.fetchone()