GitHub Overview

cockroachdb/cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

Stars31,127
Watchers678
Forks3,940
Created:February 6, 2014
Language:Go
License:Other

Topics

cockroachdbdatabasedistributed-databasegohacktoberfestsql

Star History

cockroachdb/cockroach Star History
Data as of: 7/30/2025, 02:37 AM

Database

CockroachDB + pgvector

Overview

CockroachDB is a PostgreSQL-compatible distributed SQL database designed for globally distributed applications. By using the pgvector extension, vector search capabilities can be added, enabling storage and search of vector data in distributed environments. It allows running AI workloads while maintaining strong consistency and high availability.

Details

CockroachDB was developed by Cockroach Labs in 2015, with a design inspired by Google Spanner. Being PostgreSQL-compatible, many PostgreSQL extensions including pgvector are available. With features like distributed transactions, automatic replication, and geographic distribution, it enables building vector search applications at global scale.

Key features of CockroachDB + pgvector:

  • Vector search in distributed environments
  • pgvector support through PostgreSQL compatibility
  • Strong consistency (ACID compliant)
  • Automatic replication and sharding
  • Geographically distributed clusters
  • Zero-downtime scaling
  • Automatic failure recovery
  • Multi-region support
  • Familiar SQL interface
  • Enterprise-grade security

Architecture Features

  • Raft consensus algorithm
  • Distributed transactions (2PC)
  • Range-based data distribution
  • Multi-version concurrency control (MVCC)

Pros and Cons

Pros

  • Geographic distribution: Globally distributed data and vector search
  • Strong consistency: Full ACID transaction support
  • High availability: Automatic failover and zero downtime
  • PostgreSQL compatible: Leverage existing tools and libraries
  • Automatic management: Automated sharding and replication
  • Horizontal scaling: Easy scale-out by adding nodes

Cons

  • Latency: Delays due to consensus in distributed environments
  • Complexity: Managing and optimizing distributed systems
  • Resource requirements: Minimum 3-node configuration
  • pgvector limitations: Not native support
  • Cost: Licensing fees for enterprise features

Key Links

Usage Examples

Setup and Table Creation

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create vector data table
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title TEXT NOT NULL,
    content TEXT,
    embedding vector(768),
    category TEXT,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT now()
);

-- Create vector index
CREATE INDEX documents_embedding_idx ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Geographic partitioning (multi-region)
ALTER TABLE documents 
SET LOCALITY REGIONAL BY ROW AS "region";

Basic Operations with Python

import psycopg2
from psycopg2.extras import RealDictCursor
import numpy as np
from typing import List, Dict

# Connection
conn = psycopg2.connect(
    "postgresql://username:password@localhost:26257/vectordb?sslmode=require"
)

# Insert document
def insert_document(title: str, content: str, embedding: np.ndarray):
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO documents (title, content, embedding, category, metadata)
            VALUES (%s, %s, %s, %s, %s)
            RETURNING id
        """, (
            title,
            content,
            embedding.tolist(),
            'technology',
            {'source': 'manual', 'version': 1}
        ))
        doc_id = cur.fetchone()[0]
        conn.commit()
        return doc_id

# Vector search
def vector_search(query_vector: np.ndarray, limit: int = 10) -> List[Dict]:
    with conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute("""
            SELECT id, title, content, 
                   1 - (embedding <=> %s::vector) as similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_vector.tolist(), query_vector.tolist(), limit))
        
        return cur.fetchall()

# Usage example
embedding = np.random.rand(768).astype(np.float32)
doc_id = insert_document(
    "CockroachDB Vector Search",
    "Implementing vector search in distributed SQL database",
    embedding
)

# Execute search
query_embedding = np.random.rand(768).astype(np.float32)
results = vector_search(query_embedding)

for result in results:
    print(f"Title: {result['title']}, Similarity: {result['similarity']:.4f}")

Hybrid Search and Filtering

# Combined text and vector search
def hybrid_search(
    query_vector: np.ndarray, 
    text_query: str, 
    category: str = None,
    limit: int = 10
) -> List[Dict]:
    with conn.cursor(cursor_factory=RealDictCursor) as cur:
        query = """
            SELECT id, title, content,
                   1 - (embedding <=> %s::vector) as vector_score,
                   ts_rank(to_tsvector('english', content), 
                          plainto_tsquery('english', %s)) as text_score,
                   (1 - (embedding <=> %s::vector)) * 0.7 + 
                   ts_rank(to_tsvector('english', content), 
                          plainto_tsquery('english', %s)) * 0.3 as combined_score
            FROM documents
            WHERE (%s IS NULL OR category = %s)
            ORDER BY combined_score DESC
            LIMIT %s
        """
        
        cur.execute(query, (
            query_vector.tolist(),
            text_query,
            query_vector.tolist(),
            text_query,
            category,
            category,
            limit
        ))
        
        return cur.fetchall()

# Geographically optimized query
def region_optimized_search(
    query_vector: np.ndarray,
    region: str,
    limit: int = 10
) -> List[Dict]:
    with conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute("""
            SELECT id, title, content,
                   1 - (embedding <=> %s::vector) as similarity
            FROM documents
            WHERE region = %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (
            query_vector.tolist(),
            region,
            query_vector.tolist(),
            limit
        ))
        
        return cur.fetchall()

Batch Processing and Optimization

# Batch insert
def batch_insert_documents(documents: List[Dict]):
    with conn.cursor() as cur:
        # Fast insertion using COPY statement
        cur.execute("BEGIN")
        
        for doc in documents:
            cur.execute("""
                INSERT INTO documents (title, content, embedding, category, metadata)
                VALUES (%s, %s, %s, %s, %s)
            """, (
                doc['title'],
                doc['content'],
                doc['embedding'].tolist(),
                doc.get('category', 'general'),
                doc.get('metadata', {})
            ))
        
        cur.execute("COMMIT")

# Rebuild index
def rebuild_vector_index():
    with conn.cursor() as cur:
        # Drop existing index
        cur.execute("DROP INDEX IF EXISTS documents_embedding_idx")
        
        # Create new index with optimized parameters
        cur.execute("""
            CREATE INDEX documents_embedding_idx ON documents 
            USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 200)
        """)
        
        conn.commit()

# Get cluster statistics
def get_cluster_stats():
    with conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute("""
            SELECT 
                count(*) as total_documents,
                avg(octet_length(embedding::text)) as avg_embedding_size,
                count(DISTINCT category) as categories
            FROM documents
        """)
        
        return cur.fetchone()