GitHub Overview

apache/cassandra

Apache Cassandra®

Stars9,296
Watchers443
Forks3,723
Created:May 21, 2009
Language:Java
License:Apache License 2.0

Topics

cassandradatabasejava

Star History

apache/cassandra Star History
Data as of: 7/30/2025, 02:37 AM

Database

Apache Cassandra + Vector Search

Overview

Apache Cassandra is a distributed NoSQL database known for high scalability and availability. With extensions by DataStax, Cassandra now includes Vector Search capabilities, enabling storage and search of large-scale vector data. Its distributed architecture allows vector search on petabyte-scale data.

Details

Cassandra was developed at Facebook in 2008 and later donated to the Apache Software Foundation. DataStax Astra DB and DataStax Enterprise (DSE) integrate vector search functionality, allowing AI workloads to run on existing Cassandra infrastructure.

Key features of Cassandra vector search:

  • Vector search in large-scale distributed environments
  • ANN algorithm (Approximate Nearest Neighbors)
  • Support for high-dimensional vectors (up to 8192 dimensions)
  • Vector operations via CQL
  • Linear scalability
  • Multi-datacenter replication
  • Tunable and eventual consistency
  • High availability (no single point of failure)
  • Integration with time-series data
  • Managed service via DataStax Astra DB

Architecture Features

  • Peer-to-peer distributed system
  • Data distribution via consistent hashing
  • Node communication via gossip protocol
  • Storage using SSTables (Sorted String Tables)

Pros and Cons

Pros

  • High scalability: Linear scaling for large datasets
  • High availability: No single point of failure, always available
  • Geographic distribution: Multi-datacenter replication
  • Write performance: Fast write operations
  • Flexible consistency: Consistency levels based on application requirements
  • Production proven: Adopted by many large-scale systems

Cons

  • Complex operations: Requires cluster management and tuning
  • Eventual consistency: Performance degrades when strong consistency is needed
  • Query limitations: Secondary index constraints
  • Memory usage: Requires significant memory
  • Learning curve: Need to understand CQL and data modeling

Key Links

Usage Examples

Setup and Table Creation

-- Create vector search enabled table
CREATE TABLE IF NOT EXISTS vector_data (
    id UUID PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding VECTOR<FLOAT, 768>,
    category TEXT,
    created_at TIMESTAMP
);

-- Create vector index
CREATE CUSTOM INDEX IF NOT EXISTS embedding_index 
ON vector_data(embedding) 
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
WITH OPTIONS = {
    'similarity_function': 'cosine'
};

Basic Operations with Python

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import numpy as np
import uuid
from datetime import datetime

# Connection setup
auth_provider = PlainTextAuthProvider(
    username='cassandra', 
    password='cassandra'
)
cluster = Cluster(
    ['127.0.0.1'],
    auth_provider=auth_provider
)
session = cluster.connect('vector_keyspace')

# Insert document
def insert_vector_data(title, content, embedding):
    query = """
    INSERT INTO vector_data (id, title, content, embedding, category, created_at)
    VALUES (?, ?, ?, ?, ?, ?)
    """
    
    session.execute(query, (
        uuid.uuid4(),
        title,
        content,
        embedding.tolist(),
        'technology',
        datetime.now()
    ))

# Insert vector
embedding = np.random.rand(768).astype(np.float32)
insert_vector_data(
    "Cassandra Vector Search", 
    "Implementing vector search in distributed environments",
    embedding
)

# Vector search
def vector_search(query_vector, limit=10):
    query = """
    SELECT id, title, content, 
           similarity_cosine(embedding, ?) as similarity
    FROM vector_data
    ORDER BY embedding ANN OF ?
    LIMIT ?
    """
    
    results = session.execute(query, (
        query_vector.tolist(),
        query_vector.tolist(),
        limit
    ))
    
    return results

# Execute search
query_embedding = np.random.rand(768).astype(np.float32)
results = vector_search(query_embedding)

for row in results:
    print(f"Title: {row.title}, Similarity: {row.similarity}")

Batch Processing and Optimization

from cassandra.concurrent import execute_concurrent_with_args

# Batch insert
def batch_insert_vectors(documents):
    query = """
    INSERT INTO vector_data (id, title, content, embedding, category, created_at)
    VALUES (?, ?, ?, ?, ?, ?)
    """
    
    # Prepare data
    parameters = []
    for doc in documents:
        parameters.append((
            uuid.uuid4(),
            doc['title'],
            doc['content'],
            doc['embedding'].tolist(),
            doc['category'],
            datetime.now()
        ))
    
    # Concurrent execution
    execute_concurrent_with_args(
        session, 
        query, 
        parameters,
        concurrency=50
    )

# Search with filtering
def filtered_vector_search(query_vector, category, limit=10):
    query = """
    SELECT id, title, content,
           similarity_cosine(embedding, ?) as similarity
    FROM vector_data
    WHERE category = ?
    ORDER BY embedding ANN OF ?
    LIMIT ?
    ALLOW FILTERING
    """
    
    results = session.execute(query, (
        query_vector.tolist(),
        category,
        query_vector.tolist(),
        limit
    ))
    
    return results

DataStax Astra DB Implementation

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json

# Astra DB connection
cloud_config = {
    'secure_connect_bundle': '/path/to/secure-connect-bundle.zip'
}

with open('/path/to/token.json') as f:
    secrets = json.load(f)

auth_provider = PlainTextAuthProvider(
    secrets['clientId'],
    secrets['secret']
)

cluster = Cluster(
    cloud=cloud_config,
    auth_provider=auth_provider
)
session = cluster.connect()

# Vector search (Astra DB)
def astra_vector_search(query_vector, collection_name):
    query = f"""
    SELECT * FROM {collection_name}
    WHERE vector_embedding ANN OF ?
    LIMIT 10
    """
    
    results = session.execute(query, [query_vector.tolist()])
    return results