Technology Catalog | Developer's Catalog

GitHub Overview

apache/cassandra

Apache Cassandra®

Repository:https://github.com/apache/cassandra

Homepage:https://cassandra.apache.org/

Stars9,436

Watchers439

Forks3,777

Created:May 21, 2009

Language:Java

License:Apache License 2.0

Topics

cassandradatabasejava

Star History

Data as of: 10/22/2025, 04:10 AM

Database

Apache Cassandra + Vector Search

Overview

Apache Cassandra is a distributed NoSQL database known for high scalability and availability. With extensions by DataStax, Cassandra now includes Vector Search capabilities, enabling storage and search of large-scale vector data. Its distributed architecture allows vector search on petabyte-scale data.

Details

Cassandra was developed at Facebook in 2008 and later donated to the Apache Software Foundation. DataStax Astra DB and DataStax Enterprise (DSE) integrate vector search functionality, allowing AI workloads to run on existing Cassandra infrastructure.

Key features of Cassandra vector search:

Vector search in large-scale distributed environments
ANN algorithm (Approximate Nearest Neighbors)
Support for high-dimensional vectors (up to 8192 dimensions)
Vector operations via CQL
Linear scalability
Multi-datacenter replication
Tunable and eventual consistency
High availability (no single point of failure)
Integration with time-series data
Managed service via DataStax Astra DB

Architecture Features

Peer-to-peer distributed system
Data distribution via consistent hashing
Node communication via gossip protocol
Storage using SSTables (Sorted String Tables)

Pros and Cons

Pros

High scalability: Linear scaling for large datasets
High availability: No single point of failure, always available
Geographic distribution: Multi-datacenter replication
Write performance: Fast write operations
Flexible consistency: Consistency levels based on application requirements
Production proven: Adopted by many large-scale systems

Cons

Complex operations: Requires cluster management and tuning
Eventual consistency: Performance degrades when strong consistency is needed
Query limitations: Secondary index constraints
Memory usage: Requires significant memory
Learning curve: Need to understand CQL and data modeling

Key Links

Usage Examples

Setup and Table Creation

-- Create vector search enabled table
CREATE TABLE IF NOT EXISTS vector_data (
    id UUID PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding VECTOR<FLOAT, 768>,
    category TEXT,
    created_at TIMESTAMP
);

-- Create vector index
CREATE CUSTOM INDEX IF NOT EXISTS embedding_index 
ON vector_data(embedding) 
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
WITH OPTIONS = {
    'similarity_function': 'cosine'
};

Basic Operations with Python

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import numpy as np
import uuid
from datetime import datetime

# Connection setup
auth_provider = PlainTextAuthProvider(
    username='cassandra', 
    password='cassandra'
)
cluster = Cluster(
    ['127.0.0.1'],
    auth_provider=auth_provider
)
session = cluster.connect('vector_keyspace')

# Insert document
def insert_vector_data(title, content, embedding):
    query = """
    INSERT INTO vector_data (id, title, content, embedding, category, created_at)
    VALUES (?, ?, ?, ?, ?, ?)
    """
    
    session.execute(query, (
        uuid.uuid4(),
        title,
        content,
        embedding.tolist(),
        'technology',
        datetime.now()
    ))

# Insert vector
embedding = np.random.rand(768).astype(np.float32)
insert_vector_data(
    "Cassandra Vector Search", 
    "Implementing vector search in distributed environments",
    embedding
)

# Vector search
def vector_search(query_vector, limit=10):
    query = """
    SELECT id, title, content, 
           similarity_cosine(embedding, ?) as similarity
    FROM vector_data
    ORDER BY embedding ANN OF ?
    LIMIT ?
    """
    
    results = session.execute(query, (
        query_vector.tolist(),
        query_vector.tolist(),
        limit
    ))
    
    return results

# Execute search
query_embedding = np.random.rand(768).astype(np.float32)
results = vector_search(query_embedding)

for row in results:
    print(f"Title: {row.title}, Similarity: {row.similarity}")

Batch Processing and Optimization

from cassandra.concurrent import execute_concurrent_with_args

# Batch insert
def batch_insert_vectors(documents):
    query = """
    INSERT INTO vector_data (id, title, content, embedding, category, created_at)
    VALUES (?, ?, ?, ?, ?, ?)
    """
    
    # Prepare data
    parameters = []
    for doc in documents:
        parameters.append((
            uuid.uuid4(),
            doc['title'],
            doc['content'],
            doc['embedding'].tolist(),
            doc['category'],
            datetime.now()
        ))
    
    # Concurrent execution
    execute_concurrent_with_args(
        session, 
        query, 
        parameters,
        concurrency=50
    )

# Search with filtering
def filtered_vector_search(query_vector, category, limit=10):
    query = """
    SELECT id, title, content,
           similarity_cosine(embedding, ?) as similarity
    FROM vector_data
    WHERE category = ?
    ORDER BY embedding ANN OF ?
    LIMIT ?
    ALLOW FILTERING
    """
    
    results = session.execute(query, (
        query_vector.tolist(),
        category,
        query_vector.tolist(),
        limit
    ))
    
    return results

DataStax Astra DB Implementation

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json

# Astra DB connection
cloud_config = {
    'secure_connect_bundle': '/path/to/secure-connect-bundle.zip'
}

with open('/path/to/token.json') as f:
    secrets = json.load(f)

auth_provider = PlainTextAuthProvider(
    secrets['clientId'],
    secrets['secret']
)

cluster = Cluster(
    cloud=cloud_config,
    auth_provider=auth_provider
)
session = cluster.connect()

# Vector search (Astra DB)
def astra_vector_search(query_vector, collection_name):
    query = f"""
    SELECT * FROM {collection_name}
    WHERE vector_embedding ANN OF ?
    LIMIT 10
    """
    
    results = session.execute(query, [query_vector.tolist()])
    return results