GitHub Overview
activeloopai/deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Topics
Star History
Database
Deep Lake
Overview
Deep Lake is a Database for AI developed by Activeloop, powered by a storage format optimized for deep-learning applications. As a data lake specialized for vector embeddings, it achieves high performance capable of querying over 35 million embeddings in under a second. It unifies management of multimodal data including images, videos, text, and audio, simplifying data pipelines required for LLM applications and deep learning model training.
Details
Deep Lake takes a different approach from traditional vector databases, combining the flexibility of data lakes with the search performance of vector databases. Its serverless architecture works seamlessly in local, cloud, or in-memory environments, with computations running client-side, enabling lightweight production applications to start in seconds.
Key Features
Optimized HNSW Algorithm Implementation
- Deep Lake 3.7.1 introduces a unique implementation of the HNSW (Hierarchical Navigable Small World) ANN search algorithm
- Accelerated index creation through intelligent memory utilization and multithreading
- Based on the Hnswlib package with Deep Lake's proprietary optimizations
- Integrated fast filtering based on metadata, text, and other attributes
Performance and Scalability
- Achieves sub-second vector search for up to 35 million embeddings
- Automatically switches between linear search for datasets under 100K and ANN for larger ones
- Distributes data across object storage, attached storage (on-disk), and RAM
- 80% cost reduction compared to other vector databases
Data Management Capabilities
- Multimodal Support: Handles diverse data types including embeddings, images, videos, audio, text, DICOM, PDFs, and more
- Native Compression: Stores data in formats like JPEG, PNG, MP4 while providing NumPy-like array access
- Version Control: Built-in Git-like version control system for data
- Lazy Loading: Loads only required data on-demand, optimizing memory efficiency
Integration and Ecosystem
- ML Frameworks: Built-in dataloaders for PyTorch and TensorFlow
- LLM Tools: Seamless integration with LangChain and LlamaIndex
- Cloud Storage: Compatible with S3, GCP, Azure, MinIO and other S3-compatible storage
- Visualization: Instant dataset visualization through Deep Lake App
Tensor Query Language (TQL)
SQL-like query language optimized for machine learning datasets:
- Vector similarity search using
COSINE_SIMILARITY
,L2_NORM
- Text semantic search with
BM25_SIMILARITY
- Complex filtering, joins, sorting, and pagination
- Support for synchronous/asynchronous execution
Pros and Cons
Pros
- Cost Efficient: 80% cost reduction compared to other vector databases
- High Performance: Sub-second search speed even with 35 million vectors
- Multimodal Support: Unified management of various data types
- Serverless: No infrastructure management, instant deployment
- Version Control: Track data change history
- Flexible Storage: Supports local, cloud, and in-memory storage
- Rich Integrations: Native integration with major ML tools
Cons
- HNSW Details Not Public: Implementation details not fully documented
- Learning Curve: Requires learning TQL and Deep Lake-specific concepts
- Real-time Requirements: Not suitable for applications requiring millisecond-level responses
- Enterprise Features: Some advanced features only available in paid versions
Key Links
Code Examples
Basic Usage
import deeplake
from deeplake import VectorStore
import numpy as np
# Create dataset
ds = deeplake.empty('my_dataset', overwrite=True)
# Add tensors
ds.create_tensor('embeddings', htype='embedding', dtype=np.float32)
ds.create_tensor('text', htype='text')
ds.create_tensor('metadata', htype='json')
# Add data
embeddings = np.random.randn(1000, 768).astype(np.float32)
texts = [f"Document {i}" for i in range(1000)]
metadata = [{"category": f"cat_{i%5}", "id": i} for i in range(1000)]
ds.append({
'embeddings': embeddings,
'text': texts,
'metadata': metadata
})
# Vector search
query_embedding = np.random.randn(768).astype(np.float32)
results = ds.query(
f"SELECT * ORDER BY COSINE_SIMILARITY(embeddings, ARRAY{query_embedding.tolist()}) DESC LIMIT 10"
)
LangChain Integration
from langchain.vectorstores import DeepLake
from langchain.embeddings import OpenAIEmbeddings
# Initialize Deep Lake vector store
embeddings = OpenAIEmbeddings()
vector_store = DeepLake(
dataset_path="./my_deeplake_db",
embedding=embeddings,
runtime={"tensor_db": True} # Enable managed tensor DB
)
# Add documents
texts = ["Deep Lake is a high-performance vector database",
"It achieves fast search through HNSW algorithm"]
metadata = [{"source": "doc1"}, {"source": "doc2"}]
vector_store.add_texts(texts=texts, metadatas=metadata)
# Similarity search
query = "About vector database performance"
docs = vector_store.similarity_search(query, k=5)
# Search with filtering
docs_filtered = vector_store.similarity_search(
query,
k=5,
filter={"source": "doc1"}
)
Managing Multimodal Data
import deeplake
from PIL import Image
import torch
# Create multimodal dataset
ds = deeplake.empty('multimodal_dataset', overwrite=True)
ds.create_tensor('images', htype='image', sample_compression='jpeg')
ds.create_tensor('image_embeddings', htype='embedding')
ds.create_tensor('captions', htype='text')
ds.create_tensor('audio', htype='audio')
# Add images and captions
image = Image.open('example.jpg')
image_embedding = model.encode_image(image) # Using pre-trained model
caption = "Beautiful landscape photo"
ds.append({
'images': image,
'image_embeddings': image_embedding,
'captions': caption
})
# Create PyTorch dataloader
dataloader = ds.pytorch(
batch_size=32,
shuffle=True,
num_workers=4,
transform={
'images': torchvision.transforms.Compose([
torchvision.transforms.Resize(224),
torchvision.transforms.ToTensor(),
])
}
)
# Train model
for batch in dataloader:
images = batch['images']
captions = batch['captions']
# Training process...