GitHub Overview

activeloopai/deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Stars8,742
Watchers95
Forks669
Created:August 9, 2019
Language:Python
License:Apache License 2.0

Topics

aicomputer-visioncvdata-sciencedatalakedatasetsdeep-learningimage-processinglangchainlarge-language-modelsllmmachine-learningmlmlopsmulti-modalpythonpytorchtensorflowvector-databasevector-search

Star History

activeloopai/deeplake Star History
Data as of: 7/30/2025, 02:37 AM

Database

Deep Lake

Overview

Deep Lake is a Database for AI developed by Activeloop, powered by a storage format optimized for deep-learning applications. As a data lake specialized for vector embeddings, it achieves high performance capable of querying over 35 million embeddings in under a second. It unifies management of multimodal data including images, videos, text, and audio, simplifying data pipelines required for LLM applications and deep learning model training.

Details

Deep Lake takes a different approach from traditional vector databases, combining the flexibility of data lakes with the search performance of vector databases. Its serverless architecture works seamlessly in local, cloud, or in-memory environments, with computations running client-side, enabling lightweight production applications to start in seconds.

Key Features

Optimized HNSW Algorithm Implementation

  • Deep Lake 3.7.1 introduces a unique implementation of the HNSW (Hierarchical Navigable Small World) ANN search algorithm
  • Accelerated index creation through intelligent memory utilization and multithreading
  • Based on the Hnswlib package with Deep Lake's proprietary optimizations
  • Integrated fast filtering based on metadata, text, and other attributes

Performance and Scalability

  • Achieves sub-second vector search for up to 35 million embeddings
  • Automatically switches between linear search for datasets under 100K and ANN for larger ones
  • Distributes data across object storage, attached storage (on-disk), and RAM
  • 80% cost reduction compared to other vector databases

Data Management Capabilities

  • Multimodal Support: Handles diverse data types including embeddings, images, videos, audio, text, DICOM, PDFs, and more
  • Native Compression: Stores data in formats like JPEG, PNG, MP4 while providing NumPy-like array access
  • Version Control: Built-in Git-like version control system for data
  • Lazy Loading: Loads only required data on-demand, optimizing memory efficiency

Integration and Ecosystem

  • ML Frameworks: Built-in dataloaders for PyTorch and TensorFlow
  • LLM Tools: Seamless integration with LangChain and LlamaIndex
  • Cloud Storage: Compatible with S3, GCP, Azure, MinIO and other S3-compatible storage
  • Visualization: Instant dataset visualization through Deep Lake App

Tensor Query Language (TQL)

SQL-like query language optimized for machine learning datasets:

  • Vector similarity search using COSINE_SIMILARITY, L2_NORM
  • Text semantic search with BM25_SIMILARITY
  • Complex filtering, joins, sorting, and pagination
  • Support for synchronous/asynchronous execution

Pros and Cons

Pros

  • Cost Efficient: 80% cost reduction compared to other vector databases
  • High Performance: Sub-second search speed even with 35 million vectors
  • Multimodal Support: Unified management of various data types
  • Serverless: No infrastructure management, instant deployment
  • Version Control: Track data change history
  • Flexible Storage: Supports local, cloud, and in-memory storage
  • Rich Integrations: Native integration with major ML tools

Cons

  • HNSW Details Not Public: Implementation details not fully documented
  • Learning Curve: Requires learning TQL and Deep Lake-specific concepts
  • Real-time Requirements: Not suitable for applications requiring millisecond-level responses
  • Enterprise Features: Some advanced features only available in paid versions

Key Links

Code Examples

Basic Usage

import deeplake
from deeplake import VectorStore
import numpy as np

# Create dataset
ds = deeplake.empty('my_dataset', overwrite=True)

# Add tensors
ds.create_tensor('embeddings', htype='embedding', dtype=np.float32)
ds.create_tensor('text', htype='text')
ds.create_tensor('metadata', htype='json')

# Add data
embeddings = np.random.randn(1000, 768).astype(np.float32)
texts = [f"Document {i}" for i in range(1000)]
metadata = [{"category": f"cat_{i%5}", "id": i} for i in range(1000)]

ds.append({
    'embeddings': embeddings,
    'text': texts,
    'metadata': metadata
})

# Vector search
query_embedding = np.random.randn(768).astype(np.float32)
results = ds.query(
    f"SELECT * ORDER BY COSINE_SIMILARITY(embeddings, ARRAY{query_embedding.tolist()}) DESC LIMIT 10"
)

LangChain Integration

from langchain.vectorstores import DeepLake
from langchain.embeddings import OpenAIEmbeddings

# Initialize Deep Lake vector store
embeddings = OpenAIEmbeddings()
vector_store = DeepLake(
    dataset_path="./my_deeplake_db",
    embedding=embeddings,
    runtime={"tensor_db": True}  # Enable managed tensor DB
)

# Add documents
texts = ["Deep Lake is a high-performance vector database", 
         "It achieves fast search through HNSW algorithm"]
metadata = [{"source": "doc1"}, {"source": "doc2"}]
vector_store.add_texts(texts=texts, metadatas=metadata)

# Similarity search
query = "About vector database performance"
docs = vector_store.similarity_search(query, k=5)

# Search with filtering
docs_filtered = vector_store.similarity_search(
    query, 
    k=5,
    filter={"source": "doc1"}
)

Managing Multimodal Data

import deeplake
from PIL import Image
import torch

# Create multimodal dataset
ds = deeplake.empty('multimodal_dataset', overwrite=True)
ds.create_tensor('images', htype='image', sample_compression='jpeg')
ds.create_tensor('image_embeddings', htype='embedding')
ds.create_tensor('captions', htype='text')
ds.create_tensor('audio', htype='audio')

# Add images and captions
image = Image.open('example.jpg')
image_embedding = model.encode_image(image)  # Using pre-trained model
caption = "Beautiful landscape photo"

ds.append({
    'images': image,
    'image_embeddings': image_embedding,
    'captions': caption
})

# Create PyTorch dataloader
dataloader = ds.pytorch(
    batch_size=32,
    shuffle=True,
    num_workers=4,
    transform={
        'images': torchvision.transforms.Compose([
            torchvision.transforms.Resize(224),
            torchvision.transforms.ToTensor(),
        ])
    }
)

# Train model
for batch in dataloader:
    images = batch['images']
    captions = batch['captions']
    # Training process...