Technology Catalog | Developer's Catalog

GitHub Overview

activeloopai/deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Repository:https://github.com/activeloopai/deeplake

Homepage:https://activeloop.ai

Stars8,869

Watchers96

Forks688

Created:August 9, 2019

Language:Python

License:Apache License 2.0

Topics

aicomputer-visioncvdata-sciencedatalakedatasetsdeep-learningimage-processinglangchainlarge-language-modelsllmmachine-learningmlmlopsmulti-modalpythonpytorchtensorflowvector-databasevector-search

Star History

Data as of: 10/22/2025, 04:10 AM

Database

Deep Lake

Overview

Deep Lake is a Database for AI developed by Activeloop, powered by a storage format optimized for deep-learning applications. As a data lake specialized for vector embeddings, it achieves high performance capable of querying over 35 million embeddings in under a second. It unifies management of multimodal data including images, videos, text, and audio, simplifying data pipelines required for LLM applications and deep learning model training.

Details

Deep Lake takes a different approach from traditional vector databases, combining the flexibility of data lakes with the search performance of vector databases. Its serverless architecture works seamlessly in local, cloud, or in-memory environments, with computations running client-side, enabling lightweight production applications to start in seconds.

Key Features

Optimized HNSW Algorithm Implementation

Deep Lake 3.7.1 introduces a unique implementation of the HNSW (Hierarchical Navigable Small World) ANN search algorithm
Accelerated index creation through intelligent memory utilization and multithreading
Based on the Hnswlib package with Deep Lake's proprietary optimizations
Integrated fast filtering based on metadata, text, and other attributes

Performance and Scalability

Achieves sub-second vector search for up to 35 million embeddings
Automatically switches between linear search for datasets under 100K and ANN for larger ones
Distributes data across object storage, attached storage (on-disk), and RAM
80% cost reduction compared to other vector databases

Data Management Capabilities

Multimodal Support: Handles diverse data types including embeddings, images, videos, audio, text, DICOM, PDFs, and more
Native Compression: Stores data in formats like JPEG, PNG, MP4 while providing NumPy-like array access
Version Control: Built-in Git-like version control system for data
Lazy Loading: Loads only required data on-demand, optimizing memory efficiency

Integration and Ecosystem

ML Frameworks: Built-in dataloaders for PyTorch and TensorFlow
LLM Tools: Seamless integration with LangChain and LlamaIndex
Cloud Storage: Compatible with S3, GCP, Azure, MinIO and other S3-compatible storage
Visualization: Instant dataset visualization through Deep Lake App

Tensor Query Language (TQL)

SQL-like query language optimized for machine learning datasets:

Vector similarity search using COSINE_SIMILARITY, L2_NORM
Text semantic search with BM25_SIMILARITY
Complex filtering, joins, sorting, and pagination
Support for synchronous/asynchronous execution

Pros and Cons

Pros

Cost Efficient: 80% cost reduction compared to other vector databases
High Performance: Sub-second search speed even with 35 million vectors
Multimodal Support: Unified management of various data types
Serverless: No infrastructure management, instant deployment
Version Control: Track data change history
Flexible Storage: Supports local, cloud, and in-memory storage
Rich Integrations: Native integration with major ML tools

Cons

HNSW Details Not Public: Implementation details not fully documented
Learning Curve: Requires learning TQL and Deep Lake-specific concepts
Real-time Requirements: Not suitable for applications requiring millisecond-level responses
Enterprise Features: Some advanced features only available in paid versions

Key Links

Code Examples

Basic Usage

import deeplake
from deeplake import VectorStore
import numpy as np

# Create dataset
ds = deeplake.empty('my_dataset', overwrite=True)

# Add tensors
ds.create_tensor('embeddings', htype='embedding', dtype=np.float32)
ds.create_tensor('text', htype='text')
ds.create_tensor('metadata', htype='json')

# Add data
embeddings = np.random.randn(1000, 768).astype(np.float32)
texts = [f"Document {i}" for i in range(1000)]
metadata = [{"category": f"cat_{i%5}", "id": i} for i in range(1000)]

ds.append({
    'embeddings': embeddings,
    'text': texts,
    'metadata': metadata
})

# Vector search
query_embedding = np.random.randn(768).astype(np.float32)
results = ds.query(
    f"SELECT * ORDER BY COSINE_SIMILARITY(embeddings, ARRAY{query_embedding.tolist()}) DESC LIMIT 10"
)

LangChain Integration

from langchain.vectorstores import DeepLake
from langchain.embeddings import OpenAIEmbeddings

# Initialize Deep Lake vector store
embeddings = OpenAIEmbeddings()
vector_store = DeepLake(
    dataset_path="./my_deeplake_db",
    embedding=embeddings,
    runtime={"tensor_db": True}  # Enable managed tensor DB
)

# Add documents
texts = ["Deep Lake is a high-performance vector database", 
         "It achieves fast search through HNSW algorithm"]
metadata = [{"source": "doc1"}, {"source": "doc2"}]
vector_store.add_texts(texts=texts, metadatas=metadata)

# Similarity search
query = "About vector database performance"
docs = vector_store.similarity_search(query, k=5)

# Search with filtering
docs_filtered = vector_store.similarity_search(
    query, 
    k=5,
    filter={"source": "doc1"}
)

Managing Multimodal Data

import deeplake
from PIL import Image
import torch

# Create multimodal dataset
ds = deeplake.empty('multimodal_dataset', overwrite=True)
ds.create_tensor('images', htype='image', sample_compression='jpeg')
ds.create_tensor('image_embeddings', htype='embedding')
ds.create_tensor('captions', htype='text')
ds.create_tensor('audio', htype='audio')

# Add images and captions
image = Image.open('example.jpg')
image_embedding = model.encode_image(image)  # Using pre-trained model
caption = "Beautiful landscape photo"

ds.append({
    'images': image,
    'image_embeddings': image_embedding,
    'captions': caption
})

# Create PyTorch dataloader
dataloader = ds.pytorch(
    batch_size=32,
    shuffle=True,
    num_workers=4,
    transform={
        'images': torchvision.transforms.Compose([
            torchvision.transforms.Resize(224),
            torchvision.transforms.ToTensor(),
        ])
    }
)

# Train model
for batch in dataloader:
    images = batch['images']
    captions = batch['captions']
    # Training process...