Database

Apache Solr

Overview

Apache Solr is a high-performance, open-source search platform built on Apache Lucene. It provides enterprise-grade full-text search, faceted search, and distributed search capabilities, widely used in large-scale websites and enterprise applications.

Details

Key Features

  • Lucene-Based: High-performance search engine built on the powerful Apache Lucene search library
  • SolrCloud: Distributed architecture with horizontal scaling using Zookeeper coordination
  • Rich Search Capabilities: Full-text search, faceted search, proximity search, fuzzy search, range search
  • Multi-Format Support: Support for JSON, XML, CSV, PDF, Word, and various document formats
  • Real-time Indexing: Near real-time document updates and index creation
  • Machine Learning Integration: Learning to Rank (LTR), classification, and clustering features
  • Vector Search: Dense vector fields for semantic search capabilities

Architecture

Solr consists of the following main components:

  • Solr Cores: Basic units that manage search indexes and configurations
  • Schema: Structure definition including field definitions, analyzers, and index settings
  • SolrCloud: Cluster management and inter-node coordination via Apache Zookeeper
  • Request Handlers: Processing search, update, and admin API calls
  • Analyzer Chains: Combinations of tokenizers and filters for text processing

Advantages and Disadvantages

Advantages

  • Rich Full-Text Search Features: Highlighting, faceting, spell checking, auto-complete functionality
  • Excellent Scalability: Horizontal scaling and high availability through SolrCloud
  • Enterprise-Ready: Comprehensive security, monitoring, and management features
  • True Open Source: Completely open license under Apache License 2.0
  • Mature Ecosystem: Rich plugins, tools, and community support
  • Schema-less Capability: Flexible data structure support through dynamic field definitions
  • Multi-language Support: Analyzers and stemmers for over 50 languages

Disadvantages

  • Configuration Complexity: Initial setup and optimization require deep knowledge
  • Memory Consumption: High memory usage for large datasets
  • Learning Curve: XML-based configuration files and Lucene knowledge required
  • Real-time Limitations: Constraints on complete real-time updates
  • JSON Query Support: JSON-based query API is a later addition
  • Developer Experience: More complex compared to Elasticsearch's REST API

Key Links

Usage Examples

Installation & Setup

# Start Solr (techproducts sample)
bin/solr start -e techproducts

# Create a new core
bin/solr create -c mycore

# Start in SolrCloud mode
bin/solr start -c -p 8983

Basic Operations (Indexing & Search)

# Add documents (JSON)
curl -X POST "http://localhost:8983/solr/techproducts/update?commit=true" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "id": "doc1",
      "title": "Apache Solr Search Engine",
      "content": "Provides high-performance full-text search"
    }
  ]'

# Basic search
curl "http://localhost:8983/solr/techproducts/select?q=*:*"

# Field-specific search
curl "http://localhost:8983/solr/techproducts/select?q=title:Apache"

Schema Design

<!-- schema.xml - Field definitions -->
<schema name="example" version="1.5">
  <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
  <field name="title" type="text_general" indexed="true" stored="true"/>
  <field name="content" type="text_general" indexed="true" stored="true"/>
  <field name="category" type="string" indexed="true" stored="true" multiValued="true"/>
  
  <!-- Dynamic fields (schema-less) -->
  <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
  
  <uniqueKey>id</uniqueKey>
</schema>

Faceted Search

# Category faceted search
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.field=category"

# Range faceting
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=100"

# Multi-condition faceting
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.field=category&facet.field=brand&fq=category:electronics"

Practical Example

// JavaScript Solr search (using Solr client)
const solr = require('solr-client');
const client = solr.createClient();

// Execute search
client.search('category:books AND author:"Haruki Murakami"', {
  facet: true,
  'facet.field': ['genre', 'publisher'],
  start: 0,
  rows: 10,
  sort: 'score desc'
}, function(err, obj) {
  if (err) {
    console.log(err);
  } else {
    console.log(obj.response.docs);
    console.log(obj.facet_counts);
  }
});

Best Practices

<!-- solrconfig.xml - Performance optimization -->
<config>
  <!-- Cache configuration -->
  <query>
    <filterCache class="solr.search.CaffeineCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>
    
    <queryResultCache class="solr.search.CaffeineCache"
                      size="512"
                      initialSize="512"
                      autowarmCount="0"/>
  </query>
  
  <!-- Auto-commit configuration -->
  <updateHandler class="solr.DirectUpdateHandler2">
    <autoCommit>
      <maxTime>30000</maxTime>
      <openSearcher>false</openSearcher>
    </autoCommit>
    <autoSoftCommit>
      <maxTime>1000</maxTime>
    </autoSoftCommit>
  </updateHandler>
  
  <!-- Learning to Rank configuration -->
  <transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
    <str name="fvCacheName">QUERY_DOC_FV</str>
  </transformer>
</config>