Database
Apache Solr
Overview
Apache Solr is a high-performance, open-source search platform built on Apache Lucene. It provides enterprise-grade full-text search, faceted search, and distributed search capabilities, widely used in large-scale websites and enterprise applications.
Details
Key Features
- Lucene-Based: High-performance search engine built on the powerful Apache Lucene search library
- SolrCloud: Distributed architecture with horizontal scaling using Zookeeper coordination
- Rich Search Capabilities: Full-text search, faceted search, proximity search, fuzzy search, range search
- Multi-Format Support: Support for JSON, XML, CSV, PDF, Word, and various document formats
- Real-time Indexing: Near real-time document updates and index creation
- Machine Learning Integration: Learning to Rank (LTR), classification, and clustering features
- Vector Search: Dense vector fields for semantic search capabilities
Architecture
Solr consists of the following main components:
- Solr Cores: Basic units that manage search indexes and configurations
- Schema: Structure definition including field definitions, analyzers, and index settings
- SolrCloud: Cluster management and inter-node coordination via Apache Zookeeper
- Request Handlers: Processing search, update, and admin API calls
- Analyzer Chains: Combinations of tokenizers and filters for text processing
Advantages and Disadvantages
Advantages
- Rich Full-Text Search Features: Highlighting, faceting, spell checking, auto-complete functionality
- Excellent Scalability: Horizontal scaling and high availability through SolrCloud
- Enterprise-Ready: Comprehensive security, monitoring, and management features
- True Open Source: Completely open license under Apache License 2.0
- Mature Ecosystem: Rich plugins, tools, and community support
- Schema-less Capability: Flexible data structure support through dynamic field definitions
- Multi-language Support: Analyzers and stemmers for over 50 languages
Disadvantages
- Configuration Complexity: Initial setup and optimization require deep knowledge
- Memory Consumption: High memory usage for large datasets
- Learning Curve: XML-based configuration files and Lucene knowledge required
- Real-time Limitations: Constraints on complete real-time updates
- JSON Query Support: JSON-based query API is a later addition
- Developer Experience: More complex compared to Elasticsearch's REST API
Key Links
Usage Examples
Installation & Setup
# Start Solr (techproducts sample)
bin/solr start -e techproducts
# Create a new core
bin/solr create -c mycore
# Start in SolrCloud mode
bin/solr start -c -p 8983
Basic Operations (Indexing & Search)
# Add documents (JSON)
curl -X POST "http://localhost:8983/solr/techproducts/update?commit=true" \
-H "Content-Type: application/json" \
-d '[
{
"id": "doc1",
"title": "Apache Solr Search Engine",
"content": "Provides high-performance full-text search"
}
]'
# Basic search
curl "http://localhost:8983/solr/techproducts/select?q=*:*"
# Field-specific search
curl "http://localhost:8983/solr/techproducts/select?q=title:Apache"
Schema Design
<!-- schema.xml - Field definitions -->
<schema name="example" version="1.5">
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Dynamic fields (schema-less) -->
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
</schema>
Faceted Search
# Category faceted search
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.field=category"
# Range faceting
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=100"
# Multi-condition faceting
curl "http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.field=category&facet.field=brand&fq=category:electronics"
Practical Example
// JavaScript Solr search (using Solr client)
const solr = require('solr-client');
const client = solr.createClient();
// Execute search
client.search('category:books AND author:"Haruki Murakami"', {
facet: true,
'facet.field': ['genre', 'publisher'],
start: 0,
rows: 10,
sort: 'score desc'
}, function(err, obj) {
if (err) {
console.log(err);
} else {
console.log(obj.response.docs);
console.log(obj.facet_counts);
}
});
Best Practices
<!-- solrconfig.xml - Performance optimization -->
<config>
<!-- Cache configuration -->
<query>
<filterCache class="solr.search.CaffeineCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<queryResultCache class="solr.search.CaffeineCache"
size="512"
initialSize="512"
autowarmCount="0"/>
</query>
<!-- Auto-commit configuration -->
<updateHandler class="solr.DirectUpdateHandler2">
<autoCommit>
<maxTime>30000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
</updateHandler>
<!-- Learning to Rank configuration -->
<transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
<str name="fvCacheName">QUERY_DOC_FV</str>
</transformer>
</config>