Deep Understanding of Database Index Optimization

DatabaseIndexSQLPerformance TuningPostgreSQL

Overview

In database performance, index design and optimization are among the most critical elements. This article provides a deep understanding of major index structures such as B-tree, Hash, and Bitmap, along with practical optimization techniques leveraging the latest features of PostgreSQL 16 and MySQL 8.4. From detailed EXPLAIN PLAN analysis to specific techniques that dramatically improve actual query performance, we comprehensively cover the knowledge necessary for database engineers.

Fundamental Theory of Indexes

B-tree Index Structure and Characteristics

B-tree indexes are the most versatile index structure adopted as the default index type in both PostgreSQL and MySQL.

-- PostgreSQL: Creating a B-tree index
CREATE INDEX idx_users_email ON users(email);

-- Creating a composite index
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);

-- Partial index (PostgreSQL-specific)
CREATE INDEX idx_active_users ON users(email) 
WHERE status = 'active';

B-tree characteristics:

Balanced tree structure: All leaf nodes are at the same depth
O(log n) search performance: Doubling data only adds one search step
Range query support: Supports <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL
Sorted data: Accelerates ORDER BY clauses

-- Example query where B-tree index is effective
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders 
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
ORDER BY created_at;

Hash Index Characteristics and Use Cases

Hash indexes are high-speed index structures specialized for equality searches. They have been significantly improved since PostgreSQL 10, making them practical for production environments.

-- PostgreSQL: Creating a Hash index
CREATE INDEX idx_products_sku_hash ON products USING hash(sku);

-- MySQL supports Hash indexes only for MEMORY tables
CREATE TABLE memory_cache (
    cache_key VARCHAR(255) PRIMARY KEY,
    cache_value TEXT
) ENGINE=MEMORY;

Hash index advantages:

Fast equality searches: Particularly effective for long string columns
Smaller index size: About 30-50% smaller than B-tree
Constant access time: Ideal O(1) performance

Limitations:

No range queries
Cannot be used for sorting operations
Not WAL-logged in PostgreSQL versions before 10

Bitmap Index Concepts and Utilization

PostgreSQL's Bitmap Index Scan enables efficient query execution by combining multiple indexes.

-- Create multiple single-column indexes
CREATE INDEX idx_users_status ON users(status);
CREATE INDEX idx_users_country ON users(country);
CREATE INDEX idx_users_age_range ON users(age_range);

-- Query that uses Bitmap Index Scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users
WHERE status = 'active'
  AND country = 'JP'
  AND age_range = '20-29';

Latest Index Features in PostgreSQL 16

Parallel Index Creation

PostgreSQL 16 has enhanced parallel processing for index creation.

-- Settings for parallel index creation
SET max_parallel_maintenance_workers = 4;
SET maintenance_work_mem = '1GB';

-- Parallel index creation on large tables
CREATE INDEX CONCURRENTLY idx_large_table_parallel 
ON large_table(column1, column2);

B-tree Deduplication

B-tree deduplication introduced in PostgreSQL 13 significantly reduces index size.

-- Creating an index with deduplication enabled (enabled by default)
CREATE INDEX idx_events_user_id ON events(user_id) 
WITH (deduplicate_items = on);

-- Check index statistics
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexname = 'idx_events_user_id';

Covering Indexes with INCLUDE Clause

Using the INCLUDE clause to promote index-only scans:

-- Creating a covering index
CREATE INDEX idx_users_email_include 
ON users(email) 
INCLUDE (first_name, last_name, created_at);

-- Query that enables index-only scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT email, first_name, last_name, created_at
FROM users
WHERE email = '[email protected]';

Latest Optimization Features in MySQL 8.4

Invisible Indexes

Invisible indexes introduced in MySQL 8.0 enable safer index management.

-- Creating an invisible index
CREATE INDEX idx_test INVISIBLE ON orders(total_amount);

-- Changing index visibility
ALTER TABLE orders ALTER INDEX idx_test VISIBLE;

-- Using invisible index with optimizer hint
SELECT /*+ USE_INDEX(orders idx_test) */ *
FROM orders
WHERE total_amount > 1000;

Descending Indexes

MySQL 8.0 and later support true descending indexes.

-- Creating a descending index
CREATE INDEX idx_logs_timestamp_desc ON logs(timestamp DESC);

-- Combining ascending and descending in composite indexes
CREATE INDEX idx_sales_date_amount ON sales(sale_date ASC, amount DESC);

Function-based Indexes

Advanced optimization with expression indexes:

-- Index using JSON functions
CREATE INDEX idx_json_email 
ON users((JSON_UNQUOTE(JSON_EXTRACT(data, '$.email'))));

-- Index using calculated expressions
CREATE INDEX idx_total_with_tax 
ON orders((price * 1.1));

Detailed EXPLAIN PLAN Analysis

PostgreSQL EXPLAIN ANALYZE

-- Getting detailed execution plan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)
SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    SUM(oi.quantity * oi.unit_price) as total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01'
GROUP BY o.order_id, o.order_date, c.customer_name
HAVING SUM(oi.quantity * oi.unit_price) > 1000
ORDER BY total DESC
LIMIT 100;

Important execution plan metrics:

Actual Time: Actual execution time (milliseconds)
Rows: Number of rows processed
Loops: Number of times the node was executed
Buffers: Number of pages read
I/O Timings: Disk I/O time

MySQL EXPLAIN FORMAT=TREE

New EXPLAIN format in MySQL 8.0.16 and later:

-- Display execution plan in tree format
EXPLAIN FORMAT=TREE
SELECT 
    c.customer_name,
    COUNT(o.order_id) as order_count,
    AVG(o.total_amount) as avg_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'Japan'
  AND o.order_date >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY c.customer_id
HAVING order_count > 10;

-- Get detailed information in JSON format
EXPLAIN FORMAT=JSON
SELECT ...;

Advanced Index Design Strategies

Optimal Order for Composite Indexes

-- Analyze cardinality and query patterns
WITH index_stats AS (
    SELECT 
        attname as column_name,
        n_distinct,
        correlation
    FROM pg_stats
    WHERE tablename = 'orders'
)
SELECT * FROM index_stats;

-- Creating optimal composite index
-- 1. Equality condition columns first
-- 2. In order of high cardinality
-- 3. Range condition columns last
CREATE INDEX idx_optimal ON orders(
    status,           -- Equality condition, low cardinality
    customer_id,      -- Equality condition, high cardinality  
    created_at        -- Range condition
);

Index Strategy for Partitioned Tables

-- PostgreSQL: Creating partitioned table
CREATE TABLE sales (
    id BIGSERIAL,
    sale_date DATE NOT NULL,
    amount DECIMAL(10,2),
    PRIMARY KEY (id, sale_date)
) PARTITION BY RANGE (sale_date);

-- Creating partitions
CREATE TABLE sales_2024_q1 PARTITION OF sales
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

-- Creating global index
CREATE INDEX idx_sales_amount ON sales(amount);

Index Maintenance Strategy

-- PostgreSQL: Check index bloat
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    indexrelid::regclass as index_oid,
    100 * (pg_relation_size(indexrelid) - pg_relation_size(indexrelid, 'fsm')) 
        / pg_relation_size(indexrelid) as bloat_ratio
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1024 * 1024  -- Over 1MB
ORDER BY bloat_ratio DESC;

-- Online index rebuild
CREATE INDEX CONCURRENTLY idx_new ON table_name(column);
DROP INDEX CONCURRENTLY idx_old;
ALTER INDEX idx_new RENAME TO idx_old;

Practical Optimization Cases

Case 1: Full-Text Search Optimization

-- PostgreSQL: Full-text search using GIN index
CREATE INDEX idx_articles_fts ON articles 
USING gin(to_tsvector('english', title || ' ' || content));

-- Fast full-text search query
SELECT 
    id,
    title,
    ts_rank(
        to_tsvector('english', title || ' ' || content),
        plainto_tsquery('english', 'database index')
    ) as rank
FROM articles
WHERE to_tsvector('english', title || ' ' || content) 
    @@ plainto_tsquery('english', 'database index')
ORDER BY rank DESC
LIMIT 20;

Case 2: Time-Series Data Optimization

-- Time-series data optimization using BRIN index
CREATE INDEX idx_logs_timestamp_brin ON logs 
USING brin(timestamp) 
WITH (pages_per_range = 128);

-- Combination with partitioning
CREATE TABLE logs_2024_01 PARTITION OF logs
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Time-series query optimization
EXPLAIN (ANALYZE, BUFFERS)
SELECT 
    date_trunc('hour', timestamp) as hour,
    COUNT(*) as log_count,
    AVG(response_time) as avg_response_time
FROM logs
WHERE timestamp >= '2024-01-01'
  AND timestamp < '2024-02-01'
GROUP BY hour
ORDER BY hour;

Case 3: Geospatial Data Optimization

-- Geospatial index using PostGIS extension
CREATE EXTENSION IF NOT EXISTS postgis;

-- Creating GiST index
CREATE INDEX idx_locations_geom ON locations 
USING gist(geom);

-- Optimizing nearest neighbor search
EXPLAIN (ANALYZE, BUFFERS)
SELECT 
    id,
    name,
    ST_Distance(geom, ST_MakePoint(139.7, 35.7)::geography) as distance
FROM locations
WHERE ST_DWithin(geom, ST_MakePoint(139.7, 35.7)::geography, 5000)
ORDER BY distance
LIMIT 10;

Performance Monitoring and Troubleshooting

Monitoring Index Usage

-- PostgreSQL: Detecting unused indexes
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- MySQL: Checking index statistics
SELECT 
    table_schema,
    table_name,
    index_name,
    cardinality,
    seq_in_index,
    column_name
FROM information_schema.statistics
WHERE table_schema = 'your_database'
ORDER BY table_name, index_name, seq_in_index;

Slow Query Analysis and Improvement

-- PostgreSQL: Analysis using pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT 
    query,
    calls,
    total_exec_time / 1000 as total_sec,
    mean_exec_time / 1000 as mean_sec,
    stddev_exec_time / 1000 as stddev_sec,
    rows
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY mean_exec_time DESC
LIMIT 20;

Conclusion

Database index optimization is not merely a technical task but a crucial design activity directly connected to application performance. By understanding the characteristics of each index type - B-tree, Hash, and Bitmap - and employing a scientific approach using EXPLAIN PLAN, query performance can be dramatically improved.

The latest features in PostgreSQL 16 and MySQL 8.4 enable more advanced optimizations. Continuous monitoring and proper maintenance are essential for maintaining database performance over the long term.

By implementing the techniques and best practices introduced in this article, you can build scalable and high-performance database systems. The key to sustainable system operation is evolving your index strategy as your data grows.