Modern software systems operate on datasets that can easily grow into millions or billions of records. At that scale, the difference between a fast system and a slow one is often determined by how efficiently the database can locate data. Database indexing plays a central role in this process.

For a senior developer, indexing is not simply about running a CREATE INDEX command. It requires understanding internal data structures, query planning, selectivity, storage overhead, and trade-offs between read and write performance. This article explores database indexing in depth, focusing on concepts and insights that are often expected in senior engineering roles and system design discussions.
Table of Contents
What is Database Indexing
Database indexing is a technique used by database management systems to speed up data retrieval operations. Instead of scanning every row in a table, the database uses an auxiliary data structure that maps column values to the locations of rows in the table.
A helpful analogy is the index section of a book. Imagine trying to locate a topic in a 1000-page book without an index. You would have to scan every page until you find the relevant section. However, with an index, you can quickly jump to the exact page where the topic is discussed.
Databases operate similarly. When a table grows very large, scanning every row becomes inefficient. An index allows the database engine to quickly navigate to the rows that match a query condition.
Consider a table called users:
id | name | email | city
Suppose we run the query:
SELECT * FROM users WHERE email = 'alice@example.com';
If the email column is not indexed, the database must perform a full table scan, checking every row until it finds the match.
If the email column is indexed, the database can directly navigate to the matching record using the index structure.
Indexes therefore dramatically improve read performance, but they introduce trade-offs such as additional storage consumption and slower write operations.
Why Indexes Matter in Large Systems
Indexes become critical when dealing with large datasets and high traffic applications.
Imagine a table containing 10 million rows. Without indexing, the database must inspect rows sequentially.
The complexity of such a search is approximately O(n).
When an index is present, the database uses a structured lookup mechanism such as a tree, reducing complexity to approximately O(log n).
This difference becomes enormous as data grows.
For example:
Searching 10 million rows sequentially could require scanning millions of entries.
Searching via an index may require only three or four tree traversals.
Indexes are therefore essential in:
- High-traffic APIs
- Search heavy workloads
- Financial systems
- Large scale microservices
- Analytics platforms
However, excessive indexing introduces overhead. Every insert, update, or delete operation must also update the associated index structures. Senior developers must therefore balance query speed vs write performance.
Internal Data Structures Used in Indexing
Most relational databases implement indexes using a B-Tree or B+Tree data structure. These structures are optimized for disk-based storage systems, where minimizing disk reads is crucial.
B-Tree Structure
A B-Tree is a balanced tree structure that maintains sorted data and allows efficient search, insertion, and deletion operations.
The tree consists of three main types of nodes:
- Root node
- Internal nodes
- Leaf nodes
The root node acts as the entry point to the tree.
The internal nodes store keys and pointers to child nodes.
The leaf nodes store indexed values and references to the corresponding table rows.
When a query searches for a value, the database starts at the root node and traverses down the tree based on comparisons.
Because the tree remains balanced, the height of the tree grows very slowly even when the dataset becomes very large.
For example, a B-Tree with only three levels can index millions of records.
This design ensures that locating a record requires only a small number of disk reads.
B+Tree Structure
Most modern relational databases actually use B+Trees, a variation of B-Trees that improves performance for range queries.
Key characteristics of a B+Tree include:
- Actual row pointers exist only in leaf nodes
- Leaf nodes are linked sequentially
- Internal nodes store only keys used for navigation
Because leaf nodes are linked, the database can efficiently perform range scans.
Example query:
SELECT * FROM orders WHERE price BETWEEN 100 AND 200;
The database finds the starting position in the index and then scans sequentially through linked leaf nodes.
This makes B+Trees ideal for ordered data access patterns.
Clustered vs Non-Clustered Indexes
One of the most important indexing concepts is the difference between clustered indexes and non-clustered indexes.
Clustered Index
A clustered index determines the physical order of rows in a table.
In other words, the table’s data is stored on disk according to the ordering defined by the clustered index.
Because rows can only be stored in one physical order, a table can have only one clustered index.
Example:
PRIMARY KEY (id)
If id is the clustered index, the rows are physically stored in sorted order of id.
Advantages include:
- Faster range queries
- Efficient sequential reads
- Improved locality of reference
Non-Clustered Index
A non-clustered index is a separate structure that stores indexed values and pointers to the actual rows.
The table’s physical order does not change.
Example:
CREATE INDEX idx_email ON users(email);
The index structure contains:
email value → pointer to row location
When the database finds the email in the index, it follows the pointer to retrieve the row.
This extra step is sometimes called a bookmark lookup or index lookup.
Tables can contain multiple non-clustered indexes.
Composite Indexes
A composite index indexes multiple columns together.
Example:
CREATE INDEX idx_city_age ON users(city, age);
The index entries are sorted first by city, then by age.
Composite indexes follow the leftmost prefix rule.
This means the index can be used if queries reference the leading columns.
Queries that can use this index:
SELECT * FROM users WHERE city = 'Delhi';
SELECT * FROM users WHERE city = 'Delhi' AND age = 25;
Query that cannot efficiently use the index:
SELECT * FROM users WHERE age = 25;
Because age is not the leading column, the database cannot efficiently navigate the index.
Designing composite indexes requires analyzing query patterns and filtering conditions.
Covering Indexes
A covering index is an index that contains all the columns required by a query.
When this happens, the database can return results directly from the index without accessing the base table.
Example:
CREATE INDEX idx_email_name ON users(email, name);
Query:
SELECT name FROM users WHERE email = 'alice@example.com';
Because both email and name exist in the index, the database does not need to read the table.
This is called an index-only scan.
Benefits include:
- Reduced disk I/O
- Faster query execution
- Improved cache efficiency
Hash Indexes
Some databases also support hash indexes.
A hash index uses a hash function to compute the storage location of a key.
This provides extremely fast equality lookups.
Example:
SELECT * FROM users WHERE id = 105;
However, hash indexes cannot efficiently support range queries.
Example:
SELECT * FROM orders WHERE amount > 100;
Because of this limitation, B-Trees remain the default indexing structure in most relational databases.
Bitmap Indexes
A bitmap index represents column values as bit arrays.
These indexes are particularly useful when columns have low cardinality, meaning they contain only a few distinct values.
Example column:
status → ACTIVE, INACTIVE, PENDING
Instead of storing many row pointers, the database stores bitmaps indicating which rows contain each value.
Bitmap indexes are widely used in data warehouses and analytical databases, where large datasets are scanned with complex filtering conditions.
Index Selectivity
Selectivity refers to how unique the values in a column are.
High selectivity means the column contains many unique values.
Examples:
- user_id
- transaction_id
These columns are excellent candidates for indexing.
Low selectivity columns include:
- gender
- boolean flags
- status columns
Indexing low-selectivity columns often provides limited benefits because many rows share the same value.
Query planners rely on database statistics to estimate selectivity and determine whether using an index is beneficial.
Query Planner and Execution Plans
When a SQL query is executed, the database does not immediately run it. Instead, it generates an execution plan.
The query planner evaluates multiple strategies, including:
- Sequential scan
- Index scan
- Index range scan
- Nested loop join
- Hash join
It selects the strategy with the lowest estimated cost.
In some cases, even if an index exists, the planner may still choose a full table scan.
For example, if a query retrieves 80–90% of rows, scanning the entire table may actually be faster than using an index.
Developers can inspect execution plans using tools such as:
EXPLAIN
or
EXPLAIN ANALYZE
Understanding query plans is an important skill for diagnosing performance bottlenecks in production systems.
Write Overhead and Index Maintenance
Indexes improve read performance but increase the cost of write operations.
When a row is inserted, the database must:
- Insert the row into the table
- Update every associated index
If a table contains five indexes, each insert requires five additional index updates.
This phenomenon is known as write amplification.
In write-heavy systems such as:
- financial transaction systems
- event logging platforms
- high frequency trading systems
excessive indexing can significantly reduce throughput.
Senior engineers must carefully evaluate which indexes truly provide value.
Best Practices for Index Design
Effective indexing requires understanding both data distribution and application query patterns.
Some widely accepted best practices include:
- Index columns used frequently in WHERE conditions
- Use composite indexes for multi-column filters
- Avoid indexing columns with very low selectivity
- Monitor execution plans using EXPLAIN
- Periodically remove unused indexes
Indexes should be treated as performance optimization tools, not default structures applied to every column.
Conclusion
Database indexing is one of the most powerful tools available for optimizing data retrieval. By organizing data using structures such as B-Trees and B+Trees, databases can reduce search complexity from linear scans to logarithmic lookups.
However, indexes introduce trade-offs. They consume storage, increase write latency, and require maintenance. Senior developers must understand not only how indexes work, but also when they should and should not be used.
Mastery of indexing involves understanding data structures, query planners, execution plans, and workload patterns. When applied thoughtfully, indexing enables systems to scale efficiently while maintaining fast response times even with massive datasets.