8.4 Selection of Indexes

Selection of Indexes

Section outline

Choosing which indexes to create involves a trade-off between speeding up queries and the overhead of maintaining indexes during data modifications (insertions, deletions, updates).
Two key factors are the potential speed-up of queries versus the increased complexity and time for data modifications.
The primary cost in query processing is often the number of disk pages that must be retrieved into main memory.
Indexes are particularly useful for keys, attributes that are ‘almost keys’ (high selectivity), and attributes on which the data is clustered.
The optimal set of indexes can be calculated by modeling the costs of common queries and modifications and their expected frequencies.
Database tuning tools can automate index selection by analyzing a query workload and using a greedy algorithm to choose indexes that provide the most benefit.

#1

read: Section 8.4: Selection of Indexes (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Choosing which indexes to create involves a trade-off between speeding up queries and the overhead of maintaining indexes during data modifications (insertions, deletions, updates).
- Two key factors are the potential speed-up of queries versus the increased complexity and time for data modifications.
- The primary cost in query processing is often the number of disk pages that must be retrieved into main memory.
- Indexes are particularly useful for keys, attributes that are 'almost keys' (high selectivity), and attributes on which the data is clustered.
- The optimal set of indexes can be calculated by modeling the costs of common queries and modifications and their expected frequencies.
- Database tuning tools can automate index selection by analyzing a query workload and using a greedy algorithm to choose indexes that provide the most benefit.

Selection of Indexes

The choice of which indexes to create is a fundamental problem in database design, requiring a careful analysis of the trade-offs involved. An appropriate set of indexes is a principal factor in achieving acceptable database performance.

The Indexing Trade-Off

The selection of indexes involves a direct trade-off. An index on an attribute can significantly accelerate queries that specify a value or range for that attribute. Conversely, every index adds overhead to data modification operations—insertions, deletions, and updates—as the index structure must also be maintained.
A Simple Cost Model

The predominant cost in most database operations is the number of disk accesses required. Relations are stored across many disk pages, and to examine even a single tuple, its entire page must be read from disk into main memory. The objective of an index is to reduce the number of data pages that must be accessed to satisfy a query.
Characteristics of Useful Indexes
An index is most beneficial under certain conditions.
- Key Attributes: An index on a key is highly effective, as it typically maps a key value to a single tuple location, requiring at most one data page access.
- High Selectivity (“Almost a Key”): An index on an attribute where values are unique or nearly unique is also effective. It retrieves a small number of tuples, thus requiring few disk accesses.
- Clustered Attributes: If tuples are physically stored (clustered) on disk in the order of an attribute’s value, an index on that attribute can be very efficient. It allows the retrieval of all tuples with a common value by accessing only a small, contiguous set of pages.
Calculating the Optimal Set of Indexes

One can determine the best set of indexes by modeling the costs of the expected database workload. This involves estimating the cost of each frequent query and modification under various indexing schemes and weighting those costs by the frequency of each operation.

Definition: Average Operation Cost

Let $O_1, O_2, \ldots, O_n$ be a set of database operations (queries and modifications) that constitute the workload. Let $p_i$ be the probability (or frequency) of operation $O_i$, such that $\sum_{i=1}^{n} p_i = 1$.

For a given set of indexes $S$, let $C(O_i, S)$ be the cost, measured in disk accesses, of performing operation $O_i$ with the indexes in $S$.

The average cost of an operation for the index set $S$ is the weighted average: \[ \text{AverageCost}(S) = \sum_{i=1}^{n} p_i \cdot C(O_i, S) \] The optimal set of indexes is the set $S$ that minimizes this value.
Automatic Index Selection
Modern database systems often include tuning advisors that automate the index selection process. These tools analyze a representative query workload and employ a greedy algorithm to recommend an optimal set of indexes.
The typical process is as follows:
1. With no indexes selected, evaluate the performance benefit of each potential candidate index.
2. Choose the index that provides the greatest benefit.
3. Re-evaluate the remaining candidate indexes, assuming the previously selected index is available.
4. Repeat this process, adding the next-best index at each step, until no remaining candidate offers a positive performance benefit.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-10-15 23:45:54
Duration: 19.6110737323761
Cummulative cost: $2.44e-02

A Simple Cost Model

Section outline

Understanding index selection requires a cost model based on where time is spent during a query.
The main cost is retrieving data pages from disk into main memory.
Tuples of a relation are typically distributed across many pages on a disk.

#2

read: Section 8.4.1: A Simple Cost Model (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Understanding index selection requires a cost model based on where time is spent during a query.
- The main cost is retrieving data pages from disk into main memory.
- Tuples of a relation are typically distributed across many pages on a disk.

To understand how to choose indexes for a database, we must first know where time is spent when answering a query. The details of data storage are complex, but we can state a simple model for the cost of database operations.

The principal cost of answering a query is the time spent retrieving data. The details of how relations are stored will determine the efficiency of this process.

To make an informed choice about creating an index, the database designer must analyze a trade-off between query speed and the overhead of maintaining the index during data modifications.
The tuples of a relation are normally distributed among many pages of a disk. A page, typically several thousand bytes, holds many tuples.

To examine even one tuple, the entire page containing that tuple must be brought from disk into main memory. The cost to examine all tuples on a page is only slightly more than the cost to examine a single tuple on that page.
For simplicity in our cost model, we shall assume that every page needed to answer a query must be retrieved from the disk. We do not consider the possibility that a page might already be in main memory.

This assumption simplifies the analysis by making the cost of an operation directly proportional to the number of distinct pages that must be accessed on disk.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-10-15 23:46:06
Duration: 11.934682130813599
Cummulative cost: $1.34e-02

Some Useful Indexes

Section outline

An index on a relation’s key is often the most useful, as key-based lookups are common and efficient, retrieving at most one tuple.
Indexes on non-key attributes can be effective in two situations: when the attribute is ‘almost a key’ (few tuples share a value) or when tuples are ‘clustered’ by that attribute (tuples with the same value are stored on few pages).

#3

read: Section 8.4.2: Some Useful Indexes (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- An index on a relation's key is often the most useful, as key-based lookups are common and efficient, retrieving at most one tuple.
- Indexes on non-key attributes can be effective in two situations: when the attribute is 'almost a key' (few tuples share a value) or when tuples are 'clustered' by that attribute (tuples with the same value are stored on few pages).

An index on a relation’s key is often the most useful index.

There are two principal reasons for this. First, queries that specify a value for the key are common; thus, an index on the key will be used frequently. Second, since at most one tuple can have a given key value, the index returns at most one tuple location. Consequently, at most one data page must be retrieved from disk to access the tuple itself.
An index on a non-key attribute can be effective in improving query performance under certain conditions.
The utility of a non-key index depends on the number of disk pages that must be accessed. Two situations are noteworthy:
1. The attribute is “almost a key.” If an attribute has relatively few tuples for any given value, an index is beneficial. Even if each tuple resides on a different page, the total number of pages to retrieve will be small compared to scanning the entire relation.
2. The tuples are “clustered” on the attribute. When tuples with a common value for an attribute are physically grouped onto a minimal number of disk pages, an index can efficiently locate this small set of pages.

Definition: Clustered Relation

A relation $R$ is said to be clustered on an attribute $A$ if the tuples of $R$ are physically stored such that for any value $v$ of $A$, all tuples $t \in R$ where $t.A = v$ are located on the smallest possible number of disk pages.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-10-15 23:46:24
Duration: 17.881883144378662
Cummulative cost: $2.16e-02

Calculating the Best Indexes to Create

Section outline

The decision to create indexes must balance the benefits for queries against the costs for modifications.
Cost calculation requires making assumptions about the expected queries, modifications, and their frequencies.
The principal cost of a query or modification is often the number of disk pages accessed.
An example is provided for the relation StarsIn(movieTitle, movieYear, starName) to analyze the costs of two query types ($Q_1$, $Q_2$) and one insertion type ($I$) under different indexing strategies.

#4

read: Section 8.4.3: Calculating the Best Indexes to Create (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- The decision to create indexes must balance the benefits for queries against the costs for modifications.
- Cost calculation requires making assumptions about the expected queries, modifications, and their frequencies.
- The principal cost of a query or modification is often the number of disk pages accessed.
- An example is provided for the relation `StarsIn(movieTitle, movieYear, starName)` to analyze the costs of two query types ($Q_1$, $Q_2$) and one insertion type ($I$) under different indexing strategies.

The decision to create an index requires a careful analysis of the trade-offs involved. An index can significantly improve the performance of queries but introduces overhead for data modification operations such as insertions, updates, and deletions.

Each modification to a relation forces a corresponding change to any index on the modified attributes. This process involves reading and writing pages for the index in addition to the pages for the data itself, thereby increasing the cost of modifications.
To determine the optimal set of indexes, we must formulate a cost model based on the expected workload.

This involves estimating the relative frequencies of various queries and modifications. The primary cost metric is typically the number of disk block (or page) accesses required to perform an operation, as disk I/O is often the main bottleneck in database performance.

Definition: Index Selection Cost Model

Let a set of database operations be $O_1, O_2, \dots, O_k$ with execution frequencies $p_1, p_2, \dots, p_k$ respectively, where $\sum_{i=1}^{k} p_i = 1$. Let $C(O_i, S)$ be the cost, in disk accesses, of performing operation $O_i$ given a set of indexes $S$. The average cost for the workload is:

\[ \text{Average Cost}(S) = \sum_{i=1}^{k} p_i \cdot C(O_i, S) \]

The objective is to choose the set of indexes $S$ that minimizes this average cost.

Consider the relation StarsIn(movieTitle, movieYear, starName) and the following three common operations.

We will analyze the cost of these operations under four different indexing strategies: no indexes, an index on starName, an index on (movieTitle, movieYear), and indexes on both.
Operations:

$Q_1$: Find the title and year of movies for a given star. (WHERE starName = s)

$Q_2$: Find the stars in a given movie. (WHERE movieTitle = t AND movieYear = y)

$I$: Insert a new tuple into the relation.

Assumptions:

The StarsIn relation occupies 10 disk pages. A full scan costs 10 disk accesses.

On average, a star appears in 3 movies, and a movie has 3 stars. These tuples are spread across the relation, requiring 3 disk accesses to retrieve if an index is used.

Using an index requires one additional disk access to read the index page itself.

An insertion requires reading and writing a data page (2 accesses). If an index exists, it also requires reading and writing an index page (2 additional accesses per index).

Cost Analysis: The costs, in disk accesses, are summarized below. The average cost is calculated assuming $Q_1$ occurs with frequency $p_1$ and $Q_2$ with frequency $p_2$. The insertion $I$ therefore occurs with frequency $1 - p_1 - p_2$.

Action No Index Star Index Movie Index Both Indexes

$Q_1$ 10 4 10 4

$Q_2$ 10 10 4 4

$I$ 2 4 4 6

Average Cost $2 + 8p_1 + 8p_2$ $4 + 6p_2$ $4 + 6p_1$ $6 - 2p_1 - 2p_2$

Conclusion: The optimal choice depends entirely on the workload frequencies $p_1$ and $p_2$.

If insertions are dominant (e.g., $p_1 = p_2 = 0.1$), the lowest cost is achieved with no indexes.

If both queries are frequent (e.g., $p_1 = p_2 = 0.4$), the lowest cost is achieved with both indexes.

If only one query type is frequent (e.g., $p_1 = 0.5, p_2 = 0.1$), the best choice is a single index that supports that query (the Star Index in this case).

Action	No Index	Star Index	Movie Index	Both Indexes
\(Q_1\)	10	4	10	4
\(Q_2\)	10	10	4	4
\(I\)	2	4	4	6
Average Cost	\(2 + 8p_1 + 8p_2\)	\(4 + 6p_2\)	\(4 + 6p_1\)	\(6 - 2p_1 - 2p_2\)

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-10-15 23:46:45
Duration: 20.336645126342773
Cummulative cost: $2.44e-02

Automatic Selection of Indexes to Create

Section outline

‘Tuning’ a database includes the process of index selection.
Tuning advisors can automate or assist in this process through a four-step approach: establishing a workload, specifying constraints, evaluating candidate indexes, and suggesting the optimal set.
A ‘greedy’ approach is often effective, where indexes are chosen iteratively based on which one provides the maximum benefit given the already selected indexes.

#5

read: Section 8.4.4: Automatic Selection of Indexes to Create (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- 'Tuning' a database includes the process of index selection.
- Tuning advisors can automate or assist in this process through a four-step approach: establishing a workload, specifying constraints, evaluating candidate indexes, and suggesting the optimal set.
- A 'greedy' approach is often effective, where indexes are chosen iteratively based on which one provides the maximum benefit given the already selected indexes.

Database Tuning

The process of adjusting a database for optimal performance is known as “tuning.” This process includes not only index selection but also the configuration of other parameters, such as memory allocation and checkpoint frequency. Tools known as tuning advisors can assist the database designer by automating or providing guidance on these choices.
The Tuning Advisor Process

Tuning advisors systematically approach index selection through a defined process. The goal is to identify a set of indexes that minimizes the cost for a given workload.
Definition: The Index Selection Process
A tuning advisor typically follows a four-step process to recommend an optimal set of indexes:
1. Establish the Workload: The advisor first determines a representative set of queries and modifications. This workload can be derived from system logs of past operations or from an analysis of the application programs that interact with the database.
2. Specify Constraints: The database designer may impose constraints, such as requiring certain indexes to be present or forbidding others.
3. Evaluate Candidate Indexes: A set of potential, or candidate, indexes is generated. The system’s query optimizer is then used to estimate the performance cost of the workload under the assumption that various subsets of these candidate indexes are available.
4. Recommend the Optimal Set: The set of indexes that results in the lowest estimated cost for the given workload is suggested to the designer or, in some cases, automatically created.
The Greedy Approach to Index Selection

A “greedy” algorithm has proven effective for navigating the large search space of possible index combinations. The existence of one index can affect the benefit provided by another, and this approach addresses that interdependency iteratively.
The algorithm proceeds as follows:
1. Initially, with no indexes selected, evaluate the benefit of each candidate index individually. Select the index that provides the greatest positive benefit, meaning it most reduces the average execution time of the workload.
2. Re-evaluate the benefit of each of the remaining candidate indexes, now assuming the previously selected index is available. Again, choose the index that provides the greatest additional benefit.
3. Repeat the evaluation process, at each step adding the index with the maximum marginal benefit, given the set of indexes already chosen. The process terminates when no remaining candidate index offers a positive benefit.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-10-15 23:46:59
Duration: 14.13808822631836
Cummulative cost: $1.63e-02