Design of Relational Database Schemas

Section outline

Careless schema selection leads to redundancy and anomalies, illustrated by the repetition of movie information.
Introduces a four-stage process for designing good relational schemas: exploring problems of poor design, introducing decomposition, defining Boyce-Codd Normal Form (BCNF), and using decomposition to achieve BCNF.

#1

read: Section 3.3: Design of Relational Database Schemas (section)

model: pro
rules: bullet definition math presentation short

Provide a succinct preview summarization for this section.  Focus on:

- Careless schema selection leads to redundancy and anomalies, illustrated by the repetition of movie information.
- Introduces a four-stage process for designing good relational schemas: exploring problems of poor design, introducing decomposition, defining Boyce-Codd Normal Form (BCNF), and using decomposition to achieve BCNF.

A careless selection of a relational database schema can lead to redundancy and related anomalies. This section introduces a systematic approach to designing good relational schemas.

Problems of Poor Schema Design

Poorly designed schemas often result in the repetition of information, which introduces redundancy. This redundancy, in turn, can lead to update, insertion, and deletion anomalies, creating potential for data inconsistencies.
A Four-Stage Design Process
A structured process for schema design is presented.
1. First, the problems arising from poor design are identified.
2. Second, the concept of “decomposition” is introduced as a method for breaking a relation schema into smaller schemas.
3. Third, “Boyce-Codd Normal Form,” or BCNF, is defined as a condition that a schema should satisfy to eliminate these problems.
4. Finally, we generalize a method of decomposition to ensure schemas meet the BCNF condition.

gemini-2.5-pro

Version: 3
Model: gemini-2.5-pro
Created: 2025-09-07 17:35:06
Duration: 11.867660999298096
Cummulative cost: $1.26e-02

Anomalies

Section outline

Defines ‘anomalies’ as problems that occur when too much information is combined into a single relation.
Describes three main types of anomalies: Redundancy (unnecessary repetition of information), Update Anomalies (risk of inconsistent data changes), and Deletion Anomalies (unintended loss of data when other data is deleted).

#2

read: Section 3.3.1: Anomalies (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Defines 'anomalies' as problems that occur when too much information is combined into a single relation.
- Describes three main types of anomalies: Redundancy (unnecessary repetition of information), Update Anomalies (risk of inconsistent data changes), and Deletion Anomalies (unintended loss of data when other data is deleted).

Definition: Anomaly

Anomalies are problems that arise when a relation schema attempts to represent information about more than one type of entity, leading to redundancy and potential inconsistencies.

Redundancy

Information is repeated unnecessarily across multiple tuples. This wastes storage and complicates data management. For example, in a relation storing movie and star information, the movie’s length and genre might be repeated for every star in that movie.
Update Anomalies

A single logical update requires modification of multiple tuples. If one tuple containing the redundant information is updated but others are not, the database becomes inconsistent. For instance, changing a movie’s length would necessitate updating every tuple associated with that movie.
Deletion Anomalies

The deletion of a set of tuples may result in the unintended loss of other distinct information. For example, if the last star associated with a particular movie is deleted, all information about that movie (such as its length and genre) might also be lost if it is not stored elsewhere.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-07 17:44:27
Duration: 11.42677092552185
Cummulative cost: $1.34e-02

# 2.1

Give an example for each of the anomalies: redundancy, update, and deletion.

Use markdown table.  Generate a short paragraph describing the anomaly.

One observes that a poorly conceived relational database schema can introduce various problems, commonly referred to as anomalies. These anomalies may impede the integrity and efficient management of data. We shall now consider examples of the three principal kinds of anomalies.

Redundancy

Redundancy occurs when the same information is stored multiple times within a relation, often across different tuples. This wasteful repetition can inflate storage requirements and complicate data management.

Consider the following excerpt from the Movies1 relation:

title	year	length	genre	studioName	starName
Star Wars	1977	124	SciFi	Fox	Carrie Fisher
Star Wars	1977	124	SciFi	Fox	Mark Hamill
Star Wars	1977	124	SciFi	Fox	Harrison Ford

Here, the information pertaining to the movie “Star Wars”

its release year (1977),
length (124 minutes),
genre (SciFi),
and studio (Fox)

is repeated for each star associated with the film. This duplication is called redundancy.

Update Anomaly

An update anomaly arises when changes to information are not propagated consistently across all instances where that information is stored. This can lead to contradictory data within the database.

Using the Movies1 relation as an example, suppose we initially have the following tuples for “Star Wars”:

title	year	length	genre	studioName	starName
Star Wars	1977	124	SciFi	Fox	Carrie Fisher
Star Wars	1977	124	SciFi	Fox	Mark Hamill

If it is discovered that “Star Wars” is actually 125 minutes long, and a user updates only the first tuple, the relation might appear as follows:

title	year	length	genre	studioName	starName
Star Wars	1977	125	SciFi	Fox	Carrie Fisher
Star Wars	1977	124	SciFi	Fox	Mark Hamill

The differing lengths (125 and 124) for the same movie constitute an update anomaly, indicating an inconsistent state of the database.

Deletion Anomaly

A deletion anomaly occurs when the removal of a tuple inadvertently causes the loss of other, unrelated information that was stored only in that tuple.

Consider the tuple for “Gone With the Wind” from the Movies1 relation:

title	year	length	genre	studioName	starName
Gone With the Wind	1939	231	drama	MGM	Vivien Leigh

If, for example, Vivien Leigh is removed from the list of stars for “Gone With the Wind,” and this tuple is the only one associating Vivien Leigh with the movie, deleting this tuple would also remove the length (231 minutes) and genre (drama) information for “Gone With the Wind” from the database. This loss of information, incidental to the primary deletion, is a deletion anomaly.

gemini-2.5-flash

Version: 1
Model: gemini-2.5-flash
Created: 2025-09-07 17:56:59
Duration: 9.923089027404785
Cummulative cost: $5.83e-03

Decomposing Relations

Section outline

Presents decomposition as the accepted method to eliminate anomalies by splitting the attributes of a relation into schemas for new, smaller relations.
Formally defines the decomposition of a relation $R(A_1, ..., A_n)$ into two relations $S(B_1, ..., B_m)$ and $T(C_1, ..., C_k)$ such that the union of attribute sets is preserved and the new relations are projections of the original.

#3

read: Section 3.3.2: Decomposing Relations (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Presents decomposition as the accepted method to eliminate anomalies by splitting the attributes of a relation into schemas for new, smaller relations.
- Formally defines the decomposition of a relation $R(A_1, ..., A_n)$ into two relations $S(B_1, ..., B_m)$ and $T(C_1, ..., C_k)$ such that the union of attribute sets is preserved and the new relations are projections of the original.

The accepted way to eliminate anomalies is to decompose relations.

Decomposition involves splitting the attributes of a relation $R$ to create the schemas of two new relations, say $S$ and $T$. After describing the process, we shall show how to select a decomposition that resolves the anomalies.
A decomposition must satisfy two formal properties.
Definition: Decomposition

Given a relation $R(A_1, A_2, \dots, A_n)$, we may decompose $R$ into two relations $S(B_1, B_2, \dots, B_m)$ and $T(C_1, C_2, \dots, C_k)$ such that:
1. The union of the attributes of the new relations equals the set of attributes of the original relation. \[ \{A_1, A_2, \dots, A_n\} = \{B_1, B_2, \dots, B_m\} \cup \{C_1, C_2, \dots, C_k\} \]
2. The instances of the new relations are the projections of the original relation’s instance onto their respective attribute sets. \[ S = \pi_{B_1, B_2, \dots, B_m}(R) \] \[ T = \pi_{C_1, C_2, \dots, C_k}(R) \]

The decomposition is lossless if we have can recover $R$ by natural join $S$ with T.

\[ R = S\bowtie T \]

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-07 18:06:44
Duration: 10.652580976486206
Cummulative cost: $1.70e-02

2-NF

# 3.1
model: pro
rules: presentation short bullet math definition

- Write a definition of 2NF.
- Describe why 2NF _reduces_ anomalies, but does not eliminate anomalies.
- Give an example that does not satisfay 2NF, and how it can be converted to 2NF by lossless decomposition.

Definition: Second Normal Form (2NF)

A relation schema $R$ is in second normal form (2NF) if it is in 1NF and every non-prime attribute $A \in R$ is fully functionally dependent on every candidate key of $R$. An attribute is fully functionally dependent on a set of attributes $X$ if it is functionally dependent on $X$ but not on any proper subset of $X$.

Reduction of Anomalies

Second Normal Form addresses certain anomalies by eliminating partial dependencies. A partial dependency exists when a non-prime attribute is dependent on only a part of a composite candidate key. This situation leads to data redundancy and update anomalies. For instance, if an attribute depends on a portion of the key, its value will be repeated for every tuple sharing that portion of the key. Converting a relation to 2NF by decomposition removes these partial dependencies, thus reducing such redundancy.

However, 2NF does not eliminate all anomalies. A relation in 2NF can still suffer from anomalies caused by transitive dependencies, where a non-prime attribute is functionally dependent on another non-prime attribute. These dependencies create indirect relationships between attributes and a candidate key, which can lead to redundancy and update issues of their own.
Example of Decomposition into 2NF
Consider the Movies1 relation schema from Figure 3.6: \[ R(\underline{title}, \underline{year}, \text{length}, \text{genre}, \text{studioName}, \underline{starName}) \] The sole candidate key is $\{title, year, starName\}$. The non-prime attributes are $\{length, genre, studioName\}$.

We have the functional dependency: \[ \{title, year\} \to \{\text{length}, \text{genre}, \text{studioName}\} \] This dependency violates 2NF because the non-prime attributes $\{length, genre, studioName\}$ are dependent on a proper subset of the candidate key, namely $\{title, year\}$. This is a partial dependency.

To achieve 2NF, we decompose Movies1 into two relations, eliminating the partial dependency. This is a lossless-join decomposition.
1. $R_1(\underline{title}, \underline{year}, \text{length}, \text{genre}, \text{studioName})$
  
  This relation contains the determinant of the partial dependency and the attributes dependent on it. The key is $\{title, year\}$. It is in 2NF as there are no partial dependencies on its key.
2. $R_2(\underline{title}, \underline{year}, \underline{starName})$
  
  This relation contains the original key and removes the attributes involved in the partial dependency. The key is $\{title, year, starName\}$. Since it has no non-prime attributes, it is trivially in 2NF.
This decomposition is identical to that which produced the Movies2 and Movies3 relations in Example 3.14. It resolves the redundancy where movie information (length, genre, etc.) was repeated for each star of that movie.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-07 23:53:32
Duration: 21.250662803649902
Cummulative cost: $2.60e-02

2NF is not enough

# 3.2

Give a quick example illustrating that 2NF does not guarantee no redundancy.

Consider a relation, which we may call EmployeeDepartments, with the following schema: $R(\text{EmployeeID}, \text{DepartmentID}, \text{DepartmentName})$

Let us posit the following functional dependencies (FDs) for this relation:

$\text{EmployeeID} \rightarrow \text{DepartmentID}$ (Each employee is assigned to precisely one department.)
$\text{DepartmentID} \rightarrow \text{DepartmentName}$ (Each department possesses a unique name.)

In this schema, $\text{EmployeeID}$ serves as the sole candidate key. Consequently, $\text{EmployeeID}$ is the prime attribute, while $\text{DepartmentID}$ and $\text{DepartmentName}$ are non-prime attributes.

To ascertain if this relation adheres to 2NF, we examine its dependencies:

The non-prime attribute $\text{DepartmentID}$ is fully functionally dependent on the candidate key $\text{EmployeeID}$. There is no partial dependency involving $\text{DepartmentID}$.
The non-prime attribute $\text{DepartmentName}$ depends on $\text{DepartmentID}$. As $\text{DepartmentID}$ is not a proper subset of the key $\text{EmployeeID}$ (it is itself a non-prime attribute), $\text{DepartmentName}$ is not partially dependent on the key.

Based upon these observations, the relation EmployeeDepartments satisfies the conditions for 2NF.

However, this relation regrettably still exhibits data redundancy. Consider an instance of this relation:

EmployeeID	DepartmentID	DepartmentName
101	D1	Sales
102	D1	Sales
103	D2	Marketing
104	D1	Sales

In this instance, the information “Department D1 is named Sales” is reiterated across multiple tuples (for Employee 101, 102, and 104). This redundancy stems from the transitive dependency: $\text{EmployeeID} \rightarrow \text{DepartmentID} \rightarrow \text{DepartmentName}$. While 2NF successfully eliminates partial dependencies, it does not preclude transitive dependencies where a non-key attribute determines another non-key attribute. Such transitive dependencies, as seen here, invariably lead to repeated information and the attendant risks of update and deletion anomalies, which Boyce-Codd Normal Form is designed to prevent.

gemini-2.5-flash

Version: 1
Model: gemini-2.5-flash
Created: 2025-09-08 00:12:40
Duration: 21.873821020126343
Cummulative cost: $1.17e-02

Boyce-Codd Normal Form

Section outline

Introduces Boyce-Codd Normal Form (BCNF) as a condition on a relation schema that guarantees the absence of redundancy and other anomalies.
Defines the BCNF condition: A relation R is in BCNF if and only if for every nontrivial functional dependency $A_1 A_2 \dots A_n \rightarrow B_1 B_2 \dots B_m$, the set of attributes on the left side, ${A_1, A_2, \dots, A_n}$, is a superkey for R.

#4

read: Section 3.3.3: Boyce-Codd Normal Form (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Introduces Boyce-Codd Normal Form (BCNF) as a condition on a relation schema that guarantees the absence of redundancy and other anomalies.
- Defines the BCNF condition: A relation R is in BCNF if and only if for every nontrivial functional dependency $A_1 A_2 \dots A_n \rightarrow B_1 B_2 \dots B_m$, the set of attributes on the left side, ${A_1, A_2, \dots, A_n}$, is a superkey for R.

Boyce-Codd Normal Form (BCNF) is a condition for relation schemas that guarantees the absence of anomalies, such as redundancy. The objective of decomposition is often to produce a set of relations, each of which is in BCNF.
The formal condition for BCNF is defined as follows.

Definition: Boyce-Codd Normal Form (BCNF)

A relation R is in BCNF if and only if for every nontrivial functional dependency (FD) $X \rightarrow Y$ that holds in R, the set of attributes $X$ is a superkey for R.
This definition implies that for any nontrivial dependency, the determinant (the left-hand side of the dependency) must be capable of uniquely identifying a tuple. An equivalent statement is that the left side of every nontrivial FD must contain a key for the relation.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-08 00:10:02
Duration: 8.756283283233643
Cummulative cost: $1.22e-02

Decomposition into BCNF

Section outline

Describes a strategy to break any relation schema into a collection of smaller schemas that are all in BCNF.
The decomposition strategy is to find a nontrivial FD, $X \rightarrow Y$, that violates BCNF. The relation is then decomposed into two new relations: one with schema $X^+$ (the closure of X) and the other with schema containing X and all attributes not in $X^+$.

#5

read: Section 3.3.4: Decomposition into BCNF (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Describes a strategy to break any relation schema into a collection of smaller schemas that are all in BCNF.
- The decomposition strategy is to find a nontrivial FD, $X \rightarrow Y$, that violates BCNF. The relation is then decomposed into two new relations: one with schema $X^+$ (the closure of X) and the other with schema containing X and all attributes not in $X^+$.

By repeatedly choosing suitable decompositions, we can break any relation schema into a collection of subsets of its attributes that are the schemas of relations in Boyce-Codd Normal Form (BCNF). This process ensures that the data in the original relation is represented faithfully in the resulting decomposed relations.

The strategy for decomposition is guided by the functional dependencies that violate the BCNF condition.

Decomposition Strategy

The decomposition strategy is to identify a nontrivial functional dependency, $X \rightarrow Y$, that violates the BCNF condition for a relation schema $R$. A violation occurs when the left-hand side, $X$, is not a superkey for $R$. This violating FD is then used to decompose the schema $R$ into two smaller schemas.
The Decomposition Rule
Let $R$ be a relation schema and let $X \rightarrow Y$ be a functional dependency that violates BCNF.
1. Compute the closure of the attributes on the left-hand side, denoted $X^+$.
2. Decompose $R$ into two new relation schemas, $R_1$ and $R_2$.
  - The schema for $R_1$ consists of all attributes in the closure $X^+$.
  - The schema for $R_2$ consists of the attributes in $X$ along with all attributes of $R$ that are not in $X^+$. That is, the schema is $X \cup (R - X^+)$.
3. This process is applied recursively to $R_1$ and $R_2$ until all resulting relation schemas are in BCNF.
Termination of the Algorithm

This decomposition process is guaranteed to terminate. Each application of the decomposition rule results in schemas that have strictly fewer attributes than the original schema. The process concludes when all schemas are in BCNF, which is certain to happen at the latest when all schemas have been reduced to two attributes.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-08 00:10:17
Duration: 14.536620140075684
Cummulative cost: $1.53e-02

#5.1

read: Section 3.3.4: Decomposition into BCNF (section)

model: pro
rules: bullet definition math presentation short

Write a pseudo code in python-like syntax on the BCNF decomposition algorithm.

    ```python
    ...
    ```

By repeatedly applying suitable decompositions, one may break any relation schema into a collection of schemas that are in Boyce-Codd Normal Form (BCNF). This decomposition must be guided carefully by the functional dependencies that violate the BCNF condition to ensure the resulting relations faithfully represent the data of the original relation.

Decomposition Strategy

The strategy is to find a nontrivial functional dependency, $X \to Y$, that violates the BCNF condition; that is, $X$ is not a superkey. The relation schema $R$ is then split into two schemas. The first, $R_1$, consists of all attributes functionally determined by $X$, which is the closure $X^+$. The second, $R_2$, consists of the attributes in $X$ along with all attributes of $R$ that are not in $X^+$. This process is repeated on the resulting schemas until no BCNF violations remain.

Definition: BCNF Decomposition Algorithm

An algorithm that takes as input a relation schema $R_0$ with a set of functional dependencies $S_0$. The output is a decomposition of $R_0$ into a collection of relation schemas, $\{R_1, R_2, \dots, R_k\}$, such that each schema $R_i$ in the collection is in Boyce-Codd Normal Form.

Algorithm

The decomposition process is summarized by the following recursive procedure, which reflects the method described in Algorithm 3.20.

def BCNF_Decomposition(R, S):
    """
    Recursively decomposes a relation schema R into a set of BCNF schemas.
    
    INPUT:
        R: A set of attributes (the relation schema).
        S: A set of functional dependencies for R.
        
    OUTPUT:
        A set of relation schemas {R1, R2, ...} where each Ri is in BCNF.
    """
    
    # Find a nontrivial FD X -> Y in S where X is not a superkey for R.
    violating_fd = find_bcnf_violation(R, S)

    if not violating_fd:
        # Base Case: R is in BCNF. Nothing more needs to be done.
        return {R}
    else:
        # Recursive Step: A BCNF violation X -> Y is found.
        X = violating_fd.left_hand_side
        
        # Let R1 be the schema containing all attributes in the closure of X.
        # R1 = X+
        R1 = compute_closure(X, S)
        
        # Let R2 be the schema containing X and all attributes of R not in R1.
        # R2 = X U (R - R1)
        R2 = X.union(R - R1)
        
        # Project the original FDs onto the new schemas to get S1 and S2.
        S1 = project_fds(S, R1)
        S2 = project_fds(S, R2)
        
        # Return the union of the recursive decompositions of R1 and R2.
        return BCNF_Decomposition(R1, S1).union(BCNF_Decomposition(R2, S2))

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-08 00:22:09
Duration: 27.75649404525757
Cummulative cost: $3.89e-02

Anomalies

Redundancy

Update Anomaly

Deletion Anomaly

Decomposing Relations

2-NF

Reduction of Anomalies

Example of Decomposition into 2NF

2NF is not enough

Boyce-Codd Normal Form

Decomposition into BCNF