Full-Relation Operations

Section outline

This section studies operations that act on relations as a whole, rather than on individual tuples.
It addresses how SQL handles relations as bags (multisets) rather than sets.
It discusses forcing results to be sets and preventing duplicate elimination.
It covers SQL’s support for the grouping and aggregation operator, $\gamma$, through the GROUP-BY and HAVING clauses.

#1

read: Section 6.4: Full-Relation Operations (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- This section studies operations that act on relations as a whole, rather than on individual tuples.
- It addresses how SQL handles relations as bags (multisets) rather than sets.
- It discusses forcing results to be sets and preventing duplicate elimination.
- It covers SQL's support for the grouping and aggregation operator, $\gamma$, through the `GROUP-BY` and `HAVING` clauses.

This section examines operations that act upon relations as a whole, rather than on individual tuples. We will discuss the distinction between the set-based relational model and SQL’s bag-based semantics, and explore SQL’s mechanisms for grouping and aggregation.

Full-Relation Operations

These operations consider the entire collection of tuples within a relation. This contrasts with operations such as joins, which operate on pairs or small sets of tuples. The primary focus is on managing duplicates and performing aggregations over groups of tuples.
Relations as Bags in SQL

In the formal relational model, a relation is a set of tuples, meaning no duplicates are permitted. SQL, however, treats relations as bags (or multisets), where a tuple can appear multiple times. Consequently, the result of a standard SELECT query may contain duplicate tuples.
Controlling Duplicates
SQL provides mechanisms to control the presence of duplicates in query results.
- Duplicate Elimination: To force a query result to be a set, the DISTINCT keyword is used following SELECT. This instructs the system to eliminate all duplicate tuples from the final result. Be advised that this operation is computationally expensive as it typically requires sorting the entire result set.
- Duplicate Preservation: The standard set operations UNION, INTERSECT, and EXCEPT eliminate duplicates by default. To preserve duplicates and apply bag semantics, one must append the ALL keyword (e.g., UNION ALL).
Grouping and Aggregation

SQL implements the grouping-and-aggregation operator, $\gamma$, from extended relational algebra. This powerful feature allows for the partitioning of tuples into groups and the computation of summary values for each group. This is accomplished through the GROUP BY clause and aggregate functions.

Definition: Grouping and Aggregation ($\gamma$)

Let $R$ be a relation. The grouping and aggregation operation is denoted as $\gamma_{L}(R)$, where $L$ is a list of elements that are either: 1. Grouping attributes: Attributes from $R$. 2. Aggregated attributes: Of the form $F(A)$, where $F$ is an aggregation operator (e.g., SUM, AVG, COUNT) and $A$ is an attribute of $R$.

The operation first partitions the tuples of $R$ into groups such that all tuples in a group have the same values for the specified grouping attributes. For each such group, a single result tuple is generated. This tuple contains the common values for the grouping attributes and, for each aggregation $F(A)$ in $L$, the result of applying the function $F$ to the multiset of values of attribute $A$ within that group.

SQL Clauses for Aggregation
- GROUP BY Clause: This clause follows the FROM and WHERE clauses and specifies a list of attributes. The relation resulting from the FROM and WHERE clauses is partitioned into groups based on the values of these attributes.
- HAVING Clause: This clause follows the GROUP BY clause and provides a condition for filtering groups. Whereas the WHERE clause filters individual tuples before grouping, the HAVING clause filters entire groups after they have been formed, typically using an aggregate function in its condition.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:18:40
Duration: 19.050811767578125
Cummulative cost: $2.41e-02

Eliminating Duplicates

Section outline

SQL’s relations are bags, meaning they can contain duplicate tuples, unlike the abstract set-based definition of relations.
By default, SQL queries do not eliminate duplicate tuples from the result.
The SELECT DISTINCT keyword is used to eliminate duplicate tuples and produce a set as the result.
This is analogous to the $\delta$ operator in relational algebra.

#2

read: Section 6.4.1: Eliminating Duplicates (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- SQL's relations are bags, meaning they can contain duplicate tuples, unlike the abstract set-based definition of relations.
- By default, SQL queries do not eliminate duplicate tuples from the result.
- The `SELECT DISTINCT` keyword is used to eliminate duplicate tuples and produce a set as the result.
- This is analogous to the $\delta$ operator in relational algebra.

SQL relations differ from the abstract notion of relations presented in set theory. A relation in SQL is a bag (or multiset), which means it can contain more than one copy of any given tuple.

In formal relational algebra, a relation is a set of tuples, and by definition, a set cannot have duplicate elements. SQL deviates from this formalism, which has practical implications for query results.
The standard SELECT-FROM-WHERE query in SQL does not eliminate duplicates from the result set.

The evaluation of a query may produce the same tuple from different source tuples. For instance, a projection that discards certain attributes can cause distinct tuples in the original relation to become identical in the result. By default, SQL will list each of these identical tuples.
To eliminate duplicate tuples from the output of a query, the SELECT keyword must be followed by the DISTINCT keyword.

The SELECT DISTINCT statement instructs the database system to produce only one copy of any given tuple in the final result, thereby converting the output bag into a set.
The DISTINCT keyword in SQL is the analog of the duplicate-elimination operator, $\delta$, from relational algebra.

If a query Q produces a relation (bag) $R$, then the query SELECT DISTINCT ... produces the set $\delta(R)$.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:18:52
Duration: 11.4144127368927
Cummulative cost: $1.23e-02

Duplicates in Unions, Intersections, and Differences

Section outline

The set operations UNION, INTERSECT, and EXCEPT normally eliminate duplicates, treating relations as sets.
To preserve duplicates and perform bag operations, the keyword ALL must be used (e.g., UNION ALL, INTERSECT ALL, EXCEPT ALL).
For bag intersection (INTERSECT ALL), a tuple appears in the result the minimum number of times it appears in either input relation.
For bag difference (EXCEPT ALL), a tuple appears in the result $n - m$ times, where $n$ and $m$ are its counts in the first and second relations, respectively, provided $n > m$.

#3

read: Section 6.4.2: Duplicates in Unions, Intersections, and Differences (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- The set operations `UNION`, `INTERSECT`, and `EXCEPT` normally eliminate duplicates, treating relations as sets.
- To preserve duplicates and perform bag operations, the keyword `ALL` must be used (e.g., `UNION ALL`, `INTERSECT ALL`, `EXCEPT ALL`).
- For bag intersection (`INTERSECT ALL`), a tuple appears in the result the minimum number of times it appears in either input relation.
- For bag difference (`EXCEPT ALL`), a tuple appears in the result $n - m$ times, where $n$ and $m$ are its counts in the first and second relations, respectively, provided $n > m$.

Unlike the SELECT statement, which preserves duplicates by default, the set-like operations in SQL behave differently.

UNION, INTERSECT, and EXCEPT

By default, these operations eliminate duplicate tuples from the result. The relations are treated as sets, and the corresponding set operation is performed.
The ALL Keyword for Bag Operations

To prevent the elimination of duplicates and perform operations on bags (multisets), the keyword ALL must be appended to the operator, resulting in UNION ALL, INTERSECT ALL, and EXCEPT ALL.
Bag Intersection (INTERSECT ALL)

This operation computes the multiset intersection of two relations.

Definition: Bag Intersection

Let $R$ and $S$ be two relations (bags). For a given tuple $t$, let $count_R(t)$ and $count_S(t)$ be the number of times $t$ appears in $R$ and $S$, respectively. The number of times $t$ appears in the result of R INTERSECT ALL S is $min(count_R(t), count_S(t))$.
Bag Difference (EXCEPT ALL)

This operation computes the multiset difference between two relations.

Definition: Bag Difference

Let $R$ and $S$ be two relations (bags). For a given tuple $t$, let $count_R(t)$ and $count_S(t)$ be the number of times $t$ appears in $R$ and $S$, respectively. The number of times $t$ appears in the result of R EXCEPT ALL S is $count_R(t) - count_S(t)$, provided that this difference is positive. If $count_R(t) \le count_S(t)$, the tuple $t$ does not appear in the result.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:19:06
Duration: 13.569606065750122
Cummulative cost: $1.60e-02

Grouping and Aggregation in SQL

Section outline

Introduces the grouping-and-aggregation operator $\gamma$ from extended relational algebra.
This operator partitions a relation’s tuples into groups based on specified attributes.

#4

read: Section 6.4.3: Grouping and Aggregation in SQL (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- Introduces the grouping-and-aggregation operator $\gamma$ from extended relational algebra.
- This operator partitions a relation's tuples into groups based on specified attributes.

SQL supports grouping and aggregation, operations analogous to the $\gamma$ operator from extended relational algebra.

This operator enables the partitioning of tuples within a relation into distinct groups. This is a fundamental operation for summarizing data.
The partitioning of tuples is based on the values of one or more specified attributes.

All tuples that share identical values for the designated grouping attributes are collected into a single group. Subsequently, aggregation functions can be applied to each group to compute summary statistics.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:19:14
Duration: 8.260210990905762
Cummulative cost: $9.69e-03

Aggregation Operators

Section outline

SQL provides five aggregation operators: SUM, AVG, MIN, MAX, and COUNT.
COUNT(*) counts all tuples in a relation or group.
The DISTINCT keyword can be used within aggregation functions (e.g., COUNT(DISTINCT attribute)) to operate on unique values only.

#5

read: Section 6.4.4: Aggregation Operators (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- SQL provides five aggregation operators: `SUM`, `AVG`, `MIN`, `MAX`, and `COUNT`.
- `COUNT(*)` counts all tuples in a relation or group.
- The `DISTINCT` keyword can be used within aggregation functions (e.g., `COUNT(DISTINCT attribute)`) to operate on unique values only.

Definition: Aggregation Operator

An aggregation operator is a function $f$ that takes a multiset (or bag) of values $B$ as input and returns a single scalar value. In SQL, these operators are applied to the values of an attribute within a relation or within groups of a relation.

SQL provides five standard aggregation operators.

The operators are SUM, AVG, MIN, MAX, and COUNT. They are applied to a scalar-valued expression, typically a column name, within a SELECT clause. By default, these operators consider all non-NULL values in the specified column, including duplicates.
The COUNT(*) operator is a special case for counting tuples.

The expression COUNT(*) counts all tuples in the relation that results from the FROM and WHERE clauses of a query. This differs from COUNT(attribute), which counts the number of non-NULL values in a specific attribute.
The DISTINCT keyword can be used to operate on unique values.

To eliminate duplicates before applying the aggregation, the keyword DISTINCT may be used before the attribute name. For example, COUNT(DISTINCT attribute) counts the number of unique non-NULL values in the specified attribute. While syntactically valid for other operators (e.g., SUM(DISTINCT attribute)), its application is less common.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:19:26
Duration: 12.270047187805176
Cummulative cost: $1.46e-02

Grouping

Section outline

The GROUP BY clause is used to group tuples based on the values of one or more attributes.
When GROUP BY is used, the SELECT clause can only contain aggregation functions or attributes listed in the GROUP BY clause.
Describes the logical evaluation order: FROM/WHERE clauses are evaluated first to create an intermediate relation, which is then grouped by the GROUP BY clause, and finally the SELECT list is computed for each group.

#6

read: Section 6.4.5: Grouping (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- The `GROUP BY` clause is used to group tuples based on the values of one or more attributes.
- When `GROUP BY` is used, the `SELECT` clause can only contain aggregation functions or attributes listed in the `GROUP BY` clause.
- Describes the logical evaluation order: `FROM`/`WHERE` clauses are evaluated first to create an intermediate relation, which is then grouped by the `GROUP BY` clause, and finally the `SELECT` list is computed for each group.

The GROUP BY clause is used to group tuples based on the values of one or more attributes.

It follows the WHERE clause and is succeeded by a list of grouping attributes. The query result will contain one row for each distinct combination of values in the grouping attributes.
When GROUP BY is used, the SELECT clause is constrained to include only aggregation functions or attributes that are also listed in the GROUP BY clause.

This constraint ensures that for each group of tuples, a single, unambiguous value can be produced for each column in the result set. Attributes from the GROUP BY clause have a single value per group by definition, and aggregate functions compute a single value from the tuples within that group.
The logical evaluation of a query with a GROUP BY clause proceeds in a specific sequence.
The conceptual order of operations is as follows:
1. First, the relation specified by the FROM and WHERE clauses is computed. This intermediate relation contains all tuples that satisfy the selection condition.
2. Next, the tuples of this intermediate relation are partitioned into groups according to the attributes listed in the GROUP BY clause.
3. Finally, the aggregations and attributes in the SELECT clause are computed for each group to generate the final output.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:19:45
Duration: 18.174846172332764
Cummulative cost: $1.74e-02

Grouping, Aggregation, and Nulls

Section outline

The value NULL is ignored in aggregations, except for COUNT(*). COUNT(attribute) only counts non-NULL values.
NULL is treated as a distinct value when forming groups; there can be a group for NULL values.
Aggregation over an empty set of values results in NULL, except for COUNT, which returns 0.

#7

read: Section 6.4.6: Grouping, Aggregation, and Nulls (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- The value `NULL` is ignored in aggregations, except for `COUNT(*)`. `COUNT(attribute)` only counts non-`NULL` values.
- `NULL` is treated as a distinct value when forming groups; there can be a group for `NULL` values.
- Aggregation over an empty set of values results in `NULL`, except for `COUNT`, which returns 0.

When performing grouping and aggregation operations in SQL, the handling of NULL values follows a specific set of rules. These rules dictate how nulls contribute to aggregate calculations and group formation.

Definition: Handling of NULL in Aggregation and Grouping

Let $g$ be a group of tuples formed by a GROUP BY clause, and let $B_A$ be the bag of values from attribute $A$ for the tuples in group $g$.

Aggregation Functions: For any aggregation function agg $\in$ {SUM, AVG, MIN, MAX}, the value NULL is eliminated from the bag $B_A$ before the function is computed. The function COUNT(A) computes the cardinality of the bag $B_A$ after NULL values are removed. The function COUNT(*) returns the total number of tuples in the group $g$.
Group Formation: The value NULL is treated as a distinct value when partitioning tuples into groups. All tuples that have a NULL in a grouping attribute are placed into a single group.
Empty Bag Aggregation: If the bag of values for an aggregation is empty (either because the group is empty or contains only NULLs for the aggregated attribute), the COUNT is 0. All other aggregation functions return NULL.

NULL values are ignored in aggregations.

An aggregation function such as SUM, AVG, MIN, or MAX operates on a collection of values. Any NULLs within this collection are discarded before the calculation is performed. COUNT(attribute) returns the number of non-NULL values, whereas COUNT(*) returns the total number of tuples in the group, regardless of NULLs.
NULL is treated as a distinct value for grouping.

When a GROUP BY clause is used, all tuples for which the grouping attribute is NULL are collected into a single group. This group is then treated like any other group formed from non-NULL values.
Aggregation over an empty set of values yields NULL, except for COUNT.

If a group contains only NULL values for the attribute being aggregated, the set of values for the aggregation is empty. For such an empty set, COUNT returns 0. Any other aggregate function, such as SUM or AVG, returns NULL. For example, computing SUM(B) on a group whose only B-value is NULL results in NULL, as we are summing an empty bag of values.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:20:06
Duration: 20.965851068496704
Cummulative cost: $2.63e-02

HAVING Clauses

Section outline

The HAVING clause is used to filter groups based on a condition involving aggregates.
It is applied after the GROUP BY clause has formed the groups.
Rules for the HAVING clause are similar to the SELECT clause in a grouped query: any unaggregated attribute must also be a grouping attribute.

#8

read: Section 6.4.7: HAVING Clauses (section)

model: pro
rules: bullet definition math presentation short


Present the following.  Use definition formal definition when it applies.

- The `HAVING` clause is used to filter groups based on a condition involving aggregates.
- It is applied after the `GROUP BY` clause has formed the groups.
- Rules for the `HAVING` clause are similar to the `SELECT` clause in a grouped query: any unaggregated attribute must also be a grouping attribute.

Definition: The HAVING Clause

The HAVING clause is a component of a SQL SELECT statement that filters the results of a GROUP BY clause. Whereas the WHERE clause filters individual tuples before grouping and aggregation, the HAVING clause filters the groups themselves based on aggregate properties.

Let $R$ be the relation produced by the FROM and WHERE clauses. Let the GROUP BY clause partition $R$ into a set of groups $G = \{g_1, g_2, \dots, g_k\}$. The HAVING clause contains a predicate $P$ that is evaluated for each group $g_i \in G$. The final result of the query is formed from the set of groups $\{g_i \in G \mid P(g_i) \text{ is true}\}$.

Post-Grouping Filtration

The HAVING clause is processed after the GROUP BY clause. It acts as a filter on the groups that have been formed, eliminating those that do not satisfy the specified condition.
Conditions on Aggregates

The primary function of the HAVING clause is to specify conditions that involve aggregate functions, such as COUNT, SUM, AVG, MIN, or MAX. This allows for the selection of groups based on a computed property of the group as a whole.
Rules for Attribute Reference
The rules governing the use of attributes within a HAVING clause are congruent with those for the SELECT clause when a GROUP BY is present.
- An attribute may be used within an aggregate function.
- An attribute that appears unaggregated must be one of the grouping attributes listed in the GROUP BY clause.

gemini-2.5-pro

Version: 1
Model: gemini-2.5-pro
Created: 2025-09-20 13:20:34
Duration: 27.979530096054077
Cummulative cost: $1.44e-02