Subqueries

Subqueries (also known as inner queries or nested queries) are queries within a query. Subqueries can be used in SELECT, FROM, WHERE, and HAVING clauses.

The examples below are based on the following tables.

SELECT * FROM x;

+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
| 2        | 4        |
+----------+----------+
SELECT * FROM y;

+--------+--------+
| number | string |
+--------+--------+
| 1      | one    |
+--------+--------+
| 2      | two    |
+--------+--------+
| 3      | three  |
+--------+--------+
| 4      | four   |
+--------+--------+

Subquery operators

[ NOT ] EXISTS

The EXISTS operator returns all rows where a correlated subquery produces one or more matches for that row. NOT EXISTS returns all rows where a correlated subquery produces zero matches for that row. Only correlated subqueries are supported.

[NOT] EXISTS (subquery)

[ NOT ] IN

The IN operator returns all rows where a given expression’s value can be found in the results of a correlated subquery. NOT IN returns all rows where a given expression’s value cannot be found in the results of a subquery or list of values.

expression [NOT] IN (subquery|list-literal)

Examples

SELECT * FROM x WHERE column_1 IN (1,3);

+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
SELECT * FROM x WHERE column_1 NOT IN (1,3);

+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 2        | 4        |
+----------+----------+

SELECT clause subqueries

SELECT clause subqueries use values returned from the inner query as part of the outer query’s SELECT list. The SELECT clause only supports scalar subqueries that return a single value per execution of the inner query. The returned value can be unique per row.

SELECT [expression1[, expression2, ..., expressionN],] (<subquery>)

Note: SELECT clause subqueries can be used as an alternative to JOIN operations.

Example

SELECT
  column_1,
  (
    SELECT
      first_value(string)
    FROM
      y
    WHERE
      number = x.column_1
  ) AS "numeric string"
FROM
  x;

+----------+----------------+
| column_1 | numeric string |
+----------+----------------+
|        1 | one            |
|        2 | two            |
+----------+----------------+

FROM clause subqueries

FROM clause subqueries return a set of results that is then queried and operated on by the outer query.

SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>)

Example

The following query returns the average of maximum values per room. The inner query returns the maximum value for each field from each room. The outer query uses the results of the inner query and returns the average maximum value for each field.

SELECT
  column_2
FROM
  (
    SELECT
      *
    FROM
      x
    WHERE
      column_1 > 1
  );

+----------+
| column_2 |
+----------+
|        4 |
+----------+

WHERE clause subqueries

WHERE clause subqueries compare an expression to the result of the subquery and return true or false. Rows that evaluate to false or NULL are filtered from results. The WHERE clause supports correlated and non-correlated subqueries as well as scalar and non-scalar subqueries (depending on the the operator used in the predicate expression).

SELECT
  expression1[, expression2, ..., expressionN]
FROM
  <measurement>
WHERE
  expression operator (<subquery>)

Note: WHERE clause subqueries can be used as an alternative to JOIN operations.

Examples

WHERE clause with scalar subquery

The following query returns all rows with column_2 values above the average of all number values in y.

SELECT
  *
FROM
  x
WHERE
  column_2 > (
    SELECT
      AVG(number)
    FROM
      y
  );

+----------+----------+
| column_1 | column_2 |
+----------+----------+
|        2 |        4 |
+----------+----------+

WHERE clause with non-scalar subquery

Non-scalar subqueries must use the [NOT] IN or [NOT] EXISTS operators and can only return a single column. The values in the returned column are evaluated as a list.

The following query returns all rows with column_2 values in table x that are in the list of numbers with string lengths greater than three from table y.

SELECT
  *
FROM
  x
WHERE
  column_2 IN (
    SELECT
      number
    FROM
      y
    WHERE
      length(string) > 3
  );

+----------+----------+
| column_1 | column_2 |
+----------+----------+
|        2 |        4 |
+----------+----------+

WHERE clause with correlated subquery

The following query returns rows with column_2 values from table x greater than the average string value length from table y. The subquery in the WHERE clause uses the column_1 value from the outer query to return the average string value length for that specific value.

SELECT
  *
FROM
  x
WHERE
  column_2 > (
    SELECT
      AVG(length(string))
    FROM
      y
    WHERE
      number = x.column_1
  );

+----------+----------+
| column_1 | column_2 |
+----------+----------+
|        2 |        4 |
+----------+----------+

HAVING clause subqueries

HAVING clause subqueries compare an expression that uses aggregate values returned by aggregate functions in the SELECT clause to the result of the subquery and return true or false. Rows that evaluate to false are filtered from results. The HAVING clause supports correlated and non-correlated subqueries as well as scalar and non-scalar subqueries (depending on the the operator used in the predicate expression).

SELECT
  aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN]
FROM
  <measurement>
WHERE
  <conditional_expression>
GROUP BY
  column_expression1[, column_expression2, ..., column_expressionN]
HAVING
  expression operator (<subquery>)

Examples

The following query calculates the averages of even and odd numbers in table y and returns the averages that are equal to the maximum value of column_1 in table x.

HAVING clause with a scalar subquery

SELECT
  AVG(number) AS avg,
  (number % 2 = 0) AS even
FROM
  y
GROUP BY
  even
HAVING
  avg = (
    SELECT
      MAX(column_1)
    FROM
      x
  );

+-------+--------+
|   avg | even   |
+-------+--------+
|     2 | false  |
+-------+--------+

HAVING clause with a non-scalar subquery

Non-scalar subqueries must use the [NOT] IN or [NOT] EXISTS operators and can only return a single column. The values in the returned column are evaluated as a list.

The following query calculates the averages of even and odd numbers in table y and returns the averages that are in column_1 of table x.

SELECT
  AVG(number) AS avg,
  (number % 2 = 0) AS even
FROM
  y
GROUP BY
  even
HAVING
  avg IN (
    SELECT
      column_1
    FROM
      x
  );

+-------+--------+
|   avg | even   |
+-------+--------+
|     2 | false  |
+-------+--------+

Subquery categories

Subqueries can be categorized as one or more of the following based on the behavior of the subquery:

Correlated subqueries

In a correlated subquery, the inner query depends on the values of the current row being processed.

Note: DataFusion internally rewrites correlated subqueries into JOINs to improve performance. In general correlated subqueries are less performant than non-correlated subqueries.

Non-correlated subqueries

In a non-correlated subquery, the inner query doesn’t depend on the outer query and executes independently. The inner query executes first, and then passes the results to the outer query.

Scalar subqueries

A scalar subquery returns a single value (one column of one row). If no rows are returned, the subquery returns NULL.

Non-scalar subqueries

A non-scalar subquery returns 0, 1, or multiple rows, each of which may contain 1 or multiple columns. For each column, if there is no value to return, the subquery returns NULL. If no rows qualify to be returned, the subquery returns 0 rows.