Struct Type Coercion and Field Mapping#

DataFusion uses name-based field mapping when coercing struct types across different operations. This document explains how struct coercion works, when it applies, and how to handle NULL fields.

Overview: Name-Based vs Positional Mapping#

When combining structs from different sources (e.g., in UNION, array construction, or JOINs), DataFusion matches struct fields by name rather than by position. This provides more robust and predictable behavior compared to positional matching.

Example: Field Reordering is Handled Transparently#

-- These two structs have the same fields in different order
SELECT [{a: 1, b: 2}, {b: 3, a: 4}];

-- Result: Field names matched, values unified
-- [{"a": 1, "b": 2}, {"a": 4, "b": 3}]

Coercion Paths Using Name-Based Matching#

The following query operations use name-based field mapping for struct coercion:

1. Array Literal Construction#

When creating array literals with struct elements that have different field orders:

-- Structs with reordered fields in array literal
SELECT [{x: 1, y: 2}, {y: 3, x: 4}];

-- Unified type: List(Struct("x": Int32, "y": Int32))
-- Values: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]

When it applies:

Array literals with struct elements: [{...}, {...}]
Nested arrays with structs: [[{x: 1}, {x: 2}]]

2. Array Construction from Columns#

When constructing arrays from table columns with different struct schemas:

CREATE TABLE t_left (s struct(x int, y int)) AS VALUES ({x: 1, y: 2});
CREATE TABLE t_right (s struct(y int, x int)) AS VALUES ({y: 3, x: 4});

-- Dynamically constructs unified array schema
SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;

-- Result: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]

When it applies:

Array construction with column references: [col1, col2]
Array construction in joins with matching field names

3. UNION Operations#

When combining query results with different struct field orders:

SELECT {a: 1, b: 2} as s
UNION ALL
SELECT {b: 3, a: 4} as s;

-- Result: {"a": 1, "b": 2} and {"a": 4, "b": 3}

When it applies:

UNION ALL with structs: field names matched across branches
UNION (deduplicated) with structs

4. Common Table Expressions (CTEs)#

When multiple CTEs produce structs with different field orders that are combined:

WITH
  t1 AS (SELECT {a: 1, b: 2} as s),
  t2 AS (SELECT {b: 3, a: 4} as s)
SELECT s FROM t1
UNION ALL
SELECT s FROM t2;

-- Result: Field names matched across CTEs

5. VALUES Clauses#

When creating tables or temporary results with struct values in different field orders:

CREATE TABLE t AS VALUES ({a: 1, b: 2}), ({b: 3, a: 4});

-- Table schema unified: struct(a: int, b: int)
-- Values: {a: 1, b: 2} and {a: 4, b: 3}

6. JOIN Operations#

When joining tables where the JOIN condition involves structs with different field orders:

CREATE TABLE orders (customer struct(name varchar, id int));
CREATE TABLE customers (info struct(id int, name varchar));

-- Join matches struct fields by name
SELECT * FROM orders
JOIN customers ON orders.customer = customers.info;

7. Aggregate Functions#

When collecting structs with different field orders using aggregate functions like array_agg:

SELECT array_agg(s) FROM (
  SELECT {x: 1, y: 2} as s
  UNION ALL
  SELECT {y: 3, x: 4} as s
) t
GROUP BY category;

-- Result: Array of structs with unified field order

8. Window Functions#

When using window functions with struct expressions having different field orders:

SELECT
  id,
  row_number() over (partition by s order by id) as rn
FROM (
  SELECT {category: 1, value: 10} as s, 1 as id
  UNION ALL
  SELECT {value: 20, category: 1} as s, 2 as id
);

-- Fields matched by name in PARTITION BY clause

NULL Handling for Missing Fields#

When structs have different field sets, missing fields are filled with NULL values during coercion.

Example: Partial Field Overlap#

-- Struct in first position has fields: a, b
-- Struct in second position has fields: b, c
-- Unified schema includes all fields: a, b, c

SELECT [
  CAST({a: 1, b: 2} AS STRUCT(a INT, b INT, c INT)),
  CAST({b: 3, c: 4} AS STRUCT(a INT, b INT, c INT))
];

-- Result:
-- [
--   {"a": 1, "b": 2, "c": NULL},
--   {"a": NULL, "b": 3, "c": 4}
-- ]

Limitations#

Field count must match exactly. If structs have different numbers of fields and their field names don’t completely overlap, the query will fail:

-- This fails because field sets don't match:
-- t_left has {x, y} but t_right has {x, y, z}
SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;
-- Error: Cannot coerce struct with mismatched field counts

Workaround: Use explicit CAST

To handle partial field overlap, explicitly cast structs to a unified schema:

SELECT [
  CAST(t_left.s AS STRUCT(x INT, y INT, z INT)),
  CAST(t_right.s AS STRUCT(x INT, y INT, z INT))
] FROM t_left JOIN t_right;

Comparison and Ordering#

DataFusion supports comparing STRUCT values with standard comparison operators (=, !=, <, <=, >, >=). Ordering comparisons are lexicographical and follow DataFusion’s default ascending comparison behavior, where NULL sorts before non-NULL values.

Examples#

SELECT {x: 1, y: 2} < {x: 1, y: 3};
-- true

SELECT {x: 1, y: NULL} < {x: 1, y: 2};
-- true

SELECT {x: 1, y: NULL} = {x: 1, y: NULL};
--true

Migration Guide: From Positional to Name-Based Matching#

If you have existing code that relied on positional struct field matching, you may need to update it.

Example: Query That Changes Behavior#

Old behavior (positional):

-- These would have been positionally mapped (left-to-right)
SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
-- Old result (positional): [{"x": 1, "y": 2}, {"y": 3, "x": 4}]

New behavior (name-based):

-- Now uses name-based matching
SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
-- New result (by name): [{"x": 1, "y": 2}, {"x": 4, "y": 3}]

Migration Steps#

Review struct operations - Look for queries that combine structs from different sources
Check field names - Verify that field names match as expected (not positions)
Test with new coercion - Run queries and verify the results match your expectations
Handle field reordering - If you need specific field orders, use explicit CAST operations

Using Explicit CAST for Compatibility#

If you need precise control over struct field order and types, use explicit CAST:

-- Guarantee specific field order and types
SELECT CAST({b: 3, a: 4} AS STRUCT(a INT, b INT));
-- Result: {"a": 4, "b": 3}

Best Practices#

1. Be Explicit with Schema Definitions#

When joining or combining structs, define target schemas explicitly:

-- Good: explicit schema definition
SELECT CAST(data AS STRUCT(id INT, name VARCHAR, active BOOLEAN))
FROM external_source;

2. Use Named Struct Constructors#

Prefer named struct constructors for clarity:

-- Good: field names are explicit
SELECT named_struct('id', 1, 'name', 'Alice', 'active', true);

-- Or using struct literal syntax
SELECT {id: 1, name: 'Alice', active: true};

3. Test Field Mappings#

Always verify that field mappings work as expected:

-- Use arrow_typeof to verify unified schema
SELECT arrow_typeof([{x: 1, y: 2}, {y: 3, x: 4}]);
-- Result: List(Struct("x": Int32, "y": Int32))

4. Handle Partial Field Overlap Explicitly#

When combining structs with partial field overlap, use explicit CAST:

-- Instead of relying on implicit coercion
SELECT [
  CAST(left_struct AS STRUCT(x INT, y INT, z INT)),
  CAST(right_struct AS STRUCT(x INT, y INT, z INT))
];

5. Document Struct Schemas#

In complex queries, document the expected struct schemas:

-- Expected schema: {customer_id: INT, name: VARCHAR, age: INT}
SELECT {
  customer_id: c.id,
  name: c.name,
  age: c.age
} as customer_info
FROM customers c;

Error Messages and Troubleshooting#

“Cannot coerce struct with different field counts”#

Cause: Trying to combine structs with different numbers of fields.

Solution:

-- Use explicit CAST to handle missing fields
SELECT [
  CAST(struct1 AS STRUCT(a INT, b INT, c INT)),
  CAST(struct2 AS STRUCT(a INT, b INT, c INT))
];

“Field X not found in struct”#

Cause: Referencing a field name that doesn’t exist in the struct.

Solution:

-- Verify field names match exactly (case-sensitive)
SELECT s['field_name'] FROM my_table;  -- Use bracket notation for access
-- Or use get_field function
SELECT get_field(s, 'field_name') FROM my_table;

Unexpected NULL values after coercion#

Cause: Struct coercion added NULL for missing fields.

Solution: Check that all structs have the required fields, or explicitly handle NULLs:

SELECT COALESCE(s['field'], default_value) FROM my_table;

Struct Type Coercion and Field Mapping#

Overview: Name-Based vs Positional Mapping#

Example: Field Reordering is Handled Transparently#

Coercion Paths Using Name-Based Matching#

1. Array Literal Construction#

2. Array Construction from Columns#

3. UNION Operations#

4. Common Table Expressions (CTEs)#

5. VALUES Clauses#

6. JOIN Operations#

7. Aggregate Functions#

8. Window Functions#

NULL Handling for Missing Fields#

Example: Partial Field Overlap#

Limitations#

Comparison and Ordering#

Examples#

Migration Guide: From Positional to Name-Based Matching#

Example: Query That Changes Behavior#

Migration Steps#

Using Explicit CAST for Compatibility#

Best Practices#

1. Be Explicit with Schema Definitions#

2. Use Named Struct Constructors#

3. Test Field Mappings#

4. Handle Partial Field Overlap Explicitly#

5. Document Struct Schemas#

Error Messages and Troubleshooting#

“Cannot coerce struct with different field counts”#

“Field X not found in struct”#

Unexpected NULL values after coercion#

Related Functions#