# Struct Type Coercion and Field Mapping DataFusion uses **name-based field mapping** when coercing struct types across different operations. This document explains how struct coercion works, when it applies, and how to handle NULL fields. ## Overview: Name-Based vs Positional Mapping When combining structs from different sources (e.g., in UNION, array construction, or JOINs), DataFusion matches struct fields by **name** rather than by **position**. This provides more robust and predictable behavior compared to positional matching. ### Example: Field Reordering is Handled Transparently ```sql -- These two structs have the same fields in different order SELECT [{a: 1, b: 2}, {b: 3, a: 4}]; -- Result: Field names matched, values unified -- [{"a": 1, "b": 2}, {"a": 4, "b": 3}] ``` ## Coercion Paths Using Name-Based Matching The following query operations use name-based field mapping for struct coercion: ### 1. Array Literal Construction When creating array literals with struct elements that have different field orders: ```sql -- Structs with reordered fields in array literal SELECT [{x: 1, y: 2}, {y: 3, x: 4}]; -- Unified type: List(Struct("x": Int32, "y": Int32)) -- Values: [{"x": 1, "y": 2}, {"x": 4, "y": 3}] ``` **When it applies:** - Array literals with struct elements: `[{...}, {...}]` - Nested arrays with structs: `[[{x: 1}, {x: 2}]]` ### 2. Array Construction from Columns When constructing arrays from table columns with different struct schemas: ```sql CREATE TABLE t_left (s struct(x int, y int)) AS VALUES ({x: 1, y: 2}); CREATE TABLE t_right (s struct(y int, x int)) AS VALUES ({y: 3, x: 4}); -- Dynamically constructs unified array schema SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right; -- Result: [{"x": 1, "y": 2}, {"x": 4, "y": 3}] ``` **When it applies:** - Array construction with column references: `[col1, col2]` - Array construction in joins with matching field names ### 3. UNION Operations When combining query results with different struct field orders: ```sql SELECT {a: 1, b: 2} as s UNION ALL SELECT {b: 3, a: 4} as s; -- Result: {"a": 1, "b": 2} and {"a": 4, "b": 3} ``` **When it applies:** - UNION ALL with structs: field names matched across branches - UNION (deduplicated) with structs ### 4. Common Table Expressions (CTEs) When multiple CTEs produce structs with different field orders that are combined: ```sql WITH t1 AS (SELECT {a: 1, b: 2} as s), t2 AS (SELECT {b: 3, a: 4} as s) SELECT s FROM t1 UNION ALL SELECT s FROM t2; -- Result: Field names matched across CTEs ``` ### 5. VALUES Clauses When creating tables or temporary results with struct values in different field orders: ```sql CREATE TABLE t AS VALUES ({a: 1, b: 2}), ({b: 3, a: 4}); -- Table schema unified: struct(a: int, b: int) -- Values: {a: 1, b: 2} and {a: 4, b: 3} ``` ### 6. JOIN Operations When joining tables where the JOIN condition involves structs with different field orders: ```sql CREATE TABLE orders (customer struct(name varchar, id int)); CREATE TABLE customers (info struct(id int, name varchar)); -- Join matches struct fields by name SELECT * FROM orders JOIN customers ON orders.customer = customers.info; ``` ### 7. Aggregate Functions When collecting structs with different field orders using aggregate functions like `array_agg`: ```sql SELECT array_agg(s) FROM ( SELECT {x: 1, y: 2} as s UNION ALL SELECT {y: 3, x: 4} as s ) t GROUP BY category; -- Result: Array of structs with unified field order ``` ### 8. Window Functions When using window functions with struct expressions having different field orders: ```sql SELECT id, row_number() over (partition by s order by id) as rn FROM ( SELECT {category: 1, value: 10} as s, 1 as id UNION ALL SELECT {value: 20, category: 1} as s, 2 as id ); -- Fields matched by name in PARTITION BY clause ``` ## NULL Handling for Missing Fields When structs have different field sets, missing fields are filled with **NULL** values during coercion. ### Example: Partial Field Overlap ```sql -- Struct in first position has fields: a, b -- Struct in second position has fields: b, c -- Unified schema includes all fields: a, b, c SELECT [ CAST({a: 1, b: 2} AS STRUCT(a INT, b INT, c INT)), CAST({b: 3, c: 4} AS STRUCT(a INT, b INT, c INT)) ]; -- Result: -- [ -- {"a": 1, "b": 2, "c": NULL}, -- {"a": NULL, "b": 3, "c": 4} -- ] ``` ### Limitations **Field count must match exactly.** If structs have different numbers of fields and their field names don't completely overlap, the query will fail: ```sql -- This fails because field sets don't match: -- t_left has {x, y} but t_right has {x, y, z} SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right; -- Error: Cannot coerce struct with mismatched field counts ``` **Workaround: Use explicit CAST** To handle partial field overlap, explicitly cast structs to a unified schema: ```sql SELECT [ CAST(t_left.s AS STRUCT(x INT, y INT, z INT)), CAST(t_right.s AS STRUCT(x INT, y INT, z INT)) ] FROM t_left JOIN t_right; ``` ## Migration Guide: From Positional to Name-Based Matching If you have existing code that relied on **positional** struct field matching, you may need to update it. ### Example: Query That Changes Behavior **Old behavior (positional):** ```sql -- These would have been positionally mapped (left-to-right) SELECT [{x: 1, y: 2}, {y: 3, x: 4}]; -- Old result (positional): [{"x": 1, "y": 2}, {"y": 3, "x": 4}] ``` **New behavior (name-based):** ```sql -- Now uses name-based matching SELECT [{x: 1, y: 2}, {y: 3, x: 4}]; -- New result (by name): [{"x": 1, "y": 2}, {"x": 4, "y": 3}] ``` ### Migration Steps 1. **Review struct operations** - Look for queries that combine structs from different sources 2. **Check field names** - Verify that field names match as expected (not positions) 3. **Test with new coercion** - Run queries and verify the results match your expectations 4. **Handle field reordering** - If you need specific field orders, use explicit CAST operations ### Using Explicit CAST for Compatibility If you need precise control over struct field order and types, use explicit `CAST`: ```sql -- Guarantee specific field order and types SELECT CAST({b: 3, a: 4} AS STRUCT(a INT, b INT)); -- Result: {"a": 4, "b": 3} ``` ## Best Practices ### 1. Be Explicit with Schema Definitions When joining or combining structs, define target schemas explicitly: ```sql -- Good: explicit schema definition SELECT CAST(data AS STRUCT(id INT, name VARCHAR, active BOOLEAN)) FROM external_source; ``` ### 2. Use Named Struct Constructors Prefer named struct constructors for clarity: ```sql -- Good: field names are explicit SELECT named_struct('id', 1, 'name', 'Alice', 'active', true); -- Or using struct literal syntax SELECT {id: 1, name: 'Alice', active: true}; ``` ### 3. Test Field Mappings Always verify that field mappings work as expected: ```sql -- Use arrow_typeof to verify unified schema SELECT arrow_typeof([{x: 1, y: 2}, {y: 3, x: 4}]); -- Result: List(Struct("x": Int32, "y": Int32)) ``` ### 4. Handle Partial Field Overlap Explicitly When combining structs with partial field overlap, use explicit CAST: ```sql -- Instead of relying on implicit coercion SELECT [ CAST(left_struct AS STRUCT(x INT, y INT, z INT)), CAST(right_struct AS STRUCT(x INT, y INT, z INT)) ]; ``` ### 5. Document Struct Schemas In complex queries, document the expected struct schemas: ```sql -- Expected schema: {customer_id: INT, name: VARCHAR, age: INT} SELECT { customer_id: c.id, name: c.name, age: c.age } as customer_info FROM customers c; ``` ## Error Messages and Troubleshooting ### "Cannot coerce struct with different field counts" **Cause:** Trying to combine structs with different numbers of fields. **Solution:** ```sql -- Use explicit CAST to handle missing fields SELECT [ CAST(struct1 AS STRUCT(a INT, b INT, c INT)), CAST(struct2 AS STRUCT(a INT, b INT, c INT)) ]; ``` ### "Field X not found in struct" **Cause:** Referencing a field name that doesn't exist in the struct. **Solution:** ```sql -- Verify field names match exactly (case-sensitive) SELECT s['field_name'] FROM my_table; -- Use bracket notation for access -- Or use get_field function SELECT get_field(s, 'field_name') FROM my_table; ``` ### Unexpected NULL values after coercion **Cause:** Struct coercion added NULL for missing fields. **Solution:** Check that all structs have the required fields, or explicitly handle NULLs: ```sql SELECT COALESCE(s['field'], default_value) FROM my_table; ``` ## Related Functions - `arrow_typeof()` - Returns the Arrow type of an expression - `struct()` / `named_struct()` - Creates struct values - `get_field()` - Extracts field values from structs - `CAST()` - Explicitly casts structs to specific schemas