<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</title><link href="https://datafusion.apache.org/blog/" rel="alternate"/><link href="https://datafusion.apache.org/blog/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.atom.xml" rel="self"/><id>https://datafusion.apache.org/blog/</id><updated>2025-09-21T00:00:00+00:00</updated><entry><title>Implementing User Defined Types and Custom Metadata in DataFusion</title><link href="https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata" rel="alternate"/><published>2025-09-21T00:00:00+00:00</published><updated>2025-09-21T00:00:00+00:00</updated><author><name>Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</name></author><id>tag:datafusion.apache.org,2025-09-21:/blog/2025/09/21/custom-types-using-metadata</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}x
--&gt;

&lt;p&gt;&lt;a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/"&gt;Apache DataFusion&lt;/a&gt; significantly improves support for user
defined types and metadata. The user defined function APIs let users access
metadata on the input columns to functions and produce metadata in the output.&lt;/p&gt;
&lt;h2 id="user-defined-types-extension-types"&gt;User defined types == extension types&lt;a class="headerlink" href="#user-defined-types-extension-types" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion directly uses &lt;a href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt;'s &lt;a href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html"&gt;DataTypes&lt;/a&gt; as its type system. This
has …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}x
--&gt;

&lt;p&gt;&lt;a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/"&gt;Apache DataFusion&lt;/a&gt; significantly improves support for user
defined types and metadata. The user defined function APIs let users access
metadata on the input columns to functions and produce metadata in the output.&lt;/p&gt;
&lt;h2 id="user-defined-types-extension-types"&gt;User defined types == extension types&lt;a class="headerlink" href="#user-defined-types-extension-types" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion directly uses &lt;a href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt;'s &lt;a href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html"&gt;DataTypes&lt;/a&gt; as its type system. This
has several benefits including being simple to explain, supports a rich set of
both scalar and nested types, true zero copy interoperability with other Arrow
implementations, and world-class library support (via &lt;a href="https://github.com/apache/arrow-rs"&gt;arrow-rs&lt;/a&gt;). However, one
challenge of directly using the Arrow type system is there is no distinction
between logical types and physical types. For example, the Arrow type system
contains multiple types which can store "String"s (sequences of UTF8 encoded
bytes) such as &lt;code&gt;Utf8&lt;/code&gt;, &lt;code&gt;LargeUTF8&lt;/code&gt;, &lt;code&gt;Dictionary(Utf8)&lt;/code&gt;, and &lt;code&gt;Utf8View&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;However, Apache Arrow does provide &lt;a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types"&gt;extension types&lt;/a&gt;, a version of logical type
information, which describe how to interpret data stored in one of the existing
physical types. With the improved support for metadata in DataFusion 48.0.0, it
is now easier to implement user defined types using Arrow extension types.&lt;/p&gt;
&lt;h2 id="metadata-in-apache-arrow-fields"&gt;Metadata in Apache Arrow &lt;code&gt;Field&lt;/code&gt;s&lt;a class="headerlink" href="#metadata-in-apache-arrow-fields" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://arrow.apache.org/docs/format/Columnar.html"&gt;Arrow specification&lt;/a&gt; defines Metadata as a map of key-value pairs of
strings. This metadata is used to attach extension types and use case-specific
context to a column of values. The Rust implementation of Apache Arrow,
&lt;a href="https://github.com/apache/arrow-rs"&gt;arrow-rs&lt;/a&gt;, stores metadata on &lt;a href="https://arrow.apache.org/docs/format/Glossary.html#term-field"&gt;Field&lt;/a&gt;s, but prior to DataFusion 48.0.0, many of
DataFusion's internal APIs used &lt;a href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html"&gt;DataTypes&lt;/a&gt; directly, and thus did not propagate
metadata through all operations.&lt;/p&gt;
&lt;p&gt;In previous versions of DataFusion &lt;code&gt;Field&lt;/code&gt; metadata was propagated through certain
operations (e.g., renaming or selecting a column) but was not 
others (e.g., scalar, window, or aggregate function calls). In DataFusion 48.0.0, 
and later, all user defined functions are passed the full
input &lt;code&gt;Field&lt;/code&gt; information and can return &lt;code&gt;Field&lt;/code&gt; information to the caller.&lt;/p&gt;
&lt;p&gt;Supporting extension types was a key motivation for adding metadata to the
function processing, the same mechanism can store arbitrary metadata on the
input and output fields, which supports other interesting use cases as we describe
later in this post.&lt;/p&gt;
&lt;h2 id="metadata-handling"&gt;Metadata handling&lt;a class="headerlink" href="#metadata-handling" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Data in Arrow record batches carry a &lt;a href="https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html"&gt;Schema&lt;/a&gt; in addition to the Arrow arrays. Each
&lt;a href="https://arrow.apache.org/docs/format/Glossary.html#term-field"&gt;Field&lt;/a&gt; in this &lt;code&gt;Schema&lt;/code&gt; contains a name, data type, nullability, and metadata. The
metadata is specified as a map of key-value pairs of strings.  In the new
implementation, during processing of all user defined functions we pass the input
field information.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." class="img-fluid" src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/&gt;
&lt;figcaption&gt;
&lt;b&gt;Figure 1:&lt;/b&gt; Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;It is often desirable to write a generic function for reuse. Prior versions of
user defined functions only had access to the &lt;code&gt;DataType&lt;/code&gt; of the input columns.
This works well for some features that only rely on the types of data, but other
use cases may need additional information that describes the data.&lt;/p&gt;
&lt;p&gt;For example, suppose I wish to write a function that takes in a UUID and returns a string
of the &lt;a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1"&gt;variant&lt;/a&gt; of the input field. We would want this function to be able to handle
all of the string types and also a binary encoded UUID. The Arrow specification does not
contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary
array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0]
we can validate during planning that the input data not only has the correct underlying
data type, but that it also represents the right &lt;em&gt;kind&lt;/em&gt; of data. The UUID example is a
common one, and it is included in the &lt;a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html"&gt;canonical extension types&lt;/a&gt; that are now
supported in DataFusion.&lt;/p&gt;
&lt;p&gt;Another common application of metadata handling is understanding encoding of a blob of data.
Suppose you have a column that contains image data. Most likely this data is stored as
an array of &lt;code&gt;u8&lt;/code&gt; data. Without knowing a priori what the encoding of that blob of data is,
you cannot ensure you are using the correct methods for decoding it. You may work around
this by adding another column to your data source indicating the encoding, but this can be
wasteful for systems where the encoding never changes. Instead, you could use metadata to
specify the encoding for the entire column.&lt;/p&gt;
&lt;h2 id="how-to-use-metadata-in-user-defined-functions"&gt;How to use metadata in user defined functions&lt;a class="headerlink" href="#how-to-use-metadata-in-user-defined-functions" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When working with metadata for &lt;a href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html"&gt;user defined scalar functions&lt;/a&gt;, there are typically two
places in the function definition that require implementation.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Computing the return field from the arguments&lt;/li&gt;
&lt;li&gt;Invocation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;During planning, we will attempt to call the function &lt;a href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args"&gt;return_field_from_args()&lt;/a&gt;. This will
provide a list of input fields to the function and return the output field. To evaluate
metadata on the input side, you can write a functions similar to this example:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;fn return_field_from_args(
    &amp;amp;self,
    args: ReturnFieldArgs,
) -&amp;gt; datafusion::common::Result&amp;lt;FieldRef&amp;gt; {
    if args.arg_fields.len() != 1 {
        return exec_err!("Incorrect number of arguments for uuid_version");
    }

    let input_field = &amp;amp;args.arg_fields[0];
    if &amp;amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
        let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type()
        else {
            return exec_err!("Input field must contain the UUID canonical extension type");
        };
    }

    let is_nullable = args.arg_fields[0].is_nullable();

    Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we take advantage of the fact that we already have support for extension
types that evaluate metadata. If you were attempting to check for metadata other than
extension type support, we could have instead written a snippet such as:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;    if &amp;amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
        let _ = input_field
            .metadata()
            .get("ARROW:extension:metadata")
            .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?;
        };
    }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you are writing a user defined function that will instead return metadata on output
you can add this directly into the &lt;code&gt;Field&lt;/code&gt; that is the output of the &lt;code&gt;return_field_from_args&lt;/code&gt;
call. In our above example, we could change the return line to:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;    Ok(Arc::new(
        Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
            [("my_key".to_string(), "my_value".to_string())]
                .into_iter()
                .collect(),
        ),
    ))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By checking the metadata during the planning process, we can identify errors early in
the query process. There are cases were we wish to have access to this metadata during
execution as well. The function &lt;a href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args"&gt;invoke_with_args&lt;/a&gt; in the user defined function takes
the updated struct &lt;a href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html"&gt;ScalarFunctionArgs&lt;/a&gt;. This now contains the input fields, which can
be used to check for metadata. For example, you can do the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;fn invoke_with_args(&amp;amp;self, args: ScalarFunctionArgs) -&amp;gt; Result&amp;lt;ColumnarValue&amp;gt; {
    assert_eq!(args.arg_fields.len(), 1);
    let my_value = args.arg_fields[0]
        .metadata()
        .get("encoding_type");
    ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this snippet we have extracted an &lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt; from the input field metadata
which we can then use to determine which functions we might want to call. We could
then parse the returned value to determine what type of encoding to use when
evaluating the array in the arguments. Since &lt;code&gt;return_field_from_args&lt;/code&gt; is not &lt;code&gt;&amp;amp;mut self&lt;/code&gt;
this check could not be performed during the planning stage.&lt;/p&gt;
&lt;p&gt;The description in this section applies to scalar user defined functions, but equivalent
support exists for aggregate and window functions.&lt;/p&gt;
&lt;h2 id="extension-types"&gt;Extension types&lt;a class="headerlink" href="#extension-types" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Extension types are one of the primary motivations for this  enhancement in
[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, &lt;a href="https://github.com/apache/arrow-rs"&gt;arrow-rs&lt;/a&gt;,
already contains support for the &lt;a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html"&gt;canonical extension types&lt;/a&gt;. This support includes
helper functions such as &lt;code&gt;try_canonical_extension_type()&lt;/code&gt; in the earlier example.&lt;/p&gt;
&lt;p&gt;For a concrete example of how extension types can be used in DataFusion functions,
there is an &lt;a href="https://github.com/timsaucer/datafusion_extension_type_examples"&gt;example repository&lt;/a&gt; that demonstrates using UUIDs. The UUID extension
type specifies that the data are stored as a Fixed Size Binary of length 16. In the
DataFusion core functions, we have the ability to generate string representations of
UUIDs that match the version 4 specification. These are helpful, but a user may
wish to do additional work with UUIDs where having them in the dense representation
is preferable. Alternatively, the user may already have data with the binary encoding
and we want to extract values such as the version, timestamp, or string
representation.&lt;/p&gt;
&lt;p&gt;In the example repository we have created three user defined functions: &lt;code&gt;UuidVersion&lt;/code&gt;,
&lt;code&gt;StringToUuid&lt;/code&gt;, and &lt;code&gt;UuidToString&lt;/code&gt;. Each of these implements &lt;code&gt;ScalarUDFImpl&lt;/code&gt; and can
be used thusly:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;async fn main() -&amp;gt; Result&amp;lt;()&amp;gt; {
    let ctx = create_context()?;

    // get a DataFrame from the context
    let mut df = ctx.table("t").await?;

    // Create the string UUIDs
    df = df.select(vec![uuid().alias("string_uuid")])?;

    // Convert string UUIDs to canonical extension UUIDs
    let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
    df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?;

    // Extract version number from canonical extension UUIDs
    let version = ScalarUDF::new_from_impl(UuidVersion::default());
    df = df.with_column("version", version.call(vec![col("uuid")]))?;

    // Convert back to a string
    let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
    df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?;

    df.show().await?;

    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/timsaucer/datafusion_extension_type_examples"&gt;example repository&lt;/a&gt; also contains a crate that demonstrates how to expose these
UDFs to &lt;a href="https://datafusion.apache.org/python/"&gt;datafusion-python&lt;/a&gt;. This requires version 48.0.0 or later.&lt;/p&gt;
&lt;h2 id="other-use-cases"&gt;Other use cases&lt;a class="headerlink" href="#other-use-cases" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The metadata attached to the fields can be used to store &lt;em&gt;any&lt;/em&gt; user data in key/value
pairs. Some of the other use cases that have been identified include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Creating output for downstream systems. One user of DataFusion produces
  &lt;a href="https://rerun.io/blog/column-chunks"&gt;data visualizations&lt;/a&gt; that are dependant upon metadata in record batch fields. By
  enabling metadata on output of user defined functions, we can now produce batches
  that are directly consumable by these systems.&lt;/li&gt;
&lt;li&gt;Describe the relationships between columns of data. You can store data about how
  one column of data relates to another and use these during function evaluation. For
  example, in robotics it is common to use &lt;a href="https://wiki.ros.org/tf2"&gt;transforms&lt;/a&gt; to describe how to convert
  from one coordinate system to another. It can be convenient to send the function
  all the columns that contain transform information and then allow the function
  to determine which columns to use based on the metadata. This allows for
  encapsulation of the transform logic within the user function.&lt;/li&gt;
&lt;li&gt;Storing logical types of the data model. &lt;a href="https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/"&gt;InfluxDB&lt;/a&gt; uses field metadata to specify
  which columns are used for tags, times, and fields.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Based on the experience of the authors, we recommend caution when using metadata
for use cases other than type extension. One issue that can arises is that as columns
are used to compute new fields, some functions may pass through the metadata and the
semantic meaning may change. For example, suppose you decided to use metadata to
store some kind of statistics for the entire stream of record batches. Then you pass
that column through a filter that removes many rows of data. Your statistics
metadata may now be invalid, even though it was passed through the filter.&lt;/p&gt;
&lt;p&gt;Similarly, if you use metadata to form relations between one column and another and
the naming of the columns has changed at some point in your workflow, then the metadata
may indicate an incorrect column of data it is referring to. This can be mitigated by
not relying on column naming but rather adding additional metadata to all columns of
interest.&lt;/p&gt;
&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;a class="headerlink" href="#acknowledgements" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We would like to thank &lt;a href="https://rerun.io"&gt;Rerun.io&lt;/a&gt; for sponsoring the development of this work. &lt;a href="https://rerun.io"&gt;Rerun.io&lt;/a&gt;
is building a data visualization system for Physical AI and uses metadata to specify 
context about columns in Arrow record batches.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;a class="headerlink" href="#conclusion" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The enhanced metadata handling in [DataFusion 48.0.0] is a significant step
forward in the ability to handle more interesting types of data. Users can
validate the input data matches the intent of the data to be processed, enable
complex operations on binary data because we understand the encoding used, and 
use metadata to create new and interesting user defined data types.
We can't wait to see what you build with it!&lt;/p&gt;
&lt;h2 id="get-involved"&gt;Get Involved&lt;a class="headerlink" href="#get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The DataFusion team is an active and engaging community and we would love to have you join
us and help the project.&lt;/p&gt;
&lt;p&gt;Here are some ways to get involved:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Learn more by visiting the &lt;a href="https://datafusion.apache.org/index.html"&gt;DataFusion&lt;/a&gt; project page.&lt;/li&gt;
&lt;li&gt;Try out the project and provide feedback, file issues, and contribute code.&lt;/li&gt;
&lt;li&gt;Work on a &lt;a href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;good first issue&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Reach out to us via the &lt;a href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;communication doc&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</content><category term="blog"/></entry></feed>