<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - timsaucer</title><link href="https://datafusion.apache.org/blog/" rel="alternate"/><link href="https://datafusion.apache.org/blog/feeds/timsaucer.atom.xml" rel="self"/><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><entry><title>Apache DataFusion Python 46.0.0 Released</title><link href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0" rel="alternate"/><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><name>timsaucer</name></author><id>tag:datafusion.apache.org,2025-03-30:/blog/2025/03/30/datafusion-python-46.0.0</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/46.0.0/"&gt;datafusion-python 46.0.0&lt;/a&gt; has been released. This release
brings in all of the new features of the core &lt;a href="https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0"&gt;DataFusion 46.0.0&lt;/a&gt; library. Since the last
blog post for &lt;a href="https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/"&gt;datafusion-python 43.1.0&lt;/a&gt;, a large number of improvements have been made
that can …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/46.0.0/"&gt;datafusion-python 46.0.0&lt;/a&gt; has been released. This release
brings in all of the new features of the core &lt;a href="https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0"&gt;DataFusion 46.0.0&lt;/a&gt; library. Since the last
blog post for &lt;a href="https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/"&gt;datafusion-python 43.1.0&lt;/a&gt;, a large number of improvements have been made
that can be found in the &lt;a href="https://github.com/apache/datafusion-python/tree/main/dev/changelog"&gt;changelogs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We highly recommend reviewing the upstream &lt;a href="https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0"&gt;DataFusion 46.0.0&lt;/a&gt; announcement.&lt;/p&gt;
&lt;h2 id="easier-file-reading"&gt;Easier file reading&lt;a class="headerlink" href="#easier-file-reading" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In these releases we have introduced two new ways to more easily read files into
DataFrames.&lt;/p&gt;
&lt;p&gt;PR &lt;a href="https://github.com/apache/datafusion-python/pull/982"&gt;#982&lt;/a&gt; introduced a series of easier read functions for Parquet, JSON, CSV, and
AVRO files. This introduces a concept of a global context that is available by
default when using these methods. Now instead of creating a default Session
Context and then calling the read methods, you can simply import these read
alternative methods and begin working with your DataFrames. Below is an example of
how easy to use this new approach is.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;from datafusion.io import read_parquet
df = read_parquet(path="./examples/tpch/data/customer.parquet")
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;PR &lt;a href="https://github.com/apache/datafusion-python/pull/980"&gt;#980&lt;/a&gt; adds a method for setting up a session context to use URL tables. With
this enabled, you can use a path to a local file as a table name. An example
of how to use this is demonstrated in the following snippet.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;import datafusion
ctx = datafusion.SessionContext().enable_url_table()
df = ctx.table("./examples/tpch/data/customer.parquet")
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="registering-table-views"&gt;Registering Table Views&lt;a class="headerlink" href="#registering-table-views" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion supports registering a logical plan as a view with a session context. This
allows creating views in one part of your work flow and passinng the session
context to other places where that logical plan can be reused. This is an useful
feature for building up complex workflows and for code clarity. PR &lt;a href="https://github.com/apache/datafusion-python/pull/1016"&gt;#1016&lt;/a&gt; enables this
feature in &lt;code&gt;datafusion-python&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example, supposing you have a DataFrame called &lt;code&gt;df1&lt;/code&gt;, you could use this code snippet
to register the view and then use it in another place:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;ctx.register_view("view1", df1)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And then in another portion of your code which has access to the same session context
you can retrieve the DataFrame with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;df2 = ctx.table("view1")
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="asynchronous-iteration-of-record-batches"&gt;Asynchronous Iteration of Record Batches&lt;a class="headerlink" href="#asynchronous-iteration-of-record-batches" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Retrieving a &lt;code&gt;RecordBatch&lt;/code&gt; from a &lt;code&gt;RecordBatchStream&lt;/code&gt; was a synchronous call, which would
require the end user's code to wait for the data retrieval. This is described in
&lt;a href="https://github.com/apache/datafusion-python/issues/974"&gt;Issue 974&lt;/a&gt;. We continue to support this as a synchronous iterator, but we have also added
in the ability to retrieve the &lt;code&gt;RecordBatch&lt;/code&gt; using the Python asynchronous &lt;code&gt;anext&lt;/code&gt;
function.&lt;/p&gt;
&lt;h2 id="default-zstd-compression-for-parquet-files"&gt;Default ZSTD Compression for Parquet files&lt;a class="headerlink" href="#default-zstd-compression-for-parquet-files" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;With PR &lt;a href="https://github.com/apache/datafusion-python/pull/981"&gt;#981&lt;/a&gt;, we change the saving of Parquet files to use zstd compression by default.
Previously the default was uncompressed, causing excessive disk storage. Zstd is an
excellent compression scheme that balances speed and compression ratio. Users can still
save their Parquet files uncompressed by passing in the appropriate value to the
&lt;code&gt;compression&lt;/code&gt; argument when calling &lt;code&gt;DataFrame.write_parquet&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="udf-decorators"&gt;UDF Decorators&lt;a class="headerlink" href="#udf-decorators" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In PRs &lt;a href="https://github.com/apache/datafusion-python/pull/1040"&gt;#1040&lt;/a&gt; and &lt;a href="https://github.com/apache/datafusion-python/pull/1061"&gt;#1061&lt;/a&gt; we add methods to make creating user defined functions
easier and take advantage of Python decorators. With these PRs you can save a step
from defining a method and then defining a udf of that method. Instead you can
simply add the appropriate &lt;code&gt;udf&lt;/code&gt; decorator. Similar methods exist for aggregate
and window user defined functions.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;@udf([pa.int64(), pa.int64()], pa.bool_(), "stable")
def my_custom_function(
    age: pa.Array,
    favorite_number: pa.Array,
) -&amp;gt; pa.Array:
    pass
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="uv-package-management"&gt;&lt;code&gt;uv&lt;/code&gt; package management&lt;a class="headerlink" href="#uv-package-management" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; is an extremely fast Python package manager, written in Rust. In the previous version
of &lt;code&gt;datafusion-python&lt;/code&gt; we had a combination of settings of PyPi and Conda. Instead, we
switch to using &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; is our primary method for dependency management.&lt;/p&gt;
&lt;p&gt;For most users of DataFusion, this change will be transparent. You can still install
via &lt;code&gt;pip&lt;/code&gt; or &lt;code&gt;conda&lt;/code&gt;. For developers, the instructions in the repository have been updated.&lt;/p&gt;
&lt;h2 id="code-cleanup"&gt;Code cleanup&lt;a class="headerlink" href="#code-cleanup" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In an effort to improve our code cleanliness and ensure we are following Python best
practices, we use &lt;a href="https://docs.astral.sh/ruff/"&gt;ruff&lt;/a&gt; to perform Python linting. Until now we enabled only a portion
of the available linters available. In PRs &lt;a href="https://github.com/apache/datafusion-python/pull/1055"&gt;#1055&lt;/a&gt; and &lt;a href="https://github.com/apache/datafusion-python/pull/1062"&gt;#1062&lt;/a&gt;, we enable many more
of these linters and made code improvements to ensure we are following these
recommendations.&lt;/p&gt;
&lt;h2 id="improved-jupyter-notebook-rendering"&gt;Improved Jupyter Notebook rendering&lt;a class="headerlink" href="#improved-jupyter-notebook-rendering" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Since PR &lt;a href="https://github.com/apache/datafusion-python/pull/839"&gt;#839&lt;/a&gt; in DataFusion 41.0.0 we have been able to render DataFrames using html in
&lt;a href="https://jupyter.org/"&gt;jupyter&lt;/a&gt; notebooks. This is a big improvement over the &lt;code&gt;show&lt;/code&gt; command when we have the
ability to render tables. In PR &lt;a href="https://github.com/apache/datafusion-python/pull/1036"&gt;#1036&lt;/a&gt; we went a step further and added in a variety
of features.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now html tables are scrollable, vertically and horizontally.&lt;/li&gt;
&lt;li&gt;When data are truncated, we report this to the user.&lt;/li&gt;
&lt;li&gt;Instead of showing a small number of rows, we collect up to 2 megabytes of data to
display. Since we have scrollable tables, we are able to make more data available
to the user without sacrificing notebook usability.&lt;/li&gt;
&lt;li&gt;We report explicitly when the DataFrame is empty. Previously we would not output
anything for an empty table. This indicator is helpful to users to ensure their plans
are written correctly. Sometimes a non-output can be overlooked.&lt;/li&gt;
&lt;li&gt;For long output of data, we generate a collapsed view of the data with an option
for the user to click on it to expand the data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the below view you can see an example of some of these features such as the
expandable text and scroll bars.&lt;/p&gt;
&lt;figure class="text-center"&gt;
&lt;img alt="Fig 1: Example html rendering in a jupyter notebook." class="img-fluid" src="/blog/images/python-datafusion-46.0.0/html_rendering.png"/&gt;
&lt;figcaption&gt;
&lt;b&gt;Figure 1&lt;/b&gt;: With the html rendering enhancements, tables are more easily
   viewable in jupyter notebooks.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="extension-documentation"&gt;Extension Documentation&lt;a class="headerlink" href="#extension-documentation" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We have recently added &lt;a href="https://datafusion.apache.org/python/contributor-guide/ffi.html"&gt;Extension Documentation&lt;/a&gt; to the DataFusion in Python website. We
have received many requests about how to better understand how to integrate DataFusion
in Python with other Rust libraries. To address these questions we wrote an article about
some of the difficulties that we encounter when using Rust libraries in Python and our
approach to addressing them.&lt;/p&gt;
&lt;h2 id="migration-guide"&gt;Migration Guide&lt;a class="headerlink" href="#migration-guide" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;During the upgrade from &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md"&gt;DataFusion 43.0.0&lt;/a&gt; to &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/44.0.0.md"&gt;DataFusion 44.0.0&lt;/a&gt; as our upstream core
dependency, we discovered a few changes were necessary within our repository and our
unit tests. These notes serve to help guide users who may encounter similar issues when
upgrading.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;RuntimeConfig&lt;/code&gt; is now deprecated in favor of &lt;code&gt;RuntimeEnvBuilder&lt;/code&gt;. The migration is
fairly straightforward, and the corresponding classes have been marked as deprecated. For
end users it should be simply a matter of changing the class name.&lt;/li&gt;
&lt;li&gt;If you perform a &lt;code&gt;concat&lt;/code&gt; of a &lt;code&gt;string_view&lt;/code&gt; and &lt;code&gt;string&lt;/code&gt;, it will now return a
&lt;code&gt;string_view&lt;/code&gt; instead of a &lt;code&gt;string&lt;/code&gt;. This likely only impacts unit tests that are validating
return types. In general, it is recommended to switch to using &lt;code&gt;string_view&lt;/code&gt; whenever 
possible. You can see the blog articles &lt;a href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/"&gt;String View Pt 1&lt;/a&gt; and &lt;a href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/"&gt;Pt 2&lt;/a&gt; for more information
on these performance improvements.&lt;/li&gt;
&lt;li&gt;The function &lt;code&gt;date_part&lt;/code&gt; now returns an &lt;code&gt;int32&lt;/code&gt; instead of a &lt;code&gt;float64&lt;/code&gt;. This is likely
only impactful to unit tests.&lt;/li&gt;
&lt;li&gt;We have upgraded the Python minimum version to 3.9 since 3.8 is no longer officially
supported.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="coming-soon"&gt;Coming Soon&lt;a class="headerlink" href="#coming-soon" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;There is a lot of excitement around the upcoming work. This list is not comprehensive, but
a glimpse into some of the upcoming work includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reusable DataFusion UDFs: The way user defined functions are currently written in
&lt;code&gt;datafusion-python&lt;/code&gt; is slightly different from those written for the upstream Rust
&lt;code&gt;datafusion&lt;/code&gt;. The core ideas are usually the same, but it means it takes effort for users
to re-implement functions already written for Rust projects to be usable in Python. Issue
&lt;a href="https://github.com/apache/datafusion-python/issues/1017"&gt;#1017&lt;/a&gt; addresses this topic. Work is well underway to make it easier to expose these
user functions through the FFI boundary. This means that the work that already exists in
repositories such as those found in the &lt;a href="https://github.com/datafusion-contrib"&gt;datafusion-contrib&lt;/a&gt; project can be easily
re-used in Python. This will provide a low effort way to expose significant functionality
to the DataFusion in Python community.&lt;/li&gt;
&lt;li&gt;Additional table providers: We have work well underway to provide a host of table providers
to &lt;code&gt;datafusion-python&lt;/code&gt; including: sqlite, duckdb, postgres, odbc, and mysql! In
&lt;a href="https://github.com/datafusion-contrib/datafusion-table-providers/issues/279"&gt;datafusion-contrib #279&lt;/a&gt; we track the progress of this excellent work. Once complete, users
will be able to &lt;code&gt;pip install&lt;/code&gt; this library and get easy access to all of these table
providers. This is another way we are leveraging the FFI work to greatly expand the usability
of &lt;code&gt;datafusion-python&lt;/code&gt; with relatively low effort.&lt;/li&gt;
&lt;li&gt;External catalog and schema providers: For users who wish to go beyond table providers
and have an entire custom catalog with schema, Issue &lt;a href="https://github.com/apache/datafusion-python/issues/1091"&gt;#1091&lt;/a&gt; tracks the progress of exposing
this in Python. With this work, if you have already written a Rust based table catalog you
will be able to interface it in Python similar to the work described for the table
providers above.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is only a sample of the great work that is being done. If there are features you would
love to see, we encourage you to open an issue and join us as we build something wonderful.&lt;/p&gt;
&lt;h2 id="appreciation"&gt;Appreciation&lt;a class="headerlink" href="#appreciation" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We would like to thank everyone who has helped with these releases through their helpful
conversations, code review, issue descriptions, and code authoring. We would especially
like to thank the following authors of PRs who made these releases possible, listed in
alphabetical order by username: &lt;a href="https://github.com/chenkovsky"&gt;@chenkovsky&lt;/a&gt;, &lt;a href="https://github.com/CrystalZhou0529"&gt;@CrystalZhou0529&lt;/a&gt;, &lt;a href="https://github.com/ion-elgreco"&gt;@ion-elgreco&lt;/a&gt;,
&lt;a href="https://github.com/jsai28"&gt;@jsai28&lt;/a&gt;, &lt;a href="https://github.com/kevinjqliu"&gt;@kevinjqliu&lt;/a&gt;, &lt;a href="https://github.com/kylebarron"&gt;@kylebarron&lt;/a&gt;, &lt;a href="https://github.com/kosiew"&gt;@kosiew&lt;/a&gt;, &lt;a href="https://github.com/nirnayroy"&gt;@nirnayroy&lt;/a&gt;, and &lt;a href="https://github.com/Spaarsh"&gt;@Spaarsh&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Thank you!&lt;/p&gt;
&lt;h2 id="get-involved"&gt;Get Involved&lt;a class="headerlink" href="#get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The DataFusion Python team is an active and engaging community and we would love
to have you join us and help the project.&lt;/p&gt;
&lt;p&gt;Here are some ways to get involved:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Learn more by visiting the &lt;a href="https://datafusion.apache.org/python/index.html"&gt;DataFusion Python project&lt;/a&gt; page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Try out the project and provide feedback, file issues, and contribute code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Join us on &lt;a href="https://s.apache.org/slack-invite"&gt;ASF Slack&lt;/a&gt; or the &lt;a href="https://discord.gg/Qw5gKqHxUM"&gt;Arrow Rust Discord Server&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="blog"/></entry><entry><title>Apache DataFusion Python 43.1.0 Released</title><link href="https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0" rel="alternate"/><published>2024-12-14T00:00:00+00:00</published><updated>2024-12-14T00:00:00+00:00</updated><author><name>timsaucer</name></author><id>tag:datafusion.apache.org,2024-12-14:/blog/2024/12/14/datafusion-python-43.1.0</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/43.1.0/"&gt;datafusion-python 43.1.0&lt;/a&gt; has been released. This release
brings in all of the new features of the core &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md"&gt;DataFusion 43.0.0&lt;/a&gt; library. Since the last
blog post for &lt;a href="https://datafusion.apache.org/blog/2024/08/20/python-datafusion-40.0.0/"&gt;datafusion-python 40.1.0&lt;/a&gt;, a large number of improvements have been made
that can …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/43.1.0/"&gt;datafusion-python 43.1.0&lt;/a&gt; has been released. This release
brings in all of the new features of the core &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md"&gt;DataFusion 43.0.0&lt;/a&gt; library. Since the last
blog post for &lt;a href="https://datafusion.apache.org/blog/2024/08/20/python-datafusion-40.0.0/"&gt;datafusion-python 40.1.0&lt;/a&gt;, a large number of improvements have been made
that can be found in the &lt;a href="https://github.com/apache/datafusion-python/tree/main/dev/changelog"&gt;changelogs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We would like to point out four features that are particularly noteworthy.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Arrow PyCapsule import and export&lt;/li&gt;
&lt;li&gt;User-Defined Window Functions&lt;/li&gt;
&lt;li&gt;Foreign Table Providers&lt;/li&gt;
&lt;li&gt;String View performance enhancements&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="arrow-pycapsule-import-and-export"&gt;Arrow PyCapsule import and export&lt;a class="headerlink" href="#arrow-pycapsule-import-and-export" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Arrow has stable C interface for moving data between different libraries, but difficulties
sometimes arise when different Python libraries expose this interface through different
methods, requiring developers to write function calls for each library they are attempting
to work with. A better approach is to use the &lt;a href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html"&gt;Arrow PyCapsule Interface&lt;/a&gt; which gives a
consistent method for exposing these data structures across libraries.&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://github.com/apache/datafusion-python/pull/825"&gt;PR #825&lt;/a&gt;, we introduced support for both importing and exporting Arrow data in
&lt;code&gt;datafusion-python&lt;/code&gt;. With this improvement, you can now use a single function call to import
a table from &lt;strong&gt;any&lt;/strong&gt; Python library that implements the &lt;a href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html"&gt;Arrow PyCapsule Interface&lt;/a&gt;.
Many popular libraries, such as &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt; and &lt;a href="https://pola.rs/"&gt;Polars&lt;/a&gt;
already support these interfaces.&lt;/p&gt;
&lt;p&gt;Suppose you have a Pandas and Polars DataFrames named &lt;code&gt;df_pandas&lt;/code&gt; or &lt;code&gt;df_polars&lt;/code&gt;, respectively:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;ctx = SessionContext()
df_dfn1 = ctx.from_arrow(df_pandas)
df_dfn1.show()

df_dfn2 = ctx.from_arrow(df_polars)
df_dfn2.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One great thing about using this interface is that as any new library is developed and
uses these stable interfaces, they will work out of the box with DataFusion!&lt;/p&gt;
&lt;p&gt;Additionally, DataFusion DataFrames allow for exporting via the PyCapsule interface. For example,
to convert a DataFrame to a PyArrow table, it is simply&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;import pyarrow as pa
table = pa.table(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="user-defined-window-functions"&gt;User-Defined Window Functions&lt;a class="headerlink" href="#user-defined-window-functions" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In &lt;code&gt;datafusion-python 42.0.0&lt;/code&gt; we released User-Defined Window Support in &lt;a href="https://github.com/apache/datafusion-python/pull/880"&gt;PR #880&lt;/a&gt;.
For a detailed description of how these work please see the online documentation for
all &lt;a href="https://datafusion.apache.org/python/user-guide/common-operations/udf-and-udfa.html"&gt;user-defined functions&lt;/a&gt;. Additionally the &lt;a href="https://github.com/apache/datafusion-python/tree/main/examples"&gt;examples folder&lt;/a&gt; contains a complete
example demonstrating the four different modes of operation of window functions
within DataFusion.&lt;/p&gt;
&lt;h2 id="foreign-table-providers"&gt;Foreign Table Providers&lt;a class="headerlink" href="#foreign-table-providers" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the core &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md"&gt;DataFusion 43.0.0&lt;/a&gt; release, support was added for a Foreign Function
Interface to table providers. This creates a stable way for sharing functionality
across different libraries, similar to the &lt;a href="https://arrow.apache.org/docs/format/CDataInterface.html"&gt;Arrow C data interface&lt;/a&gt; operates. This
enables libraries, such as &lt;a href="https://delta.io/docs/"&gt;delta lake&lt;/a&gt; and &lt;a href="https://github.com/datafusion-contrib/datafusion-table-providers"&gt;datafusion-contrib&lt;/a&gt; to write their own
table providers in Rust and expose them in Python without requiring a Rust dependency
on &lt;code&gt;datafusion-python&lt;/code&gt;. This is important because it allows these libraries to
operate with &lt;code&gt;datafusion-python&lt;/code&gt; regardless of which version of &lt;code&gt;datafusion&lt;/code&gt; they
were built against.&lt;/p&gt;
&lt;p&gt;To implement this feature in a table provider is quite simple. There is a complete
example in the &lt;a href="https://github.com/apache/datafusion-python/tree/main/examples"&gt;examples folder&lt;/a&gt;, but the relevant code is here, exposed as a
Python function via &lt;a href="https://pyo3.rs/"&gt;pyo3&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;    fn __datafusion_table_provider__&amp;lt;'py&amp;gt;(
        &amp;amp;self,
        py: Python&amp;lt;'py&amp;gt;,
    ) -&amp;gt; PyResult&amp;lt;Bound&amp;lt;'py, PyCapsule&amp;gt;&amp;gt; {
        let name = CString::new("datafusion_table_provider").unwrap();

        let provider = self
            .create_table()
            .map_err(|e| PyRuntimeError::new_err(e.to_string()))?;
        let provider = FFI_TableProvider::new(Arc::new(provider), false);

        PyCapsule::new_bound(py, provider, Some(name.clone()))
    }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's it! All of the work of converting the table provider to use the FFI interface
is performed by the core library.&lt;/p&gt;
&lt;h2 id="string-view-performance-enhancements"&gt;String View performance enhancements&lt;a class="headerlink" href="#string-view-performance-enhancements" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the core &lt;a href="https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md"&gt;DataFusion 43.0.0&lt;/a&gt; release, the option to enable StringView by default
was turned on. This leads to some significant performance enhancements, but it &lt;em&gt;may&lt;/em&gt;
require some changes to users of &lt;code&gt;datafusion-python&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;To learn more about the excellent work on this feature please read &lt;a href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/"&gt;part 1&lt;/a&gt; and &lt;a href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/"&gt;part 2&lt;/a&gt;
of the blog post describing how these enhancements can lead to 20-200% performance
gains in some tests.&lt;/p&gt;
&lt;p&gt;During our testing we identified some cases where we needed to adjust workflows to
account for the fact that StringView is now the default type for string based operations.
First, when performing manipulations on string objects there is a performance loss when
needing to cast from string to string view or vice versa. To reap the best performance,
ideally all of your string type data will use StringView. For most users this should be
transparent. However if you specify a schema for reading or creating data, then you
likely need to change from &lt;code&gt;pa.string()&lt;/code&gt; to &lt;code&gt;pa.string_view()&lt;/code&gt;. For our testing, this
primarily happens during data loading operations and in unit tests.&lt;/p&gt;
&lt;p&gt;If you wish to disable StringView as the default type to retain the old approach,
you can do so following this example:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;from datafusion import SessionContext
from datafusion import SessionConfig
config = SessionConfig({"datafusion.execution.parquet.schema_force_view_types": "false"})
ctx = SessionContext(config=config)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="appreciation"&gt;Appreciation&lt;a class="headerlink" href="#appreciation" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We would like to thank everyone who has helped with these releases through their helpful
conversations, code review, issue descriptions, and code authoring. We would especially
like to thank the following authors of PRs who made these releases possible, listed in
alphabetical order by username: &lt;a href="https://github.com/andygrove"&gt;@andygrove&lt;/a&gt;, &lt;a href="https://github.com/drauschenbach"&gt;@drauschenbach&lt;/a&gt;, &lt;a href="https://github.com/emgeee"&gt;@emgeee&lt;/a&gt;, &lt;a href="https://github.com/ion-elgreco"&gt;@ion-elgreco&lt;/a&gt;,
&lt;a href="https://github.com/jcrist"&gt;@jcrist&lt;/a&gt;, &lt;a href="https://github.com/kosiew"&gt;@kosiew&lt;/a&gt;, &lt;a href="https://github.com/mesejo"&gt;@mesejo&lt;/a&gt;, &lt;a href="https://github.com/Michael-J-Ward"&gt;@Michael-J-Ward&lt;/a&gt;, and &lt;a href="https://github.com/sir-sigurd"&gt;@sir-sigurd&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Thank you!&lt;/p&gt;
&lt;h2 id="get-involved"&gt;Get Involved&lt;a class="headerlink" href="#get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The DataFusion Python team is an active and engaging community and we would love
to have you join us and help the project.&lt;/p&gt;
&lt;p&gt;Here are some ways to get involved:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Learn more by visiting the &lt;a href="https://datafusion.apache.org/python/index.html"&gt;DataFusion Python project&lt;/a&gt;
page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Try out the project and provide feedback, file issues, and contribute code.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="blog"/></entry><entry><title>Comparing approaches to User Defined Functions in Apache DataFusion using Python</title><link href="https://datafusion.apache.org/blog/2024/11/19/datafusion-python-udf-comparisons" rel="alternate"/><published>2024-11-19T00:00:00+00:00</published><updated>2024-11-19T00:00:00+00:00</updated><author><name>timsaucer</name></author><id>tag:datafusion.apache.org,2024-11-19:/blog/2024/11/19/datafusion-python-udf-comparisons</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;h2 id="personal-context"&gt;Personal Context&lt;a class="headerlink" href="#personal-context" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For a few months now I’ve been working with &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt;, a
fast query engine written in Rust. From my experience the language that nearly all data scientists
are working in is Python. In general, data scientists often use &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt;
for in-memory tasks and &lt;a href="https://spark.apache.org/"&gt;PySpark&lt;/a&gt; for larger …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;h2 id="personal-context"&gt;Personal Context&lt;a class="headerlink" href="#personal-context" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For a few months now I’ve been working with &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt;, a
fast query engine written in Rust. From my experience the language that nearly all data scientists
are working in is Python. In general, data scientists often use &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt;
for in-memory tasks and &lt;a href="https://spark.apache.org/"&gt;PySpark&lt;/a&gt; for larger tasks that require
distributed processing.&lt;/p&gt;
&lt;p&gt;In addition to DataFusion, there is another Rust based newcomer to the DataFrame world,
&lt;a href="https://pola.rs/"&gt;Polars&lt;/a&gt;. The latter is growing extremely fast, and it serves many of the same
use cases as DataFusion. For my use cases, I'm interested in DataFusion because I want to be able
to build small scale tests rapidly and then scale them up to larger distributed systems with ease.
I do recommend evaluating Polars for in-memory work.&lt;/p&gt;
&lt;p&gt;Personally, I would love a single query approach that is fast for both in-memory usage and can
extend to large batch processing to exploit parallelization. I think DataFusion, coupled with
&lt;a href="https://datafusion.apache.org/ballista/"&gt;Ballista&lt;/a&gt; or
&lt;a href="https://github.com/apache/datafusion-ray"&gt;DataFusion-Ray&lt;/a&gt;, may provide this solution.&lt;/p&gt;
&lt;p&gt;As I’m testing, I’m primarily limiting my work to the
&lt;a href="https://datafusion.apache.org/python/"&gt;datafusion-python&lt;/a&gt; project, a wrapper around the Rust
DataFusion library. This wrapper gives you the speed advantages of keeping all of the data in the
Rust implementation and the ergonomics of working in Python. Personally, I would prefer to work
purely in Rust, but I also recognize that since the industry works in Python we should meet the
people where they are.&lt;/p&gt;
&lt;h2 id="user-defined-functions"&gt;User-Defined Functions&lt;a class="headerlink" href="#user-defined-functions" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The focus of this post is User-Defined Functions (UDFs). The DataFusion library gives a lot of
useful functions already for doing DataFrame manipulation. These are going to be similar to those
you find in other DataFrame libraries. You’ll be able to do simple arithmetic, create substrings of
columns, or find the average value across a group of rows. These cover most of the use cases
you’ll need in a DataFrame.&lt;/p&gt;
&lt;p&gt;However, there will always arise times when you want a custom function. With UDFs you open a
world of possibilities in your code. Sometimes there simply isn’t an easy way to use built-in
functions to achieve your goals.&lt;/p&gt;
&lt;p&gt;In the following, I’m going to demonstrate two example use cases. These are based on real world
problems I’ve encountered. Also I want to demonstrate the approach of “make it work, make it work
well, make it work fast” that is a motto I’ve seen thrown around in data science.&lt;/p&gt;
&lt;p&gt;I will demonstrate three approaches to writing UDFs. In order of increasing performance they are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing a pure Python function to do your computation&lt;/li&gt;
&lt;li&gt;Using the PyArrow libraries in Python to accelerate your processing&lt;/li&gt;
&lt;li&gt;Writing a UDF in Rust and exposing it to Python&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally I will demonstrate two variants of this. The first will be nearly identical to the
PyArrow library approach to simplify understanding how to connect the Rust code to Python. In the
second version we will do the iteration through the input arrays ourselves to give even greater
flexibility to the user.&lt;/p&gt;
&lt;p&gt;Here are the two example use cases, taken from my own work but generalized.&lt;/p&gt;
&lt;h3 id="use-case-1-scalar-function"&gt;Use Case 1: Scalar Function&lt;a class="headerlink" href="#use-case-1-scalar-function" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;I have a DataFrame and a list of tuples that I’m interested in. I want to filter out the DataFrame
to only have values that match those tuples from certain columns in the DataFrame.&lt;/p&gt;
&lt;p&gt;To give a concrete example, we will use data generated for the &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H benchmarks&lt;/a&gt;.
Suppose I have a table of sales line items. There are many columns, but I am interested in three: a
part key (&lt;code&gt;p_partkey&lt;/code&gt;), supplier key (&lt;code&gt;p_suppkey&lt;/code&gt;), and return status (&lt;code&gt;p_returnflag&lt;/code&gt;). I want
only to return a DataFrame with a specific combination of these three values. That is, I want
to know if part number 1530 from supplier 4031 was sold (not returned), so I want a specific
combination of &lt;code&gt;p_partkey = 1530&lt;/code&gt;, &lt;code&gt;p_suppkey = 4031&lt;/code&gt;, and &lt;code&gt;p_returnflag = 'N'&lt;/code&gt;. I have a small
handful of these combinations I want to return.&lt;/p&gt;
&lt;p&gt;Probably the most ergonomic way to do this without UDF is to turn that list of tuples into a
DataFrame itself, perform a join, and select the columns from the original DataFrame. If we were
working in PySpark we would probably broadcast join the DataFrame created from the tuple list since
it is tiny. In practice, I have found that with some DataFrame libraries performing a filter rather
than a join can be significantly faster. This is worth profiling for your specific use case.&lt;/p&gt;
&lt;h3 id="use-case-2-aggregate-function"&gt;Use Case 2: Aggregate Function&lt;a class="headerlink" href="#use-case-2-aggregate-function" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;I have a DataFrame with many values that I want to aggregate. I have already analyzed it and
determined there is a noise level below which I do not want to include in my analysis. I want to
compute a sum of only values that are above my noise threshold.&lt;/p&gt;
&lt;p&gt;This can be done fairly easy without leaning on a User Defined Aggregate Function (UDAF). You can
simply filter the DataFrame and then aggregate using the built-in &lt;code&gt;sum&lt;/code&gt; function. Here, we
demonstrate doing this as a UDF primarily as an example of how to write UDAFs. We will use the
PyArrow compute approach.&lt;/p&gt;
&lt;h2 id="pure-python-approach"&gt;Pure Python approach&lt;a class="headerlink" href="#pure-python-approach" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The fastest way (developer time, not code time) for me to implement the scalar problem solution
was to do something along the lines of “for each row, check the values of interest contains that
tuple”. I’ve published this as
&lt;a href="https://github.com/apache/datafusion-python/blob/main/examples/python-udf-comparisons.py"&gt;an example&lt;/a&gt;
in the &lt;a href="https://github.com/apache/datafusion-python"&gt;datafusion-python repository&lt;/a&gt;. Here is an
example of how this can be done:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;values_of_interest = [
    (1530, 4031, "N"),
    (6530, 1531, "N"),
    (5618, 619, "N"),
    (8118, 8119, "N"),
]

def is_of_interest_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -&amp;gt; pa.Array:
    result = []
    for idx, partkey in enumerate(partkey_arr):
        partkey = partkey.as_py()
        suppkey = suppkey_arr[idx].as_py()
        returnflag = returnflag_arr[idx].as_py()
        value = (partkey, suppkey, returnflag)
        result.append(value in values_of_interest)

    return pa.array(result)

# Wrap our custom function with `datafusion.udf`, annotating expected 
# parameter and return types
is_of_interest = udf(
    is_of_interest_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When working with a DataFusion UDF in Python, you define your function to take in some number of
expressions. During the evaluation, these will get computed into their corresponding values and
passed to your UDF as a PyArrow Array. We must return an Array also with the same number of
elements (rows). So the UDF example just iterates through all of the arrays and checks to see if
the tuple created from these columns matches any of those that we’re looking for.&lt;/p&gt;
&lt;p&gt;I’ll repeat because this is something that tripped me up the first time I wrote a UDF for
datafusion: &lt;strong&gt;DataFusion UDFs, even scalar UDFs, process an array of values at a time not a single
row.&lt;/strong&gt; This is different from some other DataFrame libraries and you may need to recognize a slight
change in mentality.&lt;/p&gt;
&lt;p&gt;Some important lines here are the lines like &lt;code&gt;partkey = partkey.as_py()&lt;/code&gt;. When we do this, we pay a
heavy cost. Now instead of keeping the analysis in the Rust code, we have to take the values in the
array and convert them over to Python objects. In this case we end up getting two numbers and a
string as real Python objects, complete with reference counting and all. Also we are iterating
through the array in Python rather than Rust native. These will &lt;strong&gt;significantly&lt;/strong&gt; slow down your
code. Any time you have to cross the barrier where you change values inside the Rust arrays into
Python objects or vice versa you will pay &lt;strong&gt;heavy&lt;/strong&gt; cost in that transformation. You will want to
design your UDFs to avoid this as much as possible.&lt;/p&gt;
&lt;h2 id="python-approach-using-pyarrow-compute"&gt;Python approach using PyArrow compute&lt;a class="headerlink" href="#python-approach-using-pyarrow-compute" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion uses &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; as its in-memory data format. This can
be seen in the way that Arrow Arrays are passed into the UDFs. We can take advantage of the fact
that &lt;a href="https://arrow.apache.org/docs/python/"&gt;PyArrow&lt;/a&gt;, the canonical Python Arrow implementation,
provides a variety of
useful functions. In the example below, we are only using a few of the boolean functions and the
equality function. Each of these functions takes two arrays and analyzes them row by row. In the
below example, we shift the logic around a little since we are now operating on an entire array of
values instead of checking a single row ourselves.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;import pyarrow.compute as pc

def udf_using_pyarrow_compute_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -&amp;gt; pa.Array:
    results = None
    for partkey, suppkey, returnflag in values_of_interest:
        filtered_partkey_arr = pc.equal(partkey_arr, partkey)
        filtered_suppkey_arr = pc.equal(suppkey_arr, suppkey)
        filtered_returnflag_arr = pc.equal(returnflag_arr, returnflag)

        resultant_arr = pc.and_(filtered_partkey_arr, filtered_suppkey_arr)
        resultant_arr = pc.and_(resultant_arr, filtered_returnflag_arr)

        if results is None:
            results = resultant_arr
        else:
            results = pc.or_(results, resultant_arr)

    return results


udf_using_pyarrow_compute = udf(
    udf_using_pyarrow_compute_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_pyarrow_compute = df_lineitem.filter(
    udf_using_pyarrow_compute(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The idea in the code above is that we will iterate through each of the values of interest, which we
expect to be small. For each of the columns, we compare the value of interest to it’s corresponding
array using &lt;code&gt;pyarrow.compute.equal&lt;/code&gt;. This will give use three boolean arrays. We have a match to
the tuple if we have a row in all three arrays that is true, so we use &lt;code&gt;pyarrow.compute.and_&lt;/code&gt;. Now
our return value from the UDF needs to include arrays for which any of the values of interest list
of tuples exists, so we take the result from the current loop and perform a &lt;code&gt;pyarrow.compute.or_&lt;/code&gt;
on it.&lt;/p&gt;
&lt;p&gt;From my benchmarking, switching from approach of converting values into Python objects to this
approach of using the PyArrow built-in functions leads to about a 10x speed improvement in this
simple problem.&lt;/p&gt;
&lt;p&gt;It’s worth noting that almost all of the PyArrow compute functions expect to take one or two arrays
as their arguments. If you need to write a UDF that is evaluating three or more columns, you’ll
need to do something akin to what we’ve shown here.&lt;/p&gt;
&lt;h2 id="rust-udf-with-python-wrapper"&gt;Rust UDF with Python wrapper&lt;a class="headerlink" href="#rust-udf-with-python-wrapper" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;This is the most complicated approach, but has the potential to be the most performant. What we
will do here is write a Rust function to perform our computation and then expose that function to
Python. I know of two use cases where I would recommend this approach. The first is the case when
the PyArrow compute functions are insufficient for your needs. Perhaps your code is too complex or
could be greatly simplified if you pulled in some outside dependency. The second use case is when
you have written a UDF that you’re sharing across multiple projects and have hardened the approach.
It is possible that you can implement your function in Rust to give a speed improvement and then
every project that is using this shared UDF will benefit from those updates.&lt;/p&gt;
&lt;p&gt;When deciding to use this approach, it’s worth considering how much you think you’ll actually
benefit from the Rust implementation to decide if it’s worth the additional effort to maintain and
deploy the Python wheels you generate. It is certainly not necessary for every use case.&lt;/p&gt;
&lt;p&gt;Due to the excellent work by the Python arrow team, we can simplify our work to needing only two
dependencies on the Rust side, &lt;a href="https://github.com/apache/arrow-rs"&gt;arrow-rs&lt;/a&gt; and
&lt;a href="https://pyo3.rs/"&gt;pyo3&lt;/a&gt;. I have posted a &lt;a href="https://github.com/timsaucer/tuple_filter_example"&gt;minimal example&lt;/a&gt;.
You’ll need &lt;a href="https://github.com/PyO3/maturin"&gt;maturin&lt;/a&gt; to build the project, and you must use
release mode when building to get the expected performance.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-bash"&gt;maturin develop --release
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you write your UDF in Rust you generally will need to take these steps&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Write a function description that takes in some number of Python generic objects.&lt;/li&gt;
&lt;li&gt;Convert these objects to Arrow Arrays of the appropriate type(s).&lt;/li&gt;
&lt;li&gt;Perform your computation and create a resultant Array.&lt;/li&gt;
&lt;li&gt;Convert the array into a Python generic object.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the conversion to and from Python objects, we can take advantage of the
&lt;code&gt;ArrayData::from_pyarrow_bound&lt;/code&gt; and &lt;code&gt;ArrayData::to_pyarrow&lt;/code&gt; functions.  All that remains is to
perform your computation.&lt;/p&gt;
&lt;p&gt;We are going to demonstrate doing this computation in two ways. The first is to mimic what we’ve
done in the above approach using PyArrow. In the second we demonstrate iterating through the three
arrays ourselves.&lt;/p&gt;
&lt;p&gt;In our first approach, we can expect the performance to be nearly identical to when we used the
PyArrow compute functions. On the Rust side we will have slightly less overhead but the heavy
lifting portions of the code are essentially the same between this Rust implementation and the
PyArrow approach above.&lt;/p&gt;
&lt;p&gt;The reason for demonstrating this, even though it doesn’t provide a significant speedup over
Python, is to primarily demonstrate how to make the Python to Rust with Python wrapper
transition. In the second implementation you can see how we can iterate through all of the arrays
ourselves.&lt;/p&gt;
&lt;p&gt;In this first example, we are hard coding the values of interest, but in the following section
we demonstrate passing these in during initialization.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;#[pyfunction]
pub fn tuple_filter_fn(
    py: Python&amp;lt;'_&amp;gt;,
    partkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
    suppkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
    returnflag_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
) -&amp;gt; PyResult&amp;lt;Py&amp;lt;PyAny&amp;gt;&amp;gt; {
    let partkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
        ArrayData::from_pyarrow_bound(partkey_expr)?.into();
    let suppkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
        ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
    let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

    let values_of_interest = vec![
        (1530, 4031, "N".to_string()),
        (6530, 1531, "N".to_string()),
        (5618, 619, "N".to_string()),
        (8118, 8119, "N".to_string()),
    ];

    let mut res: Option&amp;lt;BooleanArray&amp;gt; = None;

    for (partkey, suppkey, returnflag) in &amp;amp;values_of_interest {
        let filtered_partkey_arr = BooleanArray::from_unary(&amp;amp;partkey_arr, |p| p == *partkey);
        let filtered_suppkey_arr = BooleanArray::from_unary(&amp;amp;suppkey_arr, |s| s == *suppkey);
        let filtered_returnflag_arr =
            BooleanArray::from_unary(&amp;amp;returnflag_arr, |s| s == returnflag);

        let part_and_supp = compute::and(&amp;amp;filtered_partkey_arr, &amp;amp;filtered_suppkey_arr)
            .map_err(|e| PyValueError::new_err(e.to_string()))?;
        let resultant_arr = compute::and(&amp;amp;part_and_supp, &amp;amp;filtered_returnflag_arr)
            .map_err(|e| PyValueError::new_err(e.to_string()))?;

        res = match res {
            Some(r) =&amp;gt; compute::or(&amp;amp;r, &amp;amp;resultant_arr).ok(),
            None =&amp;gt; Some(resultant_arr),
        };
    }

    res.unwrap().into_data().to_pyarrow(py)
}


#[pymodule]
fn tuple_filter_example(module: &amp;amp;Bound&amp;lt;'_, PyModule&amp;gt;) -&amp;gt; PyResult&amp;lt;()&amp;gt; {
    module.add_function(wrap_pyfunction!(tuple_filter_fn, module)?)?;
    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To use this we use the &lt;code&gt;udf&lt;/code&gt; function in &lt;code&gt;datafusion-python&lt;/code&gt; just as before.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;from datafusion import udf
import pyarrow as pa
from tuple_filter_example import tuple_filter_fn

udf_using_custom_rust_fn = udf(
    tuple_filter_fn,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's it! We've now got a third party Rust UDF with Python wrappers working with DataFusion's
Python bindings!&lt;/p&gt;
&lt;h3 id="rust-udf-with-initialization"&gt;Rust UDF with initialization&lt;a class="headerlink" href="#rust-udf-with-initialization" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Looking at the code above, you can see that it is hard coding the values we're interested in. There
are many types of UDFs that don't require any additional data provided to them before they start
the computation. The code above is sloppy, so let's clean it up.&lt;/p&gt;
&lt;p&gt;We want to write the function to take some additional data. A limitation of the UDFs we create is
that they expect to operate on entire arrays of data at a time. We can get around this problem by
creating an initializer for our UDF. We do this by defining a Rust struct that contains the data we
need and implement two methods on this struct, &lt;code&gt;new&lt;/code&gt; and &lt;code&gt;__call__&lt;/code&gt;. By doing this we will create a
Python object that is callable, so it can be the function we provide to &lt;code&gt;udf&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;#[pyclass]
pub struct TupleFilterClass {
    values_of_interest: Vec&amp;lt;(i64, i64, String)&amp;gt;,
}

#[pymethods]
impl TupleFilterClass {
    #[new]
    fn new(values_of_interest: Vec&amp;lt;(i64, i64, String)&amp;gt;) -&amp;gt; Self {
        Self {
            values_of_interest,
        }
    }

    fn __call__(
        &amp;amp;self,
        py: Python&amp;lt;'_&amp;gt;,
        partkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
        suppkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
        returnflag_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
    ) -&amp;gt; PyResult&amp;lt;Py&amp;lt;PyAny&amp;gt;&amp;gt; {
        let partkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
            ArrayData::from_pyarrow_bound(partkey_expr)?.into();
        let suppkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
            ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
        let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

        let mut res: Option&amp;lt;BooleanArray&amp;gt; = None;

        for (partkey, suppkey, returnflag) in &amp;amp;self.values_of_interest {
            let filtered_partkey_arr = BooleanArray::from_unary(&amp;amp;partkey_arr, |p| p == *partkey);
            let filtered_suppkey_arr = BooleanArray::from_unary(&amp;amp;suppkey_arr, |s| s == *suppkey);
            let filtered_returnflag_arr =
                BooleanArray::from_unary(&amp;amp;returnflag_arr, |s| s == returnflag);

            let part_and_supp = compute::and(&amp;amp;filtered_partkey_arr, &amp;amp;filtered_suppkey_arr)
                .map_err(|e| PyValueError::new_err(e.to_string()))?;
            let resultant_arr = compute::and(&amp;amp;part_and_supp, &amp;amp;filtered_returnflag_arr)
                .map_err(|e| PyValueError::new_err(e.to_string()))?;

            res = match res {
                Some(r) =&amp;gt; compute::or(&amp;amp;r, &amp;amp;resultant_arr).ok(),
                None =&amp;gt; Some(resultant_arr),
            };
        }

        res.unwrap().into_data().to_pyarrow(py)
    }
}

#[pymodule]
fn tuple_filter_example(module: &amp;amp;Bound&amp;lt;'_, PyModule&amp;gt;) -&amp;gt; PyResult&amp;lt;()&amp;gt; {
    module.add_class::&amp;lt;TupleFilterClass&amp;gt;()?;
    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you write this, you don't have to call your constructor &lt;code&gt;new&lt;/code&gt;. The more important part is that
you have &lt;code&gt;#[new]&lt;/code&gt; designated on the function. With this you can provide any kinds of data you need
during processing. Using this initializer in Python is fairly straightforward.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;from datafusion import udf
import pyarrow as pa
from tuple_filter_example import TupleFilterClass

tuple_filter_class = TupleFilterClass(values_of_interest)

udf_using_custom_rust_fn_with_data = udf(
    tuple_filter_class,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
    name="tuple_filter_with_data"
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you use this approach you will need to provide a &lt;code&gt;name&lt;/code&gt; argument to &lt;code&gt;udf&lt;/code&gt;. This is because our
class/struct does not get the &lt;code&gt;__qualname__&lt;/code&gt; attribute that the &lt;code&gt;udf&lt;/code&gt; function is looking for. You
can give this udf any name you choose.&lt;/p&gt;
&lt;h3 id="rust-udf-with-direct-iteration"&gt;Rust UDF with direct iteration&lt;a class="headerlink" href="#rust-udf-with-direct-iteration" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The final version of our scalar UDF is one where we implement it in Rust and iterate through all of
the arrays ourselves. If you are iterating through more than 3 arrays at a time I recommend looking
at &lt;a href="https://docs.rs/itertools/latest/itertools/macro.izip.html"&gt;izip&lt;/a&gt; in the
&lt;a href="https://crates.io/crates/itertools"&gt;itertools crate&lt;/a&gt;. For ease of understanding and since we only
have 3 arrays here I will just explicitly create my own tuple here.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;#[pyclass]
pub struct TupleFilterDirectIterationClass {
    values_of_interest: Vec&amp;lt;(i64, i64, String)&amp;gt;,
}

#[pymethods]
impl TupleFilterDirectIterationClass {
    #[new]
    fn new(values_of_interest: Vec&amp;lt;(i64, i64, String)&amp;gt;) -&amp;gt; Self {
        Self { values_of_interest }
    }

    fn __call__(
        &amp;amp;self,
        py: Python&amp;lt;'_&amp;gt;,
        partkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
        suppkey_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
        returnflag_expr: &amp;amp;Bound&amp;lt;'_, PyAny&amp;gt;,
    ) -&amp;gt; PyResult&amp;lt;Py&amp;lt;PyAny&amp;gt;&amp;gt; {
        let partkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
            ArrayData::from_pyarrow_bound(partkey_expr)?.into();
        let suppkey_arr: PrimitiveArray&amp;lt;Int64Type&amp;gt; =
            ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
        let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

        let values_to_search: Vec&amp;lt;(&amp;amp;i64, &amp;amp;i64, &amp;amp;str)&amp;gt; = (&amp;amp;self.values_of_interest)
            .iter()
            .map(|(a, b, c)| (a, b, c.as_str()))
            .collect();

        let values = partkey_arr
            .values()
            .iter()
            .zip(suppkey_arr.values().iter())
            .zip(returnflag_arr.iter())
            .map(|((a, b), c)| (a, b, c.unwrap_or_default()))
            .map(|v| values_to_search.contains(&amp;amp;v));

        let res: BooleanArray = BooleanBuffer::from_iter(values).into();

        res.into_data().to_pyarrow(py)
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We convert the &lt;code&gt;values_of_interest&lt;/code&gt; into a vector of borrowed types so that we can do a fast search
without creating additional memory. The other option is to turn the &lt;code&gt;returnflag&lt;/code&gt; into a &lt;code&gt;String&lt;/code&gt;
but that memory allocation is unnecessary. After that we use two &lt;code&gt;zip&lt;/code&gt; operations so that we can
iterate over all three columns in a single pass. Since each &lt;code&gt;zip&lt;/code&gt; will return a tuple of two
elements, a quick &lt;code&gt;map&lt;/code&gt; turns them into the tuple format we need. Also, &lt;code&gt;StringArray&lt;/code&gt; is a little
different in the buffer it uses, so it is treated slightly differently from the others.&lt;/p&gt;
&lt;h2 id="user-defined-aggregate-function"&gt;User Defined Aggregate Function&lt;a class="headerlink" href="#user-defined-aggregate-function" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Writing a user defined aggregate function or user defined window function is slightly more complex
than scalar functions. This is because we must accumulate values and there is no guarantee that one
batch will contain all the values we are aggregating over. For this we need to define an
&lt;code&gt;Accumulator&lt;/code&gt; which will do a few things.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Process a batch and compute an internal state&lt;/li&gt;
&lt;li&gt;Share the state so that we can combine multiple batches&lt;/li&gt;
&lt;li&gt;Merge the results across multiple batches&lt;/li&gt;
&lt;li&gt;Return the final result&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the example below, we're going to look at customer orders and we want to know per customer ID,
how much they have ordered total. We want to ignore small orders, which we define as anything under
5000.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;from datafusion import Accumulator, udaf
import pyarrow as pa
import pyarrow.compute as pc

IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
    def __init__(self) -&amp;gt; None:
        self._sum = 0.0

    def update(self, values: pa.Array) -&amp;gt; None:
        over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
        sum_above = pc.sum(values.filter(over_threshold)).as_py()
        if sum_above is None:
            sum_above = 0.0
        self._sum = self._sum + sum_above

    def merge(self, states: List[pa.Array]) -&amp;gt; None:
        self._sum = self._sum + pc.sum(states[0]).as_py()

    def state(self) -&amp;gt; List[pa.Scalar]:
        return [pa.scalar(self._sum)]

    def evaluate(self) -&amp;gt; pa.Scalar:
        return pa.scalar(self._sum)

sum_above_threshold = udaf(AboveThresholdAccum, [pa.float64()], pa.float64(), [pa.float64()], 'stable')

df_orders.aggregate([col("o_custkey")],[sum_above_threshold(col("o_totalprice")).alias("sales")]).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we are doing a &lt;code&gt;sum&lt;/code&gt; we can keep a single value as our internal state. When we call &lt;code&gt;update()&lt;/code&gt;
we will process a single array and update the internal state, which we share with the &lt;code&gt;state()&lt;/code&gt;
function. For larger batches we may &lt;code&gt;merge()&lt;/code&gt; these states. It is important to note that the
&lt;code&gt;states&lt;/code&gt; in the &lt;code&gt;merge()&lt;/code&gt; function are an array of the values returned from &lt;code&gt;state()&lt;/code&gt;. It is
entirely possible that the &lt;code&gt;merge&lt;/code&gt; function is significantly different than the &lt;code&gt;update&lt;/code&gt;, though in
our example they are very similar.&lt;/p&gt;
&lt;p&gt;One example of implementing a user defined aggregate function where the &lt;code&gt;update()&lt;/code&gt; and &lt;code&gt;merge()&lt;/code&gt;
operations are different is computing an average. In &lt;code&gt;update()&lt;/code&gt; we would create a state that is both
a sum and a count. &lt;code&gt;state()&lt;/code&gt; would return a list of these two values, and &lt;code&gt;merge()&lt;/code&gt; would compute
the final result.&lt;/p&gt;
&lt;h2 id="user-defined-window-functions"&gt;User Defined Window Functions&lt;a class="headerlink" href="#user-defined-window-functions" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Writing a user defined window function is slightly more complex than an aggregate function due
to the variety of ways that window functions are called. I recommend reviewing the
&lt;a href="https://datafusion.apache.org/python/user-guide/common-operations/udf-and-udfa.html"&gt;online documentation&lt;/a&gt;
for a description of which functions need to be implemented. The details of how to implement
these generally follow the same patterns as described above for aggregate functions.&lt;/p&gt;
&lt;h2 id="performance-comparison"&gt;Performance Comparison&lt;a class="headerlink" href="#performance-comparison" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For the scalar functions above, we performed a timing evaluation, repeating the operation 100
times. For this simple example these are our results.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------------+--------------+---------+
| approach                    | Average Time | Std Dev |
+-----------------------------+--------------+---------+
| python udf                  | 4.969        | 0.062   |
| simple filter               | 1.075        | 0.022   |
| explicit filter             | 0.685        | 0.063   |
| pyarrow compute             | 0.529        | 0.017   |
| arrow rust compute          | 0.511        | 0.034   |
| arrow rust compute as class | 0.502        | 0.011   |
| rust custom iterator        | 0.478        | 0.009   |
+-----------------------------+--------------+---------+
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected, the conversion to Python objects is by far the worst performance. As soon as we drop
into using any functions that keep the data entirely on the Native (Rust or C/C++) side we see a
near 10x speed improvement. Then as we increase our complexity from using PyArrow compute functions
to implementing the UDF in Rust we see incremental improvements. Our fastest approach - iterating
through the arrays ourselves does operate nearly 10% faster than the PyArrow compute approach.&lt;/p&gt;
&lt;h2 id="final-thoughts-and-recommendations"&gt;Final Thoughts and Recommendations&lt;a class="headerlink" href="#final-thoughts-and-recommendations" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For anyone who is curious about &lt;a href="https://datafusion.apache.org/"&gt;DataFusion&lt;/a&gt; I highly recommend
giving it a try. This post was designed to make it easier for new users to the Python implementation
to work with User Defined Functions by giving a few examples of how one might implement these.&lt;/p&gt;
&lt;p&gt;When it comes to designing UDFs, I strongly recommend seeing if you can write your UDF using
&lt;a href="https://arrow.apache.org/docs/python/api/compute.html"&gt;PyArrow functions&lt;/a&gt; rather than pure Python
objects. As shown in the scalar example above, you can achieve a 10x speedup by using PyArrow
functions. If you must do something that isn't well represented by the PyArrow compute functions,
then I would consider using a Rust based UDF in the manner shown above.&lt;/p&gt;
&lt;p&gt;I would like to thank &lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;, &lt;a href="https://github.com/andygrove"&gt;@andygrove&lt;/a&gt;, &lt;a href="https://github.com/comphead"&gt;@comphead&lt;/a&gt;, &lt;a href="https://github.com/emgeee"&gt;@emgeee&lt;/a&gt;, &lt;a href="https://github.com/kylebarron"&gt;@kylebarron&lt;/a&gt;, and &lt;a href="https://github.com/Omega359"&gt;@Omega359&lt;/a&gt;
for their helpful reviews and feedback.&lt;/p&gt;
&lt;p&gt;Lastly, the Apache Arrow and DataFusion community is an active group of very helpful people working
to make a great tool. If you want to get involved, please take a look at the
&lt;a href="https://datafusion.apache.org/python/"&gt;online documentation&lt;/a&gt; and jump in to help with one of the
&lt;a href="https://github.com/apache/datafusion-python/issues"&gt;open issues&lt;/a&gt;.&lt;/p&gt;</content><category term="blog"/></entry><entry><title>Apache DataFusion Python 40.1.0 Released, Significant usability updates</title><link href="https://datafusion.apache.org/blog/2024/08/20/python-datafusion-40.0.0" rel="alternate"/><published>2024-08-20T00:00:00+00:00</published><updated>2024-08-20T00:00:00+00:00</updated><author><name>timsaucer</name></author><id>tag:datafusion.apache.org,2024-08-20:/blog/2024/08/20/python-datafusion-40.0.0</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;h2 id="introduction"&gt;Introduction&lt;a class="headerlink" href="#introduction" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/40.1.0/"&gt;DataFusion in Python 40.1.0&lt;/a&gt; has been released. In addition to
bringing in all of the new features of the core &lt;a href="https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/"&gt;DataFusion 40.0.0&lt;/a&gt; package, this release
contains &lt;em&gt;significant&lt;/em&gt; updates to the user interface and documentation. We listened to the python …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;h2 id="introduction"&gt;Introduction&lt;a class="headerlink" href="#introduction" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We are happy to announce that &lt;a href="https://pypi.org/project/datafusion/40.1.0/"&gt;DataFusion in Python 40.1.0&lt;/a&gt; has been released. In addition to
bringing in all of the new features of the core &lt;a href="https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/"&gt;DataFusion 40.0.0&lt;/a&gt; package, this release
contains &lt;em&gt;significant&lt;/em&gt; updates to the user interface and documentation. We listened to the python
user community to create a more &lt;em&gt;pythonic&lt;/em&gt; experience. If you have not used the python interface to
DataFusion before, this is an excellent time to give it a try!&lt;/p&gt;
&lt;h2 id="background"&gt;Background&lt;a class="headerlink" href="#background" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Until now, the python bindings for DataFusion have primarily been a thin layer to expose the
underlying Rust functionality. This has been worked well for early adopters to use DataFusion
within their Python projects, but some users have found it difficult to work with. As compared to
other DataFrame libraries, these issues were raised:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Most of the functions had little or no documentation. Users often had to refer to the Rust
documentation or code to learn how to use DataFusion. This alienated some python users.&lt;/li&gt;
&lt;li&gt;Users could not take advantage of modern IDE features such as type hinting. These are valuable
tools for rapid testing and development.&lt;/li&gt;
&lt;li&gt;Some of the interfaces felt “clunky” to users since some Python concepts do not always map well
to their Rust counterparts.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This release aims to bring a better user experience to the DataFusion Python community.&lt;/p&gt;
&lt;h2 id="whats-changed"&gt;What's Changed&lt;a class="headerlink" href="#whats-changed" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The most significant difference is that we have added wrapper functions and classes for most of the
user facing interface. These wrappers, written in Python, contain both documentation and type
annotations.&lt;/p&gt;
&lt;p&gt;This documentation is now available on the &lt;a href="https://datafusion.apache.org/python/autoapi/datafusion/index.html"&gt;DataFusion in Python API&lt;/a&gt; website. There you can browse
the available functions and classes to see the breadth of available functionality.&lt;/p&gt;
&lt;p&gt;Modern IDEs use language servers such as
&lt;a href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance"&gt;Pylance&lt;/a&gt; or
&lt;a href="https://jedi.readthedocs.io/en/latest/"&gt;Jedi&lt;/a&gt; to perform analysis of python code, provide useful
hints, and identify usage errors. These are major tools in the python user community. With this
release, users can fully use these tools in their workflow.&lt;/p&gt;
&lt;figure class="text-center"&gt;
&lt;img alt="Fig 1: Enhanced tooltips in an IDE." class="img-fluid" src="/blog/images/python-datafusion-40.0.0/vscode_hover_tooltip.png"/&gt;
&lt;figcaption&gt;
&lt;b&gt;Figure 1&lt;/b&gt;: With the enhanced python wrappers, users can see helpful tool tips with
   type annotations directly in modern IDEs.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;By having the type annotations, these IDEs can also identify quickly when a user has incorrectly
used a function's arguments as shown in Figure 2.&lt;/p&gt;
&lt;figure class="text-center"&gt;
&lt;img alt="Fig 2: Error checking in static analysis" class="img-fluid" src="/blog/images/python-datafusion-40.0.0/pylance_error_checking.png"/&gt;
&lt;figcaption&gt;
&lt;b&gt;Figure 2&lt;/b&gt;: Modern Python language servers can perform static analysis and quickly find
   errors in the arguments to functions.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;In addition to these wrapper libraries, we have enhancements to some of the functions to feel more
easy to use.&lt;/p&gt;
&lt;h3 id="improved-dataframe-filter-arguments"&gt;Improved DataFrame filter arguments&lt;a class="headerlink" href="#improved-dataframe-filter-arguments" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You can now apply multiple &lt;code&gt;filter&lt;/code&gt; statements in a single step. When using &lt;code&gt;DataFrame.filter&lt;/code&gt; you
can pass in multiple arguments, separated by a comma. These will act as a logical &lt;code&gt;AND&lt;/code&gt; of all of
the filter arguments. The following two statements are equivalent:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df.filter(col("size") &amp;lt; col("max_size")).filter(col("color") == lit("green"))
df.filter(col("size") &amp;lt; col("max_size"), col("color") == lit("green"))
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="comparison-against-literal-values"&gt;Comparison against literal values&lt;a class="headerlink" href="#comparison-against-literal-values" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It is very common to write DataFrame operations that compare an expression to some fixed value.
For example, filtering a DataFrame might have an operation such as &lt;code&gt;df.filter(col("size") &amp;lt; lit(16))&lt;/code&gt;.
To make these common operations more ergonomic, you can now simply use &lt;code&gt;df.filter(col("size") &amp;lt; 16)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For the right hand side of the comparison operator, you can now use any Python value that can be
coerced into a &lt;code&gt;Literal&lt;/code&gt;. This gives an easy to ready expression. For example, consider these few
lines from one of the
&lt;a href="https://github.com/apache/datafusion-python/tree/main/examples/tpch"&gt;TPC-H examples&lt;/a&gt; provided in
the DataFusion Python repository.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df = (
    df_lineitem.filter(col("l_shipdate") &amp;gt;= lit(date))
    .filter(col("l_discount") &amp;gt;= lit(DISCOUNT) - lit(DELTA))
    .filter(col("l_discount") &amp;lt;= lit(DISCOUNT) + lit(DELTA))
    .filter(col("l_quantity") &amp;lt; lit(QUANTITY))
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above code mirrors closely how these filters would need to be applied in rust. With this new
release, the user can simplify these lines. Also shown in the example below is that &lt;code&gt;filter()&lt;/code&gt;
now accepts a variable number of arguments and filters on all such arguments (boolean AND).&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df = df_lineitem.filter(
    col("l_shipdate") &amp;gt;= date,
    col("l_discount") &amp;gt;= DISCOUNT - DELTA,
    col("l_discount") &amp;lt;= DISCOUNT + DELTA,
    col("l_quantity") &amp;lt; QUANTITY,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="select-columns-by-name"&gt;Select columns by name&lt;a class="headerlink" href="#select-columns-by-name" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It is very common for users to perform &lt;code&gt;DataFrame&lt;/code&gt; selection where they simply want a column. For
this we have had the function &lt;code&gt;select_columns("a", "b")&lt;/code&gt; or the user could perform
&lt;code&gt;select(col("a"), col("b"))&lt;/code&gt;. In the new release, we accept either full expressions in &lt;code&gt;select()&lt;/code&gt;
or strings of the column names. You can mix these as well.&lt;/p&gt;
&lt;p&gt;Where before you may have to do an operation like&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df_subset = df.select(col("a"), col("b"), f.abs(col("c")))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can now simplify this to&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df_subset = df.select("a", "b", f.abs(col("c")))
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="creating-named-structs"&gt;Creating named structs&lt;a class="headerlink" href="#creating-named-structs" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Creating a &lt;code&gt;struct&lt;/code&gt; with named fields was previously difficult to use and allowed for potential
user errors when specifying the name of each field. Now we have a cleaner interface where the
user passes a list of tuples containing the name of the field and the expression to create.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-python"&gt;df.select(f.named_struct([
  ("a", col("a")),
  ("b", col("b"))
]))
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="next-steps"&gt;Next Steps&lt;a class="headerlink" href="#next-steps" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;While most of the user facing classes and functions have been exposed, there are a few that require
exposure. Namely the classes in &lt;code&gt;datafusion.object_store&lt;/code&gt; and the logical plans used by
&lt;code&gt;datafusion.substrait&lt;/code&gt;. The team is working on
&lt;a href="https://github.com/apache/datafusion-python/issues/767"&gt;these issues&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, in the next release of DataFusion there have been improvements made to the user-defined
aggregate and window functions to make them easier to use. We plan on
&lt;a href="https://github.com/apache/datafusion-python/issues/780"&gt;bringing these enhancements&lt;/a&gt; to this project.&lt;/p&gt;
&lt;h2 id="thank-you"&gt;Thank You&lt;a class="headerlink" href="#thank-you" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We would like to thank the following members for their very helpful discussions regarding these
updates: &lt;a href="https://github.com/andygrove"&gt;@andygrove&lt;/a&gt;, &lt;a href="https://github.com/max-muoto"&gt;@max-muoto&lt;/a&gt;, &lt;a href="https://github.com/slyons"&gt;@slyons&lt;/a&gt;, &lt;a href="https://github.com/Throne3d"&gt;@Throne3d&lt;/a&gt;, &lt;a href="https://github.com/Michael-J-Ward"&gt;@Michael-J-Ward&lt;/a&gt;, &lt;a href="https://github.com/datapythonista"&gt;@datapythonista&lt;/a&gt;,
&lt;a href="https://github.com/austin362667"&gt;@austin362667&lt;/a&gt;, &lt;a href="https://github.com/kylebarron"&gt;@kylebarron&lt;/a&gt;, &lt;a href="https://github.com/simicd"&gt;@simicd&lt;/a&gt;. The &lt;a href="https://github.com/apache/datafusion-python/pull/750"&gt;primary PR (#750)&lt;/a&gt; that includes these updates
had an extensive conversation, leading to a significantly improved end product. Again, thank you
to all who provided input!&lt;/p&gt;
&lt;p&gt;We would like to give an special thank you to &lt;a href="https://github.com/3ok"&gt;@3ok&lt;/a&gt; who created the initial version of the wrapper
definitions. The work they did was time consuming and required exceptional attention to detail. It
provided enormous value to starting this project. Thank you!&lt;/p&gt;
&lt;h2 id="get-involved"&gt;Get Involved&lt;a class="headerlink" href="#get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The DataFusion Python team is an active and engaging community and we would love
to have you join us and help the project.&lt;/p&gt;
&lt;p&gt;Here are some ways to get involved:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Learn more by visiting the &lt;a href="https://datafusion.apache.org/python/index.html"&gt;DataFusion Python project&lt;/a&gt;
page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Try out the project and provide feedback, file issues, and contribute code.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content><category term="blog"/></entry></feed>