Introduction

We welcome and encourage contributions of all kinds, such as:

  1. Tickets with issue reports of feature requests

  2. Documentation improvements

  3. Code, both PR and (especially) PR Review.

In addition to submitting new PRs, we have a healthy tradition of community members reviewing each other’s PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin. We recommend using uv for python package management.

By default uv will attempt to build the datafusion python package. For our development we prefer to build manually. This means that when creating your virtual environment using uv sync you need to pass in the additional –no-install-package datafusion and for uv run commands the additional parameter –no-project

Bootstrap:

# fetch this repo
git clone git@github.com:apache/datafusion-python.git
# create the virtual enviornment
uv sync --dev --no-install-package datafusion
# activate the environment
source .venv/bin/activate

The tests rely on test data in git submodules.

git submodule init
git submodule update

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source .venv/bin/activate" first
maturin develop -uv
python -m pytest

Running & Installing pre-commit hooks

arrow-datafusion-python takes advantage of pre-commit to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.

Our pre-commit hooks can be installed by running pre-commit install, which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.

The pre-commit hooks can also be run adhoc without installing them by simply running pre-commit run --all-files

Guidelines for Separating Python and Rust Code

Version 40 of datafusion-python introduced python wrappers around the pyo3 generated code to vastly improve the user experience. (See the blog post and pull request for more details.)

Mostly, the python code is limited to pure wrappers with type hints and good docstrings, but there are a few reasons for when the code does more:

  1. Trivial aliases like array_append() and list_append().

  2. Simple type conversion, like from a path to a string of the path or from number to lit(number).

  3. The additional code makes an API much more pythonic, like we do for named_struct() (see source code).

Update Dependencies

To change test dependencies, change the pyproject.toml and run

To update dependencies, run

uv sync --dev --no-install-package datafusion

Improving Build Speed

The pyo3 dependency of this project contains a build.rs file which can cause it to rebuild frequently. You can prevent this from happening by defining a PYO3_CONFIG_FILE environment variable that points to a file with your build configuration. Whenever your build configuration changes, such as during some major version updates, you will need to regenerate this file. This variable should point to a fully resolved path on your build machine.

To generate this file, use the following command:

PYO3_PRINT_CONFIG=1 cargo build

This will generate some output that looks like the following. You will want to copy these contents intro a file. If you place this file in your project directory with filename .pyo3_build_config it will be ignored by git.

implementation=CPython
version=3.8
shared=true
abi3=true
lib_name=python3.12
lib_dir=/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/lib
executable=/Users/myusername/src/datafusion-python/.venv/bin/python
pointer_width=64
build_flags=
suppress_build_script_link_lines=false

Add the environment variable to your system.

export PYO3_CONFIG_FILE="/Users//myusername/src/datafusion-python/.pyo3_build_config"

If you are on a Mac and you use VS Code for your IDE, you will want to add these variables to your settings. You can find the appropriate rust flags by looking in the .cargo/config.toml file.

"rust-analyzer.cargo.extraEnv": {
    "RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
    "PYO3_CONFIG_FILE": "/Users/myusername/src/datafusion-python/.pyo3_build_config"
},
"rust-analyzer.runnables.extraEnv": {
    "RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
    "PYO3_CONFIG_FILE": "/Users/myusername/src/personal/datafusion-python/.pyo3_build_config"
}