Introduction¶
We welcome and encourage contributions of all kinds, such as:
Tickets with issue reports of feature requests
Documentation improvements
Code, both PR and (especially) PR Review.
In addition to submitting new PRs, we have a healthy tradition of community members reviewing each other’s PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
How to develop¶
This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin. We recommend using uv for python package management.
By default uv will attempt to build the datafusion python package. For our development we prefer to build manually. This means that when creating your virtual environment using uv sync you need to pass in the additional –no-install-package datafusion and for uv run commands the additional parameter –no-project
Bootstrap:
# fetch this repo
git clone git@github.com:apache/datafusion-python.git
# create the virtual enviornment
uv sync --dev --no-install-package datafusion
# activate the environment
source .venv/bin/activate
The tests rely on test data in git submodules.
git submodule init
git submodule update
Whenever rust code changes (your changes or via git pull):
# make sure you activate the venv using "source .venv/bin/activate" first
maturin develop -uv
python -m pytest
Running & Installing pre-commit hooks¶
arrow-datafusion-python takes advantage of pre-commit to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.
Our pre-commit hooks can be installed by running pre-commit install
, which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.
The pre-commit hooks can also be run adhoc without installing them by simply running pre-commit run --all-files
Guidelines for Separating Python and Rust Code¶
Version 40 of datafusion-python
introduced python
wrappers around the pyo3
generated code to vastly improve the user experience. (See the blog post and pull request for more details.)
Mostly, the python
code is limited to pure wrappers with type hints and good docstrings, but there are a few reasons for when the code does more:
Trivial aliases like
array_append()
andlist_append()
.Simple type conversion, like from a
path
to astring
of the path or fromnumber
tolit(number)
.The additional code makes an API much more pythonic, like we do for
named_struct()
(see source code).
Update Dependencies¶
To change test dependencies, change the pyproject.toml
and run
To update dependencies, run
uv sync --dev --no-install-package datafusion
Improving Build Speed¶
The pyo3 dependency of this project contains a build.rs
file which
can cause it to rebuild frequently. You can prevent this from happening by defining a PYO3_CONFIG_FILE
environment variable that points to a file with your build configuration. Whenever your build configuration
changes, such as during some major version updates, you will need to regenerate this file. This variable
should point to a fully resolved path on your build machine.
To generate this file, use the following command:
PYO3_PRINT_CONFIG=1 cargo build
This will generate some output that looks like the following. You will want to copy these contents intro
a file. If you place this file in your project directory with filename .pyo3_build_config
it will
be ignored by git
.
implementation=CPython
version=3.8
shared=true
abi3=true
lib_name=python3.12
lib_dir=/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/lib
executable=/Users/myusername/src/datafusion-python/.venv/bin/python
pointer_width=64
build_flags=
suppress_build_script_link_lines=false
Add the environment variable to your system.
export PYO3_CONFIG_FILE="/Users//myusername/src/datafusion-python/.pyo3_build_config"
If you are on a Mac and you use VS Code for your IDE, you will want to add these variables
to your settings. You can find the appropriate rust flags by looking in the
.cargo/config.toml
file.
"rust-analyzer.cargo.extraEnv": {
"RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
"PYO3_CONFIG_FILE": "/Users/myusername/src/datafusion-python/.pyo3_build_config"
},
"rust-analyzer.runnables.extraEnv": {
"RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
"PYO3_CONFIG_FILE": "/Users/myusername/src/personal/datafusion-python/.pyo3_build_config"
}