# Crate Configuration This section contains information on how to configure builds of DataFusion in your Rust project. The [Configuration Settings] section lists options that control additional aspects DataFusion's runtime behavior. [configuration settings]: configs.md ## Using the nightly DataFusion builds DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) If you would like to use or test versions of the DataFusion code which are merged but not yet published, you can use Cargo's [support for adding dependencies] directly to a GitHub branch: ```toml datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} ``` Also it works on the package level ```toml datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"} ``` And with features ```toml datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] } ``` More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) ## Optimizing Builds Here are several suggestions to get the Rust compler to produce faster code when compiling DataFusion. Note that these changes may increase compile time and binary size. ### Generate Code with CPU Specific Instructions By default, the Rust compiler produces code that runs on a wide range of CPUs, but may not take advantage of all the features of your specific CPU (such as certain [SIMD instructions]). This is especially true for x86_64 CPUs, where the default target is `x86_64-unknown-linux-gnu`, which only guarantees support for the `SSE2` instruction set. DataFusion can benefit from the more advanced instructions in the `AVX2` and `AVX512` to speed up operations like filtering, aggregation, and joins. To tell the Rust compiler to use these instructions, set the `RUSTFLAGS` environment variable to specify a more specific target CPU. We recommend setting `target-cpu` or at least `avx2`, or preferably at least `native` (whatever the current CPU is). For example, to build and run DataFusion with optimizations for your current CPU: ```shell RUSTFLAGS='-C target-cpu=native' cargo run --release ``` [simd instructions]: https://en.wikipedia.org/wiki/SIMD ### Enable Link Time Optimization / Single Codegen Unit You can potentially improve your performance by compiling DataFusion into a single codegen unit which gives the Rust compiler more opportunity to optimize across crate boundaries. To do so, modify your projects' `Cargo.toml` to include `lto = true` and `codegen-units = 1` as shown below. Beware that using a single codegen unit _significantly_ increases `--release` build times. ```toml [profile.release] lto = true codegen-units = 1 ``` ### Alternate Allocator: `snmalloc` You can also use [snmalloc-rs](https://crates.io/crates/snmalloc-rs) crate as the memory allocator for DataFusion to improve performance. To do so, add the dependency to your `Cargo.toml` as shown below. ```toml [dependencies] snmalloc-rs = "0.3" ``` Then, in `main.rs.` update the memory allocator with the below after your imports: ```no-run use datafusion::prelude::*; #[global_allocator] static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; #[tokio::main] async fn main() -> datafusion::error::Result<()> { Ok(()) } ``` ## Enable Backtraces By default, Datafusion returns errors as a plain text message. You can enable more verbose details about the error, such as backtraces by enabling the `backtrace` feature to your `Cargo.toml` file like this: ```toml datafusion = { version = "31.0.0", features = ["backtrace"]} ``` Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables) ```bash RUST_BACKTRACE=1 ./target/debug/datafusion-cli DataFusion CLI v31.0.0 > select row_numer() over (partition by a order by a) from (select 1 a); Error during planning: Invalid function 'row_numer'. Did you mean 'ROW_NUMBER'? backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 1: std::backtrace_rs::backtrace::trace_unsynchronized at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: std::backtrace::Backtrace::create at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13 3: std::backtrace::Backtrace::capture at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9 4: datafusion_common::error::DataFusionError::get_back_trace at /datafusion/datafusion/common/src/error.rs:436:30 5: datafusion_sql::expr::function::>::sql_function_to_expr ............ ``` The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs` ```rust #[tokio::test] async fn test_get_backtrace_for_failed_code() -> Result<()> { let ctx = SessionContext::new(); let sql = " select row_numer() over (partition by a order by a) from (select 1 a); "; let _ = ctx.sql(sql).await?.collect().await?; Ok(()) } ``` To obtain a backtrace: ```bash cargo build --features=backtrace RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture running 1 test Error: Plan("Invalid function 'row_numer'.\nDid you mean 'ROW_NUMBER'?\n\nbacktrace: 0: std::backtrace_rs::backtrace::libunwind::trace\n at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5\n 1: std::backtrace_rs::backtrace::trace_unsynchronized\n... ``` Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored To show the backtrace in a pretty-printed format use `eprintln!("{e}");`. ```rust #[tokio::test] async fn test_get_backtrace_for_failed_code() -> Result<()> { let ctx = SessionContext::new(); let sql = "select row_numer() over (partition by a order by a) from (select 1 a);"; let _ = match ctx.sql(sql).await { Ok(result) => result.show().await?, Err(e) => { eprintln!("{e}"); } }; Ok(()) } ``` Then run the test: ```bash $ RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture running 1 test Error during planning: Invalid function 'row_numer'. Did you mean 'ROW_NUMBER'? backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5 1: std::backtrace_rs::backtrace::trace_unsynchronized at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: std::backtrace::Backtrace::create at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/backtrace.rs:331:13 3: std::backtrace::Backtrace::capture ... ```