Apache DataFusion¶

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.

The documentation on this site is for the core DataFusion project, which contains libraries and binaries for developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.

The following related subprojects target end users and have separate documentation.

DataFusion Python offers a Python interface for SQL and DataFrame queries.
DataFusion Ray provides a distributed version of DataFusion that scales out on Ray clusters.
DataFusion Comet is an accelerator for Apache Spark based on DataFusion.

“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.

DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

To get started, see

The example usage section of the user guide and the datafusion-examples directory.
The library user guide for examples of using DataFusion’s extension APIs
The developer’s guide for contributing and communication for getting in touch with us.

ASF Links

Links

User Guide

Library User Guide

Contributor Guide

DataFusion Subprojects

Download