Apache DataFusion¶
DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.
The documentation on this site is for the core DataFusion project, which contains libraries and binaries for developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.
The following related subprojects target end users and have separate documentation.
DataFusion Python offers a Python interface for SQL and DataFrame queries.
DataFusion Ray provides a distributed version of DataFusion that scales out on Ray clusters.
DataFusion Comet is an accelerator for Apache Spark based on DataFusion.
“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.
DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.
To get started, see
The example usage section of the user guide and the datafusion-examples directory.
The library user guide for examples of using DataFusion’s extension APIs
The developer’s guide for contributing and communication for getting in touch with us.
- Introduction
- Extensions List
- Using the SQL API
- Working with
Expr
s - Using the DataFrame API
- Write DataFrame to Files
- Building Logical Plans
- Catalogs, Schemas, and Tables
- Adding User Defined Functions: Scalar/Window/Aggregate/Table Functions
- Custom Table Provider
- Extending DataFusion’s operators: custom LogicalPlan and Execution Plans
- Profiling Cookbook
- DataFusion Query Optimizer
- API health policy