<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

# Introduction

DataFusion is a very fast, extensible query engine for building
high-quality data-centric systems in [Rust](http://rustlang.org),
using the [Apache Arrow](https://arrow.apache.org) in-memory format.
DataFusion originated as part of the [Apache Arrow](https://arrow.apache.org/)
project.

DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, [python bindings], extensive customization, a great community, and more.

[python bindings]: https://github.com/apache/datafusion-python

## Project Goals

DataFusion aims to be the query engine of choice for new, fast
data centric systems such as databases, dataframe libraries, machine
learning and streaming applications by leveraging the unique features
of [Rust](https://www.rust-lang.org/) and [Apache
Arrow](https://arrow.apache.org/).

## Features

- Feature-rich [SQL support](https://datafusion.apache.org/user-guide/sql/index.html) and [DataFrame API](https://datafusion.apache.org/user-guide/dataframe.html)
- Blazingly fast, vectorized, multithreaded, streaming execution engine.
- Native support for Parquet, CSV, JSON, and Avro file formats. Support
  for custom file formats and non-file datasources via the `TableProvider` trait.
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
  other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
  `ObjectStore` trait).
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
  [welcoming community](https://datafusion.apache.org/contributor-guide/communication.html).
- A state of the art query optimizer with expression coercion and
  simplification, projection and filter pushdown, sort and distribution
  aware optimizations, automatic join reordering, and more.
- Permissive Apache 2.0 License, predictable and well understood
  [Apache Software Foundation](https://www.apache.org/) governance.
- Implementation in [Rust](https://www.rust-lang.org/), a modern
  system language with development productivity similar to Java or
  Golang, the performance of C++, and [loved by programmers
  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
- Support for [Substrait](https://substrait.io/) query plans, to
  easily pass plans across language and system boundaries.

## Use Cases

DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
building new systems.

While most current use cases are "analytic" or (throughput) some
components of DataFusion such as the plan representations, are
suitable for "streaming" and "transaction" style systems (low
latency).

Here are some example systems built using DataFusion:

- Specialized Analytical Database systems such as [HoraeDB] and more general Apache Spark like system such as [Ballista]
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
- Research platform for new Database Systems, such as [Flock]
- SQL support to another library, such as [Vortex]
- Streaming data platforms such as [Synnada]
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
- Native Spark runtime replacement such as [Auron]
- Distributed data cache to boost GPU utilization of AI workloads with [Kubeflow Trainer](https://www.kubeflow.org/docs/components/trainer/user-guides/data-cache/)

By using DataFusion, projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
parallelized streaming execution plans, file format support, etc.

## Known Users

Here are some active projects using DataFusion:

 <!-- "Active" means github repositories that had at least one commit in the last 6 months -->

- [Arroyo](https://github.com/ArroyoSystems/arroyo) Distributed stream processing engine in Rust
- [ArkFlow](https://github.com/arkflow-rs/arkflow) High-performance Rust stream processing engine
- [Auron] The Auron accelerator for big data engine (e.g., Spark, Flink) leverages native vectorized execution to accelerate query processing
- [Ballista] Distributed SQL Query Engine
- [CnosDB] Open Source Distributed Time Series Database
- [Comet](https://github.com/apache/datafusion-comet) Apache Spark native query execution plugin
- [Cube Store] Cube’s universal semantic layer platform is the next evolution of OLAP technology for AI, BI, spreadsheets, and embedded analytics
- [datafusion-dft](https://github.com/datafusion-contrib/datafusion-dft) Batteries included CLI, TUI, and server implementations for DataFusion.
- [dbt Fusion engine](https://github.com/dbt-labs/dbt-fusion) The dbt Fusion engine, written in Rust, designed for speed and correctness with a native SQL understanding across DWH SQL dialects.
- [delta-rs] Native Rust implementation of Delta Lake
- [EDB Postgres Lakehouse] built with [Seafowl]
- [Feldera](https://github.com/feldera/feldera) Fast query engine for incremental computation
- [Funnel](https://funnel.io/) Data Platform powering Marketing Intelligence applications.
- [GlareDB](https://github.com/GlareDB/glaredb) Fast SQL database for querying and analyzing distributed data.
- [GreptimeDB] Open Source & Cloud Native Distributed Time Series Database
- [hiop](https://hiop.io) Serverless Data Logistic Platform
- [HoraeDB] Distributed Time-Series Database
- [Iceberg-rust](https://github.com/apache/iceberg-rust) Rust implementation of Apache Iceberg
- [InfluxDB] Time Series Database
- [Kamu] Planet-scale streaming data pipeline
- [Kubeflow Trainer](https://github.com/kubeflow/trainer) Kubernetes-native project designed for
  scalable LLMs fine-tuning and distributed AI model training.
- [LakeSoul](https://github.com/lakesoul-io/LakeSoul) Open source LakeHouse framework with native IO in Rust.
- [Lance](https://github.com/lancedb/lance) Modern columnar data format for ML
- [OpenObserve] Distributed cloud native observability platform
- [ParadeDB](https://github.com/paradedb/paradedb) PostgreSQL for Search & Analytics
- [Parseable] Log storage and observability platform
- [Polygon.io](https://polygon.io/) Stock Market API
- [qv] Quickly view your data
- [R2 Query Engine](https://blog.cloudflare.com/r2-sql-deep-dive/) Cloudflare's distributed engine for querying data in Iceberg Catalogs
- [rerun.io](https://rerun.io/) Visualize and query robotics logs and transform them into training data.
- [Restate](https://github.com/restatedev) Easily build resilient applications using distributed durable async/await
- [ROAPI] Create full-fledged APIs for slowly moving datasets without writing a single line of code
- [Sail](https://github.com/lakehq/sail) Unifying stream, batch and AI workloads with Apache Spark compatibility
- [SedonaDB](https://github.com/apache/sedona-db) A single-node analytical database engine with geospatial as a first-class citizen
- [Sleeper](https://github.com/gchq/sleeper) Serverless, cloud-native, log-structured merge tree based, scalable key-value store
- [Spice.ai] Building blocks for data-driven AI applications
- [Synnada] Streaming-first framework for data products
- [VegaFusion] Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
- [Vortex] An extensible, state of the art columnar file format
- [Telemetry](https://telemetry.sh/) Structured logging made easy
- [Xorq](https://github.com/xorq-labs/xorq/) Xorq is a multi-engine batch transformation framework built on Ibis, DataFusion and Arrow

Here are some less active projects that used DataFusion:

- [bdt](https://github.com/datafusion-contrib/bdt) Boring Data Tool
- [Cloudfuse Buzz]
- [Dask SQL] Distributed SQL query engine in Python
- [Exon] Analysis toolkit for life-science applications
- [Flock]
- [Tensorbase]

If you know of another project, please submit a PR to add a link!

[ballista]: https://github.com/apache/datafusion-ballista
[auron]: https://github.com/apache/auron
[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
[cnosdb]: https://github.com/cnosdb/cnosdb
[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
[dask sql]: https://github.com/dask-contrib/dask-sql
[delta-rs]: https://github.com/delta-io/delta-rs
[edb postgres lakehouse]: https://www.enterprisedb.com/products/analytics
[exon]: https://github.com/wheretrue/exon
[flock]: https://github.com/flock-lab/flock
[kamu]: https://github.com/kamu-data/kamu-cli
[greptimedb]: https://github.com/GreptimeTeam/greptimedb
[horaedb]: https://github.com/apache/incubator-horaedb
[influxdb]: https://github.com/influxdata/influxdb
[openobserve]: https://github.com/openobserve/openobserve
[parseable]: https://github.com/parseablehq/parseable
[prql-query]: https://github.com/prql/prql-query
[qv]: https://github.com/timvw/qv
[roapi]: https://github.com/roapi/roapi
[seafowl]: https://github.com/splitgraph/seafowl
[spice.ai]: https://github.com/spiceai/spiceai
[synnada]: https://synnada.ai/
[tensorbase]: https://github.com/tensorbase/tensorbase
[vegafusion]: https://vegafusion.io/
[vortex]: https://vortex.dev/

## Integrations and Extensions

There are a number of community projects that extend DataFusion or
provide integrations with other systems, some of which are described below:

### Language Bindings

- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c)
- [datafusion-python](https://github.com/apache/datafusion-python)
- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby)
- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)

### Integrations

- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue)
- [datafusion-federation](https://github.com/datafusion-contrib/datafusion-federation)

## Why DataFusion?

- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet, and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, and published regularly as a crate on [crates.io](http://crates.io), DataFusion can be integrated and tailored for your specific usecase.
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.

## Rust Version Compatibility Policy

The Rust toolchain releases are tracked at [Rust Versions](https://releases.rs) and follow
[semantic versioning](https://semver.org/). A Rust toolchain release can be identified
by a version string like `1.80.0`, or more generally `major.minor.patch`.

DataFusion supports the last 4 stable Rust minor versions released and any such versions released within the last 4 months.

For example, given the releases `1.78.0`, `1.79.0`, `1.80.0`, `1.80.1` and `1.81.0` DataFusion will support 1.78.0, which is 3 minor versions prior to the most minor recent `1.81`.

Note: If a Rust hotfix is released for the current MSRV, the MSRV will be updated to the specific minor version that includes all applicable hotfixes preceding other policies.

DataFusion enforces MSRV policy using a [MSRV CI Check](https://github.com/search?q=repo%3Aapache%2Fdatafusion+rust-version+language%3ATOML+path%3A%2F%5ECargo.toml%2F&type=code)