Data Sources¶
DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations.
Local file¶
DataFusion has the abilty to read from a variety of popular file formats, such as Parquet, CSV, JSON, and AVRO.
In [1]: from datafusion import SessionContext
In [2]: ctx = SessionContext()
In [3]: df = ctx.read_csv("pokemon.csv")
In [4]: df.show()
DataFrame()
+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary |
+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+
| 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | false |
| 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | false |
| 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | false |
| 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | false |
| 4 | Charmander | Fire | | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | false |
| 5 | Charmeleon | Fire | | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | false |
| 6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 109 | 85 | 100 | 1 | false |
| 6 | CharizardMega Charizard X | Fire | Dragon | 634 | 78 | 130 | 111 | 130 | 85 | 100 | 1 | false |
| 6 | CharizardMega Charizard Y | Fire | Flying | 634 | 78 | 104 | 78 | 159 | 115 | 100 | 1 | false |
| 7 | Squirtle | Water | | 314 | 44 | 48 | 65 | 50 | 64 | 43 | 1 | false |
| 8 | Wartortle | Water | | 405 | 59 | 63 | 80 | 65 | 80 | 58 | 1 | false |
| 9 | Blastoise | Water | | 530 | 79 | 83 | 100 | 85 | 105 | 78 | 1 | false |
| 9 | BlastoiseMega Blastoise | Water | | 630 | 79 | 103 | 120 | 135 | 115 | 78 | 1 | false |
| 10 | Caterpie | Bug | | 195 | 45 | 30 | 35 | 20 | 20 | 45 | 1 | false |
| 11 | Metapod | Bug | | 205 | 50 | 20 | 55 | 25 | 25 | 30 | 1 | false |
| 12 | Butterfree | Bug | Flying | 395 | 60 | 45 | 50 | 90 | 80 | 70 | 1 | false |
| 13 | Weedle | Bug | Poison | 195 | 40 | 35 | 30 | 20 | 20 | 50 | 1 | false |
| 14 | Kakuna | Bug | Poison | 205 | 45 | 25 | 50 | 25 | 25 | 35 | 1 | false |
| 15 | Beedrill | Bug | Poison | 395 | 65 | 90 | 40 | 45 | 80 | 75 | 1 | false |
| 15 | BeedrillMega Beedrill | Bug | Poison | 495 | 65 | 150 | 40 | 15 | 80 | 145 | 1 | false |
+----+---------------------------+--------+--------+-------+----+--------+---------+---------+---------+-------+------------+-----------+
Create in-memory¶
Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object.
To do this in DataFusion, you can use one of the three functions
from_pydict()
,
from_pylist()
, or
create_dataframe()
.
As their names suggest, from_pydict
and from_pylist
will create DataFrames from Python
dictionary and list objects, respectively. create_dataframe
assumes you will pass in a list
of list of PyArrow Record Batches.
The following three examples all will create identical DataFrames:
In [5]: import pyarrow as pa
In [6]: ctx.from_pylist([
...: { "a": 1, "b": 10.0, "c": "alpha" },
...: { "a": 2, "b": 20.0, "c": "beta" },
...: { "a": 3, "b": 30.0, "c": "gamma" },
...: ]).show()
...:
DataFrame()
+---+------+-------+
| a | b | c |
+---+------+-------+
| 1 | 10.0 | alpha |
| 2 | 20.0 | beta |
| 3 | 30.0 | gamma |
+---+------+-------+
In [7]: ctx.from_pydict({
...: "a": [1, 2, 3],
...: "b": [10.0, 20.0, 30.0],
...: "c": ["alpha", "beta", "gamma"],
...: }).show()
...:
DataFrame()
+---+------+-------+
| a | b | c |
+---+------+-------+
| 1 | 10.0 | alpha |
| 2 | 20.0 | beta |
| 3 | 30.0 | gamma |
+---+------+-------+
In [8]: batch = pa.RecordBatch.from_arrays(
...: [
...: pa.array([1, 2, 3]),
...: pa.array([10.0, 20.0, 30.0]),
...: pa.array(["alpha", "beta", "gamma"]),
...: ],
...: names=["a", "b", "c"],
...: )
...:
In [9]: ctx.create_dataframe([[batch]]).show()
DataFrame()
+---+------+-------+
| a | b | c |
+---+------+-------+
| 1 | 10.0 | alpha |
| 2 | 20.0 | beta |
| 3 | 30.0 | gamma |
+---+------+-------+
Object Store¶
DataFusion has support for multiple storage options in addition to local files. The example below requires an appropriate S3 account with access credentials.
Supported Object Stores are
from datafusion.object_store import AmazonS3
region = "us-east-1"
bucket_name = "yellow-trips"
s3 = AmazonS3(
bucket_name=bucket_name,
region=region,
access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)
path = f"s3://{bucket_name}/"
ctx.register_object_store("s3://", s3, None)
ctx.register_parquet("trips", path)
ctx.table("trips").show()
Other DataFrame Libraries¶
DataFusion can import DataFrames directly from other libraries, such as
Polars and Pandas.
Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule
interface can be imported to DataFusion using the
from_arrow()
function. Older verions of Polars may
not support the arrow interface. In those cases, you can still import via the
from_polars()
function.
import pandas as pd
data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] }
pandas_df = pd.DataFrame(data)
datafusion_df = ctx.from_arrow(pandas_df)
datafusion_df.show()
import polars as pl
polars_df = pl.DataFrame(data)
datafusion_df = ctx.from_arrow(polars_df)
datafusion_df.show()
Delta Lake¶
DataFusion 43.0.0 and later support the ability to register table providers from sources such as Delta Lake. This will require a recent version of deltalake to provide the required interfaces.
from deltalake import DeltaTable
delta_table = DeltaTable("path_to_table")
ctx.register_table_provider("my_delta_table", delta_table)
df = ctx.table("my_delta_table")
df.show()
On older versions of deltalake
(prior to 0.22) you can use the
Arrow DataSet
interface to import to DataFusion, but this does not support features such as filter push down
which can lead to a significant performance difference.
from deltalake import DeltaTable
delta_table = DeltaTable("path_to_table")
ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset())
df = ctx.table("my_delta_table")
df.show()
Iceberg¶
Coming soon!
Custom Table Provider¶
You can implement a custom Data Provider in Rust and expose it to DataFusion through the the interface as describe in the Custom Table Provider section. This is an advanced topic, but a user example is provided in the DataFusion repository.