Arrow¶

DataFusion implements the Apache Arrow PyCapsule interface for importing and exporting DataFrames with zero copy. With this feature, any Python project that implements this interface can share data back and forth with DataFusion with zero copy.

We can demonstrate using pyarrow.

Importing to DataFusion¶

Here we will create an Arrow table and import it to DataFusion.

To import an Arrow table, use datafusion.context.SessionContext.from_arrow(). This will accept any Python object that implements __arrow_c_stream__ or __arrow_c_array__ and returns a StructArray. Common pyarrow sources you can use are:

Array (but it must return a Struct Array)
Record Batch
Record Batch Reader
Table

In [1]: from datafusion import SessionContext

In [2]: import pyarrow as pa

In [3]: data = {"a": [1, 2, 3], "b": [4, 5, 6]}

In [4]: table = pa.Table.from_pydict(data)

In [5]: ctx = SessionContext()

In [6]: df = ctx.from_arrow(table)

In [7]: df
Out[7]: 
DataFrame()
+---+---+
| a | b |
+---+---+
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
+---+---+

Exporting from DataFusion¶

DataFusion DataFrames implement __arrow_c_stream__ PyCapsule interface, so any Python library that accepts these can import a DataFusion DataFrame directly.

Warning

It is important to note that this will cause the DataFrame execution to happen, which may be a time consuming task. That is, you will cause a datafusion.dataframe.DataFrame.collect() operation call to occur.

In [8]: df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))

In [9]: pa.table(df)
Out[9]: 
pyarrow.Table
c: double
d: string_view not null
----
c: [[1.5,3,4.5]]
d: [["df","df","df"]]

Avro