ConceptsΒΆ
In this section, we will cover a basic example to introduce a few key concepts.
import datafusion
from datafusion import col
import pyarrow
# create a context
ctx = datafusion.SessionContext()
# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
# create a new statement
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
)
# execute and collect the first (and only) batch
result = df.collect()[0]
The first statement group:
# create a context
ctx = datafusion.SessionContext()
creates a SessionContext
, that is, the main interface for executing queries with DataFusion. It maintains the state
of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
Create a DataFrame from a CSV or Parquet data source.
Register a CSV or Parquet data source as a table that can be referenced from a SQL query.
Register a custom data source that can be referenced from a SQL query.
Execute a SQL query
The second statement group creates a DataFrame
,
# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame.
DataFrames are typically created by calling a method on SessionContext
, such as read_csv
, and can then be modified by
calling the transformation methods, such as filter()
, select()
, aggregate()
,
and limit()
to build up a query definition.
The third statement uses Expressions
to build up a query definition.
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
)
Finally the collect()
method converts the logical plan represented by the DataFrame into a physical plan and execute it,
collecting all results into a list of RecordBatch.