Concepts

In this section, we will cover a basic example to introduce a few key concepts. We will use the same source file as described in the Introduction, the Pokemon data set.

In [1]: from datafusion import SessionContext, col, lit, functions as f

In [2]: ctx = SessionContext()

In [3]: df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")

In [4]: df = df.select(
   ...:     "trip_distance",
   ...:     col("total_amount").alias("total"),
   ...:     (f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
   ...: )
   ...: 

In [5]: df.show()
DataFrame()
+---------------+-------+-------------+
| trip_distance | total | tip_percent |
+---------------+-------+-------------+
| 2.1           | 11.8  | 0.0         |
| 0.2           | 4.3   | 0.0         |
| 14.7          | 51.95 | 16.7        |
| 10.6          | 36.35 | 16.6        |
| 4.94          | 24.36 | 16.7        |
| 1.6           | 14.15 | 16.6        |
| 4.1           | 17.3  | 0.0         |
| 5.7           | 21.8  | 0.0         |
| 9.1           | 28.8  | 0.0         |
| 2.7           | 18.95 | 16.6        |
| 6.11          | 24.3  | 0.0         |
| 1.21          | 10.79 | 23.1        |
| 7.4           | 33.92 | 0.0         |
| 1.7           | 14.16 | 16.7        |
| 0.81          | 8.3   | 0.0         |
| 1.01          | 10.3  | 9.7         |
| 0.73          | 12.09 | 23.1        |
| 1.17          | 12.36 | 16.7        |
| 0.78          | 9.96  | 16.7        |
| 1.66          | 12.3  | 0.0         |
+---------------+-------+-------------+

Session Context

The first statement group creates a SessionContext.

# create a context
ctx = datafusion.SessionContext()

A Session Context is the main interface for executing queries with DataFusion. It maintains the state of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:

  • Create a DataFrame from a data source.

  • Register a data source as a table that can be referenced from a SQL query.

  • Execute a SQL query

DataFrame

The second statement group creates a DataFrame,

# Create a DataFrame from a file
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")

A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame. DataFrames are typically created by calling a method on SessionContext, such as read_csv, and can then be modified by calling the transformation methods, such as filter(), select(), aggregate(), and limit() to build up a query definition.

Expressions

The third statement uses Expressions to build up a query definition. You can find explanations for what the functions below do in the user documentation for col(), lit(), round(), and alias().

df = df.select(
    "trip_distance",
    col("total_amount").alias("total"),
    (f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)

Finally the show() method converts the logical plan represented by the DataFrame into a physical plan and execute it, collecting all results and displaying them to the user. It is important to note that DataFusion performs lazy evaluation of the DataFrame. Until you call a method such as show() or collect(), DataFusion will not perform the query.