SQL

DataFusion also offers a SQL API, read the full reference here

In [1]: import datafusion

In [2]: from datafusion import DataFrame, SessionContext

# create a context
In [3]: ctx = datafusion.SessionContext()

# register a CSV
In [4]: ctx.register_csv("pokemon", "pokemon.csv")

# create a new statement via SQL
In [5]: df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')

# collect and convert to pandas DataFrame
In [6]: df.to_pandas()
Out[6]: 
     pokemon.Attack + pokemon.Defense  pokemon.Attack - pokemon.Defense
0                                  98                                 0
1                                 125                                -1
2                                 165                                -1
3                                 223                               -23
4                                  95                                 9
..                                ...                               ...
158                               190                                10
159                               109                                19
160                               149                                19
161                               229                                39
162                               200                                20

[163 rows x 2 columns]

Parameterized queries

In DataFusion-Python 51.0.0 we introduced the ability to pass parameters in a SQL query. These are similar in concept to prepared statements, but allow passing named parameters into a SQL query. Consider this simple example.

In [7]: def show_attacks(ctx: SessionContext, threshold: int) -> None:
   ...:     ctx.sql(
   ...:         'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val', val=threshold
   ...:     ).show(num=5)
   ...: show_attacks(ctx, 75)
   ...: 
DataFrame()
+---------------------------+--------+
| Name                      | Attack |
+---------------------------+--------+
| Venusaur                  | 82     |
| VenusaurMega Venusaur     | 100    |
| Charizard                 | 84     |
| CharizardMega Charizard X | 130    |
| CharizardMega Charizard Y | 104    |
+---------------------------+--------+

When passing parameters like the example above we convert the Python objects into their string representation. We also have special case handling for DataFrame objects, since they cannot simply be turned into string representations for an SQL query. In these cases we will register a temporary view in the SessionContext using a generated table name.

The formatting for passing string replacement objects is to precede the variable name with a single $. This works for all dialects in the SQL parser except hive and mysql. Since these dialects do not support named placeholders, we are unable to do this type of replacement. We recommend either switching to another dialect or using Python f-string style replacement.

Warning

To support DataFrame parameterized queries, your session must support registration of temporary views. The default CatalogProvider and SchemaProvider do have this capability. If you have implemented custom providers, it is important that temporary views do not persist across SessionContext or you may get unintended consequences.

The following example shows passing in both a DataFrame object as well as a Python object to be used in parameterized replacement.

In [8]: def show_column(
   ...:     ctx: SessionContext, column: str, df: DataFrame, threshold: int
   ...: ) -> None:
   ...:     ctx.sql(
   ...:         'SELECT "Name", $col FROM $df WHERE $col > $val',
   ...:         col=column,
   ...:         df=df,
   ...:         val=threshold,
   ...:     ).show(num=5)
   ...: df = ctx.table("pokemon")
   ...: show_column(ctx, '"Defense"', df, 75)
   ...: 
DataFrame()
+---------------------------+---------+
| Name                      | Defense |
+---------------------------+---------+
| Venusaur                  | 83      |
| VenusaurMega Venusaur     | 123     |
| Charizard                 | 78      |
| CharizardMega Charizard X | 111     |
| CharizardMega Charizard Y | 78      |
+---------------------------+---------+

The approach implemented for conversion of variables into a SQL query relies on string conversion. This has the potential for data loss, specifically for cases like floating point numbers. If you need to pass variables into a parameterized query and it is important to maintain the original value without conversion to a string, then you can use the optional parameter param_values to specify these. This parameter expects a dictionary mapping from the parameter name to a Python object. Those objects will be cast into a PyArrow Scalar Value.

Using param_values will rely on the SQL dialect you have configured for your session. This can be set using the configuration options of your SessionContext. Similar to how prepared statements work, these parameters are limited to places where you would pass in a scalar value, such as a comparison.

In [9]: def param_attacks(ctx: SessionContext, threshold: int) -> None:
   ...:     ctx.sql(
   ...:         'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val',
   ...:         param_values={"val": threshold},
   ...:     ).show(num=5)
   ...: param_attacks(ctx, 75)
   ...: 
DataFrame()
+---------------------------+--------+
| Name                      | Attack |
+---------------------------+--------+
| Venusaur                  | 82     |
| VenusaurMega Venusaur     | 100    |
| Charizard                 | 84     |
| CharizardMega Charizard X | 130    |
| CharizardMega Charizard Y | 104    |
+---------------------------+--------+