SQL¶
DataFusion also offers a SQL API, read the full reference here
In [1]: import datafusion
In [2]: from datafusion import DataFrame, SessionContext
# create a context
In [3]: ctx = datafusion.SessionContext()
# register a CSV
In [4]: ctx.register_csv("pokemon", "pokemon.csv")
# create a new statement via SQL
In [5]: df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')
# collect and convert to pandas DataFrame
In [6]: df.to_pandas()
Out[6]:
pokemon.Attack + pokemon.Defense pokemon.Attack - pokemon.Defense
0 98 0
1 125 -1
2 165 -1
3 223 -23
4 95 9
.. ... ...
158 190 10
159 109 19
160 149 19
161 229 39
162 200 20
[163 rows x 2 columns]
Parameterized queries¶
In DataFusion-Python 51.0.0 we introduced the ability to pass parameters in a SQL query. These are similar in concept to prepared statements, but allow passing named parameters into a SQL query. Consider this simple example.
In [7]: def show_attacks(ctx: SessionContext, threshold: int) -> None:
...: ctx.sql(
...: 'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val', val=threshold
...: ).show(num=5)
...: show_attacks(ctx, 75)
...:
DataFrame()
+---------------------------+--------+
| Name | Attack |
+---------------------------+--------+
| Venusaur | 82 |
| VenusaurMega Venusaur | 100 |
| Charizard | 84 |
| CharizardMega Charizard X | 130 |
| CharizardMega Charizard Y | 104 |
+---------------------------+--------+
When passing parameters like the example above we convert the Python objects
into their string representation. We also have special case handling
for DataFrame objects, since they cannot simply
be turned into string representations for an SQL query. In these cases we
will register a temporary view in the SessionContext
using a generated table name.
The formatting for passing string replacement objects is to precede the
variable name with a single $. This works for all dialects in
the SQL parser except hive and mysql. Since these dialects do not
support named placeholders, we are unable to do this type of replacement.
We recommend either switching to another dialect or using Python
f-string style replacement.
Warning
To support DataFrame parameterized queries, your session must support
registration of temporary views. The default
CatalogProvider and
SchemaProvider do have this capability.
If you have implemented custom providers, it is important that temporary
views do not persist across SessionContext
or you may get unintended consequences.
The following example shows passing in both a DataFrame
object as well as a Python object to be used in parameterized replacement.
In [8]: def show_column(
...: ctx: SessionContext, column: str, df: DataFrame, threshold: int
...: ) -> None:
...: ctx.sql(
...: 'SELECT "Name", $col FROM $df WHERE $col > $val',
...: col=column,
...: df=df,
...: val=threshold,
...: ).show(num=5)
...: df = ctx.table("pokemon")
...: show_column(ctx, '"Defense"', df, 75)
...:
DataFrame()
+---------------------------+---------+
| Name | Defense |
+---------------------------+---------+
| Venusaur | 83 |
| VenusaurMega Venusaur | 123 |
| Charizard | 78 |
| CharizardMega Charizard X | 111 |
| CharizardMega Charizard Y | 78 |
+---------------------------+---------+
The approach implemented for conversion of variables into a SQL query
relies on string conversion. This has the potential for data loss,
specifically for cases like floating point numbers. If you need to pass
variables into a parameterized query and it is important to maintain the
original value without conversion to a string, then you can use the
optional parameter param_values to specify these. This parameter
expects a dictionary mapping from the parameter name to a Python
object. Those objects will be cast into a
PyArrow Scalar Value.
Using param_values will rely on the SQL dialect you have configured
for your session. This can be set using the configuration options
of your SessionContext. Similar to how
prepared statements
work, these parameters are limited to places where you would pass in a
scalar value, such as a comparison.
In [9]: def param_attacks(ctx: SessionContext, threshold: int) -> None:
...: ctx.sql(
...: 'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val',
...: param_values={"val": threshold},
...: ).show(num=5)
...: param_attacks(ctx, 75)
...:
DataFrame()
+---------------------------+--------+
| Name | Attack |
+---------------------------+--------+
| Venusaur | 82 |
| VenusaurMega Venusaur | 100 |
| Charizard | 84 |
| CharizardMega Charizard X | 130 |
| CharizardMega Charizard Y | 104 |
+---------------------------+--------+