Jupyter Notebooks¶
Ballista works well in Jupyter notebooks. DataFrames automatically render as formatted HTML tables when displayed in a notebook cell.
Basic Usage¶
from ballista import BallistaSessionContext
ctx = BallistaSessionContext("df://localhost:50050")
ctx.register_parquet("trips", "/path/to/nyctaxi.parquet")
# The result renders as an HTML table when this is the last expression in a cell
ctx.sql("SELECT * FROM trips LIMIT 10")
When a DataFrame is the last expression in a cell, Jupyter automatically calls its
_repr_html_() method, which renders a styled table with formatted column headers,
expandable cells for long text, and scrollable display for wide tables.
Converting Results¶
DataFrames can be converted to various formats for further analysis:
df = ctx.sql("SELECT * FROM trips WHERE fare_amount > 50")
pandas_df = df.to_pandas()
arrow_table = df.to_arrow_table()
polars_df = df.to_polars()
batches = df.collect()
Example Workflow¶
# Cell 1: Setup
from ballista import BallistaSessionContext
from datafusion import col, lit
ctx = BallistaSessionContext("df://localhost:50050")
ctx.register_parquet("orders", "/data/orders.parquet")
ctx.register_parquet("customers", "/data/customers.parquet")
# Cell 2: Explore the data
df = ctx.sql("SELECT * FROM orders LIMIT 5")
# Cell 3: Run analysis — DataFrame renders as an HTML table
df = ctx.sql("""
SELECT
c.name,
COUNT(*) as order_count,
SUM(o.amount) as total_spent
FROM orders o
JOIN customers c ON o.customer_id = c.id
GROUP BY c.name
ORDER BY total_spent DESC
LIMIT 10
""")
# Cell 4: Convert to Pandas for visualization
import matplotlib.pyplot as plt
pandas_df = df.to_pandas()
pandas_df.plot(kind='bar', x='name', y='total_spent')
plt.show()
Running a Local Cluster in a Notebook¶
For development and testing, you can start a local cluster directly from a notebook:
from ballista import BallistaSessionContext, setup_test_cluster
host, port = setup_test_cluster()
ctx = BallistaSessionContext(f"df://{host}:{port}")