Basic OperationsΒΆ

In this section, you will learn how to display essential details of DataFrames using specific functions.

In [1]: from datafusion import SessionContext

In [2]: import random

In [3]: ctx = SessionContext()

In [4]: df = ctx.from_pydict({
   ...:     "nrs": [1, 2, 3, 4, 5],
   ...:     "names": ["python", "ruby", "java", "haskell", "go"],
   ...:     "random": random.sample(range(1000), 5),
   ...:     "groups": ["A", "A", "B", "C", "B"],
   ...: })
   ...: 

In [5]: df
Out[5]: 
DataFrame()
+-----+---------+--------+--------+
| nrs | names   | random | groups |
+-----+---------+--------+--------+
| 1   | python  | 205    | A      |
| 2   | ruby    | 124    | A      |
| 3   | java    | 71     | B      |
| 4   | haskell | 121    | C      |
| 5   | go      | 294    | B      |
+-----+---------+--------+--------+

Use limit() to view the top rows of the frame:

In [6]: df.limit(2)
Out[6]: 
DataFrame()
+-----+--------+--------+--------+
| nrs | names  | random | groups |
+-----+--------+--------+--------+
| 1   | python | 205    | A      |
| 2   | ruby   | 124    | A      |
+-----+--------+--------+--------+

Display the columns of the DataFrame using schema():

In [7]: df.schema()
Out[7]: 
nrs: int64
names: string
random: int64
groups: string

The method to_pandas() uses pyarrow to convert to pandas DataFrame, by collecting the batches, passing them to an Arrow table, and then converting them to a pandas DataFrame.

In [8]: df.to_pandas()
Out[8]: 
   nrs    names  random groups
0    1   python     205      A
1    2     ruby     124      A
2    3     java      71      B
3    4  haskell     121      C
4    5       go     294      B

describe() shows a quick statistic summary of your data:

In [9]: df.describe()
Out[9]: 
DataFrame()
+------------+--------------------+-------+-------------------+--------+
| describe   | nrs                | names | random            | groups |
+------------+--------------------+-------+-------------------+--------+
| count      | 5.0                | 5     | 5.0               | 5      |
| null_count | 0.0                | 0     | 0.0               | 0      |
| mean       | 3.0                | null  | 163.0             | null   |
| std        | 1.5811388300841898 | null  | 87.56997202237763 | null   |
| min        | 1.0                | go    | 71.0              | A      |
| max        | 5.0                | ruby  | 294.0             | C      |
| median     | 3.0                | null  | 124.0             | null   |
+------------+--------------------+-------+-------------------+--------+