Basic OperationsΒΆ

In this section, you will learn how to display essential details of DataFrames using specific functions.

In [1]: from datafusion import SessionContext

In [2]: import random

In [3]: ctx = SessionContext()

In [4]: df = ctx.from_pydict({
   ...:     "nrs": [1, 2, 3, 4, 5],
   ...:     "names": ["python", "ruby", "java", "haskell", "go"],
   ...:     "random": random.sample(range(1000), 5),
   ...:     "groups": ["A", "A", "B", "C", "B"],
   ...: })
   ...: 

In [5]: df
Out[5]: 
DataFrame()
+-----+---------+--------+--------+
| nrs | names   | random | groups |
+-----+---------+--------+--------+
| 1   | python  | 97     | A      |
| 2   | ruby    | 46     | A      |
| 3   | java    | 344    | B      |
| 4   | haskell | 855    | C      |
| 5   | go      | 11     | B      |
+-----+---------+--------+--------+

Use limit() to view the top rows of the frame:

In [6]: df.limit(2)
Out[6]: 
DataFrame()
+-----+--------+--------+--------+
| nrs | names  | random | groups |
+-----+--------+--------+--------+
| 1   | python | 97     | A      |
| 2   | ruby   | 46     | A      |
+-----+--------+--------+--------+

Display the columns of the DataFrame using schema():

In [7]: df.schema()
Out[7]: 
nrs: int64
names: string
random: int64
groups: string

The method to_pandas() uses pyarrow to convert to pandas DataFrame, by collecting the batches, passing them to an Arrow table, and then converting them to a pandas DataFrame.

In [8]: df.to_pandas()
Out[8]: 
   nrs    names  random groups
0    1   python      97      A
1    2     ruby      46      A
2    3     java     344      B
3    4  haskell     855      C
4    5       go      11      B

describe() shows a quick statistic summary of your data:

In [9]: df.describe()
Out[9]: 
DataFrame()
+------------+--------------------+-------+--------------------+--------+
| describe   | nrs                | names | random             | groups |
+------------+--------------------+-------+--------------------+--------+
| count      | 5.0                | 5     | 5.0                | 5      |
| null_count | 0.0                | 0     | 0.0                | 0      |
| mean       | 3.0                | null  | 270.6              | null   |
| std        | 1.5811388300841898 | null  | 351.74038721761826 | null   |
| min        | 1.0                | go    | 11.0               | A      |
| max        | 5.0                | ruby  | 855.0              | C      |
| median     | 3.0                | null  | 97.0               | null   |
+------------+--------------------+-------+--------------------+--------+