<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - agrove</title><link href="https://datafusion.apache.org/blog/" rel="alternate"/><link href="https://datafusion.apache.org/blog/feeds/agrove.atom.xml" rel="self"/><id>https://datafusion.apache.org/blog/</id><updated>2021-04-12T00:00:00+00:00</updated><entry><title>Ballista: A Distributed Scheduler for Apache Arrow</title><link href="https://datafusion.apache.org/blog/2021/04/12/ballista-donation" rel="alternate"/><published>2021-04-12T00:00:00+00:00</published><updated>2021-04-12T00:00:00+00:00</updated><author><name>agrove</name></author><id>tag:datafusion.apache.org,2021-04-12:/blog/2021/04/12/ballista-donation</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are excited to announce that &lt;a href="https://github.com/apache/arrow-datafusion/tree/master/ballista"&gt;Ballista&lt;/a&gt; has been donated 
to the Apache Arrow project. &lt;/p&gt;
&lt;p&gt;Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are excited to announce that &lt;a href="https://github.com/apache/arrow-datafusion/tree/master/ballista"&gt;Ballista&lt;/a&gt; has been donated 
to the Apache Arrow project. &lt;/p&gt;
&lt;p&gt;Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
first-class citizens without paying a penalty for serialization costs.&lt;/p&gt;
&lt;p&gt;The foundational technologies in Ballista are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; memory model and compute kernels for efficient processing of data.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/arrow-datafusion"&gt;Apache Arrow DataFusion&lt;/a&gt; query planning and 
  execution framework, extended by Ballista to provide distributed planning and execution.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/"&gt;Apache Arrow Flight Protocol&lt;/a&gt; for efficient
  data transfer between processes.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/protocol-buffers"&gt;Google Protocol Buffers&lt;/a&gt; for serializing query plans.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.docker.com/"&gt;Docker&lt;/a&gt; for packaging up executors along with user-defined code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ballista can be deployed as a standalone cluster and also supports &lt;a href="https://kubernetes.io/"&gt;Kubernetes&lt;/a&gt;. In either
case, the scheduler can be configured to use &lt;a href="https://etcd.io/"&gt;etcd&lt;/a&gt; as a backing store to (eventually) provide
redundancy in the case of a scheduler failing.&lt;/p&gt;
&lt;h2 id="status"&gt;Status&lt;a class="headerlink" href="#status" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ballista project is at an early stage of development. However, it is capable of running complex analytics queries 
in a distributed cluster with reasonable performance (comparable to more established distributed query frameworks).&lt;/p&gt;
&lt;p&gt;One of the benefits of Ballista being part of the Arrow codebase is that there is now an opportunity to push parts of 
the scheduler down to DataFusion so that is possible to seamlessly scale across cores in DataFusion, and across nodes 
in Ballista, using the same unified query scheduler.&lt;/p&gt;
&lt;h2 id="contributors-welcome"&gt;Contributors Welcome!&lt;a class="headerlink" href="#contributors-welcome" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;If you are excited about being able to use Rust for distributed compute and ETL and would like to contribute to this 
work then there are many ways to get involved. The simplest way to get started is to try out Ballista against your own 
datasets and file bug reports for any issues that you find. You could also check out the current 
&lt;a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20component%20%3D%20%22Rust%20-%20Ballista%22"&gt;list of issues&lt;/a&gt; and have a go at fixing one.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/apache/arrow/blob/master/rust/README.md#arrow-rust-community"&gt;Arrow Rust Community&lt;/a&gt;
section of the Rust README provides information on other ways to interact with the Ballista contributors and 
maintainers.&lt;/p&gt;</content><category term="blog"/></entry><entry><title>DataFusion: A Rust-native Query Engine for Apache Arrow</title><link href="https://datafusion.apache.org/blog/2019/02/04/datafusion-donation" rel="alternate"/><published>2019-02-04T00:00:00+00:00</published><updated>2019-02-04T00:00:00+00:00</updated><author><name>agrove</name></author><id>tag:datafusion.apache.org,2019-02-04:/blog/2019/02/04/datafusion-donation</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are excited to announce that &lt;a href="https://github.com/apache/arrow-datafusion"&gt;DataFusion&lt;/a&gt; has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.&lt;/p&gt;
&lt;p&gt;Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;p&gt;We are excited to announce that &lt;a href="https://github.com/apache/arrow-datafusion"&gt;DataFusion&lt;/a&gt; has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.&lt;/p&gt;
&lt;p&gt;Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support SQL queries against iterators of RecordBatch and has support for CSV files. There are plans to &lt;a href="https://issues.apache.org/jira/browse/ARROW-4466"&gt;add support for Parquet files&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SQL support is limited to projection (&lt;code&gt;SELECT&lt;/code&gt;), selection (&lt;code&gt;WHERE&lt;/code&gt;), and simple aggregates (&lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;) with an optional &lt;code&gt;GROUP BY&lt;/code&gt; clause.&lt;/p&gt;
&lt;p&gt;Supported expressions are identifiers, literals, simple math operations (&lt;code&gt;+&lt;/code&gt;, &lt;code&gt;-&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;/&lt;/code&gt;), binary expressions (&lt;code&gt;AND&lt;/code&gt;, &lt;code&gt;OR&lt;/code&gt;), equality and comparison operators (&lt;code&gt;=&lt;/code&gt;, &lt;code&gt;!=&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;), and &lt;code&gt;CAST(expr AS type)&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="example"&gt;Example&lt;a class="headerlink" href="#example" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The following example demonstrates running a simple aggregate SQL query against a CSV file.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;// create execution context
let mut ctx = ExecutionContext::new();

// define schema for data source (csv file)
let schema = Arc::new(Schema::new(vec![
    Field::new("c1", DataType::Utf8, false),
    Field::new("c2", DataType::UInt32, false),
    Field::new("c3", DataType::Int8, false),
    Field::new("c4", DataType::Int16, false),
    Field::new("c5", DataType::Int32, false),
    Field::new("c6", DataType::Int64, false),
    Field::new("c7", DataType::UInt8, false),
    Field::new("c8", DataType::UInt16, false),
    Field::new("c9", DataType::UInt32, false),
    Field::new("c10", DataType::UInt64, false),
    Field::new("c11", DataType::Float32, false),
    Field::new("c12", DataType::Float64, false),
    Field::new("c13", DataType::Utf8, false),
]));

// register csv file with the execution context
let csv_datasource =
    CsvDataSource::new("test/data/aggregate_test_100.csv", schema.clone(), 1024);
ctx.register_datasource("aggregate_test_100", Rc::new(RefCell::new(csv_datasource)));

let sql = "SELECT c1, MIN(c12), MAX(c12) FROM aggregate_test_100 WHERE c11 &amp;gt; 0.1 AND c11 &amp;lt; 0.9 GROUP BY c1";

// execute the query
let relation = ctx.sql(&amp;amp;sql).unwrap();
let mut results = relation.borrow_mut();

// iterate over the results
while let Some(batch) = results.next().unwrap() {
    println!(
        "RecordBatch has {} rows and {} columns",
        batch.num_rows(),
        batch.num_columns()
    );

    let c1 = batch
        .column(0)
        .as_any()
        .downcast_ref::&amp;lt;BinaryArray&amp;gt;()
        .unwrap();

    let min = batch
        .column(1)
        .as_any()
        .downcast_ref::&amp;lt;Float64Array&amp;gt;()
        .unwrap();

    let max = batch
        .column(2)
        .as_any()
        .downcast_ref::&amp;lt;Float64Array&amp;gt;()
        .unwrap();

    for i in 0..batch.num_rows() {
        let c1_value: String = String::from_utf8(c1.value(i).to_vec()).unwrap();
        println!("{}, Min: {}, Max: {}", c1_value, min.value(i), max.value(i),);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="roadmap"&gt;Roadmap&lt;a class="headerlink" href="#roadmap" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The roadmap for DataFusion will depend on interest from the Rust community, but here are some of the short term items that are planned:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extending test coverage of the existing functionality&lt;/li&gt;
&lt;li&gt;Adding support for Parquet data sources&lt;/li&gt;
&lt;li&gt;Implementing more SQL features such as &lt;code&gt;JOIN&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt; and &lt;code&gt;LIMIT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Implement a DataFrame API as an alternative to SQL&lt;/li&gt;
&lt;li&gt;Adding support for partitioning and parallel query execution using Rust's async and await functionality&lt;/li&gt;
&lt;li&gt;Creating a Docker image to make it easy to use DataFusion as a standalone query tool for interactive and batch queries&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="contributors-welcome"&gt;Contributors Welcome!&lt;a class="headerlink" href="#contributors-welcome" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;If you are excited about being able to use Rust for data science and would like to contribute to this work then there are many ways to get involved. The simplest way to get started is to try out DataFusion against your own data sources and file bug reports for any issues that you find. You could also check out the current &lt;a href="https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard"&gt;list of issues&lt;/a&gt; and have a go at fixing one. You can also join the &lt;a href="http://mail-archives.apache.org/mod_mbox/arrow-user/"&gt;user mailing list&lt;/a&gt; to ask questions.&lt;/p&gt;</content><category term="blog"/></entry></feed>