GSoC Project Ideas¶
Introduction¶
Welcome to the Apache DataFusion Google Summer of Code (GSoC) 2025 project ideas list. Below you can find information about the projects. Please refer to this page for application guidelines.
Projects¶
Implement Continuous Monitoring of DataFusion Performance¶
Description and Outcomes: DataFusion lacks continuous monitoring of how performance evolves over time – we do this somewhat manually today. Even though performance has been one of our top priorities for a while now, we didn’t build a continuous monitoring system yet. This linked issue contains a summary of all the previous efforts that made us inch closer to having such a system, but a functioning system needs to built on top of that progress. A student successfully completing this project would gain experience in building an end-to-end monitoring system that integrates with GitHub, scheduling/running benchmarks on some sort of a cloud infrastructure, and building a versatile web UI to expose the results. The outcome of this project will benefit Apache DataFusion on an ongoing basis in its quest for ever-more performance.
Category: Tooling
Difficulty: Medium
Possible Mentor(s) and/or Helper(s): alamb and mertak-synnada
Skills: DevOps, Cloud Computing, Web Development, Integrations
Expected Project Size: 175 to 350 hours*
Improving DataFusion DX (e.g. 1 and 2)¶
Description and Outcomes: While performance, extensibility and customizability is DataFusion’s strong aspects, we have much work to do in terms of user-friendliness and ease of debug-ability. This project aims to make strides in these areas by improving terminal visualizations of query plans and increasing the “deployment” of the newly-added diagnostics framework. This project is a potential high-impact project with high output visibility, and reduce the barrier to entry to new users.
Category: DX
Difficulty: Medium
Possible Mentor(s) and/or Helper(s): eliaperantoni and mkarbo
Skills: Software Engineering, Terminal Visualizations
Expected Project Size: 175 to 350 hours*
Robust WASM Support¶
Description and Outcomes: DataFusion can be compiled today to WASM with some care. However, it is somewhat tricky and brittle. Having robust WASM support improves the embeddability aspect of DataFusion, and can enable many practical use cases. A good conclusion of this project would be the addition of a live demo sub-page to the DataFusion homepage.
Category: Build
Difficulty: Medium
Skills: WASM, Advanced Rust, Web Development, Software Engineering
Expected Project Size: 175 to 350 hours*
High Performance Aggregations¶
Description and Outcomes: An aggregation is one of the most fundamental operations within a query engine. Practical performance in many use cases, and results in many well-known benchmarks (e.g. ClickBench), depend heavily on aggregation performance. DataFusion community has been working on improving aggregation performance for a while now, but there is still work to do. A student working on this project will get the chance to hone their skills on high-performance, low(ish) level coding, intricacies of measuring performance, data structures and others.
Category: Core
Difficulty: Advanced
Possible Mentor(s) and/or Helper(s): jayzhan-synnada and Rachelint
Skills: Algorithms, Data Structures, Advanced Rust, Databases, Benchmarking Techniques
Expected Project Size: 350 hours
Improving Python Bindings¶
Description and Outcomes: DataFusion offers Python bindings that enable users to build data systems using Python. However, the Python bindings are still relatively low-level, and do not expose all APIs libraries like Pandas and Polars with a end-user focus offer. This project aims to improve DataFusion’s Python bindings to make progress towards moving it closer to such libraries in terms of built-in APIs and functionality.
Category: Python Bindings
Difficulty: Medium
Possible Mentor(s) and/or Helper(s): timsaucer
Skills: APIs, FFIs, DataFrame Libraries
Expected Project Size: 175 to 350 hours*
Optimizing DataFusion Binary Size¶
Description and Outcomes: DataFusion is a foundational library with a large feature set. Even though we try to avoid adding too many dependencies and implement many low-level functionalities inside the codebase, the fast moving nature of the project results in an accumulation of dependencies over time. This inflates DataFusion’s binary size over time, which reduces portability and embeddability. This project involves a study of the codebase, using compiler tooling, to understand where code bloat comes from, simplifying/reducing the number of dependencies by efficient in-house implementations, and avoiding code duplications.
Category: Core/Build
Difficulty: Medium
Skills: Software Engineering, Refactoring, Dependency Management, Compilers
Expected Project Size: 175 to 350 hours*
Ergonomic SQL Features¶
Description and Outcomes: DuckDB has many innovative features that significantly improve the SQL UX. Even though some of those features are already implemented in DataFusion, there are many others we can implement (and get inspiration from). This page contains a good summary of such features. Each such feature will serve as a bite-size, achievable milestone for a cool GSoC project that will have user-facing impact improving the UX on a broad basis. The project will start with a survey of what is already implemented, what is missing, and kick off with a prioritization proposal/implementation plan.
Category: SQL FE
Difficulty: Medium
Possible Mentor(s) and/or Helper(s): berkaysynnada
Skills: SQL, Planning, Parsing, Software Engineering
Expected Project Size: 350 hours
Advanced Interval Analysis¶
Description and Outcomes: DataFusion implements interval arithmetic and utilizes it for range estimations, which enables use cases in data pruning, optimizations and statistics. However, the current implementation only works efficiently for forward evaluation; i.e. calculating the output range of an expression given input ranges (ranges of columns). When propagating constraints using the same graph, the current approach requires multiple bottom-up and top-down traversals to narrow column bounds fully. This project aims to fix this deficiency by utilizing a better algorithmic approach. Note that this is a very advanced project for students with a deep interest in computational methods, expression graphs, and constraint solvers.
Category: Core
Difficulty: Advanced
Possible Mentor(s) and/or Helper(s): ozankabak and berkaysynnada
Skills: Algorithms, Data Structures, Applied Mathematics, Software Engineering
Expected Project Size: 350 hours
Spark-Compatible Functions Crate¶
Description and Outcomes: In general, DataFusion aims to be compatible with PostgreSQL in terms of functions and behaviors. However, there are many users (and downstream projects, such as DataFusion Comet) that desire compatibility with Apache Spark. This project aims to collect Spark-compatible functions into a separate crate to help such users and/or projects. The project will be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them (e.g. via creating a compatibility-tracking page cataloging such functions, writing blog posts etc.).
Category: Extensions
Difficulty: Medium
Skills: SQL, Spark, Software Engineering
Expected Project Size: 175 to 350 hours*
SQL Fuzzing Framework in Rust¶
Description and Outcomes: Fuzz testing is a very important technique we utilize often in DataFusion. Having SQL-level fuzz testing enables us to battle-test DataFusion in an end-to-end fashion. Initial version of our fuzzing framework is Java-based, but the time has come to migrate to Rust-native solution. This will simplify the overall implementation (by avoiding things like JDBC), enable us to implement more advanced algorithms for query generation, and attract more contributors over time. This project is a good blend of software engineering, algorithms and testing techniques (i.e. fuzzing techniques).
Category: Extensions
Difficulty: Advanced
Possible Mentor(s) and/or Helper(s): 2010YOUY01
Skills: SQL, Testing Techniques, Advanced Rust, Software Engineering
Expected Project Size: 175 to 350 hours*
*There is enough material to make this a 350-hour project, but it is granular enough to make it a 175-hour project as well.
Contact Us¶
You can join our mailing list and Discord to introduce yourself and ask questions.