VimGolf AI benchmark results

Cybergod AGI research has been working on a VimGolf AI benchmark for a while. The benchmark is designed to evaluate the performance of VimGolf AI solutions. The benchmark consists of a set of VimGolf problems, and the AI solutions are evaluated based on their execution time and accuracy.

The VimGolf public challenge dataset has 612 challenges in total.

The benchmark is defined as follows:

The AI is given VimGolf input, output files and a VimGolf problem description.
The AI must generate a VimGolf solution that solves the problem, either by thinking or by interacting with the VimGolf environment in terminal.
The solution key length must be less than the output length, to prevent naive solutions.

Cybergod AGI research has run the benchmark on a variety of AI models and benchmark frameworks, including some of the most popular ones. The results are as follows:

Benchmark Name	Model Name	Average Execution Time (minute/challenge)	Accuracy (%)	Resolved	Unresolved
VimGolf Terminal Bench Adaptor	ollama/gpt-oss:120b	15	8.82	54	558
VimGolf Single Shot	ollama/gpt-oss:120b	10	18.62	114	498

The results show that the VimGolf AI benchmark is a challenging task for AI models.

In the future, we may add new challenges and level up the difficulty (by reducing time or solution length limits) of the benchmark to make it more challenging for AI models.

Code for reproducing the benchmark results: