Cybergod AGI research has been working on a VimGolf AI benchmark for a while. The benchmark is designed to evaluate the performance of VimGolf AI solutions. The benchmark consists of a set of VimGolf problems, and the AI solutions are evaluated based on their execution time and accuracy.

The VimGolf public challenge dataset has 612 challenges in total.

The benchmark is defined as follows:

  • The AI is given VimGolf input, output files and a VimGolf problem description.
  • The AI must generate a VimGolf solution that solves the problem, either by thinking or by interacting with the VimGolf environment in terminal.
  • The solution key length must be less than the output length, to prevent naive solutions.

Cybergod AGI research has run the benchmark on a variety of AI models and benchmark frameworks, including some of the most popular ones. The results are as follows:

Benchmark Name Model Name Average Execution Time (minute/challenge) Accuracy (%) Resolved Unresolved
VimGolf Terminal Bench Adaptor ollama/gpt-oss:120b 15 8.82 54 558
VimGolf Single Shot ollama/gpt-oss:120b 10 18.62 114 498

The results show that the VimGolf AI benchmark is a challenging task for AI models.

In the future, we may add new challenges and level up the difficulty (by reducing time or solution length limits) of the benchmark to make it more challenging for AI models.


Code for reproducing the benchmark results: