Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multistep actions.
However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce ShortcutsBench, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands.
ShortcutsBench includes a wealth of real APIs from Apple Inc.’s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with 5 leading open-source (size >= 57B) and 4 closed source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at https://github.com/eachsheep/shortcutsbench.
To our knowledge, ShortcutsBench is the first large-scale agent benchmark based on real APIs, considering APIs, queries, and corresponding action sequences. ShortcutsBench provides a rich set of real APIs, queries of varying difficulty and task types, high-quality human-annotated action sequences (provided by shortcut developers), and queries from real user needs. Additionally, it offers precise parameter value filling, including raw data types, enumeration types, and using outputs from previous actions as parameter values, and evaluates the agent's awareness of requesting necessary information from the system or users. Moreover, the scale of APIs, queries, and corresponding action sequences in ShortcutsBench rivals or even surpasses benchmarks and datasets created by LLMs or modified from existing datasets. A comprehensive comparison between ShortcutsBench and existing benchmarks/datasets is shown in the table below.
Referencing existing work, we selected and tested 9 most advanced LLMs...
Following existing work, we slightly modified the ReACT templates...
Key Findings:
Key Findings:
Key Findings:
Levels | Gemini 1.5-Pro | QWen 2-72B | Deeps. 2-chat | Deeps. 2-coder | LLaMA 3-70B | Gemini 1.5-Flash | QWen 2-57B | GPT 3.5-turbo | ChatGLM 4-Air |
---|---|---|---|---|---|---|---|---|---|
(0, 1] | 33.33% | 37.78% | 64.29% | 62.71% | 47.62% | 62.79% | 22.22% | 28.89% | 47.62% |
(1, 5] | 45.95% | 50.40% | 55.50% | 60.08% | 44.08% | 53.99% | 37.24% | 37.70% | 48.06% |
(5, 15] | 51.85% | 36.42% | 40.76% | 49.44% | 35.71% | 40.65% | 28.37% | 20.33% | 48.42% |
(15, 30] | 46.67% | 25.00% | 27.59% | 43.14% | 22.22% | 44.64% | 8.11% | 17.14% | 48.89% |
Overall | 46.59% | 41.97% | 47.90% | 55.18% | 49.89% | 40.71% | 30.74% | 30.55% | 48.28% |
@misc{shen2024shortcutsbenchlargescalerealworldbenchmark,
title={ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents},
author={Haiyang Shen and Yue Li and Desong Meng and Dongqi Cai and Sheng Qi and Li Zhang and Mengwei Xu and Yun Ma},
year={2024},
eprint={2407.00132},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2407.00132},
}