ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Abstract

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multistep actions.

However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce ShortcutsBench, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands.

ShortcutsBench includes a wealth of real APIs from Apple Inc.’s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with 5 leading open-source (size >= 57B) and 4 closed source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at https://github.com/eachsheep/shortcutsbench.

Advantages

To our knowledge, ShortcutsBench is the first large-scale agent benchmark based on real APIs, considering APIs, queries, and corresponding action sequences. ShortcutsBench provides a rich set of real APIs, queries of varying difficulty and task types, high-quality human-annotated action sequences (provided by shortcut developers), and queries from real user needs. Additionally, it offers precise parameter value filling, including raw data types, enumeration types, and using outputs from previous actions as parameter values, and evaluates the agent's awareness of requesting necessary information from the system or users. Moreover, the scale of APIs, queries, and corresponding action sequences in ShortcutsBench rivals or even surpasses benchmarks and datasets created by LLMs or modified from existing datasets. A comprehensive comparison between ShortcutsBench and existing benchmarks/datasets is shown in the table below.

Evaluation Setup

Model

Referencing existing work, we selected and tested 9 most advanced LLMs...

Prompt Template

Following existing work, we slightly modified the ReACT templates...

Result Analysis: API Selection

The API selection accuracy on queries with different complexity levels.

The API selection accuracy difference of each LLM across 8 task types.

The API selection accuracy of each task type on 9 API-based agents.

Key Findings:

Open-source LLMs perform comparably to closed-source models on lower-difficulty tasks...
LLM-based agents perform poorly on tasks requiring multi-step reasoning...
Significant performance variations across different types of tasks...
Better performance on daily life tasks...

Result Analysis: API Parameter Value Filling

Accuracy of primitive data types and enum data types (upper) and outputs from previous actions (lower). — Accuracy of primitive data types & enum data types (upper) and outputs from previous actions (lower).

The error rates for action parameter value filling.

Key Findings:

Increased task difficulty has a much smaller impact on the accuracy of parameter filling for the most intelligent LLMs like Gemini-1.5-Pro...
API parameter filling remains a bottleneck for cost-effective LLMs like GPT-3.5-turbo and ChatGLM-4-Air...
Extracting relevant parameters from the user's query...
Errors for cost-effective LLMs...

Result Analysis: Recognition of Need for Input

Key Findings:

All agents perform poorly at recognizing necessary system and user inputs. Overall accuracy ranges between 30.55% (GPT-3.5-turbo) and 55.18% (Deepspeed-2-coder). Larger LLMs like Deepspeed-2-chat (236B) demonstrate better recognition accuracy.

**Accuracy of recognition of the need for input from the system or the user**
Levels	Gemini 1.5-Pro	QWen 2-72B	Deeps. 2-chat	Deeps. 2-coder	LLaMA 3-70B	Gemini 1.5-Flash	QWen 2-57B	GPT 3.5-turbo	ChatGLM 4-Air
(0, 1]	33.33%	37.78%	64.29%	62.71%	47.62%	62.79%	22.22%	28.89%	47.62%
(1, 5]	45.95%	50.40%	55.50%	60.08%	44.08%	53.99%	37.24%	37.70%	48.06%
(5, 15]	51.85%	36.42%	40.76%	49.44%	35.71%	40.65%	28.37%	20.33%	48.42%
(15, 30]	46.67%	25.00%	27.59%	43.14%	22.22%	44.64%	8.11%	17.14%	48.89%
Overall	46.59%	41.97%	47.90%	55.18%	49.89%	40.71%	30.74%	30.55%	48.28%

BibTeX

@misc{shen2024shortcutsbenchlargescalerealworldbenchmark,
        title={ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents}, 
        author={Haiyang Shen and Yue Li and Desong Meng and Dongqi Cai and Sheng Qi and Li Zhang and Mengwei Xu and Yun Ma},
        year={2024},
        eprint={2407.00132},
        archivePrefix={arXiv},
        primaryClass={cs.SE},
        url={https://arxiv.org/abs/2407.00132}, 
  }