Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

1The University of Hong Kong, 2Shanghai Jiao Tong University, 3Google Cloud AI Research,
4Google Deepmind, 5Salesforce Research, 6Yale University, 7Sea AI Lab, 8University of Waterloo
Email to: ruishengcao@gmail.com , tyu@cs.hku.hk
Spider2-V Overview
scroll down to view more
**Spider2-V** is a multimodal agent benchmark spanning across the entire data science and engineering workflow (e.g., five task examples above). It involves various professional enterprise-level applications and includes intensive GUI controls apart from code writing throughout the real-time multi-turn interaction with an executable computer environment.

Abstract

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like *BigQuery*, *dbt*, and *Airbyte*. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this work, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow.

Spider2-V Framework Infrastructure

infrastructure Overview of **Spider2-V**, which is featured by: - 494 real-world tasks across the complete data science and engineering workflows (from data warehousing to orchestration) - incorporation of 20 professional enterprise-level applications (e.g., *BigQuery*, *dbt*, *Airbyte*, etc.) - integration of both command line (CLI) and graphical user interfaces (GUI), especially for intensive GUI controls - an interactive executable computer environment (adapted from our previous work [OSWorld](https://os-world.github.io/)) - a document warehouse ([Download](https://drive.usercontent.google.com/download?id=1aGaHXDkBeoUZ9EOIPj7iIRFra_2FjJoZ&export=download&authuser=0&confirm=t)) for agent retrieval

Executable Environment

executable environment - The interactive environment is a computer desktop of Ubuntu operating system. - The action space can be 1) pyautogui code, or 2) customized JSON dict. - The observation space can be 1) image-style screenshot, and 2) text-format accessibility tree.

Task Demonstration

We present one task example (with application Airbyte and uuid 66936a8e-5cbe-4638-a03a-3ae92eb81e6c) below to showcase: - 1. .json data format; - 2. two types of task intructions (abstract and verbose); - 3. environment setup methods; - 4. video recording and action trajectory to complete the task; - 5. task-specific evaluation metric.
**Note that:** the action trajectory is generated by the GPT-4o agent and it successfully completes the task πŸŽ‰ !
For a quick glance on more task exmaples, please refer to the [ Task Viewer](explorer.html) page.

Task Instruction

I have established a connection from Faker to local .csv file. Could you help me change the running schedule? I hope it can be replicated at 6:00 pm every day.

Video Recording

Action Trajectory

## Action 1
index_80 = (417, 288)
pyautogui.click(index_80)
time.sleep(1)

## Action 2
index_83 = (502, 307)
pyautogui.click(index_83)
time.sleep(1)

## Action 3
index_91 = (883, 404)
pyautogui.click(index_91)
time.sleep(1)

## Action 4
index_102 = (1130, 481)
pyautogui.click(index_102)
time.sleep(1)

## Action 5
index_121 = (1130, 782)
pyautogui.click(index_121)
time.sleep(1)
                    
## Action 6
index_98 = (1130, 430)
pyautogui.click(index_98)
time.sleep(1)

## Action 7
index_105 = (1130, 560)
pyautogui.click(index_105)
time.sleep(1)

## Action 8
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)
                    
# Enter the new cron expression
pyautogui.typewrite('0 18 * * *')
time.sleep(1)

## Action 9
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)

# Enter the new Quartz cron expression
pyautogui.typewrite('0 0 18 * * ?')
time.sleep(1)

## Action 10
index_134 = (1426, 834)
pyautogui.click(index_134)
time.sleep(1)

## Action 11
DONE
                  

Data Statistics and Comparison

We classify all 494 tasks in Spider2-V into 7 categories and 11 software sub-categories with main statistics below.

β€œverbose” means a step-by-step guideline on how to complete the task is included in the instruction.

data-overview
Key statistics of Spider2-V.

Based on task categories and professional applications to showcase the content intuitively.

data-composition
Distribution of tasks in Spider2-V
We make a comparison of **Spider2-V** against some other different benchmarks for VLM/LLM-based agents.
**The headers indicate:** the research field (Field), whether an executable environment is provided (Exec. Env.?), whether enterprise service is utilized (Enter. Serv.?), whether GUI actions are supported (GUI Support?) and some other statistics (e.g., number of involved applications or websites, number of execution-based evaluation functions).

  Spider2-V
Field Data Science &
Engineering
# Tasks 494
Exec. Env. ? βœ”οΈ
Enter. Serv.? βœ”οΈ
GUI Support? βœ”οΈ
# Apps/Sites? 20
# Exec. Eval. Func. 151
Spider1.0 DS1000 Arcade Intercode SheetCopilot MLAgentBench SWEBench Mind2Web WEBLINX GAIA WebArena WorkArena OSWorld AitW AndroidWorld
Text-to-SQL Data Science Data Science Data Science Sheet Coding Machine Learning Software Engineering Web Web Web Web Web Computer Control Android Android
1034 1000 1082 1350 221 13 2294 2000 2337 466 812 29 369 30k 116
❌ ❌ ❌ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌ βœ”οΈ
❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ βœ”οΈ ❌ ❌ ❌
❌ ❌ ❌ ❌ ❌ ❌ ❌ βœ”οΈ βœ”οΈ ❌ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ
1 1 1 3 1 4 12 137 155 n/a 5 1 9 357 20
0 0 0 3 0 13 1 0 0 0 5 7 134 0 6

Benchmarking

We experiment with state-of-the-art LLMs and VLMs, including open-source representatives such as Mixtral-8x7B and Llama-3-70B, and closed-source ones including Qwen-Max, Gemini-Pro-1.5, Claude-3-Opus and GPT families (GPT-4o and GPT-4V). The baseline agent adopts three techniques: 1) Set-of-Mark (SoM), 2) Execution feedback (EF), and 3) Retrieval-augmented generation (RAG). We also split the overall results into different subsets, including Abstract, Verbose, Account, and Non-account. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!** πŸ‘
**Notice:** t = temperature, top-p = top-p cutoff, len = max context length, a11ytree = accessibility tree
Rank Model Details Score

1

Jun 3, 2024
GPT-4V (1106)

OpenAI

OpenAI, '23

SoM + EF + RAG

t=1.0, top-p=0.9

len = 128k

14.0

2

Jun 2, 2024
GPT-4o (0513)

OpenAI

OpenAI, '24

SoM + EF + RAG

t=1.0, top-p=0.9

len = 128k

13.8

3

Jun 5, 2024
Gemini-Pro-1.5

Google

Gemini Team, Google, '24

SoM + EF + RAG

t=1.0, top-p=0.9

len = 128k

9.1

4

June 6, 2024
Claude-3-Opus

AnthropicAI

Anthropic, '24

SoM + EF + RAG

t=1.0, top-p=0.9

len = 200k

8.1

5

June 6, 2024
Llama-3-70B

Meta

Meta Llama, Meta, '24

a11ytree + EF + RAG

t=1.0, top-p=0.9

len = 32k

2.0

6

June 6, 2024
Mixtral-8x7B

MistralAI

Jiang et al., '24

a11ytree + EF + RAG

t=1.0, top-p=0.9

len = 32k

0.8

7

June 6, 2024
Qwen-Max

Qwen

Qwen Team, '24

a11ytree + EF + RAG

t=1.0, top-p=0.9

len = 32k

0.6

Analysis

We delve into different factors (e.g., action space, observation space, various techniques, and two hyper-parameters) which influence the eventual success rates. The baseline agent is GPT-4o.

Acknowledgement

We thank Yiheng Xu, Hongjin Su, Xiaochuan Li, and Toh Jing Hua for their helpful assistance and feedback on this work.

FAQ

Where to download the resources?

The Github repository, virtual machine snapshots, crawled documents can be downloaded from:
- Github repository: Spider2-V (including environment and task examples)
- VM snapshots: ubuntu-arm.zip or ubuntu-x86.zip
- Crawled documents docs.zip

What is the username and password for the virtual machines?

The username and password for the virtual machines are as follows:
- Username: user
- Password: password

How to tackle task examples requiring accounts?

See Account Guideline.

How can I configure a proxy for the VM if I'm behind a GFW?

See Proxy Guideline.

I still have problems when using Spider2-V, where can I find support?

You can put forward an issue on the Github repository or email to ruishengcao@gmail.com , tyu@cs.hku.hk .

BibTeX

@article{2024-spider2v,
    title={Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?}, 
    author={Ruisheng Cao and Fangyu Lei and Haoyuan Wu and Jixuan Chen and Yeqiao Fu and Hongcheng Gao and Xinzhuang Xiong and Hanchong Zhang and Yuchen Mao and Wenjing Hu and Tianbao Xie and Hongshen Xu and Danyang Zhang and Sida Wang and Ruoxi Sun and Pengcheng Yin and Caiming Xiong and Ansong Ni and Qian Liu and Victor Zhong and Lu Chen and Kai Yu and Tao Yu},
    year={2024},
    journal={CoRR},
    volume={abs/2407.10956},
    eprint={2407.10956},
    eprinttype={arXiv},
    url={https://arxiv.org/abs/2407.10956}
}