Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Ruisheng Cao¹², Fangyu Lei¹, Haoyuan Wu¹, Jixuan Chen¹, Yeqiao Fu¹, Hongcheng Gao¹,
Xinzhuang Xiong¹, Hanchong Zhang², Yuchen Mao¹, Wenjing Hu¹, Tianbao Xie¹, Hongsheng Xu²,
Danyang Zhang¹², Sida Wang, Ruoxi Sun³, Pengcheng Yin⁴, Caiming Xiong⁵, Ansong Ni⁶,
Qian Liu⁷, Victor Zhong⁸, Lu Chen², Kai Yu², Tao Yu¹

¹The University of Hong Kong, ²Shanghai Jiao Tong University, ³Google Cloud AI Research,
⁴Google Deepmind, ⁵Salesforce Research, ⁶Yale University, ⁷Sea AI Lab, ⁸University of Waterloo
Email to: ruishengcao@gmail.com , tyu@cs.hku.hk

Paper Code Data VM Snapshots Tool Docs Task Viewer Leaderboard

scroll down to view more

**Spider2-V** is a multimodal agent benchmark spanning across the entire data science and engineering workflow (e.g., five task examples above). It involves various professional enterprise-level applications and includes intensive GUI controls apart from code writing throughout the real-time multi-turn interaction with an executable computer environment.

Abstract

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like *BigQuery*, *dbt*, and *Airbyte*. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this work, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow.

Spider2-V Framework Infrastructure

Overview of **Spider2-V**, which is featured by: - 494 real-world tasks across the complete data science and engineering workflows (from data warehousing to orchestration) - incorporation of 20 professional enterprise-level applications (e.g., *BigQuery*, *dbt*, *Airbyte*, etc.) - integration of both command line (CLI) and graphical user interfaces (GUI), especially for intensive GUI controls - an interactive executable computer environment (adapted from our previous work [OSWorld](https://os-world.github.io/)) - a document warehouse ([Download](https://drive.usercontent.google.com/download?id=1aGaHXDkBeoUZ9EOIPj7iIRFra_2FjJoZ&export=download&authuser=0&confirm=t)) for agent retrieval

Executable Environment

- The interactive environment is a computer desktop of Ubuntu operating system. - The action space can be 1) pyautogui code, or 2) customized JSON dict. - The observation space can be 1) image-style screenshot, and 2) text-format accessibility tree.

Task Demonstration

We present one task example (with application Airbyte and uuid 66936a8e-5cbe-4638-a03a-3ae92eb81e6c) below to showcase: - 1. .json data format; - 2. two types of task intructions (abstract and verbose); - 3. environment setup methods; - 4. video recording and action trajectory to complete the task; - 5. task-specific evaluation metric.

Data Format
Instructions
Env Setup
Action Trajectory
Evaluation

Here is a brief explanation on some critical fields in the .json file (detailed in [Data Format](https://github.com/xlang-ai/Spider2-V/tree/main/evaluation_examples#task-format)): • instruction: the task instruction, user intent, or task goal
• config: a list of functions to initialize or reset the environment in the virtual machine. Each function is represented by a JSON dict, where the type field indicates the function name and the parameters field indicates the parameters of the function
• evaluator: the evaluation function to check the agent's output. Concretely, the func field indicates the function name, the result field indicates how to obtain the predicted result from the agent, and the expected field indicates the golden result of the current task

{
  "id": "66936a8e-5cbe-4638-a03a-3ae92eb81e6c",
  "snapshot": "airbyte",
  "instruction": "I have established a connection from Faker to local .csv file. Could you help me change the running schedule? I hope it can be replicated at 6:00 pm every day.",
  "source": [
      "https://docs.airbyte.com/using-airbyte/core-concepts/sync-schedules"
  ],
  "related_apps": [
      "chromium",
      "airbyte",
      "docker"
  ],
  "tags": [
      "gui",
      "data_ingestion_and_integration",
      "abstract"
  ],
  "action_number": 6,
  "config": [
      {
          "type": "copyfile_from_host_to_guest",
          "parameters": {
              "src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/connection.json",
              "dest": "/home/user/connection.json"
          }
      },
      {
          "type": "script_and_execute",
          "parameters": {
              "src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/init.sh",
              "dest": "/home/user/init.sh"
          }
      },
      {
          "type": "google_chrome_browser",
          "parameters": {
              "debugging_port": 1337,
              "listening_port": 9222,
              "urls": [
                  "https://www.bing.com/"
              ]
          }
      },
      {
          "type": "airbyte_webui_init",
          "parameters": {
              "listening_port": 9222,
              "url": "http://localhost:8000",
              "actions": [
                  {
                      "type": "login",
                      "email": "anonym@gmail.com",
                      "company": "ANONYM"
                  }
              ]
          }
      }
  ],
  "evaluator": {
      "postconfig": [],
      "func": "check_include_exclude",
      "result": {
          "type": "vm_script_output",
          "src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/eval.sh",
          "dest": "/home/user/eval.sh"
      },
      "expected": {
          "type": "rule",
          "rules": {
              "include": [
                  "succeed"
              ],
              "exclude": [
                  "failed"
              ]
          }
      }
  },
  "counterpart": "7657611f-2e32-47a1-89c9-3b887d803bc5"
}

**Note that:** the action trajectory is generated by the GPT-4o agent and it successfully completes the task 🎉 !
For a quick glance on more task exmaples, please refer to the [ Task Viewer](explorer.html) page.

Task Instruction

                    
                    I have established a connection from Faker to local .csv file. Could you help me change the running schedule? I hope it can be replicated at 6:00 pm every day.
                    

Video Recording

Action Trajectory

## Action 1
index_80 = (417, 288)
pyautogui.click(index_80)
time.sleep(1)

## Action 2
index_83 = (502, 307)
pyautogui.click(index_83)
time.sleep(1)

## Action 3
index_91 = (883, 404)
pyautogui.click(index_91)
time.sleep(1)

## Action 4
index_102 = (1130, 481)
pyautogui.click(index_102)
time.sleep(1)

## Action 5
index_121 = (1130, 782)
pyautogui.click(index_121)
time.sleep(1)
                    
## Action 6
index_98 = (1130, 430)
pyautogui.click(index_98)
time.sleep(1)

## Action 7
index_105 = (1130, 560)
pyautogui.click(index_105)
time.sleep(1)

## Action 8
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)
                    
# Enter the new cron expression
pyautogui.typewrite('0 18 * * *')
time.sleep(1)

## Action 9
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)

# Enter the new Quartz cron expression
pyautogui.typewrite('0 0 18 * * ?')
time.sleep(1)

## Action 10
index_134 = (1426, 834)
pyautogui.click(index_134)
time.sleep(1)

## Action 11
DONE

All evaluation methods can be classified into three categories below:

In the current task, we adopt the information-based metric to check whether the schedule is correctly altered to 0 0 18 * * * or 0 0 18 * * ?.

Data Statistics and Comparison

We classify all 494 tasks in Spider2-V into 7 categories and 11 software sub-categories with main statistics below.

“verbose” means a step-by-step guideline on how to complete the task is included in the instruction.

Key statistics of Spider2-V.

Based on task categories and professional applications to showcase the content intuitively.

Distribution of tasks in Spider2-V

We make a comparison of **Spider2-V** against some other different benchmarks for VLM/LLM-based agents.
**The headers indicate:** the research field (Field), whether an executable environment is provided (Exec. Env.?), whether enterprise service is utilized (Enter. Serv.?), whether GUI actions are supported (GUI Support?) and some other statistics (e.g., number of involved applications or websites, number of execution-based evaluation functions).

	Spider2-V
Field	Data Science & Engineering
# Tasks	494
Exec. Env. ?	✔️
Enter. Serv.?	✔️
GUI Support?	✔️
# Apps/Sites?	20
# Exec. Eval. Func.	151

Spider1.0	DS1000	Arcade	Intercode	SheetCopilot	MLAgentBench	SWEBench	Mind2Web	WEBLINX	GAIA	WebArena	WorkArena	OSWorld	AitW	AndroidWorld
Text-to-SQL	Data Science	Data Science	Data Science	Sheet Coding	Machine Learning	Software Engineering	Web	Web	Web	Web	Web	Computer Control	Android	Android
1034	1000	1082	1350	221	13	2294	2000	2337	466	812	29	369	30k	116
❌	❌	❌	✔️	❌	✔️	❌	❌	❌	❌	✔️	✔️	✔️	❌	✔️
❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	✔️	❌	❌	❌
❌	❌	❌	❌	❌	❌	❌	✔️	✔️	❌	✔️	✔️	✔️	✔️	✔️
1	1	1	3	1	4	12	137	155	n/a	5	1	9	357	20
0	0	0	3	0	13	1	0	0	0	5	7	134	0	6

Benchmarking

We experiment with state-of-the-art LLMs and VLMs, including open-source representatives such as Mixtral-8x7B and Llama-3-70B, and closed-source ones including Qwen-Max, Gemini-Pro-1.5, Claude-3-Opus and GPT families (GPT-4o and GPT-4V). The baseline agent adopts three techniques: 1) Set-of-Mark (SoM), 2) Execution feedback (EF), and 3) Retrieval-augmented generation (RAG). We also split the overall results into different subsets, including Abstract, Verbose, Account, and Non-account. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!** 👏

All
Abstract
Verbose
Account
Non-account

**Notice:** t = temperature, top-p = top-p cutoff, len = max context length, a11ytree = accessibility tree

Rank	Model	Details	Score
1 Jun 3, 2024	GPT-4V (1106) OpenAI OpenAI, '23	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	14.0
2 Jun 2, 2024	GPT-4o (0513) OpenAI OpenAI, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	13.8
3 Jun 5, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	9.1
4 June 6, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 200k	8.1
5 June 6, 2024	Llama-3-70B Meta Meta Llama, Meta, '24	a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k	2.0
6 June 6, 2024	Mixtral-8x7B MistralAI Jiang et al., '24	a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k	0.8
7 June 6, 2024	Qwen-Max Qwen Qwen Team, '24	a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k	0.6

Rank	Model	Details	Score
1 Jun 3, 2024	GPT-4V (1106) OpenAI OpenAI, '23	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	11.3
1 Jun 2, 2024	GPT-4o (0513) OpenAI OpenAI, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	11.3
3 Jun 5, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	6.1
4 June 6, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 200k	5.3

Rank	Model	Details	Score
1 Jun 3, 2024	GPT-4V (1106) OpenAI OpenAI, '23	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	16.6
2 Jun 2, 2024	GPT-4o (0513) OpenAI OpenAI, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	16.2
3 Jun 5, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	12.1
4 June 6, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 200k	10.9

Rank	Model	Details	Score
1 Jun 3, 2024	GPT-4V (1106) OpenAI OpenAI, '23	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	11.2
2 Jun 2, 2024	GPT-4o (0513) OpenAI OpenAI, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	10.6
3 Jun 5, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	8.8
4 June 6, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 200k	5.9

Rank	Model	Details	Score
1 Jun 2, 2024	GPT-4o (0513) OpenAI OpenAI, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	15.6
2 Jun 3, 2024	GPT-4V (1106) OpenAI OpenAI, '23	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	15.4
3 Jun 5, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 128k	9.3
3 June 6, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	SoM + EF + RAG t=1.0, top-p=0.9 len = 200k	9.3

Analysis

We delve into different factors (e.g., action space, observation space, various techniques, and two hyper-parameters) which influence the eventual success rates. The baseline agent is GPT-4o.

**Action space**: pyautogui code > JSON dict; **Observation space**: a11ytree > screenshot; All three schemes (**SoM, EF, RAG**) contribute to the eventual success.

The top-ranked result is achieved with a **moderate** sampling temperature **0.5**.

Enlarging history window size **improves performances** but leads to **inefficiency**.

Acknowledgement

We thank Yiheng Xu, Hongjin Su, Xiaochuan Li, and Toh Jing Hua for their helpful assistance and feedback on this work.

FAQ

Where to download the resources?

The Github repository, virtual machine snapshots, crawled documents can be downloaded from:
- Github repository: Spider2-V (including environment and task examples)
- VM snapshots: ubuntu-arm.zip or ubuntu-x86.zip
- Crawled documents docs.zip

What is the username and password for the virtual machines?

The username and password for the virtual machines are as follows:
- Username: user
- Password: password

How to tackle task examples requiring accounts?

See Account Guideline.

How can I configure a proxy for the VM if I'm behind a GFW?

See Proxy Guideline.

I still have problems when using Spider2-V, where can I find support?

You can put forward an issue on the Github repository or email to ruishengcao@gmail.com , tyu@cs.hku.hk .

BibTeX

@article{2024-spider2v,
    title={Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?}, 
    author={Ruisheng Cao and Fangyu Lei and Haoyuan Wu and Jixuan Chen and Yeqiao Fu and Hongcheng Gao and Xinzhuang Xiong and Hanchong Zhang and Yuchen Mao and Wenjing Hu and Tianbao Xie and Hongshen Xu and Danyang Zhang and Sida Wang and Ruoxi Sun and Pengcheng Yin and Caiming Xiong and Ansong Ni and Qian Liu and Victor Zhong and Lu Chen and Kai Yu and Tao Yu},
    year={2024},
    journal={CoRR},
    volume={abs/2407.10956},
    eprint={2407.10956},
    eprinttype={arXiv},
    url={https://arxiv.org/abs/2407.10956}
}