ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Shanghai Artificial Intelligence Laboratory
*Corresponding Author
Teaser Image

(a) ComfyBench is a comprehensive benchmark to evaluate agents's ability to design collaborative AI systems in ComfyUI. Given the task instruction, agents are required to learn from documents and create workflows to describe collaborative AI systems. The performance is measured by pass rate and resolve rate, reflecting whether the workflow can be correctly executed and whether the task requirements are realized. (b) ComfyAgent builds collaborative Al systems in ComfyUI by generating workflows. The workflows are converted into equivalent code so that LLMs can better understand them. ComfyAgent can learn from existing workflows and autonomously design new ones. The generated workflows can be interpreted as collaborative AI systems to complete given tasks.

Gallery

Leaderboard

Notice: The score is defined as the overall resolve rate on ComfyBench.

Rank Agent Parameter Score
1 GPT-4o + ComfyAgent
OpenAI, 2024
Xue et al., 2024
num_references = 5
step_limitation = 5
debug_limitation = 1
32.50
2 o1-preview + RAG
OpenAI, 2024
Lewis et al., 2020
num_references = 5 32.50
3 GPT-4o + RAG
OpenAI, 2024
Lewis et al., 2020
num_references = 5 26.50
4 Qwen-2.5-72B + RAG
Qwen, 2024
Lewis et al., 2020
num_references = 5
step_limitation = 5
debug_limitation = 1
25.50
5 Llama-3.1-70B + ComfyAgent
Meta, 2024
Xue et al., 2024
num_references = 5
step_limitation = 5
debug_limitation = 1
24.00
6 InternLM-2.5-20B + RAG
InternLM, 2024
Lewis et al., 2020
num_references = 5
step_limitation = 5
debug_limitation = 1
21.00
7 Llama-3.1-70B + RAG
Meta, 2024
Lewis et al., 2020
num_references = 5 20.00
8 GPT-4o + CoT-SC
OpenAI, 2024
Wang et al., 2022
num_examples = 5
num_trajectories = 3
18.50
9 Mixtral-8x7B + RAG
Mistral, 2023
Lewis et al., 2020
num_references = 5 17.00
10 GPT-4o + CoT
OpenAI, 2024
Wei et al., 2022
num_examples = 5 17.00
11 GPT-4o + Few-shot
OpenAI, 2024
Brown et al., 2020
num_examples = 5 16.00
12 o1-mini + RAG
OpenAI, 2024
Lewis et al., 2020
num_references = 5 12.00
13 Claude-3.5-Sonnet + RAG
Anthropic, 2024
Lewis et al., 2020
num_references = 5 8.50
14 GPT-4o + Zero-shot
OpenAI, 2024
Brown et al., 2020
num_examples = 5 0.00

Abstract

Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents’s ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with ComfyBench is paving the way for more intelligent and autonomous collaborative AI systems.

ComfyBench: Benchmark Contents

ComfyBench is a comprehensive benchmark that includes:

  • Node Documentation provides complete documentation for 3,205 nodes registered in ComfyUI. This documentation provides detailed information about each node's usage, inputs, and outputs, serving as an essential reference for agents.
  • Curriculum Workflow is a collection of 20 annotated tutorial workflows that demonstrate common nodes and patterns for solving basic tasks in ComfyUI, helping agents learn useful skills in a structured way.
  • Task Instruction serves as the core component consisting of 200 task instructions that require designing and executing ComfyUI workflows. The tasks are divided into three difficulty levels:
    • Vanilla includes 100 tasks, which can be solved by learning from a single demonstration with minor modifications.
    • Complex includes 60 tasks, which require combining multiple curriculum workflows and making adjustments.
    • Creative includes 40 tasks, which require understanding core principles and applying skills in novel ways.
If an additional image or video is involved in the task, it will be specified in the instruction.

ComfyBench Contents

ComfyBench: Evaluation Metrics

We design two metrics to evaluate the generated workflows in ComfyBench:

  • Pass Rate measures whether the generated workflows are correct. It is evaluated by the ComfyUI server. A task will be marked as passed only if the server finishes the execution process and returns a success message.
  • Resolve Rate measures whether the generated workflows produce the expected results. It is automatically evaluated using GPT-4o. A task will be marked as resolved only if the GPT-4o model confirms that all the requirements are satisfied.
To verify the reliability of VLM-based evaluation, we conduct a human evaluation to analyze the agreement between human evaluators and GPT-4o. We randomly select 50 tasks that are completed by ComfyAgent, where human evaluators and GPT-4o are asked to provide their judgments respectively. We compute the average scores for each question under a sample size of 400 and conduct a correlation analysis, which indicates a strong agreement between human evaluators and GPT-4o.

Human Alignment

ComfyAgent: Workflow Representation

There are four common formats to represent workflows:

  • Flow graph provides intuitive visualization for humans but is unsuitable for LLM or VLM processing.
  • JSON provides a structured representation but is limited by redundant information and LLMs' context limitations.
  • Element list is closer to natural language and provides a more compact representation, but it lacks explicit topological relationships, hindering LLMs from correctly processing complex workflows.
  • Code emerges as the most effective representation, offering various advantages including Turing completeness, rich semantic information, and natural compatibility with LLMs' code generation capabilities.
We implement code representation using a restricted subset of Python-like syntax.

Workflow Representation

ComfyAgent: Multi-Agent Framework

We propose ComfyAgent, a multi-agent framework consisting of three independent modules:

  • Memory stores the recent state of ComfyAgent, which is formulated into three parts:
    • History maintains recent plans and actions of Planner, enabling action review for subsequent planning.
    • Reference stores information retrieved from the knowledge base, and can be updated through active retrieval.
    • Workspace contains the current workflow together with its natural language annotation.
  • Planner serves as the core of ComfyAgent, providing the global scheme to design and modify workflows:
    • At the beginning of the task, PlanAgent selects an existing workflow to initialize the memory and produces a thorough multi-step plan based on the task instruction.
    • For each step, PlanAgent produces a high-level plan, together with an action based on the current memory.
    • For each step, PlanAgent evaluates the completion status of the task. Once the task is deemed completed, PlanAgent will finish the procedure and save the workflow.
  • Actions can be selected by PlanAgent. We define three actions as follows:
    • Combine is processed by CombineAgent, combining the current workflow with another workflow from references.
    • Adapt is processed by AdaptAgent, adapting the details of the current workflow based on the prompt.
    • Retrieve is processed by RetrieveAgent, retrieving relevant information and updating references.
    After combination or adaptation, the updated workflow will be checked and refined by RefineAgent.
After the action is processed, ComfyAgent will enter a new step, where PlanAgent updates the existing plan and forms a new action. The code representation will be finally converted into the standard format to describe the collaborative AI system.

GenAgent Framework

Experiments: Evaluation Results

We conduct experiments on advanced LLMs, including Llama-3.1 (llama-3.1-70b-instruct), Claude-3.5 (claude-3.5-sonnet-20240620), GPT-4o (gpt-4o-2024-08-06), o1-mini (o1-mini-2024-09-12), and o1-preview (o1-preview-2024-09-12). We adopt five methods that are universally effective and can be conveniently adapted to solve the tasks in ComfyBench:

  • Zero-shot Learning directly feeds LLMs with the task instruction to conduct inference.
  • Few-shot Learning provides a set of demonstrations in the prompt, which utilizes the in-context learning ability of LLMs.
  • Chain-of-Thought (CoT) instructs the agent to articulate the reasoning process before providing the final answer.
  • CoT with Self-consistency (CoT-SC) ensembles parallel trajectories and then selects the most consistent answer.
  • Retrieval-Augmented Generation (RAG) efficiently learns from the retrieved demonstrations.
We evaluate the methods mainly on GPT-4o. ComfyAgent is also evaluated on Llama-3.1 to verify its performance on the open-source model. RAG is evaluated on all the models to provide a horizontal comparison of their capabilities.

Evaluation Result

Experiments: Ablation Studies

We conduct ablation studies to verify two core designs of ComfyAgent:

  • Representation: To verify the effectiveness of code representation, we implement three variants of RAG on GPT-4o, where the workflows are respectively represented in JSON, element list, and code.
  • Architecture: To verify the reasonability of the multi-agent framework, we implement four variants of ComfyAgent on GPT-4o, each removing one agent from the original framework.
It turns out that RAG with code representation shows superior performance over other representations, and that removing any agent from ComfyAgent will lead to a significant performance degradation.

Ablation Study

BibTeX

@article{xue2024comfybench,
  title={ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems},
  author={Xue, Xiangyuan and Lu, Zeyu and Huang, Di and Wang, Zidong and Ouyang, Wanli and Bai, Lei},
  journal={arXiv preprint arXiv:2409.01392},
  year={2024}
},