ComfyBench

(a) ComfyBench is a comprehensive benchmark to evaluate agents's ability to design collaborative AI systems in ComfyUI. Given the task instruction, agents are required to learn from documents and create workflows to describe collaborative AI systems. The performance is measured by pass rate and resolve rate, reflecting whether the workflow can be correctly executed and whether the task requirements are realized. (b) ComfyAgent builds collaborative Al systems in ComfyUI by generating workflows. The workflows are converted into equivalent code so that LLMs can better understand them. ComfyAgent can learn from existing workflows and autonomously design new ones. The generated workflows can be interpreted as collaborative AI systems to complete given tasks.

Gallery

Task Instruction

Generate an image of a rainy city street at night with reflections of neon signs on the wet pavement. The result should be a high-quality image.

Result

Task Instruction

Generate an image of a hot air balloon floating over a scenic valley at sunrise. The result should be a high-quality image.

Result

Task Instruction

Generate a 2-second video of a train running on a railway track through a countryside landscape. The result should be a high-quality video.

Result

Task Instruction

Generate a 2-second video of the sun rising over a mountain range with mist in the valleys. The result should be a high-quality video.

Result

Task Instruction

You are given an image of a scribble flower. Repaint the scribble into a realistic red flower. The result should be an image of a red flower.

Input

Result

Task Instruction

You are given an image of a red apple. Change it into a green apple on a table while maintaining other details. The result should be an image of a green apple.

Input

Result

Task Instruction

You are given an image of a sample logo containing a containing a bird pattern. Convert it into a cubist art poster with dark colors. The result should be an image of a poster without watermark.

Input

Result

Task Instruction

You are given an image of a large castle standing on top of a hill. Convert the castle into the style of ice cream while maintaining its original structure. The result should be an image with the castle transformed into a colorful and fantastic ice cream castle.

Input

Result

Task Instruction

You are given a low-resolution photo of a crowd of people. Upscale the image by 4x. The result should be a high-resolution version of the image.

Input

Result

Task Instruction

You are given an image of a table filled with dishes. Remove the fork on the table. The result should be a high-quality image without visible artifacts.

Input

Result

Task Instruction

You are given an image of a red car parked on the street. Replace the tree behind the car with a white house. The result should be a high-quality image without visible artifacts.

Input

Result

Task Instruction

You are given an image of a red bridge with a person standing on it. Remove the person from the image while maintaining the original appearance of the bridge. The result should be a high-quality image without visible artifacts.

Input

Result

Task Instruction

You are given a photo of mountains and rivers with a visible watermark in the bottom right corner. Remove the watermark from the image while maintaining the quality and content of the original photo. The result should be a high-quality image without the watermark.

Input

Result

Task Instruction

You are given an image of a girl playing the guitar. Generate an image of an old man playing the guitar in a forest with the same pose as the girl. The result should be a realistic image of an old man playing the guitar.

Input

Result

Task Instruction

You are given an image of a man wearing a black jacket. Change the black jacket into a white hoodie while ensuring that the modification looks natural and realistic. The result should be a high-quality image of the man wearing a white hoodie.

Input

Result

Task Instruction

You are given an image of a male celebrity. Transform the man in the image into a beautiful woman with ponytail hair while preserving her facial identity. The result should be a high-quality image of the woman.

Input

Result

Task Instruction

You are given an image of a toy dog. Replace the background with a scene of a sunny park with green grass while keeping the lighting and shadows consistent. The result should be an image of the toy dog in the park scene.

Input

Result

Task Instruction

You are given an image of a standing cat. Replace the background with a scene of a cozy living room while keeping the lighting and shadows consistent. The result should be an image of the cat in the living room scene.

Input

Result

Task Instruction

You are given an image containing two bottles of cosmetic products illuminated by a soft yellow light. Modify the illumination into a bright pink light to create a more vibrant and attractive appearance. The result should be an image of the cosmetic products with the new illumination.

Input

Result

Task Instruction

You are given an image of the city of Budapest. Create a 2-second video of the cityscape with the perspective changing based on the image. The result should be a high-quality video.

Input

Result

Leaderboard

Notice: The score is defined as the overall resolve rate on ComfyBench.

Rank	Agent	Parameter	Score
1	GPT-4o + ComfyAgent OpenAI, 2024 Xue et al., 2024	`num_references = 5` `step_limitation = 5` `debug_limitation = 1`	32.50
2	o1-preview + RAG OpenAI, 2024 Lewis et al., 2020	`num_references = 5`	32.50
3	GPT-4o + RAG OpenAI, 2024 Lewis et al., 2020	`num_references = 5`	26.50
4	Qwen-2.5-72B + RAG Qwen, 2024 Lewis et al., 2020	`num_references = 5` `step_limitation = 5` `debug_limitation = 1`	25.50
5	Llama-3.1-70B + ComfyAgent Meta, 2024 Xue et al., 2024	`num_references = 5` `step_limitation = 5` `debug_limitation = 1`	24.00
6	InternLM-2.5-20B + RAG InternLM, 2024 Lewis et al., 2020	`num_references = 5` `step_limitation = 5` `debug_limitation = 1`	21.00
7	Llama-3.1-70B + RAG Meta, 2024 Lewis et al., 2020	`num_references = 5`	20.00
8	GPT-4o + CoT-SC OpenAI, 2024 Wang et al., 2022	`num_examples = 5` `num_trajectories = 3`	18.50
9	Mixtral-8x7B + RAG Mistral, 2023 Lewis et al., 2020	`num_references = 5`	17.00
10	GPT-4o + CoT OpenAI, 2024 Wei et al., 2022	`num_examples = 5`	17.00
11	GPT-4o + Few-shot OpenAI, 2024 Brown et al., 2020	`num_examples = 5`	16.00
12	o1-mini + RAG OpenAI, 2024 Lewis et al., 2020	`num_references = 5`	12.00
13	Claude-3.5-Sonnet + RAG Anthropic, 2024 Lewis et al., 2020	`num_references = 5`	8.50
14	GPT-4o + Zero-shot OpenAI, 2024 Brown et al., 2020	`num_examples = 5`	0.00

Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents’s ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with ComfyBench is paving the way for more intelligent and autonomous collaborative AI systems.

ComfyBench: Benchmark Contents

ComfyBench is a comprehensive benchmark that includes:

Node Documentation provides complete documentation for 3,205 nodes registered in ComfyUI. This documentation provides detailed information about each node's usage, inputs, and outputs, serving as an essential reference for agents.
Curriculum Workflow is a collection of 20 annotated tutorial workflows that demonstrate common nodes and patterns for solving basic tasks in ComfyUI, helping agents learn useful skills in a structured way.
Task Instruction serves as the core component consisting of 200 task instructions that require designing and executing ComfyUI workflows. The tasks are divided into three difficulty levels:

Vanilla includes 100 tasks, which can be solved by learning from a single demonstration with minor modifications.
Complex includes 60 tasks, which require combining multiple curriculum workflows and making adjustments.
Creative includes 40 tasks, which require understanding core principles and applying skills in novel ways.

If an additional image or video is involved in the task, it will be specified in the instruction.

ComfyBench provides annotations for 3205 nodes and 20 workflows, together with 200 task instructions categorized into three difficulty levels: vanilla, complex, and creative, reflecting the generalization ability of agents.

ComfyBench: Evaluation Metrics

We design two metrics to evaluate the generated workflows in ComfyBench:

Pass Rate measures whether the generated workflows are correct. It is evaluated by the ComfyUI server. A task will be marked as passed only if the server finishes the execution process and returns a success message.
Resolve Rate measures whether the generated workflows produce the expected results. It is automatically evaluated using GPT-4o. A task will be marked as resolved only if the GPT-4o model confirms that all the requirements are satisfied.

To verify the reliability of VLM-based evaluation, we conduct a human evaluation to analyze the agreement between human evaluators and GPT-4o. We randomly select 50 tasks that are completed by ComfyAgent, where human evaluators and GPT-4o are asked to provide their judgments respectively. We compute the average scores for each question under a sample size of 400 and conduct a correlation analysis, which indicates a strong agreement between human evaluators and GPT-4o.

Correlation analysis between the average scores given by human evaluators and GPT-4o indicates a strong agreement.

ComfyAgent: Workflow Representation

There are four common formats to represent workflows:

Flow graph provides intuitive visualization for humans but is unsuitable for LLM or VLM processing.
JSON provides a structured representation but is limited by redundant information and LLMs' context limitations.
Element list is closer to natural language and provides a more compact representation, but it lacks explicit topological relationships, hindering LLMs from correctly processing complex workflows.
Code emerges as the most effective representation, offering various advantages including Turing completeness, rich semantic information, and natural compatibility with LLMs' code generation capabilities.

We implement code representation using a restricted subset of Python-like syntax.

Examples of four common formats to represent workflows: flow graph, JSON, element list, and code.

ComfyAgent: Multi-Agent Framework

We propose ComfyAgent, a multi-agent framework consisting of three independent modules:

Memory stores the recent state of ComfyAgent, which is formulated into three parts:
- History maintains recent plans and actions of Planner, enabling action review for subsequent planning.
- Reference stores information retrieved from the knowledge base, and can be updated through active retrieval.
- Workspace contains the current workflow together with its natural language annotation.
Planner serves as the core of ComfyAgent, providing the global scheme to design and modify workflows:
- At the beginning of the task, PlanAgent selects an existing workflow to initialize the memory and produces a thorough multi-step plan based on the task instruction.
- For each step, PlanAgent produces a high-level plan, together with an action based on the current memory.
- For each step, PlanAgent evaluates the completion status of the task. Once the task is deemed completed, PlanAgent will finish the procedure and save the workflow.
Actions can be selected by PlanAgent. We define three actions as follows:
- Combine is processed by CombineAgent, combining the current workflow with another workflow from references.
- Adapt is processed by AdaptAgent, adapting the details of the current workflow based on the prompt.
- Retrieve is processed by RetrieveAgent, retrieving relevant information and updating references.
After combination or adaptation, the updated workflow will be checked and refined by RefineAgent.

After the action is processed, ComfyAgent will enter a new step, where PlanAgent updates the existing plan and forms a new action. The code representation will be finally converted into the standard format to describe the collaborative AI system.

The architecture of the ComfyAgent framework. Multiple agents cooperate to design workflows in a step-by-step manner.

Experiments: Evaluation Results

We conduct experiments on advanced LLMs, including Llama-3.1 (llama-3.1-70b-instruct), Claude-3.5 (claude-3.5-sonnet-20240620), GPT-4o (gpt-4o-2024-08-06), o1-mini (o1-mini-2024-09-12), and o1-preview (o1-preview-2024-09-12). We adopt five methods that are universally effective and can be conveniently adapted to solve the tasks in ComfyBench:

Zero-shot Learning directly feeds LLMs with the task instruction to conduct inference.
Few-shot Learning provides a set of demonstrations in the prompt, which utilizes the in-context learning ability of LLMs.
Chain-of-Thought (CoT) instructs the agent to articulate the reasoning process before providing the final answer.
CoT with Self-consistency (CoT-SC) ensembles parallel trajectories and then selects the most consistent answer.
Retrieval-Augmented Generation (RAG) efficiently learns from the retrieved demonstrations.

We evaluate the methods mainly on GPT-4o. ComfyAgent is also evaluated on Llama-3.1 to verify its performance on the open-source model. RAG is evaluated on all the models to provide a horizontal comparison of their capabilities.

Evaluation results of all the baseline agents on ComfyBench. The pass rate and resolve rate of every task category, together with the summary result on ComfyBench, are reported. The best results are highlighted in bold.

Experiments: Ablation Studies

We conduct ablation studies to verify two core designs of ComfyAgent:

Representation: To verify the effectiveness of code representation, we implement three variants of RAG on GPT-4o, where the workflows are respectively represented in JSON, element list, and code.
Architecture: To verify the reasonability of the multi-agent framework, we implement four variants of ComfyAgent on GPT-4o, each removing one agent from the original framework.

It turns out that RAG with code representation shows superior performance over other representations, and that removing any agent from ComfyAgent will lead to a significant performance degradation.

Ablation results of RAG with different workflow representations and ComfyAgent with different architectures on ComfyBench.

BibTeX

@article{xue2024comfybench,
  title={ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems},
  author={Xue, Xiangyuan and Lu, Zeyu and Huang, Di and Wang, Zidong and Ouyang, Wanli and Bai, Lei},
  journal={arXiv preprint arXiv:2409.01392},
  year={2024}
}

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Gallery

Task Instruction

Result

Task Instruction

Result

Task Instruction

Result

Task Instruction

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Task Instruction

Input

Result

Leaderboard

Abstract

ComfyBench: Benchmark Contents

ComfyBench: Evaluation Metrics

ComfyAgent: Workflow Representation

ComfyAgent: Multi-Agent Framework

Experiments: Evaluation Results

Experiments: Ablation Studies

BibTeX