The GenAgent framework builds collaborative AI systems by creating workflows. The workflows are converted into
code so that LLM agents can better understand them. GenAgent can learn from human-designed workflows and
create new ones. The generated workflows can be interpreted as collaborative systems to complete complex
tasks.
Abstract
Much previous AI research has focused on developing monolithic models to maximize their intelligence and
capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper
explores an alternative approach: collaborative AI systems that use workflows to integrate models, data
sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an
LLM-based framework that automatically generates complex workflows, offering greater flexibility and
scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows
with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We
implement GenAgent on the ComfyUI platform and propose a new benchmark,
OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both
run-level and task-level evaluations, showing its capability to generate complex workflows with superior
effectiveness and stability.
Representing Workflow with Code
Workflows are widely used across various applications, with diverse representations: flow graph, JSON,
element list and code.
Flow graph is one of the most intuitive and user-friendly representations of
workflow DAGs for humans, but not LLMs.
JSON is a popular way for LLMs to represent structured information, but processing
long JSON files is extremely difficult.
Element list is a natural representation for LLMs to grasp workflows, but is short
of semantic and topological information.
Code is a reasonable and effective representation for LLMs to understand and
generate workflows.
Four different representations of workflows, including flow graph, JSON, element list, and code. Flow
graph is only intuitive for human vision. JSON is a structured format but is complex and redundant.
Element list is more compact and closer to natural language but lacks semantic and topological
information. Code is compact, Turing complete, semantically rich, and friendly for LLMs, thus suitable
for describing workflows.
Building Workflow with GenAgent
We propose the GenAgent framework where the agents collaborate to complete the workflow generation task.
GenAgent is mainly composed of three independent modules: Memory, PlanAgent, and Action.
Memory includes history, reference, and workspace, storing the agent’s recent
history behaviors, results from intermediate, external reference knowledge, and internal reasoning.
PlanAgent is responsible for the global planning of workflows under the task
instruction. At each step, PlanAgent generates a high-level plan with an action decision based on the
current memory and task instruction.
Actions represent different activities that PlanAgent can select, and the goal of
each action is to modify the current memory. Different actions are handled by different agents or
modules.
The architecture of the GenAgent framework. Multiple agents collaborate to generate workflows in a
step-by-step manner. The PlanAgent receives the task instruction and generates high-level plans and
action decisions at every step. Different actions are then handled by the CombineAgent, AdaptAgent,
and RetrieveAgent, respectively. The agents are equipped with memory, which consists of history,
reference, and workspace. The RefineAgent is responsible for debugging if needed. Once the PlanAgent
decides to finish the generation process, the workflow will be submitted to the interpreter for
execution.
Benchmark Evaluation
We implement GenAgent on the ComfyUI platform as a proof of concept. ComfyUI uses workflows to describe
the generation pipelines, supporting various models and tools, making it possible to solve a wide range
of generation tasks. A typical ComfyUI workflow consists of tens of nodes and links, which are connected
to form a complex DAG. We propose a benchmark, OpenComfy, which contains 20 different tasks of various
types. We provide complete documentation for every node and a set of examples containing 12 basic
workflows with manual annotations, so that agents can learn from these external knowledge. We compare
GenAgent with 4 baseline agents: Zero-shot Agent, Few-shot Agent,
CoT Agent, and RAG Agent.
The evaluation results on the OpenComfy benchmark. Two types of pass rates of both run-level and
task-level evaluations are reported. We compare GenAgent with zero-shot, few-shot, CoT, and RAG
agents. The best results are highlighted in bold.
Generation Example
We present the generation results of two different tasks selected from the OpenComfy benchmark, which
can intuitively show that GenAgent can generate complex ComfyUI workflows and complete various
generation tasks.
Example 1
Example overview: The task provides a photo of a girl playing the guitar and requires
generating an image of an old man in the forest, playing the guitar with the same pose as the girl. The
expected style and resolution are also specified.
Task requirement: You are given an image of a girl playing guitar in
`play_guitar.jpg`. Generate an image of an old man playing guitar in the forest with the same pose as
the girl. The result should be a realistic and detailed image with 1024x768 resolution.
Generated workflow: The generated workflow consists of 13 nodes, involving a pose
estimator and a ControlNet model to inject the pose information as conditions. You can see the generated
workflow in the embedding below.
Generation result: The image generated by the executed workflow is shown below.
Example 2
Example overview: The task requires generating an image of London following the style
of the given photo of Budapest and convert it into a video. Considering the resolution and frame rate
are limited by a single model, the task also involves upscaling and interpolation to form a high-quality
video.
Task requirement: You are given a photo of Budapest `budapest.jpg`. First generate an
image of London with the same style as the given image. Then turn it into a 2-second video with 512x512
resolution and 8 frames per second. Finally increase its resolution to 1024x1024 and frame rate to 24.
The result should be a high-quality video saved in gif format.
Generated workflow: The generated workflow consists of 22 nodes and complicated
connections, utilizing multiple models such as SVD, ESRGAN, and RIFE. You can see the generated workflow
in the embedding below.
Generation result: The video generated by the executed workflow is shown below.
BibTeX
@misc{xue2024genagentbuildcollaborativeai,
title={GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI},
author={Xiangyuan Xue and Zeyu Lu and Di Huang and Wanli Ouyang and Lei Bai},
year={2024},
eprint={2409.01392},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.01392},
}