GenAgent

Abstract

Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.

Representing Workflow with Code

Workflows are widely used across various applications, with diverse representations: flow graph, JSON, element list and code.

Flow graph is one of the most intuitive and user-friendly representations of workflow DAGs for humans, but not LLMs.
JSON is a popular way for LLMs to represent structured information, but processing long JSON files is extremely difficult.
Element list is a natural representation for LLMs to grasp workflows, but is short of semantic and topological information.
Code is a reasonable and effective representation for LLMs to understand and generate workflows.

Four different representations of workflows, including flow graph, JSON, element list, and code. Flow graph is only intuitive for human vision. JSON is a structured format but is complex and redundant. Element list is more compact and closer to natural language but lacks semantic and topological information. Code is compact, Turing complete, semantically rich, and friendly for LLMs, thus suitable for describing workflows.

Building Workflow with GenAgent

We propose the GenAgent framework where the agents collaborate to complete the workflow generation task. GenAgent is mainly composed of three independent modules: Memory, PlanAgent, and Action.

Memory includes history, reference, and workspace, storing the agent’s recent history behaviors, results from intermediate, external reference knowledge, and internal reasoning.
PlanAgent is responsible for the global planning of workflows under the task instruction. At each step, PlanAgent generates a high-level plan with an action decision based on the current memory and task instruction.
Actions represent different activities that PlanAgent can select, and the goal of each action is to modify the current memory. Different actions are handled by different agents or modules.

The architecture of the GenAgent framework. Multiple agents collaborate to generate workflows in a step-by-step manner. The PlanAgent receives the task instruction and generates high-level plans and action decisions at every step. Different actions are then handled by the CombineAgent, AdaptAgent, and RetrieveAgent, respectively. The agents are equipped with memory, which consists of history, reference, and workspace. The RefineAgent is responsible for debugging if needed. Once the PlanAgent decides to finish the generation process, the workflow will be submitted to the interpreter for execution.

Benchmark Evaluation

We implement GenAgent on the ComfyUI platform as a proof of concept. ComfyUI uses workflows to describe the generation pipelines, supporting various models and tools, making it possible to solve a wide range of generation tasks. A typical ComfyUI workflow consists of tens of nodes and links, which are connected to form a complex DAG. We propose a benchmark, OpenComfy, which contains 20 different tasks of various types. We provide complete documentation for every node and a set of examples containing 12 basic workflows with manual annotations, so that agents can learn from these external knowledge. We compare GenAgent with 4 baseline agents: Zero-shot Agent, Few-shot Agent, CoT Agent, and RAG Agent.

The evaluation results on the OpenComfy benchmark. Two types of pass rates of both run-level and task-level evaluations are reported. We compare GenAgent with zero-shot, few-shot, CoT, and RAG agents. The best results are highlighted in bold.

Generation Example

We present the generation results of two different tasks selected from the OpenComfy benchmark, which can intuitively show that GenAgent can generate complex ComfyUI workflows and complete various generation tasks.

Example 1

Example overview: The task provides a photo of a girl playing the guitar and requires generating an image of an old man in the forest, playing the guitar with the same pose as the girl. The expected style and resolution are also specified.

Task requirement: You are given an image of a girl playing guitar in `play_guitar.jpg`. Generate an image of an old man playing guitar in the forest with the same pose as the girl. The result should be a realistic and detailed image with 1024x768 resolution.

Generated workflow: The generated workflow consists of 13 nodes, involving a pose estimator and a ControlNet model to inject the pose information as conditions. You can see the generated workflow in the embedding below.

Generation result: The image generated by the executed workflow is shown below.

Example 2

Example overview: The task requires generating an image of London following the style of the given photo of Budapest and convert it into a video. Considering the resolution and frame rate are limited by a single model, the task also involves upscaling and interpolation to form a high-quality video.

Task requirement: You are given a photo of Budapest `budapest.jpg`. First generate an image of London with the same style as the given image. Then turn it into a 2-second video with 512x512 resolution and 8 frames per second. Finally increase its resolution to 1024x1024 and frame rate to 24. The result should be a high-quality video saved in gif format.

Generated workflow: The generated workflow consists of 22 nodes and complicated connections, utilizing multiple models such as SVD, ESRGAN, and RIFE. You can see the generated workflow in the embedding below.

Generation result: The video generated by the executed workflow is shown below.

BibTeX

@misc{xue2024genagentbuildcollaborativeai,
        title={GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI}, 
        author={Xiangyuan Xue and Zeyu Lu and Di Huang and Wanli Ouyang and Lei Bai},
        year={2024},
        eprint={2409.01392},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2409.01392}, 
  }

GenAgent: Build Collaborative AI Systems with Automated Workflow Generation - Case Studies on ComfyUI