Summary of Agent S An Open Agentic Framework that Uses Computers Like a Human

The paper introduces Agent S, an innovative open agentic framework designed to enable autonomous interaction with computers through a Graphical User Interface (GUI). This framework aims to revolutionize human-computer interaction by automating complex, multi-step tasks, addressing three key challenges: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.

Key Concepts and Ideas

Experience-Augmented Hierarchical Planning:
- Agent S employs a novel planning method that leverages both external web knowledge and internal experience retrieval. This approach decomposes complex tasks into manageable subtasks, facilitating efficient task planning and execution.
- The framework uses Online Web Knowledge to stay updated with specific applications and Narrative Memory to store high-level, abstractive task experiences from past interactions.
- During task execution, the agent retrieves detailed, step-by-step subtask experiences from Episodic Memory to refine actions and improve planning continuously.
Agent-Computer Interface (ACI):
- Agent S introduces a language-centric ACI to enhance the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs).
- The ACI employs a dual-input strategy using visual input and an image-augmented accessibility tree for precise element grounding.
- It defines a bounded action space of language-based primitives, such as click(element_id), which are conducive to MLLM common-sense reasoning and generate environment transitions at the right temporal resolution.

Evaluation and Findings

Performance on OSWorld Benchmark:
- Agent S outperforms the baseline by 9.37% on success rate, achieving a new state-of-the-art with an 83.6% relative improvement.
- The framework demonstrates consistent improvements across five broad computer task categories.
Generalizability on WindowsAgentArena:
- Agent S shows a performance improvement from 13.3% to 18.2% on an equivalent setup without explicit adaptation, highlighting its broad generalizability to different operating systems.

Contributions

Introduction of Agent S:
- A new agentic framework integrating experience-augmented hierarchical planning, self-supervised continual memory update, and an Agent-Computer Interface for MLLM-based GUI agents.
Experience-Augmented Hierarchical Planning:
- A method that uses experience from external web knowledge and the agent’s internal memory to decompose complex tasks into executable subtasks.
Extension of ACI to GUI Agents:
- Allowing MLLM-based agents to operate computers more precisely using a set of high-level, predefined primitive actions.
Extensive Experiments:
- Conducted on OSWorld to show the effectiveness of individual components of Agent S, establishing new state-of-the-art results in automating computer tasks.
- Demonstrated generalizability across different operating systems on WindowsAgentArena.

Source(s):

arXiv:2410.08164v1

This summary captures the essence of the original content, highlighting the main ideas, arguments, and findings of Agent S, an open agentic framework designed to transform human-computer interaction through autonomous GUI-based task automation."