CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Overview

CUA-Suite unifies three complementary resources into a single ecosystem for full-stack computer-use intelligence.

CUA-Suite Overview. Human GUI trajectories are recorded across desktop platforms, expert-verified, and annotated with keyframes, OCR-enhanced bounding boxes, and interaction logs. The resulting CUA-Suite comprises UI-Vision, a comprehensive benchmark; GroundCUA, densely labeled UI screenshots with 3.5M annotations; and VideoCUA, 55 hours of expert video with detailed action trajectories.

87

Applications

12 categories

~10K

Tasks

Expert-designed

55h

Video

30 fps continuous

6M

Frames

Full temporal dynamics

56K

Screenshots

Dense annotations

5M+

UI Elements

Human-verified

VideoCUA

The largest open expert video corpus for desktop computer use. Continuous 30 fps recordings with kinematic cursor traces and multi-layered reasoning annotations averaging ~497 words per step.

~10K tasks 55h video 6M frames 87 apps

GroundCUA

Large-scale pixel-precise UI grounding dataset built entirely by human curators. Powers the training of GroundNext 3B/7B vision-language models achieving state-of-the-art desktop grounding.

56K screenshots 5M+ elements 700K tuning data

UI-Vision

Rigorous desktop-centric benchmark evaluating element grounding, layout understanding, and action prediction across diverse professional applications.

450 demonstrations 8.2K+ queries 3 eval tasks

Data Collection Pipeline

Expert human behavior is captured as continuous 30 fps video across 87 professional desktop applications and enriched with dense, multi-faceted annotations.

1

Selecting Diverse Applications

87 open-source applications across 12 categories, from development (VS Code, Blender) to productivity (LibreOffice, GNUCash), mirroring popular commercial software.

2

Expert-Driven Task Design

Human experts design over 10,000 real-world tasks ranging from simple actions to complex multi-step workflows, ensuring coherent, goal-oriented demonstrations.

3

Recording High-Fidelity Video

Continuous screen video at 30 fps producing ~55 hours and 6 million frames, with every mouse click, drag, scroll, and keystroke logged with millisecond precision.

4

Dense UI Annotation

Human annotators label every visible UI element with bounding boxes, textual labels, OCR text, and semantic categories, producing 5M+ element annotations across 56K screenshots.

Dataset distribution across application categories

Distribution of annotations across application categories in the CUA-Suite ecosystem.

VideoCUA

The largest open expert video corpus for desktop computer use.

55hours

continuous 30 fps video

6M frames

full temporal dynamics

2.5×

larger than current largest dataset

Unlike sparse screenshot datasets, these continuous video streams preserve the full temporal dynamics of human interaction and can be losslessly transformed into any agent framework format.

Dataset Comparison

VideoCUA is the only large-scale, human-curated dataset providing continuous 30 fps expert video for professional desktop applications with multi-layered reasoning annotations.

Dataset	Platform	Tasks	#Envs	Human	CoT	Scale
Web
Mind2Web	Web	2,350	137			~17K SS
AgentTrek	Web	10,398	127		short	~126K SS
Mobile
AITW	Mobile	715K	357	Mix.		~4.6M SS
GUI-Odyssey	Mobile	8,334	212	Mix.		~128K SS
Desktop & Cross-platform
OmniACT	D+W	9,802	65			~9.8K SS
OSWorld	Desktop	369	9			Eval.
VideoGUI	Desktop	178	11	Mix.		~7h
OpenCUA	Desktop	22,625	330+		long	~421K SS
ScaleCUA	Cross	~19K	—	Mix.		~2M SS
VideoCUA (Ours)	Desktop	~10K	87		long	55h (6M fr.)

SS = screenshots. Video = continuous video recordings (not per-step screenshots). CoT = chain-of-thought annotations (long = multi-layered, short = brief).

Application Frontiers

CUA-Suite's continuous video streams and dense annotations form a superset of information that supports emerging research directions beyond traditional action prediction.

Generalist Screen Parsing

Dense, human-verified bounding-box annotations covering all interactable elements—including canvas-based and custom-drawn widgets missed by DOM-based approaches. ScreenParse has already explored this on a subset of GroundCUA data.

Continuous Spatial Control

Intermediate cursor movements and complete video context preserve kinematic priors (e.g., Fitts's Law deceleration), enabling imitation learning or offline RL policies for feedback-driven navigation.

Visual World Models

30 fps video recordings paired with timestamped actions provide dense (s_t, a_t, s_t+1) triplets for action-conditioned video generation and visual lookahead planning for desktop workflows.

Video-Based Reward Modeling

Continuous expert video recordings with task-level annotations provide positive demonstrations for training reward models, while dense step-level annotations enable fine-grained, step-wise reward signals.

Citation

If you find CUA-Suite useful for your research, please cite our work.

@inproceedings{
    jian2026cuasuite,
    title={{CUA}-Suite: Expert Trajectories and Pixel-Precise Grounding for Computer-use Agents},
    author={Xiangru Jian and Shravan Nayak and Kevin Qinghong Lin and Aarash Feizi and Kaixin Li and Patrice Bechard and Spandana Gella and Sai Rajeswar},
    booktitle={ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving},
    year={2026},
    url={https://openreview.net/forum?id=IgTUGrZfMr}
}

CUA-Suite enables UI-Vision and GroundCUA. Please also cite the individual components if you use them.