CUA-Suite Logo

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

A unified ecosystem of expert video demonstrations and dense annotations for training and evaluating desktop computer-use agents across 87 professional applications.

*Equal Contribution 1ServiceNow 2University of Waterloo 3Mila 4Université de Montréal 5McGill University 6University of Oxford 7National University of Singapore

Overview

CUA-Suite unifies three complementary resources into a single ecosystem for full-stack computer-use intelligence.

CUA-Suite Overview: Human GUI trajectories recorded across desktop platforms, expert-verified, and annotated

CUA-Suite Overview. Human GUI trajectories are recorded across desktop platforms, expert-verified, and annotated with keyframes, OCR-enhanced bounding boxes, and interaction logs. The resulting CUA-Suite comprises UI-Vision, a comprehensive benchmark; GroundCUA, densely labeled UI screenshots with 3.5M annotations; and VideoCUA, 55 hours of expert video with detailed action trajectories.

87
Applications
12 categories
~10K
Tasks
Expert-designed
55h
Video
30 fps continuous
6M
Frames
Full temporal dynamics
56K
Screenshots
Dense annotations
5M+
UI Elements
Human-verified
GroundCUA

GroundCUA

Large-scale pixel-precise UI grounding dataset built entirely by human curators. Powers the training of GroundNext 3B/7B vision-language models achieving state-of-the-art desktop grounding.

56K screenshots 5M+ elements 700K tuning data
UI-Vision

UI-Vision

Rigorous desktop-centric benchmark evaluating element grounding, layout understanding, and action prediction across diverse professional applications.

450 demonstrations 8.2K+ queries 3 eval tasks

Data Collection Pipeline

Expert human behavior is captured as continuous 30 fps video across 87 professional desktop applications and enriched with dense, multi-faceted annotations.

1

Selecting Diverse Applications

87 open-source applications across 12 categories, from development (VS Code, Blender) to productivity (LibreOffice, GNUCash), mirroring popular commercial software.

2

Expert-Driven Task Design

Human experts design over 10,000 real-world tasks ranging from simple actions to complex multi-step workflows, ensuring coherent, goal-oriented demonstrations.

3

Recording High-Fidelity Video

Continuous screen video at 30 fps producing ~55 hours and 6 million frames, with every mouse click, drag, scroll, and keystroke logged with millisecond precision.

4

Dense UI Annotation

Human annotators label every visible UI element with bounding boxes, textual labels, OCR text, and semantic categories, producing 5M+ element annotations across 56K screenshots.

Dataset distribution across application categories

Distribution of annotations across application categories in the CUA-Suite ecosystem.

VideoCUA

The largest open expert video corpus for desktop computer use.

55hours
continuous 30 fps video
6M frames
full temporal dynamics
2.5×
larger than current largest dataset

Unlike sparse screenshot datasets, these continuous video streams preserve the full temporal dynamics of human interaction and can be losslessly transformed into any agent framework format.

Dataset Comparison

VideoCUA is the only large-scale, human-curated dataset providing continuous 30 fps expert video for professional desktop applications with multi-layered reasoning annotations.

Dataset Platform Tasks #Envs Video Desktop Human CoT Scale
Web
Mind2Web Web 2,350 137 ~17K SS
AgentTrek Web 10,398 127 short ~126K SS
Mobile
AITW Mobile 715K 357 Mix. ~4.6M SS
GUI-Odyssey Mobile 8,334 212 Mix. ~128K SS
Desktop & Cross-platform
OmniACT D+W 9,802 65 ~9.8K SS
OSWorld Desktop 369 9 Eval.
VideoGUI Desktop 178 11 Mix. ~7h
OpenCUA Desktop 22,625 330+ long ~421K SS
ScaleCUA Cross ~19K Mix. ~2M SS
VideoCUA (Ours) Desktop ~10K 87 long 55h (6M fr.)

SS = screenshots. Video = continuous video recordings (not per-step screenshots). CoT = chain-of-thought annotations (long = multi-layered, short = brief).

Application Frontiers

CUA-Suite's continuous video streams and dense annotations form a superset of information that supports emerging research directions beyond traditional action prediction.

Generalist Screen Parsing

Dense, human-verified bounding-box annotations covering all interactable elements—including canvas-based and custom-drawn widgets missed by DOM-based approaches. ScreenParse has already explored this on a subset of GroundCUA data.

Continuous Spatial Control

Intermediate cursor movements and complete video context preserve kinematic priors (e.g., Fitts's Law deceleration), enabling imitation learning or offline RL policies for feedback-driven navigation.

Visual World Models

30 fps video recordings paired with timestamped actions provide dense (st, at, st+1) triplets for action-conditioned video generation and visual lookahead planning for desktop workflows.

Video-Based Reward Modeling

Continuous expert video recordings with task-level annotations provide positive demonstrations for training reward models, while dense step-level annotations enable fine-grained, step-wise reward signals.

Citation

If you find CUA-Suite useful for your research, please cite our work.

@inproceedings{
    jian2026cuasuite,
    title={{CUA}-Suite: Expert Trajectories and Pixel-Precise Grounding for Computer-use Agents},
    author={Xiangru Jian and Shravan Nayak and Kevin Qinghong Lin and Aarash Feizi and Kaixin Li and Patrice Bechard and Spandana Gella and Sai Rajeswar},
    booktitle={ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving},
    year={2026},
    url={https://openreview.net/forum?id=IgTUGrZfMr}
}

CUA-Suite enables UI-Vision and GroundCUA. Please also cite the individual components if you use them.