OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

1Nanjing University 2SenseTime 3Nanyang Technological University
4Shanghai AI Laboratory 5The University of Hong Kong 6Xi'an Jiaotong University

OpenMobile DataViewer

We provide an online website to visualize OpenMobile's synthesized trajectories. Have a try!

Abstract

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting.

Why OpenMobile?

Recent industrial mobile agents approach 70% success on AndroidWorld, while open-data models remain around 30%. The open-source community cannot train models that perform competitively on dynamic mobile agent benchmarks. OpenMobile is designed to close this gap with an open recipe for task synthesis and trajectory rollout.

We also conduct transparent analyses on the overlap between synthesized instructions and benchmark test instructions to clarify community concerns about data leakage. The results show that OpenMobile's gains do not depend on a few test-similar instructions; instead, they are driven by broad functionality coverage and improvement in generalizable agentic capability.

OpenMobile main results figure

Performance comparison across AndroidWorld, AndroidLab, and MobileWorld, together with data scaling and error-correction analysis.

Framework Overview

OpenMobile framework overview

OpenMobile first builds a global memory of app functionalities from exploration, then synthesizes grounded instructions and rollout trajectories with policy switching.

Scalable task synthesis. Instead of generating instructions from a single local trajectory, OpenMobile first explores the environment to build a global environment memory, then retrieves short-term and long-term memory to compose diverse and grounded tasks.

Policy-switching rollout. Rather than relying on ideal expert trajectories, the rollout alternates between learner and expert models. Error-intervention switching introduces corrective signals that are crucial for real-world error recovery.

OpenMobile Dataset

We instantiate OpenMobile on the AndroidWorld emulator infrastructure. The resulting dataset covers 20 Android apps and provides both high-level task instructions and agent execution trajectories.

2.8K
Synthesized instructions
34K
Action steps
20
Android apps
12.2
Avg. steps per trajectory

Main Results

Models fine-tuned on OpenMobile data substantially outperform open-data baselines across AndroidWorld, AndroidLab, and MobileWorld. Notably, our Qwen2.5-VL-7B and Qwen3-VL-8B variants reach 51.7% and 64.7% Pass@1 on AndroidWorld, while the 8B model also achieves 51.5% on AndroidLab and 17.7% on the more challenging MobileWorld benchmark.

These gains transfer beyond the data collection environment, showing that the synthesized tasks and rollout trajectories provide strong generalization to unseen apps and long-horizon workflows.

OpenMobile main results table

Main results on AndroidWorld, AndroidLab, and MobileWorld. OpenMobile outperforms open-data baselines by a large margin and remains competitive with leading closed-data systems.

What Makes It Work?

We analyze OpenMobile data from two complementary angles. First, we examine potential benchmark overlap, since our data is synthesized in the AndroidWorld environment. Although OpenMobile instructions are naturally more similar to the test set, only a small fraction exhibit high similarity, suggesting moderate relevance rather than task-level duplication. Removing a small portion of the most test-similar samples causes only a marginal performance drop, showing that our gains do not rely on a few benchmark-like examples. The full list of similar instruction pairs is provided in the paper.

Second, we study functionality coverage to understand what truly drives downstream performance. As synthesized instructions scale, OpenMobile consistently covers more of the atomic functionalities required by benchmark tasks than a coupled baseline. Tasks with higher functionality coverage are more likely to be solved successfully, while tasks involving more required functionalities remain harder. Together, these analyses show that OpenMobile works not because of benchmark overfitting, but because its decoupled synthesis pipeline provides broad and compositional coverage of app capabilities.

Overlap analysis

Overlap analysis. OpenMobile instructions are grounded in the benchmark environment, but only a small fraction are highly similar to AndroidWorld test instructions. Removing a small percentage of the most similar training instructions leads to only marginal performance drop, mitigating concerns about benchmark data leakage.

Functionality coverage analysis

Functionality coverage analysis. OpenMobile consistently achieves higher coverage of benchmark-required functionalities as instruction count increases. Tasks that are simpler and have higher functionality coverage by synthetic data achieve higher downstream success rates.

BibTeX

TBD

Acknowledgement

This website is adapted from LLaVA-VL and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.