TextEditBench: Evaluating Reasoning-aware
Text Editing Beyond Rendering

Rui Gui 1*, Yang Wan 1*, Haochen Han 2*, Dongxing Mao 1, Fangming Liu 2,
Min Li 1, Alex Jinpeng Wang 1✉
1Central South University   2Pengcheng Laboratory
*Equal contribution, Corresponding author
Paper Code

🤗

TextEditBench
Teaser

Red rectangles highlight the major regions of modification

14
Topics
6
Task Types
1,196
Annotated Instances
12
Models Evaluated

Abstract

Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence.

To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.

We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing.Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration.By focusing evaluation on this long-overlooked yet fundamental capability,TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

🏆 Leaderboard & Analysis

Displaying Overall Score comparison across all 12 models on the Real-World dataset.

Model Source SSIM↑ LPIPS↓ PSNR↑ MSE↓
Qwen-Image-Edit Open Source 0.901 0.039 28.22 278.5
NanoBanana Closed Source 0.904 0.036 29.88 160.3
Seedream Closed Source 0.775 0.136 20.33 1121.8
Step1X-Edit-Think Open Source 0.899 0.067 27.01 584.2
FLUX.1-Kontext Open Source 0.906 0.056 28.29 522.1
Bagel (512) Open Source 0.873 0.077 24.42 726.4
Bagel-Think (512) Open Source 0.896 0.057 25.52 459.2
Emu3.5 (512) Open Source 0.769 0.092 19.32 1131.3
Step1X-Edit Open Source 0.879 0.089 25.52 982.5
OmniGen2 Open Source 0.836 0.129 21.38 2046.6
InstructPix2Pix Open Source 0.768 0.187 17.77 2718.5
MagicBrush Open Source 0.788 0.162 19.29 3456.7
Note: If you would like to submit your results or have any questions, please contact us at 8212231014@csu.edu.cn.

Semantic Expectation (SE) Examples

Semantic Expectation (SE) evaluates models’ deeper reasoning and contextual understanding capabilities. Specifically, it measures whether models can infer and apply implicit semantic dependencies between textual in structions and corresponding visual or contextual outcomes. The key dimensions of SE include Tone and style modification, Cross-modal semantic consistency, Knowledge-grounded linkage , Reasoning-based SE and Semantic preservation.

Qualitative Comparisons

BibTeX

@article{texteditbench2026,
  title   = {TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering},
  author  = {Anonymous Authors},
  journal = {CVPR Submission},
  volume  = {3050},
  year    = {2026}
}