Text rendering has recently emerged as one of the most challenging frontiers in visual
generation, drawing significant attention from large-scale diffusion and multimodal models.
However, text editing within images remains largely unexplored,
as it requires generating legible characters while preserving semantic, geometric, and
contextual coherence.
To fill this gap, we introduce TextEditBench, a comprehensive evaluation
benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel
manipulations, our benchmark emphasizes reasoning-intensive editing
scenarios that require models to understand physical plausibility, linguistic
meaning, and cross-modal dependencies.
We further propose a novel evaluation dimension, Semantic Expectation (SE),
which measures reasoning ability of model to maintain semantic consistency, contextual
coherence, and cross-modal alignment during text editing.Extensive experiments on
state-of-the-art editing systems reveal that while current models
can follow simple textual instructions, they still struggle with context-dependent
reasoning, physical consistency, and layout-aware integration.By focusing evaluation on this
long-overlooked yet fundamental capability,TextEditBench establishes a new testing ground
for advancing
text-guided image editing and reasoning in
multimodal generation.