中文DocRevive:一种统一的文档文本恢复流水线
ENDocRevive: A Unified Pipeline for Document Text Restoration
本文提出一种结合OCR、图像分析、掩码语言建模与扩散模型的统一流程,用于重建受损、遮挡或不完整的文档文本,同时保持视觉完整性。创建了包含30,078个退化文档的合成数据集。该方法可显著提升下游文档理解任务性能,为实际场景中的文本修复提供了有效解决方案。
arXiv:2604.10077v2 Announce Type: replace Abstract: In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degrade