RS
RadStudio News
  • 首页
  • 分类浏览
  • 搜索
RS
RadStudio News

专注于医学影像AI、深度学习与影像组学的前沿资讯聚合平台

快速链接

  • 分类浏览
  • 高级搜索
  • 我的收藏

研究方向

  • 深度学习
  • 影像组学
  • 多模态AI

关于

  • 关于我们
  • 投稿指南
  • RSS 订阅

© 2026 RadStudio News. All rights reserved.

今日资讯

2026年5月25日星期一 · AI × 医学影像 领域前沿动态聚合(182 篇)

高级筛选 →
  • arXiv cs.CV论文9 小时前

    Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

    arXiv:2602.11146v2 Announce Type: replace Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent

    详情
  • arXiv cs.CV论文9 小时前

    ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

    arXiv:2603.02897v2 Announce Type: replace Abstract: Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The r

  • arXiv cs.CV论文9 小时前

    Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

    arXiv:2603.17879v2 Announce Type: replace Abstract: This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static tempor

  • arXiv cs.CV论文9 小时前

    Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

    arXiv:2603.24985v3 Announce Type: replace Abstract: Segmenting the left atrial (LA) wall from late gadolinium enhancement magnetic resonance imaging (LGE-MRI) is challenging because of its thin geometry, low contrast, and limited expert annotations. We propose a model-agnostic meta-learning (MAML) framework with a 3D residual U-Net backbone for K-shot (K = 5, 10, 20) LA wall segmentation. The framework is meta-trained on LA wall tasks together with auxiliary LA and right atrial (RA) cavity tasks and uses a boundary-aware composite loss to improve thin-structure delineation. We evaluated MAML o

  • arXiv cs.CV论文9 小时前

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    arXiv:2603.28767v3 Announce Type: replace Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To ac

  • arXiv cs.CV论文9 小时前

    Visually-Guided Policy Optimization for Multimodal Reasoning

    arXiv:2604.09349v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual fo

  • arXiv cs.CV论文9 小时前

    DocRevive: A Unified Pipeline for Document Text Restoration

    arXiv:2604.10077v2 Announce Type: replace Abstract: In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degrade

  • arXiv cs.CV论文9 小时前

    Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

    arXiv:2604.11679v2 Announce Type: replace Abstract: Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused

  • arXiv cs.CV论文9 小时前

    VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

    arXiv:2604.13596v3 Announce Type: replace Abstract: Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even w

  • arXiv cs.CV论文9 小时前

    VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

    arXiv:2604.21502v2 Announce Type: replace Abstract: Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced

  • arXiv cs.CV论文9 小时前

    World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

    arXiv:2604.24764v3 Announce Type: replace Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedb

  • arXiv cs.CV论文9 小时前

    Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

    arXiv:2604.27247v2 Announce Type: replace Abstract: Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high imp

  • arXiv cs.CV论文9 小时前

    WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

    arXiv:2605.01018v2 Announce Type: replace Abstract: Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for natur

  • arXiv cs.CV论文9 小时前

    4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

    arXiv:2605.05997v2 Announce Type: replace Abstract: Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic

  • arXiv cs.CV论文9 小时前

    中文利用学习的世界到图像投影改进视觉到海图浮标关联

    ENImproved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

    在MaCVi 2026视觉-图表数据关联挑战中,对DETR融合Transformer基线进行轻量修改。原基线解码器以编码世界距离和方位的浮标查询隐式学习几何投影。本文训练专用MLP(QueryMLP),利用图表测量和IMU方向显式预测浮标水线接触点在图像中的位置,简化学习任务,提升数据关联准确性。

  • arXiv cs.CV论文9 小时前

    中文注视行为标注工具包(GBAT): 基于AI的自动标注工具,用于儿童与照护者互动的第一人称眼动追踪和视频数据

    ENGazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

    本文介绍基于深度学习的GazeBehavior Annotation Toolkit,用于简化儿童-看护者互动视频的数据预处理与特征提取,实现多视频同步与半自动标注,助力实时注意力、动作和语言交互研究,减少人工成本。

  • arXiv cs.CV论文9 小时前

    中文场景重建作为3D检测的建图先验

    ENScene Reconstruction as Mapping Priors for 3D Detection

    该论文提出利用地图作为静态环境结构先验,提升自动驾驶中3D目标检测的鲁棒性,尤其针对远距离和恶劣天气下的传感器数据稀疏或噪声问题。然而,传统高精地图获取和维护成本高,作者因此提出一种更高效的新方法,以平衡性能与部署成本。

  • arXiv cs.CV论文9 小时前

    中文TIME机器:论运动在高效感知中的作用

    ENThe TIME Machine: On The Power of Motion for Efficient Perception

    本文指出视频表示学习虽因大规模训练和语言对比学习取得进展,但面临成本高昂及概念受限于文本描述的问题,导致模型仍存在不足。方法上强调当前依赖语言对比的局限,实际意义在于提示未来需探索更高效、无语言依赖的视频学习策略。

  • arXiv cs.CV论文9 小时前

    中文毫米波成像用于人体测量

    ENMillimeter-wave Imaging for Anthropometric Body Measurement

    本研究利用毫米波雷达实现无接触身体形状与围度测量(如腰臀比、肢干周长),无需脱衣或固定姿势,保护隐私且可穿透衣物,尤其适用于老年人和行动不便者,提升了测量速度与尊严。

  • arXiv cs.CV论文9 小时前

    中文面向深度伪造定位的不一致性感知多模态薛定谔桥

    ENInconsistency-aware Multimodal Schr\"odinger Bridge for Deepfake Localization

    音频-视觉深度伪造定位新方法IaMSB:采用不一致性感知的多模态Schrödinger桥,联合估计跨模态一致性并进行区间级定位。与扩散模型不同,该方法无需显式噪声注入,通过最小化路径分布差异生成一致性分数,有效抑制对称融合下的交叉模态噪声传播,提升高精度定位性能。

2 / 10
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情