RS
RadStudio News
  • 首页
  • 分类浏览
  • 搜索
RS
RadStudio News

专注于医学影像AI、深度学习与影像组学的前沿资讯聚合平台

快速链接

  • 分类浏览
  • 高级搜索
  • 我的收藏

研究方向

  • 深度学习
  • 影像组学
  • 多模态AI

关于

  • 关于我们
  • 投稿指南
  • RSS 订阅

© 2026 RadStudio News. All rights reserved.

今日资讯

2026年5月25日星期一 · AI × 医学影像 领域前沿动态聚合(20 篇)

高级筛选 →
  • arXiv cs.CV论文9 小时前

    GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

    arXiv:2605.22882v1 Announce Type: new Abstract: Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This s

    详情
  • arXiv cs.CV论文9 小时前

    Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

    arXiv:2605.22903v1 Announce Type: new Abstract: Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradatio

  • arXiv cs.CV论文9 小时前

    Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

    arXiv:2605.22904v1 Announce Type: new Abstract: Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry

  • arXiv cs.CV论文9 小时前

    VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

    arXiv:2605.22907v1 Announce Type: new Abstract: Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this

  • arXiv cs.CV论文9 小时前

    Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

    arXiv:2605.22942v1 Announce Type: new Abstract: This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation

  • arXiv cs.CV论文9 小时前

    GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

    arXiv:2605.22962v1 Announce Type: new Abstract: Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annota

  • arXiv cs.CV论文9 小时前

    CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

    arXiv:2605.22996v1 Announce Type: new Abstract: We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify whic

  • arXiv cs.CV论文9 小时前

    Scene Reconstruction as Mapping Priors for 3D Detection

    arXiv:2605.22997v1 Announce Type: new Abstract: In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a

  • arXiv cs.CV论文9 小时前

    The TIME Machine: On The Power of Motion for Efficient Perception

    arXiv:2605.23045v1 Announce Type: new Abstract: Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with

  • arXiv cs.CV论文9 小时前

    Millimeter-wave Imaging for Anthropometric Body Measurement

    arXiv:2605.23064v1 Announce Type: new Abstract: Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enablin

  • arXiv cs.CV论文9 小时前

    Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering

    arXiv:2605.23065v1 Announce Type: new Abstract: Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, re

  • arXiv cs.CV论文9 小时前

    RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering

    arXiv:2605.23068v1 Announce Type: new Abstract: Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights. We present \textbf{RoboSurg-VQA}, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets und

  • arXiv cs.CV论文9 小时前

    Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

    arXiv:2605.23070v1 Announce Type: new Abstract: We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal

  • arXiv cs.CV论文9 小时前

    Inconsistency-aware Multimodal Schr\"odinger Bridge for Deepfake Localization

    arXiv:2605.23113v1 Announce Type: new Abstract: Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schr\"odinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection

  • arXiv cs.CV论文9 小时前

    CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

    arXiv:2605.23116v1 Announce Type: new Abstract: Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external La

  • arXiv cs.CV论文9 小时前

    Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

    arXiv:2605.23118v1 Announce Type: new Abstract: Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages a

  • arXiv cs.CV论文9 小时前

    VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

    arXiv:2605.23141v1 Announce Type: new Abstract: A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a mul

详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情
详情