!UPDATED -- 2024-01-03
分类/检测/识别/分割
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding | Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li | 2401.01970v1 | null |
| 2024-01-03 | Distilling Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection | Haowen Zheng, Dong Cao, Jintao Xu, Rui Ai, Weihao Gu, Yang Yang, Yanyan Liang | 2401.01918v1 | null |
| 2024-01-03 | Detours for Navigating Instructional Videos | Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman | 2401.01823v1 | null |
| 2024-01-03 | Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement | Zheng Yuan, Jie Zhang, Yude Wang, Shiguang Shan, Xilin Chen | 2401.01750v1 | null |
| 2024-01-03 | Lightweight Adaptive Feature De-drifting for Compressed Image Classification | Long Peng, Yang Cao, Yuejin Sun, Yang Wang | 2401.01724v1 | null |
| 2024-01-03 | Local Adaptive Clustering Based Image Matching for Automatic Visual Identification | Zhizhen Wang | 2401.01720v1 | null |
| 2024-01-03 | Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation | Hua Han, Cheng Li, Lei Xie, Yuanjing Feng, Alou Diakite, Shanshan Wang | 2401.01685v1 | null |
| 2024-01-03 | DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models | Yichen Liu, Huajian Zhang, Daqing Gao | 2401.01659v1 | null |
| 2024-01-03 | S3Net: Innovating Stereo Matching and Semantic Segmentation with a Single-Branch Semantic Stereo Network in Satellite Epipolar Imagery | Qingyuan Yang, Guanzhou Chen, Xiaoliang Tan, Tong Wang, Jiaqi Wang, Xiaodong Zhang | 2401.01643v1 | null |
| 2024-01-04 | BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion | Zhaochen Liu, Zhixuan Li, Tingting Jiang | 2401.01642v2 | null |
| 2024-01-03 | Context-Aware Interaction Network for RGB-T Semantic Segmentation | Ying Lv, Zhi Liu, Gongyang Li | 2401.01624v1 | link |
| 2024-01-03 | MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning | Jiarun Liu, Hong-Yu Zhou, Cheng Li, Weijian Huang, Hao Yang, Yong Liang, Shanshan Wang | 2401.01591v1 | null |
| 2024-01-03 | Enhancing Generalization of Invisible Facial Privacy Cloak via Gradient Accumulation | Xuannan Liu, Yaoyao Zhong, Weihong Deng, Hongzhi Shi, Xingchen Cui, Yunfeng Yin, Dongchao Wen | 2401.01575v1 | null |
| 2024-01-03 | DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM with Joint Semantic Encoding | Mingrui Li, Jiaming He, Guangan Jiang, Hongyu Wang | 2401.01545v1 | null |
| 2024-01-03 | LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training | Rujiao Long, Hangdi Xing, Zhibo Yang, Qi Zheng, Zhi Yu, Cong Yao, Fei Huang | 2401.01522v1 | null |
| 2024-01-03 | From Pixel to Slide image: Polarization Modality-based Pathological Diagnosis Using Representation Learning | Jia Dong, Yao Yao, Yang Dong, Hui Ma | 2401.01496v1 | null |
| 2024-01-03 | Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition | Kyle Buettner, Sina Malakouti, Xiang Lorraine Li, Adriana Kovashka | 2401.01482v1 | null |
OCR
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | WordArt Designer API: User-Driven Artistic Typography Synthesis with Large Language Models on ModelScope | Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Wangmeng Xiang, Yusen Hu, Xianhui Lin, Xiaoyang Kang, Zengke Jin, Bin Luo, et.al. | 2401.01699v1 | null |
Transformer
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions | David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo | 2401.01827v1 | link |
| 2024-01-03 | VGA: Vision and Graph Fused Attention Network for Rumor Detection | Lin Bai, Caiyan Jia, Ziying Song, Chaoqun Cui | 2401.01759v1 | null |
| 2024-01-03 | FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers | Zheng Yuan, Jie Zhang, Shiguang Shan | 2401.01752v1 | null |
| 2024-01-03 | STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion | Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang | 2401.01730v1 | null |
| 2024-01-03 | Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens | Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo | 2401.01674v1 | null |
| 2024-01-03 | Enhancing the medical foundation model with multi-scale and cross-modality feature learning | Weijian Huang, Cheng Li, Hong-Yu Zhou, Jiarun Liu, Hao Yang, Yong Liang, Shanshan Wang | 2401.01583v1 | null |
| 2024-01-03 | Context-Guided Spatio-Temporal Video Grounding | Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang | 2401.01578v1 | link |
| 2024-01-03 | A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization | Shishen Li, Cuiwei Liu, Huaijun Qiu, Zhaokui Li | 2401.01574v1 | null |
| 2024-01-03 | AttentionLut: Attention Fusion-based Canonical Polyadic LUT for Real-time Image Enhancement | Kang Fu, Yicong Peng, Zicheng Zhang, Qihang Xu, Xiaohong Liu, Jia Wang, Guangtao Zhai | 2401.01569v1 | null |
| 2024-01-03 | Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis | Shichuan Zhang, Sunyi Zheng, Zhongyi Shui, Honglin Li, Lin Yang | 2401.01553v1 | null |
| 2024-01-03 | CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers | Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu | 2401.01552v1 | link |
| 2024-01-03 | Collaborative Perception for Connected and Autonomous Driving: Challenges, Possible Solutions and Opportunities | Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang | 2401.01544v1 | null |
| 2024-01-03 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Ziyi Bai, Ruiping Wang, Xilin Chen | 2401.01529v1 | link |
| 2024-01-03 | Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports | Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hossein Rahmani, Yulan Guo, Bernt Schiele, Chen Chen | 2401.01505v1 | null |
| 2024-01-03 | Token Propagation Controller for Efficient Vision Transformer | Wentao Zhu | 2401.01470v1 | null |
模型压缩/优化
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations | Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard | 2401.01885v1 | null |
| 2024-01-03 | Retraining-free Model Quantization via One-Shot Weight-Coupling Learning | Chen Tang, Yuan Meng, Jiacheng Jiang, Shuzhao Xie, Rongwei Lu, Xinzhu Ma, Zhi Wang, Wenwu Zhu | 2401.01543v1 | null |
生成模型
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | Instruct-Imagen: Image Generation with Multi-modal Instruction | Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et.al. | 2401.01952v1 | null |
| 2024-01-03 | Can We Generate Realistic Hands Only Using Convolution? | Mehran Hosseini, Peyman Hosseini | 2401.01951v1 | null |
| 2024-01-04 | Few-shot Adaptation of Multi-modal Foundation Models: A Survey | Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, Delong Chen | 2401.01736v2 | null |
| 2024-01-03 | SIGNeRF: Scene Integrated Generation for Neural Radiance Fields | Jan-Niklas Dihlmann, Andreas Engelhardt, Hendrik Lensch | 2401.01647v1 | null |
| 2024-01-03 | Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning | Zitong Huang, Ze Chen, Zhixing Chen, Erjin Zhou, Xinxing Xu, Rick Siow Mong Goh, Yong Liu, Chunmei Feng, Wangmeng Zuo | 2401.01598v1 | link |
| 2024-01-03 | S | Yixuan Wang, Shuangyin Li | 2401.01520v1 | link |
多模态
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-04 | HawkRover: An Autonomous mmWave Vehicular Communication Testbed with Multi-sensor Fusion and Deep Learning | Ethan Zhu, Haijian Sun, Mingyue Ji | 2401.01822v2 | null |
| 2024-01-03 | Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction | Yilan Zhang, Yingxue Xu, Jianqi Chen, Fengying Xie, Hao Chen | 2401.01646v1 | link |
| 2024-01-03 | GPT-4V(ision) is a Generalist Web Agent, if Grounded | Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su | 2401.01614v1 | link |
| 2024-01-03 | Multimodal self-supervised learning for lesion localization | Hao Yang, Hong-Yu Zhou, Cheng Li, Weijian Huang, Jiarun Liu, Yong Liang, Shanshan Wang | 2401.01524v1 | null |
Zero/Few-Shot Learning
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers | Aleksandar Stanić, Sergi Caelles, Michael Tschannen | 2401.01974v1 | null |
| 2024-01-03 | Few-shot Image Generation via Information Transfer from the Built Geodesic Surface | Yuexing Han, Liheng Ruan, Bing Wang | 2401.01749v1 | null |
半监督/无监督学习
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints | Jinyang Yuan, Tonglin Chen, Zhimeng Shen, Bin Li, Xiangyang Xue | 2401.01922v1 | null |
| 2024-01-03 | Test-Time Personalization with Meta Prompt for Gaze Estimation | Huan Liu, Julia Qi, Zhenhao Li, Mohammad Hassanpour, Yang Wang, Konstantinos Plataniotis, Yuanhao Yu | 2401.01577v1 | null |
| 2024-01-03 | Boosting of Implicit Neural Representation-based Image Denoiser | Zipei Yan, Zhengji Liu, Jizhou Li | 2401.01548v1 | link |
其他
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2024-01-03 | GPS-SSL: Guided Positive Sampling to Inject Prior Into Self-Supervised Learning | Aarash Feizi, Randall Balestriero, Adriana Romero-Soriano, Reihaneh Rabbany | 2401.01990v1 | link |
| 2024-01-03 | AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance | Joao P. C. Bertoldo, Dick Ameln, Ashwin Vaidya, Samet Akçay | 2401.01984v1 | link |
| 2024-01-03 | LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry | Weirong Chen, Le Chen, Rui Wang, Marc Pollefeys | 2401.01887v1 | null |
| 2024-01-03 | Step length measurement in the wild using FMCW radar | Parthipan Siva, Alexander Wong, Patricia Hewston, George Ioannidis, Dr. Jonathan Adachi, Dr. Alexander Rabinovich, Andrea Lee, Alexandra Papaioannou | 2401.01868v1 | null |
| 2024-01-03 | A Vision Check-up for Language Models | Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba | 2401.01862v1 | null |
| 2024-01-03 | Synthetic dataset of ID and Travel Document | Carlos Boned, Maxime Talarmain, Nabil Ghanmi, Guillaume Chiron, Sanket Biswas, Ahmad Montaser Awal, Oriol Ramos Terrades | 2401.01858v1 | link |
| 2024-01-04 | Frequency Domain Modality-invariant Feature Learning for Visible-infrared Person Re-Identification | Yulin Li, Tianzhu Zhang, Yongdong Zhang | 2401.01839v2 | null |
| 2024-01-03 | aMUSEd: An Open MUSE Reproduction | Suraj Patil, William Berman, Robin Rombach, Patrick von Platen | 2401.01808v1 | null |
| 2024-01-03 | Learning Keypoints for Robotic Cloth Manipulation using Synthetic Data | Thomas Lips, Victor-Louis De Gusseme, Francis wyffels | 2401.01734v1 | null |
| 2024-01-03 | Fact-checking based fake news detection: a review | Yuzhou Yang, Yangming Zhou, Qichao Ying, Zhenxing Qian, Dan Zeng, Liang Liu | 2401.01717v1 | null |
| 2024-01-03 | AID-DTI: Accelerating High-fidelity Diffusion Tensor Imaging with Detail-Preserving Model-based Deep Learning | Wenxin Fan, Jian Cheng, Cheng Li, Xinrui Ma, Jing Yang, Juan Zou, Ruoyou Wu, Qiegen Liu, Shanshan Wang | 2401.01693v1 | null |
| 2024-01-03 | ODTrack: Online Dense Temporal Token Learning for Visual Tracking | Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, Xianxian Li | 2401.01686v1 | link |
| 2024-01-03 | Performance Evaluation of GPS Trajectory Rasterization Methods | Necip Enes Gengec, Ergin Tari | 2401.01676v1 | null |
| 2024-01-03 | Simultaneous q-Space Sampling Optimization and Reconstruction for Fast and High-fidelity Diffusion Magnetic Resonance Imaging | Jing Yang, Jian Cheng, Cheng Li, Wenxin Fan, Juan Zou, Ruoyou Wu, Shanshan Wang | 2401.01662v1 | null |
| 2024-01-03 | AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI | Fanda Fan, Chunjie Luo, Jianfeng Zhan, Wanling Gao | 2401.01651v1 | null |
| 2024-01-03 | De-Confusing Pseudo-Labels in Source-Free Domain Adaptation | Idit Diamant, Idan Achituve, Arnon Netzer | 2401.01650v1 | null |
| 2024-01-03 | Real-Time Human Fall Detection using a Lightweight Pose Estimation Technique | Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo | 2401.01587v1 | null |
| 2024-01-03 | View Distribution Alignment with Progressive Adversarial Learning for UAV Visual Geo-Localization | Cuiwei Liu, Jiahao Liu, Huaijun Qiu, Zhaokui Li, Xiangbin Shi | 2401.01573v1 | null |
| 2024-01-03 | One-Step Late Fusion Multi-view Clustering with Compressed Subspace | Qiyuan Ou, Pei Zhang, Sihang Zhou, En Zhu | 2401.01558v1 | null |
| 2024-01-03 | DDPM based X-ray Image Synthesizer | Praveen Mahaulpatha, Thulana Abeywardane, Tomson George | 2401.01539v1 | null |
| 2024-01-03 | Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering | Haopeng Li, Qiuhong Ke, Mingming Gong, Tom Drummond | 2401.01510v1 | null |