Publications

(2025). VQ-VA World: Towards High-Quality Visual Question-Visual Answering. ArXiv Preprint. * denotes equal contribution.

PDF

(2025). When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought. ArXiv Preprint..

PDF

(2025). LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation. ArXiv Preprint. * denotes equal contribution.

PDF

(2025). Emerging properties in unified multimodal pretraining. ArXiv Preprint..

PDF Code Project

(2025). Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness. ArXiv Preprint..

PDF

(2024). What If We Recaption Billions of Web Images with LLaMA-3?. ICML2025. * denotes equal contribution.

PDF Code Dataset Project

(2024). Revisiting Adversarial Training at Scale. CVPR2024. * denotes equal contribution.

PDF Code

(2023). Rejuvenating image-GPT as Strong Visual Representation Learners. ICML2024. * denotes equal contribution.

PDF Code

(2023). Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training. CVPR2024.

PDF Code

(2023). FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning. TMLR2024. * denotes equal contribution.

PDF Code

(2023). DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation. ICCV2023. * denotes equal contribution.

PDF Code

(2023). An Inverse Scaling Law for CLIP Training. NeurIPS2023. * denotes equal contribution.

PDF Code

(2023). On the Adversarial Robustness of Camera-based 3D Object Detection. TMLR2024..

PDF Code

(2022). Masked Autoencoders Enable Efficient Knowledge Distillers. CVPR2023.

PDF Code

(2022). Can CNNs Be More Robust Than Transformers?. ICLR2023.

PDF Code

(2019). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI2020. * denotes equal contribution.

PDF Code