I am a Ph.D. candidate in the Department of Computer Science at the National University of Singapore, advised by Prof. Wei Tsang Ooi, Prof. Benoit Cottereau, and Dr. Lai Xing Ng. I also collaborate closely with Prof. Ziwei Liu from Nanyang Technological University, Singapore.
My research focuses include spatial intelligence, multimodal large language models, and 3D/4D world modeling and evaluations.
I am the recipient of the Research Achievement Award (NUS Computing, 2023), Dean's Graduate Research Excellence Award (NUS Computing, 2024), DAAD AInet Fellowship (DAAD, 2025), and Apple Scholars in AI/ML Ph.D. Fellowship (Apple, 2025).
I have been fortunate to collaborate with Apple Machine Learning Research, NVIDIA Research, OpenMMLab, MMLab@NTU, and Motional.
Apple.
|
Apple AI/ML |
|
CNRS@CREATE |
|
NVIDIA Research |
|
TikTok |
|
Motional |
* equal contributions ‡ project lead § corresponding author
|
Learning to Remove Lens Flare in Event Camera
Preprint, 2026
|
|
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Preprint, 2026
|
|
AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models
Preprint, 2026
|
|
3D and 4D World Modeling: A Survey
Preprint, 2026
|
|
LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences |
|
La La LiDAR: Large-Scale Layout Generation from LiDAR Data |
|
Open-o3 Video: Grounded Video Reasoning with Spatio-Temporal Evidence
Preprint, 2025
|
|
RewardMap: Tackling Sparse Rewards in Fine-Grained Visual Reasoning via Multi-Stage Reinforcement Learning
Preprint, 2025
|
|
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Preprint, 2025
|
|
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Preprint, 2025
|
|
Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration
Preprint, 2025
|
|
|
PixelThink: Towards Efficient Chain-of-Pixel Reasoning
Preprint, 2025
|
|
See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Preprint, 2025
|
|
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras |
|
VideoLucy: Deep Memory Backtracking for Long Video Understanding |
|
3EED: Ground Everything Everywhere in 3D |
|
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query |
|
SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding |
|
FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies |
|
Perspective-Invariant 3D Object Detection |
|
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives |
|
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations |
|
MonoMRN: Monocular Semantic Scene Completion via Masked Recurrent Networks |
|
SafeMap: Robust HD Map Construction from Incomplete Observations |
|
EventFly: Event Camera Perception from Ground to the Sky |
|
LiMoE: Mixture of LiDAR Data Representation Learners from Automotive Scenes |
|
GEAL: Generalizable 3D Object Affordance Learning with Cross-Modal Consistency |
|
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding |
|
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning |
|
DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes |
|
Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding |
|
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving |
|
Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving |
|
FRNet: Frustum-Range Networks for Scalable LiDAR-Based Semantic Segmentation |
|
NUC-Net: Non-Uniform Cylindrical Partition Networks for Efficient LiDAR Semantic Segmentation |
|
Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation |
|
Is Your LiDAR Placement Optimized for 3D Scene Understanding? |
|
Is Your HD Map Constructor Reliable under Sensor Corruptions? |
|
4D Contrastive Superflows are Dense 3D Representation Learners |
|
Learning to Adapt SAM for Segmenting Cross-Domain Point Clouds |
|
OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies |
|
Multi-Space Alignments Towards Universal LiDAR Segmentation |
|
Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks |
|
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving |
|
RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions |
|
Segment Any Point Cloud Sequences by Distilling Vision Foundation Models |
|
Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective |
|
Towards Label-Free Scene Understanding by Vision Foundation Models |
|
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions |
|
Rethinking Range View Representation for LiDAR Segmentation |
|
UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase |
|
LaserMix for Semi-Supervised LiDAR Semantic Segmentation |
|
CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP |
|
ConDA: Unsupervised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation |
|
Benchmarking 3D Robustness to Common Corruptions and Sensor Failure |
|
The RoboSense Challenge: Sense Anything, Navigate Anywhere, Adapt Across Platforms
Technical Report, 2025
|
|
The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition
Technical Report, 2024
|
|
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation
Technical Report, 2023
|