

Robust visual sim-to-real transfer for robotic manipulation
Ricardo Garcia-Pinel, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
↳ webpage.
PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
Shizhe Chen*, Ricardo Garcia-Pinel*, Cordelia Schmid, Ivan Laptev
In Conference on Robot Learning (CoRL), 2023
↳ webpage.
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
Ricardo Garcia-Pinel*, Shizhe Chen*, Cordelia Schmid
In IEEE International Conference on Robotics and Automation (ICRA), 2025
↳ webpage.
Name | Institution | Jury Role |
---|---|---|
Serena Ivaldi | Inria Nancy Grand-Est | President |
Christian Wolf | Naver Labs Europe | Examiner |
Abhinav Valada | University of Freiburg | Examiner |
Cordelia Schmid | Inria Paris | Advisor |
Shizhe Chen | Inria Paris | Co-advisor |
My thesis focuses on learning visuomotor policies for robotic manipulation in unstructured environments. One of the main challenges in robotic manipulation is building robust, generalist policies that can perceive, plan, and act based on both visual and language inputs. We address this by tackling three challenges: sim-to-real transfer, 3D point cloud-based language-guided policy learning, and policy generalization.
First, we propose a data-driven approach to optimize domain randomization parameters for sim-to-real transfer, using multi-object localization as a proxy task. This method improves the transferability of visuomotor policies trained in simulation to the real world.
Next, we introduce PolarNet and 3D-LOTUS, two architectures that leverage 3D point cloud encoders and multimodal transformers to fuse language and visual inputs for robotic manipulation. These models significantly outperform 2D-based baselines across multiple tasks in the RLBench benchmark and demonstrate strong real-world performance.
Finally, we present GemBench, a new benchmark designed to evaluate policy generalization across increasing task complexity, and 3D-LOTUS++, a modular policy that integrates large language models for high-level planning and vision-language models for object grounding. This system achieves state-of-the-art results on GemBench and the real robot, demonstrating capabilities in long-horizon and diverse manipulation tasks.
Document | Description | Size |
---|---|---|
Thesis | (draft) | 132 MB |
Live: 2PM-3PM (CET)
The defense will be livestreamed on Youtube