Learning Visuomotor Policies for Robotic Manipulation

Robust visual sim-to-real transfer for robotic manipulation
Ricardo Garcia-Pinel, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
↳ webpage.

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
Shizhe Chen*, Ricardo Garcia-Pinel*, Cordelia Schmid, Ivan Laptev
In Conference on Robot Learning (CoRL), 2023
↳ webpage.

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
Ricardo Garcia-Pinel*, Shizhe Chen*, Cordelia Schmid
In IEEE International Conference on Robotics and Automation (ICRA), 2025
↳ webpage.

Name	Institution	Jury Role
Serena Ivaldi	Inria Nancy Grand-Est	President
Christian Wolf	Naver Labs Europe	Examiner
Abhinav Valada	University of Freiburg	Examiner
Cordelia Schmid	Inria Paris	Advisor
Shizhe Chen	Inria Paris	Co-advisor

My thesis focuses on learning visuomotor policies for robotic manipulation in unstructured environments. One of the main challenges in robotic manipulation is building robust, generalist policies that can perceive, plan, and act based on both visual and language inputs. We address this by tackling three challenges: sim-to-real transfer, 3D point cloud-based language-guided policy learning, and policy generalization.

First, we propose a data-driven approach to optimize domain randomization parameters for sim-to-real transfer, using multi-object localization as a proxy task. This method improves the transferability of visuomotor policies trained in simulation to the real world.

Next, we introduce PolarNet and 3D-LOTUS, two architectures that leverage 3D point cloud encoders and multimodal transformers to fuse language and visual inputs for robotic manipulation. These models significantly outperform 2D-based baselines across multiple tasks in the RLBench benchmark and demonstrate strong real-world performance.

Finally, we present GemBench, a new benchmark designed to evaluate policy generalization across increasing task complexity, and 3D-LOTUS++, a modular policy that integrates large language models for high-level planning and vision-language models for object grounding. This system achieves state-of-the-art results on GemBench and the real robot, demonstrating capabilities in long-horizon and diverse manipulation tasks.

Warning: For the final version of my thesis, please wait for the day of the PhD defense.

Document	Description	Size
Thesis	(draft)	132 MB

Attendance:

Timetable

2PM-3PM	3PM-4PM	4PM-5PM
Presentation	Questions	Decision

Access (map)

Metro: Tolbiac Metro | Corvisart Metro | Poterne de Peupliers

Location: Inria

Room: Anita Borg

Live: 2PM-3PM (CET)

The defense will be livestreamed on Youtube