Learning Visuomotor Policies for Robotic Manipulation

PhD Defense

Presented by Ricardo Garcia-Pinel

2PM (Calendar) Wednesday the 4th of June, 2025
Learning Visuomotor Policies for Robotic Manipulation Visualization Learning Visuomotor Policies for Robotic Manipulation Visualization

Robust visual sim-to-real transfer for robotic manipulation
Ricardo Garcia-Pinel, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
webpage.

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
Shizhe Chen*, Ricardo Garcia-Pinel*, Cordelia Schmid, Ivan Laptev
In Conference on Robot Learning (CoRL), 2023
webpage.

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
Ricardo Garcia-Pinel*, Shizhe Chen*, Cordelia Schmid
In IEEE International Conference on Robotics and Automation (ICRA), 2025
webpage.

Name Institution Jury Role
Serena Ivaldi Inria Nancy Grand-Est President
Christian Wolf Naver Labs Europe Examiner
Abhinav Valada University of Freiburg Examiner
Cordelia Schmid Inria Paris Advisor
Shizhe Chen Inria Paris Co-advisor

My thesis focuses on learning visuomotor policies for robotic manipulation in unstructured environments. One of the main challenges in robotic manipulation is building robust, generalist policies that can perceive, plan, and act based on both visual and language inputs. We address this by tackling three challenges: sim-to-real transfer, 3D point cloud-based language-guided policy learning, and policy generalization.

First, we propose a data-driven approach to optimize domain randomization parameters for sim-to-real transfer, using multi-object localization as a proxy task. This method improves the transferability of visuomotor policies trained in simulation to the real world.

Next, we introduce PolarNet and 3D-LOTUS, two architectures that leverage 3D point cloud encoders and multimodal transformers to fuse language and visual inputs for robotic manipulation. These models significantly outperform 2D-based baselines across multiple tasks in the RLBench benchmark and demonstrate strong real-world performance.

Finally, we present GemBench, a new benchmark designed to evaluate policy generalization across increasing task complexity, and 3D-LOTUS++, a modular policy that integrates large language models for high-level planning and vision-language models for object grounding. This system achieves state-of-the-art results on GemBench and the real robot, demonstrating capabilities in long-horizon and diverse manipulation tasks.

Warning: For the final version of my thesis, please wait for the day of the PhD defense.
Document Description Size
Thesis (draft) 132 MB
Attendance:

Timetable

2PM-3PM 3PM-4PM 4PM-5PM
Presentation Questions Decision

Access (map)

Path to Inria

Metro: Tolbiac Metro | Corvisart Metro | Poterne de Peupliers Metro

Location: Inria

Room: Anita Borg

Live: 2PM-3PM (CET)

The defense will be livestreamed on Youtube