CoDex – Learning Compositional Dexterous Functional Manipulation without Demonstrations

TL;DR

Why care? CoDex learns dexterous grasp–move–actuate skills for functional object manipulation (FOM) without any human demonstrations.

How we do it:

VLM → Semantic Constraints: From a single RGB‑D view and a task prompt, a VLM yields local (actuation/function points & directions) and a global goal pose via VLM‑CEM.
Analytic Constrained Optimization: Enforce those constraints to synthesize functional, stable grasp candidates with collision avoidance and force‑closure guarantees.
Constraint‑Guided RL: In ManiSkill3 with 2,048 parallel worlds (~1 hour), refine candidates into a parameterized motion‑primitive policy that composes grasp, motion, and actuation and transfers to a Franka + LEAP hand.

What CoDex achieves: 73% overall success across six real‑world CD‑FOM tasks, outperforming Analytical+VLM‑CEM (33%) and RL+PIVOT (0%).

Why is Functional Object Manipulation challenging?

"Robots can pick and place, but can they use a spray bottle or a glue‑gun?"

The challenge. CD‑FOM tasks demand coordinated control over internal (buttons, triggers) and external (object pose) DoF so the effect lands on the right target.

Our bridge. We let a VLM provide semantics (what to actuate, where to aim) and translate them into semantic constraints that drive analytic optimization and RL to achieve physical dexterity (functional contacts, stable grasps, collision‑free motion).

CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations

Stage 1 — VLM‑Generated Semantic Constraints

We parse (language, RGB‑D) to (i) segment and reconstruct the tool, (ii) locate the Actuation and Function points and directions, and (iii) derive a global goal pose via VLM‑CEM—an iterative, keypoint‑anchored pose search guided by VLM scoring.

Stage 2 — Constraint‑Guided Policy Training

Analytic constrained optimization turns local constraints into function‑aligned, stable grasp candidates; then constraint‑guided RL in ManiSkill3 refines them into a single motion primitive that completes grasp–move–actuate end‑to‑end.

Stage 3 — Real‑world Execution

The learned primitive transfers directly to a FRANKA arm with a LEAP hand, executing the full pipeline on unseen objects without task‑specific demonstrations or tuning.

Quantitative Results

Real‑world performance across six CD‑FOM tasks (5 trials each). The bottom row averages correspond to the overall success for each method.

Task	CoDex (Ours)	Analytical + VLM‑CEM	RL+PIVOT
Spray whiteboard	5/5 (100%)	5/5 (100%)	0/5 (0%)
Spray plant	5/5 (100%)	5/5 (100%)	0/5 (0%)
Clean keyboard	3/5 (60%)	0/5 (0%)	0/5 (0%)
Illuminate toy	0/5 (0%)	0/5 (0%)	0/5 (0%)
Glue blocks	5/5 (100%)	0/5 (0%)	0/5 (0%)
Grind salt	4/5 (80%)	0/5 (0%)	0/5 (0%)
Average (Overall)	22/30 (73%)	10/30 (33%)	0/30 (0%)

CoDex reliably solves the two spray tasks and hot‑glue, with partial degradation on Clean keyboard and a full failure on Illuminate toy. Baselines mirror the composition plot: Analytical+VLM‑CEM succeeds only on the two sprays, and RL+PIVOT achieves no holistic success despite frequent correct actuations.

Overall and partial success composition across six CD‑FOM tasks for CoDex, Analytical + VLM‑CEM, and RL+PIVOT — Overall success on six CD‑FOM tasks (5 real‑world trials each). Solid = success; shaded = partial (movement‑only or actuation‑only). CoDex reaches 73% overall, outperforming Analytical + VLM‑CEM (33%) and RL+PIVOT (0%).