Long-term Human Motion Prediction Workshop at ICRA 2024

Long-term Human Motion Prediction Workshop - ICRA 2024

Anticipating human motion is a key skill for intelligent systems that share a space or interact with humans. Accurate long-term predictions of human movement trajectories, body poses, actions or activities may significantly improve the ability of robots to plan ahead, anticipate the effects of their actions or to foresee hazardous situations. The topic has received increasing attention in recent years across several scientific communities with a growing spectrum of applications in service robots, self-driving cars, collaborative manipulators or tracking and surveillance.

This workshop is the sixth in a series of ICRA 2019-2024 events. The aim of this workshop is to bring together researchers and practitioners from different communities and to discuss recent developments in this field, promising approaches, their limitations, benchmarking techniques and open challenges..

The workshop is on Monday May 13 2024, Room 313-314. Zoom link will be available via the InfoVaya portal managed by ICRA2024.

Social and Predictive Navigation

Service robots should predict human motion for safe and efficient operation.

Collaborative and production robots

Working and co-manipulating in close proximity to humans requires precise full-body motion, task and activity anticipation.

Automated driving

Urban and highway navigation is impossible without fast inference on the dynamic enviroment.

Highlights

This workshop will feature talks of several high-profile invited speakers of diverse academic and industrial backgrounds and a poster session.

Program

Program of the workshop is available.

Learn more

Call for Papers

We encourage researchers submit their novel material in short (up to 4 pages) papers to be presented as posters.

Learn more

Challenge

Test the generalization capabilities of motion prediction models in diverse indoor environments.

Learn more

YouTube channel

Recordings of the past LHMP events are available at our YouTube channel.

Learn more

Time	Speaker	Title	Abstract
9:00-9:15 (JST)	Organizers	Intro
9:15-9:45 (JST)	Sanjiban Choudhury, Cornell University	Isn't Motion Prediction just Model-based RL?	We present a clarifying perspective on motion prediction as an attempt to learn a model of human motion useful for a downstream policy. Drawing from new (and some old) principles of Model-based RL (MBRL), we discuss how to train such models — which loss functions to use, and which data distribution to train on. We present first results on Collaborative Manipulation, where we train transformer models to predict human motion around robots and close the loop to plan with these predictions.
9:45-10:15 (JST)	Yuxiao Chen, nVidia	How to better predict for planning: categorical behavior prediction for AV planning	In a typical autonomous vehicle (AV) stack, motion predictions are consumed by the planning module to generate safe and efficient motion plans for the AV. With the advent of LLMs and foundational models, categorical/tokenized data plays an increasingly important role. This talk focuses on how we can embrace the change and generate motion predictions that are not only accurate but also friendly to downstream planning, potentially with LLMs/FMs involved. We argue that interpretable behavior-level multimodality is essential to understanding human behavior in traffic, and it is possible to draw a direct connection between trajectory predictions and natural language descriptions. On the other hand, having multiple modes to cover trajectory-level variance not only increases computation cost, but may not be necessary for planning. On this subject, our recent work proposes an easy workaround that enables integration of a gradient-based planner and prediction models.
10:15-10:30 (JST)	Coffee break
10:30-11:00 (JST)	Tao Chen, Fudan University	The Next Motion Generation: An Observation and Discussion on Motion Generation	This presentation seeks to explore the frontier of human motion generation, offering a comprehensive overview and critical analysis of the latest advancements and methodologies within this field. A significant portion of our discussion is dedicated to "Motion Latent Diffusion," as introduced at CVPR 2023, which illustrates a substantial advancement in efficient motion generation using diffusion models. Additionally, "MotionGPT," presented at NeurIPS 2023, ushers in a revolutionary approach by equating motion generation processes with linguistic analysis, thereby opening new avenues for exploration. The recent progress to "MotionChain," our latest project, which integrates Vision-Language Models with motion generation, aims to establish a holistic framework for autonomous agents. Through this dialogue, we aspire to forge stronger collaborations and innovate beyond the current frontiers of motion generation technologies.
11:00-11:30 (JST)	Angelo Cangelosi, The University of Manchester	Trust and Theory of Mind in Human Robot interaction	There is growing psychology and social robotics literature showing that theory of mind (ToM) capabilities affect trust in human-robot interaction (HRI). This concerns both users’ ToM and trust of robots, and robot’s artificial ToM and trust of people. We present developmental robotics models and HRI experiments exploring different aspects of these two dimensions of trust, to contribute towards trustworthy and transparent social robots.
11:30-12:00 (JST)	Arash Ajoudani, Italian Institute of Technology	Predictive and Perspective Control of Human-Robot Interaction through Kino-dynamic State Fusion	Abstract
12:00-13:30 (JST)	Lunch break
13:30-14:30 (JST)	Poster Session
14:30-15:00 (JST)	Mo Chen, Simon Fraser University	Long-Term Human Motion Prediction Through Hierarchy, Learning, and Control	One of the keys to long-term human motion prediction is hierarchy: Just a few high-level actions can encompass a long duration, while the details of human motion at shorter time scales can be inherently encoded in the high-level actions themselves. In this talk, we will discuss two long-term human trajectory prediction frameworks that take advantage of hierarchy. At the high level, we will look at both action spaces that are hand-designed and those that are learned from data. At the low-level, we will examine how details of human motion at shorter time scales can be reconstructed through a combination of control- and learning-based methods.
15:00-15:30 (JST)	Alina Roitberg, University of Stuttgart	Towards resource-efficient and uncertainty-aware driver behaviour understanding and maneuver prediction	This talk will explore recent advances in video-based driver observation techniques aimed at creating adaptable, resource- and data-efficient, as well as uncertainty-aware models for in-vehicle monitoring and maneuver prediction. Topics covered will include: (1) an overview state-of-the-art methods and public datasets for driver activity analysis (2) the importance of adaptability in driver observation systems to cater to new situations (environments, vehicle types, driver behaviours) as well as strategies for addressing such open world tasks, and (3) incorporating uncertainty-aware approaches, vital for robust and safe decision-making. The talk will conclude with a discussion of future research directions and the potential applications of this technology, such as improving driver safety and improving the overall driving experience.
15:30-15:45 (JST)	Coffe break
15:45-16:15 (JST)	Shuhan Tan, The University of Texas at Austin	Leveraging Natural Language for Traffic Simulation in Autonomous Vehicle Development	Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. Natural language allows practitioners to easily articulate interesting and complex traffic scenarios through high-level descriptions. Instead of meticulously crafting the details of each individual scenario, language allows for a seamless conversion of semantic ideas into simulation scenarios at scale. In this talk, I will first introduce our work LCTGen, which takes as input a natural language description of a traffic scenario, and outputs traffic actors’ initial states and motions on a compatible map. Later, I will then introduce our more recent work on building close-loop simulation scenario environments with natural language and traffic models.
16:15-16:45 (JST)	Tim Schreiter, TUM	THÖR-MAGNI Dataset and Benchmark update	This presentation aims to introduce the THÖR-MAGNI dataset, designed to facilitate research in motion prediction and human-robot interaction. The presentation comprises three parts: A Detailed Introduction to THÖR-MAGNI. This part provides an overview of the dataset. With its comprehensive collection of contextual cues and multifaceted data, THÖR-MAGNI is an excellent resource for predicting human motion in shared environments. Tools for Algorithm Development. We will showcase a suite of tools and a GitHub repository designed to ease model training on THÖR-MAGNI and enhance accessibility to data processing. Benchmarking Challenge Demonstration. This part will demonstrate how participants can submit their models and engage with the benchmarking challenge.
16:45-17:00 (JST)	Organizers	Discussion and conclusions

Time

Speaker

Title

Abstract

9:00-9:15 (JST)

Organizers

Intro

9:15-9:45 (JST)

Sanjiban Choudhury, Cornell University

Isn't Motion Prediction just Model-based RL?

We present a clarifying perspective on motion prediction as an attempt to learn a model of human motion useful for a downstream policy. Drawing from new (and some old) principles of Model-based RL (MBRL), we discuss how to train such models — which loss functions to use, and which data distribution to train on. We present first results on Collaborative Manipulation, where we train transformer models to predict human motion around robots and close the loop to plan with these predictions.

9:45-10:15 (JST)

Yuxiao Chen, nVidia

How to better predict for planning: categorical behavior prediction for AV planning

In a typical autonomous vehicle (AV) stack, motion predictions are consumed by the planning module to generate safe and efficient motion plans for the AV. With the advent of LLMs and foundational models, categorical/tokenized data plays an increasingly important role. This talk focuses on how we can embrace the change and generate motion predictions that are not only accurate but also friendly to downstream planning, potentially with LLMs/FMs involved. We argue that interpretable behavior-level multimodality is essential to understanding human behavior in traffic, and it is possible to draw a direct connection between trajectory predictions and natural language descriptions. On the other hand, having multiple modes to cover trajectory-level variance not only increases computation cost, but may not be necessary for planning. On this subject, our recent work proposes an easy workaround that enables integration of a gradient-based planner and prediction models.

10:15-10:30 (JST)

Coffee break

10:30-11:00 (JST)

Tao Chen, Fudan University

The Next Motion Generation: An Observation and Discussion on Motion Generation

This presentation seeks to explore the frontier of human motion generation, offering a comprehensive overview and critical analysis of the latest advancements and methodologies within this field. A significant portion of our discussion is dedicated to "Motion Latent Diffusion," as introduced at CVPR 2023, which illustrates a substantial advancement in efficient motion generation using diffusion models. Additionally, "MotionGPT," presented at NeurIPS 2023, ushers in a revolutionary approach by equating motion generation processes with linguistic analysis, thereby opening new avenues for exploration. The recent progress to "MotionChain," our latest project, which integrates Vision-Language Models with motion generation, aims to establish a holistic framework for autonomous agents. Through this dialogue, we aspire to forge stronger collaborations and innovate beyond the current frontiers of motion generation technologies.

11:00-11:30 (JST)

Angelo Cangelosi, The University of Manchester

Trust and Theory of Mind in Human Robot interaction

There is growing psychology and social robotics literature showing that theory of mind (ToM) capabilities affect trust in human-robot interaction (HRI). This concerns both users’ ToM and trust of robots, and robot’s artificial ToM and trust of people. We present developmental robotics models and HRI experiments exploring different aspects of these two dimensions of trust, to contribute towards trustworthy and transparent social robots.

11:30-12:00 (JST)

Arash Ajoudani, Italian Institute of Technology

Predictive and Perspective Control of Human-Robot Interaction through Kino-dynamic State Fusion

Abstract

12:00-13:30 (JST)

Lunch break

13:30-14:30 (JST)

Poster Session

14:30-15:00 (JST)

Mo Chen, Simon Fraser University

Long-Term Human Motion Prediction Through Hierarchy, Learning, and Control

One of the keys to long-term human motion prediction is hierarchy: Just a few high-level actions can encompass a long duration, while the details of human motion at shorter time scales can be inherently encoded in the high-level actions themselves. In this talk, we will discuss two long-term human trajectory prediction frameworks that take advantage of hierarchy. At the high level, we will look at both action spaces that are hand-designed and those that are learned from data. At the low-level, we will examine how details of human motion at shorter time scales can be reconstructed through a combination of control- and learning-based methods.

15:00-15:30 (JST)

Alina Roitberg, University of Stuttgart

Towards resource-efficient and uncertainty-aware driver behaviour understanding and maneuver prediction

This talk will explore recent advances in video-based driver observation techniques aimed at creating adaptable, resource- and data-efficient, as well as uncertainty-aware models for in-vehicle monitoring and maneuver prediction. Topics covered will include: (1) an overview state-of-the-art methods and public datasets for driver activity analysis (2) the importance of adaptability in driver observation systems to cater to new situations (environments, vehicle types, driver behaviours) as well as strategies for addressing such open world tasks, and (3) incorporating uncertainty-aware approaches, vital for robust and safe decision-making. The talk will conclude with a discussion of future research directions and the potential applications of this technology, such as improving driver safety and improving the overall driving experience.

15:30-15:45 (JST)

Coffe break

15:45-16:15 (JST)

Shuhan Tan, The University of Texas at Austin

Leveraging Natural Language for Traffic Simulation in Autonomous Vehicle Development

Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. Natural language allows practitioners to easily articulate interesting and complex traffic scenarios through high-level descriptions. Instead of meticulously crafting the details of each individual scenario, language allows for a seamless conversion of semantic ideas into simulation scenarios at scale. In this talk, I will first introduce our work LCTGen, which takes as input a natural language description of a traffic scenario, and outputs traffic actors’ initial states and motions on a compatible map. Later, I will then introduce our more recent work on building close-loop simulation scenario environments with natural language and traffic models.

16:15-16:45 (JST)

Tim Schreiter, TUM

THÖR-MAGNI Dataset and Benchmark update

This presentation aims to introduce the THÖR-MAGNI dataset, designed to facilitate research in motion prediction and human-robot interaction. The presentation comprises three parts:

A Detailed Introduction to THÖR-MAGNI. This part provides an overview of the dataset. With its comprehensive collection of contextual cues and multifaceted data, THÖR-MAGNI is an excellent resource for predicting human motion in shared environments.
Tools for Algorithm Development. We will showcase a suite of tools and a GitHub repository designed to ease model training on THÖR-MAGNI and enhance accessibility to data processing.
Benchmarking Challenge Demonstration. This part will demonstrate how participants can submit their models and engage with the benchmarking challenge.

16:45-17:00 (JST)

Organizers

Discussion and conclusions

Paper ID	Authors	Title
1	Chenhao Li	FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning
2	Francesco Verdoja, Tomasz Piotr Kucner, Ville Kyrki	Using occupancy priors to generalize people flow predictions
3	Justin Lidard, Hang Pham, Ariel Bachman, Bryan Boateng, Anirudha Majumdar	Risk-Calibrated Human-Robot Interaction via Set-Valued Intent Prediction
4	Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury	InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions
5	Till Hielscher, Lukas Heuer, Frederik Wulle, Luigi Palmieri	Towards Using Fast Embedded Model Predictive Control for Human-Aware Predictive Robot Navigation
6	Andrei Ivanovic, Masha Itkina, Rowan McAllister, Igor Gilitschenski, Florian Shkurti	On the Importance of Uncertainty Calibration in Perception-Based Motion Planning
7	Kazuki Mizuta, Karen Leung	CoBL-Diffusion: Diffusion-Based Conditional Robot Planning in Dynamic Environments Using Control Barrier and Lyapunov Functions
8	Claire Liang, Valerie Chen	Persistent Homology for Capturing Social Structure and Cohesion of F-Formation Groups
9	Ronny Hug, Stefan Becker, Wolfgang Hübner, Michael Arens	Generating Synthetic Ground Truth Distributions for Multi-step Trajectory Prediction using Probabilistic Composite Bézier Curves
10	Ali Imran, Giovanni Beltrame, David St-Onge	Decentralized Multi-Robot Shared Perception for Worker Action Inference in Industrial Facilities
11	Yuchen Liu, Luigi Palmieri, Sebastian Koch, Ilche Georgievski, Marco Aiello	Towards Human Awareness in Robot Task Planning with Large Language Models

Paper ID

Authors

Title

Chenhao Li

FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning

Francesco Verdoja, Tomasz Piotr Kucner, Ville Kyrki

Using occupancy priors to generalize people flow predictions

Justin Lidard, Hang Pham, Ariel Bachman, Bryan Boateng, Anirudha Majumdar

Risk-Calibrated Human-Robot Interaction via Set-Valued Intent Prediction

Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury

InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

Till Hielscher, Lukas Heuer, Frederik Wulle, Luigi Palmieri