Deep Learning Research Papers for Robot Perception

A collection of deep learning research papers with coverage in perception and associated robotic tasks. Within each research area outlined below, the course staff has identified a core and extended set of research papers. The core set of papers will form the basis of our seminar-style lectures starting in week 8. The extended set provides additional coverage of even more exciting work being done within each area.

Table of contents

RGB-D Architectures
Point Cloud Processing
Object Pose, Geometry, SDF, Implicit surfaces
Dense Descriptors, Category-level Representations
Recurrent Networks and Object Tracking
Visual Odometry and Localization
Semantic Scene Graphs and Explicit Representations
Neural Radiance Fields and Implicit Representations
Datasets
Self-Supervised Learning
Grasp Pose Detection
Tactile Perception for Grasping and Manipulation
Pre-training for Robot Manipulation
Perception Beyond Vision
More Frontiers

RGB-D Architectures

Core List

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018
A Unified Framework for Multi-View Multi-Class Object Pose Estimation, Li et al., 2018
PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation, He et al., 2020
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation, Li et al., 2021

Extended List

3D ShapeNets: A Deep Representation for Volumetric Shapes, Wu et al., 2015
VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
Multi-view Convolutional Neural Networks for 3D Shape Recognition, Su et al., 2015
Volumetric and Multi-View CNNs for Object Classification on 3D Data, Qi et al., 2016
Robust 6D Object Pose Estimation with Stochastic Congruent Sets, Mitash et al., 2018
What’s Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction, Kulkarni et al., 2022

Point Cloud Processing

Core List

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation, Xu et al., 2018
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019

Extended List

Just Go with the Flow: Self-Supervised Scene Flow Estimation, Mittal et al., 2019
PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows, Yang et al., 2019
3D Object Detection with Pointformer, Pan et al., 2021
Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories, Harley et al., 2022

Object Pose, Geometry, SDF, Implicit surfaces

Core List

SUM: Sequential scene understanding and manipulation, Sui et al., 2017
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
Implicit surface representations as layers in neural networks, Michalkiewicz et al., 2019
iSDF: Real-Time Neural Signed Distance Fields for Robot Perception, Oriz et al., 2022

Extended List

Local Deep Implicit Functions for 3D Shape, Genova et al., 2020
Implicit geometric regularization for learning shapes, Gropp et al., 2020
TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation, Pan et al., 2022
Improving Object Pose Estimation by Fusion With a Multimodal Prior – Utilizing Uncertainty-Based CNN Pipelines for Robotics, Richter-Klug et al., 2022

Dense Descriptors, Category-level Representations

Core List

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image, Lin et al., 2022

Extended List

Visual Descriptor Learning from Monocular Video, Deekshith et al., 2020
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021

Recurrent Networks and Object Tracking

Core List

DeepIM: Deep Iterative Matching for 6D Pose Estimation, Li et al., 2018
PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking, Deng et al., 2019
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints, Wang et al., 2020
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022

Extended List

Long Short-Term Memory, Hochreiter et al., 1997
The Unreasonable Effectiveness of Recurrent Neural Networks, Karpathy, 2015
TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization, Xu et al., 2022

Visual Odometry and Localization

Core List

Backprop KF: Learning Discriminative Deterministic State Estimators, Haarnoja et al., 2016
Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors, Jonschkowski et al., 2018
Multimodal Sensor Fusion with Differentiable Filters, Lee et al., 2020
Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation, Karkus et al., 2021

Extended List

Factor Graphs and GTSAM, Dellaert et al., 2012
SuperPoint: Self-Supervised Interest Point Detection and Description, DeTone et al., 2017
Particle Filter Recurrent Neural Networks, Ma et al., 2019
Differentiable Algorithm Networks for Composable Robot Learning, Karkus et al., 2019
SuperGlue: Learning Feature Matching with Graph Neural Networks, Sarlin et al., 2019
Chasing Ghosts: Instruction Following as Bayesian State Tracking, Anderson et al., 2019
Differentiable Factor Graph Optimization for Learning Smoothers, Yi et al., 2021
How to train your differentiable filter, Kloss et al., 2021
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
A Robot Web for Distributed Many-Device Localisation, Murai et al., 2022

Semantic Scene Graphs and Explicit Representations

Core List

Image Retrieval using Scene Graphs, Johnson et al., 2015
Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes, Zeng et al., 2018
Semantic Linking Maps for Active Visual Object Search, Zeng et al., 2020
Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022

Extended List

RoboSherlock: Unstructured information processing for robot perception, Beetz et al., 2015
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
Image Generation from Scene Graphs, Johnson et al., 2018
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
Differentiable Scene Graphs, Raboh et al., 2020
ConceptFusion: Open-set Multimodal 3D Mapping, Jatavallabhula et al., 2023

Neural Radiance Fields and Implicit Representations

Core List

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
iMAP: Implicit Mapping and Positioning in Real-Time, Sucar et al., 2021
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields, Rosinol et al., 2022
NARF22: Neural Articulated Radiance Fields for Configuration-Aware Rendering, Lewis et al., 2022
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation, Shen et al., 2023

Extended List

NeRF Explosion 2020, Dellaert, 2020
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, Sitzmann et al., 2019
Local Implicit Grid Representations for 3D Scenes, Jiang et al., 2020
Convolutional occupancy networks, Peng et al., 2020
Object-Centric Neural Scene Rendering, Guo et al., 2020
INeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al., 2021
ILabel: Interactive Neural Scene Labelling, Zhi et al., 2021
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation, Simeonov et al., 2021
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering, Xiangli et al., 2021
Block-NeRF: Scalable Large Scene Neural View Synthesis, Tancik et al., 2022
NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields, Yen-Chen et al., 2022
Language Embedded Radiance Fields, Kerr et al., 2023

Datasets

Core List

Deep Learning for Robots: Learning from Large-Scale Interaction, Levine et al., 2016
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
Grounding Predicates through Actions, Migimatsu and Bohg, 2022
All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators, Thananjeyan et al., 2022

Extended List

Collecting data with robots

TossingBot: Learning to Throw Arbitrary Objects, Zeng et al., 2019

RGB-D Datasets

(NYU Depth v2) Indoor Segmentation and Support Inference from RGBD Images, Silberman et al., 2012
SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite, Song et al., 2015
YCB-Video Dataset, Xiang et al., 2018
BOP: Benchmark for 6D Object Pose Estimation, Hodaň et al., 2019
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
ProgressLabeller: Visual Data Stream Annotation for Training Object-Centric 3D Perception, Chen et al., 2022
TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes, Xu et al., 2022

Semantic Datasets

Understanding Human Hands in Contact at Internet Scale, Shan et al., 2020
Habitat-Matterport 3D Semantics Dataset, Yadav et al., 2022

Object Model Datasets

ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
PartNet-Mobility Dataset

Simulators

MuJoCo: A physics engine for model-based control, Todorov et al., 2015
Pybullet, a python module for physics simulation for games, robotics and machine learning, Coumans et al., 2015
NVIDIA Isaac Sim
CARLA: An Open Urban Driving Simulator, Dosovitskiy et al., 2017
SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Manipulation, Lin et al., 2020
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills, Gu et al., 2023

Self-Supervised Learning

Core List

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, Lee et al., 2019
VICRegL: Self-Supervised Learning of Local Visual Features, Bardes et al., 2022
Fully Self-Supervised Class Awareness in Dense Object Descriptors, Hadjivelichkov and Kanoulas, 2022
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild, Zhang et al., 2022

Extended List

Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021
DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., 2023

Grasp Pose Detection

Core List

Real-Time Grasp Detection Using Convolutional Neural Networks, Redmon and Angelova, 2015
Using Geometry to Detect Grasps in 3D Point Clouds, ten Pas and Platt, 2015
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics, Mahler et al., 2017
Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes, Sundermeyer et al., 2021
Sample Efficient Grasp Learning Using Equivariant Models, Zhu et al., 2022

Extended List

Deep Learning for Detecting Robotic Grasps, Lenz et al., 2013
High precision grasp pose detection in dense clutter, Gualtieri et al., 2016
GlassLoc: Plenoptic Grasp Pose Detection in Transparent Clutter, Zhou et al., 2019
MetaGraspNet_v0: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis, Chen et al., 2021
Grasp Learning: Models, Methods, and Performance, Platt, 2022

Tactile Perception for Grasping and Manipulation

Core List

More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018
Tactile Object Pose Estimation from the First Touch with Geometric Contact Rendering, Bauza et al., 2020
Visuotactile Affordances for Cloth Manipulation with Local Control, Sunil et al., 2022
ShapeMap 3-D: Efficient shape mapping through dense touch and vision, Suresh et al., 2022

Extended List

The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?, Calandra et al., 2017
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
Soft-bubble: A highly compliant dense geometry tactile sensor for robot manipulation, Alspach et al., 2019
A Review of Tactile Information: Perception and Action Through Touch, Li et al., 2020
TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors, Wang et al., 2020
Active Extrinsic Contact Sensing: Application to General Peg-in-Hole Insertion, Kim et al., 2021
Active Visuo-Haptic Object Shape Completion, Rustler et al., 2022
Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces, Kerr and Huang et al., 2022
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Li et al., 2022
Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity, Zhou and Held, 2022

Pre-training for Robot Manipulation

Core List

SORNet: Spatial Object-Centric Representations for Sequential Manipulation, Yuan et al., 2021
CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
Real-World Robot Learning with Masked Visual Pre-training, Radosavovic et al., 2022
R3M: A Universal Visual Representation for Robot Manipulation, Nair et al., 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022

Extended List

Attention and Augmented Recurrent Neural Networks, Olah & Carter, 2016
Attention is All You Need, Vaswani et al., 2017
Feature-wise transformations, Dumoulin et al., 2018
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2020
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020
CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Masked Autoencoders Are Scalable Vision Learners, He et al., 2021
Interactive Language: Talking to Robots in Real Time, Lynch et al., 2022
Transformers are Adaptable Task Planners, Jain et al., 2022

Perception Beyond Vision

Specialized Sensors

Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images, Levenson et al., 2015
Automatic color correction for 3D reconstruction of underwater scenes, Skinner et al., 2017
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
Classification of Household Materials via Spectroscopy, Erickson et al., 2018
Through-Wall Human Pose Estimation Using Radio Signals, Zhao et al., 2018
A bio-hybrid odor-guided autonomous palm-sized air vehicle, Anderson et al., 2020
Event-based, Direct Camera Tracking from a Photometric 3D Map using Nonlinear Optimization, Bryner et al., 2019
SoundSpaces: Audio-Visual Navigation in 3D Environments, Chen et al., 2019
Neural Implicit Surface Reconstruction using Imaging Sonar, Qadri et al., 2022

More Frontiers

Interpreting Deep Learning Models

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Simonyan et al., 2013
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
The Building Blocks of Interpretability, Olah et al., 2018
Multimodal Neurons in Artificial Neural Networks, Goh et al., 2021

Fairness and Ethics

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing, Raji et al., 2020

Certifiable Perception

Certifiably Optimal Outlier-Robust Geometric Perception: Semidefinite Relaxations and Scalable Global Optimization, Yang and Carlone, 2021
Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training, Talak et al., 2022

Articulated Objects

Autonomous Tool Construction Using Part Shape and Attachment Prediction, Nair et al., 2019
Parts-Based Articulated Object Localization in Clutter Using Belief Propagation, Pavlasek et al., 2020
Category-Level Articulated Object Pose Estimation, Li et al., 2020
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
Category-Independent Articulated Object Tracking with Factor Graphs, Heppert et al., 2022
Kineverse: A Symbolic Articulation Model Framework for Model-Agnostic Mobile Manipulation, Röfer et al., 2022

Deformable Objects

DensePose: Dense Human Pose Estimation In The Wild, Xiao et al., 2018
FabricFlowNet: Bimanual Cloth Manipulation with a Flow-based Policy, Weng et al., 2021
DextAIRity: Deformable Manipulation Can be a Breeze, Xu et al., 2022
Self-supervised Transparent Liquid Segmentation for Robotic Pouring, Narasimhan et al., 2022
Visio-tactile Implicit Representations of Deformable Objects, Wi et al., 2022

Transparent Objects

LIT: Light-field Inference of Transparency for Refractive Object Localization, Zhou et al., 2019
Multi-modal Transfer Learning for Grasping Transparent and Specular Objects, Weng et al., 2020
Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
TransNet: Category-Level Transparent Object Pose Estimation, Zhang et al., 2022

Dynamic Scenes

D-NeRF: Neural Radiance Fields for Dynamic Scenes, Pumarola et al., 2020
3D Neural Scene Representations for Visuomotor Control, Li et al., 2021
HexPlane: A Fast Representation for Dynamic Scenes, Cao and Johnson, 2023

Beyond 2D Convolutions

Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks, Tolstaya et al., 2019
A Gentle Introduction to Graph Neural Networks, Sanchez-Lengeling et al., 2021

Reinforcement Learning

Deep Reinforcement Learning from Human Preferences, Christiano et al., 2017
Understanding RL Vision, Hilton et al., 2020

Generative Modeling

WaterGAN: Unsupervised Generative Network to Enable Real-time Color Correction of Monocular Underwater Images, Li et al., 2017
Differentiable Particle Filters through Conditional Normalizing Flow, Chen et al., 2021
Planning with Diffusion for Flexible Behavior Synthesis, Janner et al., 2022
Anything-3D: Towards Single-view Anything Reconstruction in the Wild, Shen et al., 2023