Deep Learning Research Papers for Robot Perception
A collection of deep learning research papers with coverage in perception and associated robotic tasks. Within each research area outlined below, the course staff has identified a core and extended set of research papers. The core set of papers will form the basis of our seminar-style lectures starting in week 8. The extended set provides additional coverage of even more exciting work being done within each area.
Table of contents
- RGB-D Architectures
- Point Cloud Processing
- Object Pose, Geometry, SDF, Implicit surfaces
- Dense Descriptors, Category-level Representations
- Recurrent Networks and Object Tracking
- Visual Odometry and Localization
- Semantic Scene Graphs and Explicit Representations
- Neural Radiance Fields and Implicit Representations
- Datasets
- Self-Supervised Learning
- Grasp Pose Detection
- Tactile Perception for Grasping and Manipulation
- Pre-training for Robot Manipulation
- Perception Beyond Vision
- More Frontiers
RGB-D Architectures
Core List
-
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018
-
A Unified Framework for Multi-View Multi-Class Object Pose Estimation, Li et al., 2018
-
PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation, He et al., 2020
-
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation, Li et al., 2021
Extended List
-
3D ShapeNets: A Deep Representation for Volumetric Shapes, Wu et al., 2015
-
VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
-
Multi-view Convolutional Neural Networks for 3D Shape Recognition, Su et al., 2015
-
Volumetric and Multi-View CNNs for Object Classification on 3D Data, Qi et al., 2016
-
Robust 6D Object Pose Estimation with Stochastic Congruent Sets, Mitash et al., 2018
-
What’s Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction, Kulkarni et al., 2022
Point Cloud Processing
Core List
-
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
-
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
-
PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation, Xu et al., 2018
-
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019
Extended List
-
Just Go with the Flow: Self-Supervised Scene Flow Estimation, Mittal et al., 2019
-
PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows, Yang et al., 2019
-
3D Object Detection with Pointformer, Pan et al., 2021
-
Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories, Harley et al., 2022
Object Pose, Geometry, SDF, Implicit surfaces
Core List
-
SUM: Sequential scene understanding and manipulation, Sui et al., 2017
-
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
-
Implicit surface representations as layers in neural networks, Michalkiewicz et al., 2019
-
iSDF: Real-Time Neural Signed Distance Fields for Robot Perception, Oriz et al., 2022
Extended List
-
Local Deep Implicit Functions for 3D Shape, Genova et al., 2020
-
Implicit geometric regularization for learning shapes, Gropp et al., 2020
-
TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation, Pan et al., 2022
-
Improving Object Pose Estimation by Fusion With a Multimodal Prior – Utilizing Uncertainty-Based CNN Pipelines for Robotics, Richter-Klug et al., 2022
Dense Descriptors, Category-level Representations
Core List
-
Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
-
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
-
kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
-
Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image, Lin et al., 2022
Extended List
-
Visual Descriptor Learning from Monocular Video, Deekshith et al., 2020
-
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021
Recurrent Networks and Object Tracking
Core List
-
DeepIM: Deep Iterative Matching for 6D Pose Estimation, Li et al., 2018
-
PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking, Deng et al., 2019
-
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints, Wang et al., 2020
-
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022
Extended List
-
Long Short-Term Memory, Hochreiter et al., 1997
-
The Unreasonable Effectiveness of Recurrent Neural Networks, Karpathy, 2015
-
TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022
-
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization, Xu et al., 2022
Visual Odometry and Localization
Core List
-
Backprop KF: Learning Discriminative Deterministic State Estimators, Haarnoja et al., 2016
-
Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors, Jonschkowski et al., 2018
-
Multimodal Sensor Fusion with Differentiable Filters, Lee et al., 2020
-
Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation, Karkus et al., 2021
Extended List
-
Factor Graphs and GTSAM, Dellaert et al., 2012
-
SuperPoint: Self-Supervised Interest Point Detection and Description, DeTone et al., 2017
-
Particle Filter Recurrent Neural Networks, Ma et al., 2019
-
Differentiable Algorithm Networks for Composable Robot Learning, Karkus et al., 2019
-
SuperGlue: Learning Feature Matching with Graph Neural Networks, Sarlin et al., 2019
-
Chasing Ghosts: Instruction Following as Bayesian State Tracking, Anderson et al., 2019
-
Differentiable Factor Graph Optimization for Learning Smoothers, Yi et al., 2021
-
How to train your differentiable filter, Kloss et al., 2021
-
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
-
A Robot Web for Distributed Many-Device Localisation, Murai et al., 2022
Semantic Scene Graphs and Explicit Representations
Core List
-
Image Retrieval using Scene Graphs, Johnson et al., 2015
-
Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes, Zeng et al., 2018
-
Semantic Linking Maps for Active Visual Object Search, Zeng et al., 2020
-
Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022
Extended List
-
RoboSherlock: Unstructured information processing for robot perception, Beetz et al., 2015
-
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
-
Image Generation from Scene Graphs, Johnson et al., 2018
-
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
-
Differentiable Scene Graphs, Raboh et al., 2020
-
ConceptFusion: Open-set Multimodal 3D Mapping, Jatavallabhula et al., 2023
Neural Radiance Fields and Implicit Representations
Core List
-
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
-
iMAP: Implicit Mapping and Positioning in Real-Time, Sucar et al., 2021
-
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields, Rosinol et al., 2022
-
NARF22: Neural Articulated Radiance Fields for Configuration-Aware Rendering, Lewis et al., 2022
-
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation, Shen et al., 2023
Extended List
-
NeRF Explosion 2020, Dellaert, 2020
-
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, Sitzmann et al., 2019
-
Local Implicit Grid Representations for 3D Scenes, Jiang et al., 2020
-
Convolutional occupancy networks, Peng et al., 2020
-
Object-Centric Neural Scene Rendering, Guo et al., 2020
-
INeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al., 2021
-
ILabel: Interactive Neural Scene Labelling, Zhi et al., 2021
-
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation, Simeonov et al., 2021
-
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering, Xiangli et al., 2021
-
Block-NeRF: Scalable Large Scene Neural View Synthesis, Tancik et al., 2022
-
NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields, Yen-Chen et al., 2022
-
Language Embedded Radiance Fields, Kerr et al., 2023
Datasets
Core List
-
Deep Learning for Robots: Learning from Large-Scale Interaction, Levine et al., 2016
-
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
-
Grounding Predicates through Actions, Migimatsu and Bohg, 2022
-
All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators, Thananjeyan et al., 2022
Extended List
Collecting data with robots
- TossingBot: Learning to Throw Arbitrary Objects, Zeng et al., 2019
RGB-D Datasets
-
(NYU Depth v2) Indoor Segmentation and Support Inference from RGBD Images, Silberman et al., 2012
-
SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite, Song et al., 2015
-
YCB-Video Dataset, Xiang et al., 2018
-
BOP: Benchmark for 6D Object Pose Estimation, Hodaň et al., 2019
-
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
-
ProgressLabeller: Visual Data Stream Annotation for Training Object-Centric 3D Perception, Chen et al., 2022
-
TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes, Xu et al., 2022
Semantic Datasets
-
Understanding Human Hands in Contact at Internet Scale, Shan et al., 2020
-
Habitat-Matterport 3D Semantics Dataset, Yadav et al., 2022
Object Model Datasets
-
ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
Simulators
-
MuJoCo: A physics engine for model-based control, Todorov et al., 2015
-
Pybullet, a python module for physics simulation for games, robotics and machine learning, Coumans et al., 2015
-
CARLA: An Open Urban Driving Simulator, Dosovitskiy et al., 2017
-
SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Manipulation, Lin et al., 2020
-
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills, Gu et al., 2023
Self-Supervised Learning
Core List
-
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, Lee et al., 2019
-
VICRegL: Self-Supervised Learning of Local Visual Features, Bardes et al., 2022
-
Fully Self-Supervised Class Awareness in Dense Object Descriptors, Hadjivelichkov and Kanoulas, 2022
-
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild, Zhang et al., 2022
Extended List
-
Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021
-
DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., 2023
Grasp Pose Detection
Core List
-
Real-Time Grasp Detection Using Convolutional Neural Networks, Redmon and Angelova, 2015
-
Using Geometry to Detect Grasps in 3D Point Clouds, ten Pas and Platt, 2015
-
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics, Mahler et al., 2017
-
Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes, Sundermeyer et al., 2021
-
Sample Efficient Grasp Learning Using Equivariant Models, Zhu et al., 2022
Extended List
-
Deep Learning for Detecting Robotic Grasps, Lenz et al., 2013
-
High precision grasp pose detection in dense clutter, Gualtieri et al., 2016
-
GlassLoc: Plenoptic Grasp Pose Detection in Transparent Clutter, Zhou et al., 2019
-
MetaGraspNet_v0: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis, Chen et al., 2021
-
Grasp Learning: Models, Methods, and Performance, Platt, 2022
Tactile Perception for Grasping and Manipulation
Core List
-
More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018
-
Tactile Object Pose Estimation from the First Touch with Geometric Contact Rendering, Bauza et al., 2020
-
Visuotactile Affordances for Cloth Manipulation with Local Control, Sunil et al., 2022
-
ShapeMap 3-D: Efficient shape mapping through dense touch and vision, Suresh et al., 2022
Extended List
-
The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?, Calandra et al., 2017
-
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
-
Soft-bubble: A highly compliant dense geometry tactile sensor for robot manipulation, Alspach et al., 2019
-
A Review of Tactile Information: Perception and Action Through Touch, Li et al., 2020
-
TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors, Wang et al., 2020
-
Active Extrinsic Contact Sensing: Application to General Peg-in-Hole Insertion, Kim et al., 2021
-
Active Visuo-Haptic Object Shape Completion, Rustler et al., 2022
-
Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces, Kerr and Huang et al., 2022
-
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Li et al., 2022
-
Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity, Zhou and Held, 2022
Pre-training for Robot Manipulation
Core List
-
SORNet: Spatial Object-Centric Representations for Sequential Manipulation, Yuan et al., 2021
-
CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
-
Real-World Robot Learning with Masked Visual Pre-training, Radosavovic et al., 2022
-
R3M: A Universal Visual Representation for Robot Manipulation, Nair et al., 2022
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
-
RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022
Extended List
-
Attention and Augmented Recurrent Neural Networks, Olah & Carter, 2016
-
Attention is All You Need, Vaswani et al., 2017
-
Feature-wise transformations, Dumoulin et al., 2018
-
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2020
-
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
-
Masked Autoencoders Are Scalable Vision Learners, He et al., 2021
-
Interactive Language: Talking to Robots in Real Time, Lynch et al., 2022
-
Transformers are Adaptable Task Planners, Jain et al., 2022
Perception Beyond Vision
Specialized Sensors
-
Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images, Levenson et al., 2015
-
Automatic color correction for 3D reconstruction of underwater scenes, Skinner et al., 2017
-
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
-
Classification of Household Materials via Spectroscopy, Erickson et al., 2018
-
Through-Wall Human Pose Estimation Using Radio Signals, Zhao et al., 2018
-
A bio-hybrid odor-guided autonomous palm-sized air vehicle, Anderson et al., 2020
-
Event-based, Direct Camera Tracking from a Photometric 3D Map using Nonlinear Optimization, Bryner et al., 2019
-
SoundSpaces: Audio-Visual Navigation in 3D Environments, Chen et al., 2019
-
Neural Implicit Surface Reconstruction using Imaging Sonar, Qadri et al., 2022
More Frontiers
Interpreting Deep Learning Models
-
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Simonyan et al., 2013
-
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
-
The Building Blocks of Interpretability, Olah et al., 2018
-
Multimodal Neurons in Artificial Neural Networks, Goh et al., 2021
Fairness and Ethics
-
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
-
Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing, Raji et al., 2020
Certifiable Perception
-
Certifiably Optimal Outlier-Robust Geometric Perception: Semidefinite Relaxations and Scalable Global Optimization, Yang and Carlone, 2021
-
Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training, Talak et al., 2022
Articulated Objects
-
Autonomous Tool Construction Using Part Shape and Attachment Prediction, Nair et al., 2019
-
Parts-Based Articulated Object Localization in Clutter Using Belief Propagation, Pavlasek et al., 2020
-
Category-Level Articulated Object Pose Estimation, Li et al., 2020
-
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
-
Category-Independent Articulated Object Tracking with Factor Graphs, Heppert et al., 2022
-
Kineverse: A Symbolic Articulation Model Framework for Model-Agnostic Mobile Manipulation, Röfer et al., 2022
Deformable Objects
-
DensePose: Dense Human Pose Estimation In The Wild, Xiao et al., 2018
-
FabricFlowNet: Bimanual Cloth Manipulation with a Flow-based Policy, Weng et al., 2021
-
DextAIRity: Deformable Manipulation Can be a Breeze, Xu et al., 2022
-
Self-supervised Transparent Liquid Segmentation for Robotic Pouring, Narasimhan et al., 2022
-
Visio-tactile Implicit Representations of Deformable Objects, Wi et al., 2022
Transparent Objects
-
LIT: Light-field Inference of Transparency for Refractive Object Localization, Zhou et al., 2019
-
Multi-modal Transfer Learning for Grasping Transparent and Specular Objects, Weng et al., 2020
-
Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
-
ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
-
TransNet: Category-Level Transparent Object Pose Estimation, Zhang et al., 2022
Dynamic Scenes
-
D-NeRF: Neural Radiance Fields for Dynamic Scenes, Pumarola et al., 2020
-
3D Neural Scene Representations for Visuomotor Control, Li et al., 2021
-
HexPlane: A Fast Representation for Dynamic Scenes, Cao and Johnson, 2023
Beyond 2D Convolutions
-
Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks, Tolstaya et al., 2019
-
A Gentle Introduction to Graph Neural Networks, Sanchez-Lengeling et al., 2021
Reinforcement Learning
-
Deep Reinforcement Learning from Human Preferences, Christiano et al., 2017
-
Understanding RL Vision, Hilton et al., 2020
Generative Modeling
-
WaterGAN: Unsupervised Generative Network to Enable Real-time Color Correction of Monocular Underwater Images, Li et al., 2017
-
Differentiable Particle Filters through Conditional Normalizing Flow, Chen et al., 2021
-
Planning with Diffusion for Flexible Behavior Synthesis, Janner et al., 2022
-
Anything-3D: Towards Single-view Anything Reconstruction in the Wild, Shen et al., 2023