Deep Learning Research Papers for Robot Perception: Important Works

This collection highlights foundational papers and promising recent research in deep learning and robot perception.

RGB-D Architectures
Point Cloud Processing
Object Pose, Geometry, SDF, Implicit surfaces
Dense Descriptors, Category-level Representations
Recurrent Networks and Object Tracking
Visual Odometry and Localization
Semantic Scene Graphs and Explicit Representations
Neural Radiance Fields and Implicit Representations
NeRF SLAM
Datasets
Self-Supervised & Representation Learning
Grasp Pose Detection
Tactile Perception for Grasping and Manipulation
Pre-training for Robot Manipulation
Generative Modeling & Dynamic Scenes
Transparent Objects
Explainable and Interpretable AI

RGB-D Architectures

⭐ VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
⭐ PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018

Point Cloud Processing

⭐ PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
⭐ PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
⭐ DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019
⭐ Point Transformer
Pointnext: Revisiting pointnet++ with improved training and scaling strategies
KPConvX: Modernizing Kernel Point Convolution with Kernel Attention

Object Pose, Geometry, SDF, Implicit surfaces

⭐ DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
iSDF: Real-Time Neural Signed Distance Fields for Robot Perception, Oriz et al., 2022
ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning

Dense Descriptors, Category-level Representations

⭐ Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
⭐ Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021

Recurrent Networks and Object Tracking

⭐ Long Short-Term Memory, Hochreiter et al., 1997
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022
TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022
6D Object Pose Tracking in Internet Videos for Robotic Manipulation

Visual Odometry and Localization

⭐ Factor Graphs and GTSAM, Dellaert et al., 2012
⭐ SuperPoint: Self-Supervised Interest Point Detection and Description, DeTone et al., 2017
⭐ SuperGlue: Learning Feature Matching with Graph Neural Networks, Sarlin et al., 2019
Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors, Jonschkowski et al., 2018
Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation, Karkus et al., 2021
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Semantic Scene Graphs and Explicit Representations

⭐ Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
⭐ Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
ConceptFusion: Open-set Multimodal 3D Mapping, Jatavallabhula et al., 2023

Neural Radiance Fields and Implicit Representations

⭐ NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
⭐ 3D Gaussian Splatting for Real-Time Radiance Field Rendering
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields, Rosinol et al., 2022
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation, Shen et al., 2023
Language Embedded Radiance Fields, Kerr et al., 2023
PIN-SLAM: LiDAR SLAM Using a Point-Based Implicit Neural Representation for Achieving Global Map Consistency
MISO: Multiresolution Submap Optimization for Efficient Globally Consistent Neural Implicit Reconstruction

NeRF SLAM

Datasets

⭐ Deep Learning for Robots: Learning from Large-Scale Interaction, Levine et al., 2016
⭐ Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
⭐ ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
⭐ ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
⭐ MuJoCo: A physics engine for model-based control, Todorov et al., 2015
⭐ CARLA: An Open Urban Driving Simulator, Dosovitskiy et al., 2017
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills, Gu et al., 2023

Self-Supervised & Representation Learning

⭐ Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021
⭐ DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., 2023
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Personalized Representation from Personalized Generation

Grasp Pose Detection

Tactile Perception for Grasping and Manipulation

⭐ GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018

Pre-training for Robot Manipulation

⭐ Attention is All You Need, Vaswani et al., 2017
⭐ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2020
⭐ CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020
CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022
⭐ DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), Shao et al., 2024
⭐ OpenVLA: An Open-Source Vision-Language-Action Model
π0.5: a Vision-Language-Action Model with Open-World Generalization

Generative Modeling & Dynamic Scenes

Transparent Objects

Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
Transplat: Surface embedding-guided 3d gaussian splatting for transparent object manipulation

Explainable and Interpretable AI

⭐ Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
⭐ Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

Deep Learning Research Papers for Robot Perception: Important Works

Table of contents

RGB-D Architectures

Point Cloud Processing

Object Pose, Geometry, SDF, Implicit surfaces

Dense Descriptors, Category-level Representations

Recurrent Networks and Object Tracking

Visual Odometry and Localization

Semantic Scene Graphs and Explicit Representations

Neural Radiance Fields and Implicit Representations

NeRF SLAM

Datasets

Self-Supervised & Representation Learning

Grasp Pose Detection

Tactile Perception for Grasping and Manipulation

Pre-training for Robot Manipulation

Generative Modeling & Dynamic Scenes

Transparent Objects

Explainable and Interpretable AI