Deep Learning Research Papers for Robot Perception: Important Works
This collection highlights foundational papers and promising recent research in deep learning and robot perception.
Table of contents
- RGB-D Architectures
- Point Cloud Processing
- Object Pose, Geometry, SDF, Implicit surfaces
- Dense Descriptors, Category-level Representations
- Recurrent Networks and Object Tracking
- Visual Odometry and Localization
- Semantic Scene Graphs and Explicit Representations
- Neural Radiance Fields and Implicit Representations
- NeRF SLAM
- Datasets
- Self-Supervised & Representation Learning
- Grasp Pose Detection
- Tactile Perception for Grasping and Manipulation
- Pre-training for Robot Manipulation
- Generative Modeling & Dynamic Scenes
- Transparent Objects
- Explainable and Interpretable AI
RGB-D Architectures
- ⭐ VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
- ⭐ PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018
Point Cloud Processing
- ⭐ PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
- ⭐ PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
- ⭐ DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019
- ⭐ Point Transformer
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies
- KPConvX: Modernizing Kernel Point Convolution with Kernel Attention
Object Pose, Geometry, SDF, Implicit surfaces
- ⭐ DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
- iSDF: Real-Time Neural Signed Distance Fields for Robot Perception, Oriz et al., 2022
- ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning
Dense Descriptors, Category-level Representations
- ⭐ Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
- ⭐ Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
- kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
- SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021
Recurrent Networks and Object Tracking
- ⭐ Long Short-Term Memory, Hochreiter et al., 1997
- XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022
- TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022
- 6D Object Pose Tracking in Internet Videos for Robotic Manipulation
Visual Odometry and Localization
- ⭐ Factor Graphs and GTSAM, Dellaert et al., 2012
- ⭐ SuperPoint: Self-Supervised Interest Point Detection and Description, DeTone et al., 2017
- ⭐ SuperGlue: Learning Feature Matching with Graph Neural Networks, Sarlin et al., 2019
- Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors, Jonschkowski et al., 2018
- Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation, Karkus et al., 2021
- VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Semantic Scene Graphs and Explicit Representations
- ⭐ Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
- ⭐ Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022
- 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
- ConceptFusion: Open-set Multimodal 3D Mapping, Jatavallabhula et al., 2023
Neural Radiance Fields and Implicit Representations
- ⭐ NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
- ⭐ 3D Gaussian Splatting for Real-Time Radiance Field Rendering
- NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields, Rosinol et al., 2022
- Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation, Shen et al., 2023
- Language Embedded Radiance Fields, Kerr et al., 2023
- PIN-SLAM: LiDAR SLAM Using a Point-Based Implicit Neural Representation for Achieving Global Map Consistency
- MISO: Multiresolution Submap Optimization for Efficient Globally Consistent Neural Implicit Reconstruction
NeRF SLAM
- DDN-SLAM: Real Time Dense Dynamic Neural Implicit SLAM
- GlORIE-SLAM: Globally Optimized RGB-only Implicit Encoding Point Cloud SLAM
- EC-SLAM: Effectively Constrained Neural RGB-D SLAM with Sparse TSDF Encoding and Global Bundle Adjustment
- HERO-SLAM: Hybrid Enhanced Robust Optimization of Neural SLAM
Datasets
- ⭐ Deep Learning for Robots: Learning from Large-Scale Interaction, Levine et al., 2016
- ⭐ Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
- ⭐ ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
- ⭐ ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
- ⭐ MuJoCo: A physics engine for model-based control, Todorov et al., 2015
- ⭐ CARLA: An Open Urban Driving Simulator, Dosovitskiy et al., 2017
- ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills, Gu et al., 2023
Self-Supervised & Representation Learning
- ⭐ Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021
- ⭐ DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., 2023
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
- Personalized Representation from Personalized Generation
Grasp Pose Detection
- ⭐ Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics, Mahler et al., 2017
- ⭐ FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes, Sundermeyer et al., 2021
- Any6D: Model-free 6D Pose Estimation of Novel Objects
- Superquadrics-based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation
- RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment
Tactile Perception for Grasping and Manipulation
- ⭐ GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
- More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018
Pre-training for Robot Manipulation
- ⭐ Attention is All You Need, Vaswani et al., 2017
- ⭐ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2020
- ⭐ CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
- Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020
- CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
- RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022
- ⭐ DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), Shao et al., 2024
- ⭐ OpenVLA: An Open-Source Vision-Language-Action Model
- π0.5: a Vision-Language-Action Model with Open-World Generalization
Generative Modeling & Dynamic Scenes
- ⭐ Deep Reinforcement Learning from Human Preferences, Christiano et al., 2017
- Planning with Diffusion for Flexible Behavior Synthesis, Janner et al., 2022
- Anything-3D: Towards Single-view Anything Reconstruction in the Wild, Shen et al., 2023
- DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
- One Diffusion to Generate Them All
- DreamDojo A Generalist Robot World Model from Large-Scale Human Videos
- SplatFormer: Point Transformer for Robust 3D Gaussian Splatting
- Self-Contrastive Fine-Tuning for Equitable Image Generation
Transparent Objects
- Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
- ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
- Transplat: Surface embedding-guided 3d gaussian splatting for transparent object manipulation
Explainable and Interpretable AI
- ⭐ Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
- ⭐ Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis