OD-VOS: Object Detection for Video Object Segmentation

Omer Benharush
University of Michigan

Sarah Chan
University of Michigan

Jack Kernan
University of Michigan

Max Rucker
University of Michigan

Abstract

Long-term video object segmentation (VOS) often requires manual intervention for selecting object masks, which can be impractical for various applications in robotics. In our project, termed OD-VOS, we propose a novel approach that integrates Vision Transformer for Open-World Localization (OWL-ViT) with XMem. OWL-ViT is an object detection network that can identify objects based on text queries, and XMem is a VOS framework which efficiently stores memory inspired by the Atkinson-Shiffrin memory model. OD-VOS uses OWL-ViT to automate the mask selection process for XMem. Our method eliminates the need for manual mask selection, thus streamlining the segmentation pipeline and reducing human effort. By automating the mask selection process, our framework enhances the applicability of VOS techniques.

Demo

Example with 1 object:

Example with 2 objects:

Project Video

Citation

If you found our work helpful, consider citing us with the following BibTeX reference:

@article{chan2024deeprob,
  title = {OD-VOS: Object Detection Video Object Segmentation},
  author = {Benharush, Omer and Chan, Sarah and Kernan, Jack and Rucker, Max},
  year = {2024}
}

Contact

If you have any questions, feel free to contact Sarah Chan.