Automatic Data Generation for SORNet: PROPS Relation Dataset

Jace Aldrich
University of Michigan

Ariana Verges Alicea
University of Michigan

Hannah Ho
University of Michigan

SORNet Architecture as specified by the original authors (Yuan et al., 2021).

Abstract

Spatial Object-Centric Representation Network (SORNet, Yuan et al., 2021) is a network architecture that takes an RBG image with several canonical object views and outputs object-centric embeddings. The authors of the original paper trained and tested SORnet on their custom Leonardo and Kitchen data sets, as well as the CLEVR dataset (Compositional Language and Elementary Visual Reasoning). We expanded SORnet’s capability by training it on PROPS Dataset (Progress Robot Object Perception Samples), which was extensively used throughout this course. Training SORNet with PROPS dataset allow us to test its capabilities to a real-world dataset in order to better understand how it performs in real-life applications.

Introduction

There are a plethora of applications for robots that can perform sequential tasks that involve manipulating objects around them. These tasks can range from object assembly to organizing and sorting to packing to much more. However, in order to perform these tasks, robots need a way to recognize the orientation of objects in the world frame and in relation to each other. Having accurate results on the positional relationships between objects in a real-world setting is important in order to perform those tasks. So we tackled applying SORNet to real-world data through training it on the PROPS dataset.

Algorithmic Extension

Our update to SORnet introduces an algorithmic extension designed to boost its performance with real-world data. By developing a base class to compute relations on 3D pose or bounding box datasets, we have made it possible for SORnet to process a diverse range of datasets, including scene images, identifiable objects, and 3D object coordinates. Users only need to overload a few methods to return image and object data, and their dataset will be compatable with SORNet. This enhancement notably streamlines the conversion of data into a format that SORnet can handle.

PROPS Best Object Canonical Views

Results

As an example of our dataset conversion framework, SORNet was trained on the PROPS Pose dataset, with the resulting dataset named the PROPS Relation Dataset. It trained on relations regarding if objects in the dataset are “left”, “right”, “in front of”, or “behind” other objects.

Example Predicate Classifications

Model Performance

With the PROPS Relational dataset, SORNet achieved over 99% total validation accuracy, and over 98% validation accuracy per object, demonstrating the efficacy of our approach. These results are almost identical to that of results on n CLEVR-CoGenT.

**Full Size PROPS Data Validation Accuracy Percentages** for all Relationships. Querries are in the form of "is [*row*] [*relation*] [*column*]?", e.g., "is the potted meat can behind the master chef can?"
	Master Chef Can	Cracker Box	Sugar Box	Tomato Soup Can	Mustard Bottle	Tuna Fish Can	Gelatin Box	Potted Meat Can	Mug	Large Marker	Average
Master Chef Can	-	99.30	99.77	98.80	98.90	98.77	98.65	99.20	99.15	98.85	99.04
Cracker Box	99.10	-	99.37	99.80	99.20	99.39	98.54	98.70	99.55	98.30	99.11
Sugar Box	99.20	99.14	-	99.09	99.37	98.89	99.75	99.32	99.54	99.20	99.28
Tomato Soup Can	98.40	99.65	99.26	-	99.40	98.87	98.86	99.60	99.00	99.15	99.13
Mustard Bottle	99.30	98.90	99.26	99.90	-	98.87	99.68	98.95	98.55	98.95	99.15
Tuna Fish Can	98.98	99.28	99.41	98.98	97.95	-	99.11	99.13	98.98	99.28	99.01
Gelatin Box	99.19	99.40	99.88	99.51	99.89	99.33	-	98.81	99.78	99.03	99.43
Potted Meat Can	99.20	98.70	99.03	99.75	98.30	99.38	98.81	-	98.90	98.20	98.92
Mug	98.80	99.45	99.49	98.80	98.70	98.92	99.51	99.65	-	99.45	99.20
Large Marker	98.30	98.10	99.43	99.20	98.95	99.23	99.24	99.20	99.55	-	99.03
Average	98.94	99.10	99.43	99.31	98.96	99.08	99.13	99.17	99.22	98.93	99.13

Complete Average	98.99	99.10	99.36	99.22	99.06	99.04	99.28	99.05	99.21	98.98

Training Performance

With respect to the training process, the PROPS datset had an initial period of little improvent much longer than CLEVR relative to total training time - the figure below shows only the first 9th of CLEVR training and improves much faster relative to its convergence. We theorize that real world data has more noise, leading to increased time to learn proper object embeddings.


CLEVR Dataset Results	PROPS Dataset Results

Other Dataset Use

We highly encourage you to checkout our code and train sornet on your own datasets! Checkout our example usage in our SORNET fork and our example implementaion for the PROPS dataset

Dataset Code Model Code

To use this framework with another dataset, simply create a new file and overload the BaseRelationDataset class, following the exmaple in PropsRelationDataset. You need to overload each method that raises a “NotImplemented” error in the same manner as which PROPS does. If there is an existing dataset manager class, initialize it in the _init_parent_dataset() method of your derived class to make the implementation easier. Otherwise, load the appropriate file information in each class method and the base class should handle the relation information automatically, provided that the camera frame uses the standard notation.

If you would like to add further relations, simply overload the get_spatial_relations() method, and watch for any locations a “4” was hardcoded for the number of relations.

Citation

If you found our work helpful, consider citing us with the following BibTeX reference:

@article{aldrich2024propsrelation,
  title = {Automatic Data Generation for SORNet: PROPS Relation Dataset},
  author = {Jace Aldrich, Ariana Verges Alicea, Hannah Ho},
  year = {2024}
}

Contact

If you have any questions, feel free to contact Jace Aldrich, Ariana Verges and Hannah Ho.