Enhance Distilled Feature Field for Better Language Grounding

Yinong He
University of Michigan

Yixiang Dai
University of Michigan

Huijie Tang
University of Michigan

Our pipeline to decompose the compositional and complex language, and then ground each part to the scene objects.

Abstract

We replicate the paper, Distilled Feature Field Enable Few-Shot Language-Guided Manipulation (Shen et al., 2023). We identify a research problem that the original paper can only handle simple language query, but can’t ground the scene objects correctly when the language queries become complex and compositional due to the bag-of-word problem with CLIP. We develop a grounding decomposition algorithm to ground more complex languages to the objects in the 3D distilled feature field.

Methodology

To introduce our methodology in a high level, we first decompose the complex and compositional query into an atomic level, by prompting the GPT-4 to identify the target object as well as the metal mug. Then we query those objects separately to the feature field, and obtain the candidate points. Next, we use DBSCAN clustering algorithm to classify the points according to their spatial positions. After that, we prompt the GPT-4 to select the cluster we want, and weigh the similarity accoding to the distance to the clusters. Finally, we ground the complex and compositional query to the expected object in the feature field.

Result

The baseline model simply encodes the query language with CLIP, and calculate the cosine similarity with the feature vectors. However, the CLIP embedding can’t handle complex and compositional relationship due to CLIP’s bag-of-word behaviours. The output results from the baseline model is noisy and incorrect, while our algorithm can output clear and correct results, showcasing a much stronger language-grounding ability. Below are several demos.

Complex queries consisting of targets, reference, and relationship

Query: wooden block under the metal mug. (Left) Baseline (Right) Ours

Query: wooden block closest to the apple. (Left) Baseline (Right) Ours

Query: block under the pink mug. (Left) Baseline (Right) Ours

Query: mug furthest to the bear. (Left) Baseline (Right) Ours

Grounding with only target objects without any other reference

Query: right baymax. (Left) Baseline (Right) Ours

Query: wooden block at the left. (Left) Baseline (Right) Ours

Multi-view Rendering

We can render and ground the language from any views. For example, we can show a demo for grounding a view from above.

Limitations

GPT-4 can sometimes hallucinate with spatial relationship reasoning. For example, the GPT-4 may understand the “under” from a 2D image view, instead of a 3D spatial sense. In the following picture, the selected block is “under” the mug if we consider their 2D coordinates in this image. However, it’s not the “under” we really mean in a 3D space.

Project Video Demo

Citation

If you found our work helpful, consider citing us with the following BibTeX reference:

@article{yinong2024deeprob,
  title = {Enhance Distilled Feature Field for Better Language Grounding},
  author = {Yinong He, Yixiang Dai, Huijie Tang},
  year = {2024}
}

Contact

If you have any questions, feel free to contact Yinong.