Open-set Object Detection on Instagram Images

After the image collection is finalized, the next action is the identification of objects in the gathered images. The focus of this analysis is on fashion-related items; therefore, it is essential to mitigate the effects of background elements and items unrelated to fashion.

This session will focus solely on object detection, reserving the task of filtering out non-fashion-related objects for the next phase.

Setup Grounded-Segment-Anything

Thanks to the contributions of the open-source community, a variety of open-set object detection models are now freely available. For this particular use case, I primarily utilize the pipelines from Grounded-Segment-Anything, incorporating several customized modifications to align with specific project needs.

Image source: Grounded-Segment-Anything

To start, it’s necessary to configure the required environment and dependencies following the guidelines provided in Grounded-Segment-Anything.

For this specific use case, setting up RAM (Recognize Anything Model) and GroundingDINO is sufficient. To create a local GPU environment for GroundingDINO without using Docker, manual configuration of the environment variable is required. Detailed guidance can be found in this GitHub issue: issue-360.

The pretrained models can be accessed via the links below. These models should be downloaded to a local directory for subsequent use.

Model	Pretrained Model Link
RAM (Recognize Anything Model)	ram_swin_large_14m
GroundingDINO	groundingdino_swint_ogc

Pipeline Overview

The major steps are:

Load and process the image for RAM and GroundingDINO models.
Get predicted tags using RAM.
Obtain bounding boxes and phrases from GroundingDINO based on the predicted tags from the previous step.
Process bounding boxes and phrases for the final output.

The simplified version of the inference process is presented as below:

def inference(self, image_path, res_path, debug=False):
    # Load and process image
    image_pil, image_gdino, image_ram = self.process_image(image_path)

    # Recognize Anything Model
    tags = self.get_auto_tags(image_ram)

    # GroundingDINO
    bboxes, scores, phrases = self.get_grounding_output(image_gdino, tags)
    bboxes_processed, phrases_processed = self.process_bboxes(
        image_pil, bboxes, scores, phrases
    )

Results

The following are sample outcomes produced by the pipeline. The final results comprise a list of bounding boxes and their corresponding tags, depicted on the right side of each image. Additionally, the initial tags predicted by RAM are displayed at the bottom of each example for reference.

RAM Tags: armchair, chair, couch, table, floor, girl, hassock, living room, relax, room, sit, stool, window, woman

RAM Tags: bush, dress, flower, garden, goggles, jumpsuit, pink, rose, sleepwear, stand, sunglasses, walk, wear, woman

RAM Tags: bench, boot, dress, event, goggles, person, picnic table, sit, stool, woman

The advantage of using an open-set detection model is its ability to identify a wide range of objects without predefined taxonomies. However, this approach also presents a significant challenge. Since the focus is on fashion-related objects, there’s a need to selectively filter out the relevant objects. Giving the output tags are in free-form text, the filtering process is much more challenging.

The final result shows that, out of the 30,179 images processed, objects were detected in 29,937 of them, resulting in a total of 182,123 identified objects. The presence of 4,561 unique tags makes it impractical to manually filter out fashion-related objects. Therefore, it is essential to develop a strategy that groups similar tags into broader categories, facilitating more efficient processing and analysis.

Most Frequent Tag	Count	Least Frequent Tag	Count
woman	13874	control mp3 player	1
dress	6688	oval pendant	1
sandal	5060	convenience store	1
girl	3664	outlet	1
goggles	3454	outhouse	1
man	3428	origami	1
hand	3402	orchid plant	1
shoe	2625	convenience store storefront	1
person	2562	conversation couple	1
stool	2561	zoo	1

Implementation Details

Input

Name	Description
`ins_posts/<username>/images`	Images of Instagram posts

Process

Code	Description
`codes/open-set_detection/inference.ipynb`	Open-set object detection inference

Output

Name	Description
`ins_posts/<username>/bboxes`	New `bboxes` folder under the same root directory as the `images` folder, containing the detection results

Folder Structure:

ins_posts_3
├── AylaDimitri
│   ├── AylaDimitri_posts.json
│   ├── AylaDimitri_profile.json
│   ├── bboxes
│   │   ├── 008a4a567a8c4a5f5f0cb06ec0dc92e8.json
│   │   ├── ...
│   │   └── fcff621d25e77ac24889686453e1befe.json
│   └── images
│       ├── 008a4a567a8c4a5f5f0cb06ec0dc92e8.jpg
│       ├── ...
│       └── fcff621d25e77ac24889686453e1befe.jpg
└── xeniaadonts
    ├── bboxes
    │   ├── 07d8c562f6d1ee6f1a2bdb1453e912d7.json
    │   ├── ...
    │   └── fbeab7c9d911db651ad4bc1d3bc25062.json
    ├── images
    │   ├── 07d8c562f6d1ee6f1a2bdb1453e912d7.jpg
    │   ├── ...
    │   └── fbeab7c9d911db651ad4bc1d3bc25062.jpg
    ├── xeniaadonts_posts.json
    └── xeniaadonts_profile.json

Result Sample:

{
    "bboxes": [
        [
            819.04150390625,
            450.6321105957031,
            984.864013671875,
            532.0742797851562
        ],
        [
            1132.3551025390625,
            928.917236328125,
            1436.1605224609375,
            1323.56005859375
        ]
    ],
    "tags": [
        "goggles(0.60)",
        "stool(0.27)"
    ]
}