Cluster Review and Selection

Group and Display by Clusters

The open-source library Fiftyone is utilized for reviewing and selecting clusters of interest, thanks to its powerful dataset visualization capabilities.


Image source: Fiftyone

The “Create dynamic group” feature in Fiftyone is exceptionally useful for cluster analysis, as it permits users to dynamically group samples based on a specific field.

In the context of this project, this feature is leveraged to group samples according to their assigned clusters. This organization enables a systematic and efficient review of each cluster, streamlining the process of evaluating and understanding the characteristics of each group.


Create dynamic group by “cluster”, ordered by “dist”.

A potential enhancement to the “Create dynamic group” feature would be the addition of dynamic group size information. As an interim solution, I have manually added an attribute called “cluster_size” to each sample within the dataset to indicate the number of samples in each cluster. While this method serves its purpose, it lacks adaptability and convenience. As the purpose of “Create dynamic group” is to form groups dynamically, it would be ideal if the group size could be displayed dynamically as well.

After grouping, extra information is displayed to facilitate more informative browsing experience.


Extra sample information.

This additional information includes:

  • group_name: Remember that each object tag has been assigned to a tag_group in the object grouping and filtering session. The group_name here is actually the name of the tag_group. This provides an indicative description of what might be contained within the cluster, offering a preliminary understanding of the cluster’s contents.
  • cluster_size: Indicates the total number of samples present in the cluster, giving an idea of the cluster’s magnitude.
  • sim: Represents the cosine similarity score between each sample and the cluster’s center. This score helps in assessing how closely each sample aligns with the central theme or characteristic of the cluster.
  • dist: Shows the euclidean distance between each sample and the cluster center. Similar to the cosine similarity score, this metric also helps in evaluating how closely each sample aligns with the central theme or characteristic of the cluster.

Round One Review: Initial Annotation

After the setup, the next step is to browse through the clusters and select the ones of interest. Since the samples are already organized into clusters, browsing can be efficiently conducted by groups. Furthermore, annotating an entire cluster can be conveniently achieved by adding a sample tag to just a single sample within that cluster.

Annotate the cluster as a whole.

My strategy is to initially annotate the clusters with simple, straightforward tags, like “black”, “red”, “sunglasses”, etc. Due to the nature of the KMeans clustering algorithm, it’s common to encounter clusters that are quite similar.

Using basic tags for these clusters facilitates the grouping of similar clusters in later stages. This approach not only streamlines the initial review process but also sets a solid foundation for more efficient organization and analysis of the clusters subsequently.

After the initial annotation, here are the top 10 tags along with the highest number of associated clusters:

Tag Number of Clusters
sunglasses 12
pink 8
white_dress 6
black_boots 5
yellow 5
denim 4
cap 4
black_sexy 4
white_suit 3
black_bag 3

Round Two Review: Refinement

Utilizing the general tags applied in the initial annotation phase, we can efficiently conduct a second round of browsing. This allows us to re-examine clusters that are grouped under similar tags and further refine our analysis.

For instance, we can collectively review clusters tagged with “sunglasses” to discern if there are any particularly unique clusters within this category. Additionally, we can review all clusters associated with the color black, such as “black_boots”, “black_sexy”, “black_bag”, etc., to identify noteworthy clusters featuring black fashion items.

The purpose of this second round of browsing is to pinpoint genuinely interesting items by reassessing clusters that are similar and, where appropriate, merging them.

Results

The final interesting groups identified through this process are outlined below.

The selected groups.

In this context, a “group” refers to a collection that may consist of several similar clusters combined into a single category.

Group Clusters Num Samples
pink 40, 56, 72, 257, 325, 477, 484, 568 385
bossy female 175, 420, 424 239
red 69, 128, 316, 426 235
black 537 208
weave 195, 195, 244, 342 207
futural sunglasses 107, 530 185
orange 5, 205, 439 172
medal_earring 136, 324 115
green 331, 458, 361, 376 102
denim wear 517 77
black and shining 333, 372 73
bright_floral 196, 367 62
stride 309, 474, 509 62
blue 147, 498 60
white 178 59
light_floral 25, 510, 0 53
warm and soft 53 49
leather 217 40
purple 10 38
cool_floral 518, 209 31
colorful 405 29
yellow 368 22

Implementation Details

Input

Name Description
ins_posts_clusters.csv csv file containing selected objects and their clusters

Process

Code Description
codes/cluster_analysis/clustering.ipynb Clustering on selected objects and pick the clusters of interest

Output

Name Description
ins_posts_clusters_selected_group.csv csv file containing selected clusters and their annotated tags and tag_groups

Data Sample:

{
    "img_path": "/mnt/ssd3/jiangchun/data/ins_posts_batch/ins_posts_2/OliviaLazuardy/images/95fee94fead30f5caca05a6ab1eafc3e.jpg",
    "image_name": "95fee94fead30f5caca05a6ab1eafc3e",
    "cluster": "cluster_537",
    "tag": "black_dress | classic",
    "tag_group": "black"
}