Cluster Review and Selection

Group and Display by Clusters

The open-source library Fiftyone is utilized for reviewing and selecting clusters of interest, thanks to its powerful dataset visualization capabilities.

Image source: Fiftyone

The “Create dynamic group” feature in Fiftyone is exceptionally useful for cluster analysis, as it permits users to dynamically group samples based on a specific field.

In the context of this project, this feature is leveraged to group samples according to their assigned clusters. This organization enables a systematic and efficient review of each cluster, streamlining the process of evaluating and understanding the characteristics of each group.

Create dynamic group by “cluster”, ordered by “dist”.

A potential enhancement to the “Create dynamic group” feature would be the addition of dynamic group size information. As an interim solution, I have manually added an attribute called “cluster_size” to each sample within the dataset to indicate the number of samples in each cluster. While this method serves its purpose, it lacks adaptability and convenience. As the purpose of “Create dynamic group” is to form groups dynamically, it would be ideal if the group size could be displayed dynamically as well.

After grouping, extra information is displayed to facilitate more informative browsing experience.

Extra sample information.

This additional information includes:

group_name: Remember that each object tag has been assigned to a tag_group in the object grouping and filtering session. The group_name here is actually the name of the tag_group. This provides an indicative description of what might be contained within the cluster, offering a preliminary understanding of the cluster’s contents.
cluster_size: Indicates the total number of samples present in the cluster, giving an idea of the cluster’s magnitude.
sim: Represents the cosine similarity score between each sample and the cluster’s center. This score helps in assessing how closely each sample aligns with the central theme or characteristic of the cluster.
dist: Shows the euclidean distance between each sample and the cluster center. Similar to the cosine similarity score, this metric also helps in evaluating how closely each sample aligns with the central theme or characteristic of the cluster.

Round One Review: Initial Annotation

After the setup, the next step is to browse through the clusters and select the ones of interest. Since the samples are already organized into clusters, browsing can be efficiently conducted by groups. Furthermore, annotating an entire cluster can be conveniently achieved by adding a sample tag to just a single sample within that cluster.

Annotate the cluster as a whole.

My strategy is to initially annotate the clusters with simple, straightforward tags, like “black”, “red”, “sunglasses”, etc. Due to the nature of the KMeans clustering algorithm, it’s common to encounter clusters that are quite similar.

Using basic tags for these clusters facilitates the grouping of similar clusters in later stages. This approach not only streamlines the initial review process but also sets a solid foundation for more efficient organization and analysis of the clusters subsequently.

After the initial annotation, here are the top 10 tags along with the highest number of associated clusters:

Tag	Number of Clusters
sunglasses	12
pink	8
white_dress	6
black_boots	5
yellow	5
denim	4
cap	4
black_sexy	4
white_suit	3
black_bag	3

Utilizing the general tags applied in the initial annotation phase, we can efficiently conduct a second round of browsing. This allows us to re-examine clusters that are grouped under similar tags and further refine our analysis.

For instance, we can collectively review clusters tagged with “sunglasses” to discern if there are any particularly unique clusters within this category. Additionally, we can review all clusters associated with the color black, such as “black_boots”, “black_sexy”, “black_bag”, etc., to identify noteworthy clusters featuring black fashion items.

The purpose of this second round of browsing is to pinpoint genuinely interesting items by reassessing clusters that are similar and, where appropriate, merging them.

Results

The final interesting groups identified through this process are outlined below.

The selected groups.

In this context, a “group” refers to a collection that may consist of several similar clusters combined into a single category.

Group	Clusters	Num Samples
pink	40, 56, 72, 257, 325, 477, 484, 568	385
bossy female	175, 420, 424	239
red	69, 128, 316, 426	235
black	537	208
weave	195, 195, 244, 342	207
futural sunglasses	107, 530	185
orange	5, 205, 439	172
medal_earring	136, 324	115
green	331, 458, 361, 376	102
denim wear	517	77
black and shining	333, 372	73
bright_floral	196, 367	62
stride	309, 474, 509	62
blue	147, 498	60
white	178	59
light_floral	25, 510, 0	53
warm and soft	53	49
leather	217	40
purple	10	38
cool_floral	518, 209	31
colorful	405	29
yellow	368	22

Implementation Details

Input

Name	Description
`ins_posts_clusters.csv`	csv file containing selected objects and their clusters

Process

Code	Description
`codes/cluster_analysis/clustering.ipynb`	Clustering on selected objects and pick the clusters of interest

Output

Name	Description
`ins_posts_clusters_selected_group.csv`	csv file containing selected clusters and their annotated tags and tag_groups

Data Sample:

{
    "img_path": "/mnt/ssd3/jiangchun/data/ins_posts_batch/ins_posts_2/OliviaLazuardy/images/95fee94fead30f5caca05a6ab1eafc3e.jpg",
    "image_name": "95fee94fead30f5caca05a6ab1eafc3e",
    "cluster": "cluster_537",
    "tag": "black_dress | classic",
    "tag_group": "black"
}