Clustering on Selected Objects
Handling a Large Embedding Matrix
As discussed in the object embedding session, the feature embedding dimension is set at 1,792. Given the task of clustering 61,751 objects, this translates to a considerably large embedding matrix with dimensions of 61,751 x 1,792. Managing such an extensive matrix in memory presents a significant challenge.
NumPy, however, offers an effective solution to address this issue. By leveraging numpy.memmap, it’s possible to create a memory-mapped file that links to an array stored in a binary file on disk. Using the "w+"
mode in numpy.memmap
, individual embedding files can be loaded and subsequently appended to this memory-mapped file.
This approach allows for efficient handling of large data arrays by breaking down the loading and processing into more manageable segments, effectively circumventing the limitations of in-memory operations.
Efficient Clustering with MiniBatchKMeans
As noted in the text clustering session, Hierarchical Agglomerative Clustering (HAC) offers intuitiveness by allowing the setting of a distance_threshold
rather than specifying the number of clusters.
However, HAC is notoriously known for its lack of scalability. Given the current task of clustering a substantial number of objects — specifically 61,751 — it becomes essential to opt for a more efficient and scalable clustering algorithm.
Image source: MiniBatchKMeans: Mini-Batch K-Means clustering.
A suitable solution is the MiniBatchKMeans algorithm, an adaptation of the KMeans algorithm designed to enhance computational efficiency. MiniBatchKMeans employs mini-batches to reduce computation time while still aiming to optimize the same objective function as KMeans.
The key advantage of MiniBatchKMeans lies in its ability to perform partial_fit
on mini-batch data. This capability aligns perfectly with the use of pre-stored memory mapping files, allowing for batch processing of the clustering task.
By using the "r"
mode for memory mapping, batches of embeddings can be sequentially loaded from the memory-mapped file, with partial_fit
applied to each batch.
Once the entire embedding matrix has been processed, the matrix can be reloaded in batches to perform the predict
function. This step assigns each object to its most appropriate cluster based on the learned clustering model.
Clustering Results
The number of clusters (n_clusters
) is set to 600, with the goal of having approximately 100 samples in each cluster, considering the overall count of 61,751 objects.
The top 10 clusters with the most samples as well as the tail 10 clusters with the least samples are listed below:
Top Clusters | Num Samples | Tail Clusters | Num Samples |
---|---|---|---|
152 | 512 | 222 | 1 |
61 | 493 | 192 | 1 |
202 | 440 | 328 | 1 |
233 | 422 | 551 | 1 |
24 | 420 | 596 | 1 |
48 | 355 | 284 | 1 |
186 | 347 | 584 | 1 |
157 | 341 | 540 | 1 |
132 | 339 | 429 | 1 |
47 | 338 | 513 | 1 |
It’s essential to recognize that while MiniBatchKMeans offers efficiency, the clustering results obtained are initial and somewhat provisional. The determination of n_clusters
is an approximation, and the k-means algorithm’s inherent nature of using all samples for cluster calculations might lead to the inclusion of potential outliers. This aspect underscores the need for a careful interpretation of the clustering outcomes.
Post-processing
Post-processing involves several key steps to refine the clustering results, using the euclidean distance and cosine similarity measurements calculated between each sample and its respective cluster center.
- Removing Duplicate Images within Clusters: To avoid redundancy (like having both left and right shoes in the same cluster), duplicate image names are removed from each cluster. This action reduces the overall sample count to 54,310.
- Filtering Based on Cosine Similarity: Samples with a cosine similarity greater than 0.7 are retained. This filtering narrows down the clusters to 585, containing 20,815 objects in total.
- Retaining Clusters with Sufficient Samples: Clusters with more than 10 samples are kept. This final step results in 339 clusters with a collective count of 20,030 objects.
These steps ensure the clustering results are more coherent and relevant for further analysis and applications.
The top and tail 10 clusters after post-processing are listed below:
Top Clusters | Num Samples | Tail Clusters | Num Samples |
---|---|---|---|
152 | 497 | 469 | 11 |
61 | 460 | 221 | 11 |
575 | 278 | 134 | 11 |
556 | 267 | 495 | 11 |
157 | 244 | 421 | 10 |
57 | 240 | 68 | 10 |
430 | 239 | 33 | 10 |
48 | 234 | 218 | 10 |
132 | 231 | 102 | 10 |
347 | 216 | 15 | 10 |
Observing the shifts in top clusters after post-processing is insightful. The changes underscore the efficacy of the post-processing steps in retaining compact clusters. These clusters are composed of objects that not only closely resemble each other but also have a sufficient number of samples to be statistically significant.
Original Top Clusters | Original Num Samples | After Post-processed | New Top Clusters | Original Num Samples | After Post-processed |
---|---|---|---|---|---|
152 | 512 | 497 | 152 | 512 | 497 |
61 | 493 | 460 | 61 | 493 | 460 |
202 | 440 | 0 | 575 | 328 | 278 |
233 | 422 | 146 | 556 | 288 | 267 |
24 | 420 | 135 | 157 | 341 | 244 |
48 | 355 | 234 | 57 | 258 | 240 |
186 | 347 | 158 | 430 | 276 | 239 |
157 | 341 | 244 | 48 | 355 | 234 |
132 | 339 | 231 | 132 | 339 | 231 |
424 | 338 | 94 | 347 | 282 | 216 |
Analyzing specific clusters can provide valuable insights into the types of objects that are excluded during post-processing.
Take cluster 202, for instance, which is entirely eliminated after post-processing. This suggests that the cluster may have been noisy, containing a diverse or unrelated set of objects that did not cohere well as a group.
Samples from Cluster 202.
Reviewing the top and tail samples by cosine similarity provides a clear picture of the cluster’s range and helps to validate the effectiveness of the post-processing steps in refining the overall clustering results.
Cluster 24
Top Samples from Cluster 24.
Tail Samples from Cluster 24.
Cluster 233
Top Samples from Cluster 233.
Tail Samples from Cluster 233.
Potential Improvements
It’s important to note that there are several parameters within the MiniBatchKMeans
algorithm that could be adjusted to potentially enhance the clustering results. For instance, modifying the batch size, increasing the number of iterations, or altering the number of clusters are all tweaks that could affect performance.
Additionally, the order in which embeddings are appended to create the memory-mapped file can influence the clustering process. In this case, embeddings are appended according to the username and subsequently by the image name. As a result, the mini-batches used during clustering are not random, which could potentially impact the effectiveness of the clustering algorithm.
Nevertheless, given that the assessment of clustering results often relies on subjective judgement, extensive tuning of the clustering algorithm may not be the most efficient use of resources.
The ultimate goal is to ensure that clusters are compact, meaning that they consist of highly similar objects. This ensures that later manual review, where interesting clusters are selected and potentially combined to form groups for display, is based on solid, coherent clusters.
Therefore, as long as the clusters exhibit a high degree of similarity among their objects, the process can proceed to the manual review and selection phase for final group-level display.
Implementation Details
Input
Name | Description |
---|---|
ins_posts_object_tags_cleaned.csv | csv file containing objects with both raw and cleaned tags |
ins_posts/<username>/embeddings | Visual embeddings of full images and objects |
Process
Code | Description |
---|---|
codes/cluster_analysis/clustering.ipynb | Clustering on selected objects and pick the clusters of interest |
Output
Name | Description |
---|---|
ins_posts_clusters.csv | csv file containing selected objects and their clusters |
Data Sample:
{
"batch_folder": "ins_posts_2",
"username": "amandashadforth",
"res_name": "c5d37b4a4b21a1593b1086976d9d92cb.json",
"bbox": "[235.21124267578125, 166.27415466308594, 388.292724609375, 257.6859130859375]",
"tag": "forehead",
"score": 0.6,
"idx": 3,
"image_name": "c5d37b4a4b21a1593b1086976d9d92cb",
"cluster": 29,
"dist": 0.5734329223632812,
"sim": 0.8205487132072449
}