Text Embedding and Clustering

As highlighted in the previous section, the method of grouping similar free-form tags into broader categories through text classification is heavily dependent on the construction of candidate_labels. This design process, unfortunately, tends to be iterative and inefficient.

Consequently, an alternative approach is considered, involving the use of text embedding combined with clustering to group similar tags.

Obtaining text embeddings for the tags is quite direct. Subsequent to this, clustering is performed on these embeddings to form groups.

Agglomerative Clustering with Precomputed Cosine Distance

Reference: sklearn.cluster.AgglomerativeClustering

Utilizing cosine distance as the metric for evaluating the proximity of generated embeddings is a logical choice. Its intuitiveness allows for straightforward threshold setting due to its bounded range of 0 to 2. However, normalizing the embeddings is a necessary step prior to calculating cosine distances.

The choice of Agglomerative Clustering for the clustering task is fitting, given that it allows for the use of a distance_threshold instead of specifying the number of clusters. This feature makes the parameter setting more intuitive. Additionally, the relatively small number of tags justifies the use of this potentially resource-intensive algorithm.

With the requirement of manual review of the resultant clusters, the distance_threshold is primarily determined by the desired number of clusters it yields. A lower threshold leads to a greater number of clusters, as it groups tags only when their cosine distance falls below this threshold.

To streamline the review process, the cluster count is capped at fewer than 200. For the current use case, setting the distance_threshold to 0.7 results in 166 clusters.

Below are some of the top clusters identified:

group num_tags
100 276
14 276
19 184
69 138
27 104
13 99
122 97
53 84
117 81
91 79

To gain insight into the top clusters, a sampling of 20 tags from each cluster can be examined:

Cluster Concept Sampled Tags
100 Clothing shirt shirtless, cardigan sweater, suit, legging, undercloth, shirt t shirt, business suit jumpsuit, top shirt, sweatshirt wear, gray legging, trench coat coat, t shirt wear, blue dress dress shirt, black sports coat jacket, wear woman, sports coat, bride, pillow shirt, costume, muscle shirtless
14 Food grain, oats peanut, daikon, butter, pancake toy, french toast, sandwich, pepperoni, almond, chocolate chip cookie, spaghetti, potato tofu, pancake pastry waffle, dessert, cheese cream cheese, bread sandwich, bean potato, chip peanut butter, cheese, carrot
19 Room/House mall, collage store, shelf, store store, boutique, file cabinet, stadium, closet room, bookcase bookshelf, family room living room, convenience store storefront, barn, clothing store store, bookshel, hotel room, bedroom living room room, kitchen room, liquor store store, bookstore building, cabinetry kitchen cabinet
69 Beverage bottle drink, beverage coffee cup, coffee coffee cup cup mug, wine bottle, beverage champagne, coffee coffee cup cup, beverage paper cup, beer beverage, cup drink, bartender person man, wine, glass wine glass, champagne wine, orange juice soda, coffee cup cup, beverage, cup face, juice beer beverage, liquid tea, drink juice soda
27 Animals duck, cub, owl, animal lamb, hen, bull cow, insect, animal dog, bird cage lamp, wolf, bird cage candle, eagle, rabbit, jungle, turkey, goat, husky, snake, butterfly, pigtail

The above results clearly demonstrate the efficacy of the clustering process in grouping tags that share similar concepts.

Nonetheless, employing hierarchical clustering presents a drawback: it does not yield cluster centers. Consequently, it becomes necessary to define the cluster centers and assign them meaningful names to aid subsequent filtering processes. The ultimate goal is to perform filtering at the cluster level rather than on the individual original tags.

Assigning Cluster Names

Method 1: Utilizing Top-N Tags Closest to Cluster Center

In this methodology, the center embedding of each cluster is represented by its L2 normalized mean embedding. For each cluster, the top-n tags that are closest to this center embedding are chosen to represent the cluster. The aim of this approach is to reflect the cluster’s diversity by using multiple representative samples.

However, there’s a noticeable drawback. Even though this method doesn’t solely rely on the tag nearest to the center embedding, it still selects the top tags based on their proximity to a single center. Consequently, these selected tags might be too similar to each other, potentially failing to capture the full diversity of the cluster.

Method 2: Utilizing N Sub-cluster Centers for Each Cluster

To guarantee a more representative and diverse set of tags for each cluster, KMeans clustering is applied to the text embeddings of its samples, with the number of sub-clusters set to n.

Within these n sub-clusters, the tag nearest to the center of each sub-cluster is pinpointed. Consequently, n tags are identified for the n sub-clusters, and these tags are then used to represent the overarching main cluster.

This technique is anticipated to produce a broader and more varied collection of tags, effectively encapsulating the unique characteristics of each cluster.

Comparing Both Methods

Comparing the final cluster names generated by these two methods by randomly sampling 10 groups:

Group #Tags Group Name Group Name KMeans
26 49 strap tie | scarf tie | bow tie shawl | hanger | strap tie
61 59 swimwear bikini | swimwear bikini top | swim bikini bath | swimwear bikini | pool
39 69 child girl woman | girl person woman | child woman child girl | person woman | mother
158 57 doorway | arch doorway | doorway entrance hallway | stair | door doorway
38 75 bonnet hat | cap hat | dress hat hat neck | dress hat hat | helmet
81 38 ice | snow | snowboard ski | snow | ice cube
45 72 hairbrush | hair | hair hair haircut | braid ponytail | brush
69 138 beverage | beverage drink | beverage cup alcohol beverage bottle | juice juice | coffee cup
160 30 sandal shoe | sandal | foot sandal sleepwear | sandal | sandal shoe
31 42 river water | creek water | water waterway canal | hose | lake water

The comparison clearly shows that group_name_kmeans offers greater diversity compared to group_name. Consequently, group_name_kmeans is selected as the definitive naming convention for the clusters. Following this, a manual review of the 166 cluster names is conducted to identify those most relevant for further analysis, with a focus primarily on fashion-related groups.

The finalized selection of groups includes:

Group Name Number of Tags
coat jacket | dress garment | dress shirt 276
neck | dress hat hat | helmet 75
child girl | person woman | mother 69
boot cowboy boot | leg | foot shoe 66
bath | swimwear bikini | pool 59
bag bag | tarp | clutch 55
shawl | hanger | strap tie 49
earring | necklace | earring 46
hip waist | stomach waist | belt 42
sleepwear | sandal | sandal shoe 30
earphone goggles | glasses goggles | lens 23
trainer woman | wrestler | ballet dancer 15
kiwi | khaki wear | dress kilt 14
cardigan crop top | crop top shirt | crop top 13
bat | man rugby player | baseball player 12
person spectator | antler | crowd person 8
student | class classroom | campus 8
worker | home office | office 8
boy man | hisk | man 7
love | beautiful girl | beautiful 6
tourist woman | tourist | spa 6
poncho | poinsettia | poncho robe 5
suede | yurt | crape 5
groom man | bishop | jockey 4
miniskirt | big ben | miniature 3

Challenges of the Choice of Model

Just like with zero-shot text classification, the effectiveness of text embedding and clustering hinges on the chosen pre-trained language model. In this case, the sentence-transformers/all-MiniLM-L6-v2 model is utilized.

However, considering the task involves embedding tags, typically simple words rather than complex sentences, the impact of the model choice may not be as critical. Additionally, the unsupervised nature of clustering algorithms often makes them more resilient to minor variations in embeddings.

Furthermore, the clustering method reduces the need for meticulous taxonomy design, significantly simplifying the task. Thanks to its groupping nature, we are able to review the results by groups, not like the text classification, we still need to review tag by tag to ensure the correctness when necessary.

Implementation Details

Input

Name Description
ins_posts_object_tags_cleaned.csv csv file containing objects with both raw and cleaned tags

Process

Code Description
codes/object_grouping_and_filtering/text_embedding_clustering.ipynb Tag embedding and clustering

Output

Name Description
tag_group.csv csv file containing tags and their respective groups

Data Sample:

{
    "tag": "clown fish",
    "idx": 4464,
    "group": 50,
    "group_name": "meat seafood | shellfish | seafood",
    "group_size": 66,
    "selected": 0
}

selected indicates whether the tag is selected for further analysis.