Text Embedding and Clustering

As highlighted in the previous section, the method of grouping similar free-form tags into broader categories through text classification is heavily dependent on the construction of candidate_labels. This design process, unfortunately, tends to be iterative and inefficient.

Consequently, an alternative approach is considered, involving the use of text embedding combined with clustering to group similar tags.

Obtaining text embeddings for the tags is quite direct. Subsequent to this, clustering is performed on these embeddings to form groups.

Agglomerative Clustering with Precomputed Cosine Distance

Reference: sklearn.cluster.AgglomerativeClustering

Utilizing cosine distance as the metric for evaluating the proximity of generated embeddings is a logical choice. Its intuitiveness allows for straightforward threshold setting due to its bounded range of 0 to 2. However, normalizing the embeddings is a necessary step prior to calculating cosine distances.

The choice of Agglomerative Clustering for the clustering task is fitting, given that it allows for the use of a distance_threshold instead of specifying the number of clusters. This feature makes the parameter setting more intuitive. Additionally, the relatively small number of tags justifies the use of this potentially resource-intensive algorithm.

With the requirement of manual review of the resultant clusters, the distance_threshold is primarily determined by the desired number of clusters it yields. A lower threshold leads to a greater number of clusters, as it groups tags only when their cosine distance falls below this threshold.

To streamline the review process, the cluster count is capped at fewer than 200. For the current use case, setting the distance_threshold to 0.7 results in 166 clusters.

Below are some of the top clusters identified:

group	num_tags
100	276
14	276
19	184
69	138
27	104
13	99
122	97
53	84
117	81
91	79

To gain insight into the top clusters, a sampling of 20 tags from each cluster can be examined:

Cluster	Concept	Sampled Tags
100	Clothing	shirt shirtless, cardigan sweater, suit, legging, undercloth, shirt t shirt, business suit jumpsuit, top shirt, sweatshirt wear, gray legging, trench coat coat, t shirt wear, blue dress dress shirt, black sports coat jacket, wear woman, sports coat, bride, pillow shirt, costume, muscle shirtless
14	Food	grain, oats peanut, daikon, butter, pancake toy, french toast, sandwich, pepperoni, almond, chocolate chip cookie, spaghetti, potato tofu, pancake pastry waffle, dessert, cheese cream cheese, bread sandwich, bean potato, chip peanut butter, cheese, carrot
19	Room/House	mall, collage store, shelf, store store, boutique, file cabinet, stadium, closet room, bookcase bookshelf, family room living room, convenience store storefront, barn, clothing store store, bookshel, hotel room, bedroom living room room, kitchen room, liquor store store, bookstore building, cabinetry kitchen cabinet
69	Beverage	bottle drink, beverage coffee cup, coffee coffee cup cup mug, wine bottle, beverage champagne, coffee coffee cup cup, beverage paper cup, beer beverage, cup drink, bartender person man, wine, glass wine glass, champagne wine, orange juice soda, coffee cup cup, beverage, cup face, juice beer beverage, liquid tea, drink juice soda
27	Animals	duck, cub, owl, animal lamb, hen, bull cow, insect, animal dog, bird cage lamp, wolf, bird cage candle, eagle, rabbit, jungle, turkey, goat, husky, snake, butterfly, pigtail

The above results clearly demonstrate the efficacy of the clustering process in grouping tags that share similar concepts.

Nonetheless, employing hierarchical clustering presents a drawback: it does not yield cluster centers. Consequently, it becomes necessary to define the cluster centers and assign them meaningful names to aid subsequent filtering processes. The ultimate goal is to perform filtering at the cluster level rather than on the individual original tags.

Assigning Cluster Names

Method 1: Utilizing Top-N Tags Closest to Cluster Center

In this methodology, the center embedding of each cluster is represented by its L2 normalized mean embedding. For each cluster, the top-n tags that are closest to this center embedding are chosen to represent the cluster. The aim of this approach is to reflect the cluster’s diversity by using multiple representative samples.

However, there’s a noticeable drawback. Even though this method doesn’t solely rely on the tag nearest to the center embedding, it still selects the top tags based on their proximity to a single center. Consequently, these selected tags might be too similar to each other, potentially failing to capture the full diversity of the cluster.

Method 2: Utilizing N Sub-cluster Centers for Each Cluster

To guarantee a more representative and diverse set of tags for each cluster, KMeans clustering is applied to the text embeddings of its samples, with the number of sub-clusters set to n.

Within these n sub-clusters, the tag nearest to the center of each sub-cluster is pinpointed. Consequently, n tags are identified for the n sub-clusters, and these tags are then used to represent the overarching main cluster.

This technique is anticipated to produce a broader and more varied collection of tags, effectively encapsulating the unique characteristics of each cluster.

Comparing Both Methods

Comparing the final cluster names generated by these two methods by randomly sampling 10 groups:

Group	#Tags	Group Name	Group Name KMeans
26	49	strap tie \| scarf tie \| bow tie	shawl \| hanger \| strap tie
61	59	swimwear bikini \| swimwear bikini top \| swim bikini	bath \| swimwear bikini \| pool
39	69	child girl woman \| girl person woman \| child woman	child girl \| person woman \| mother
158	57	doorway \| arch doorway \| doorway entrance	hallway \| stair \| door doorway
38	75	bonnet hat \| cap hat \| dress hat hat	neck \| dress hat hat \| helmet
81	38	ice \| snow \| snowboard	ski \| snow \| ice cube
45	72	hairbrush \| hair \| hair hair	haircut \| braid ponytail \| brush
69	138	beverage \| beverage drink \| beverage cup	alcohol beverage bottle \| juice juice \| coffee cup
160	30	sandal shoe \| sandal \| foot sandal	sleepwear \| sandal \| sandal shoe
31	42	river water \| creek water \| water waterway	canal \| hose \| lake water

The comparison clearly shows that group_name_kmeans offers greater diversity compared to group_name. Consequently, group_name_kmeans is selected as the definitive naming convention for the clusters. Following this, a manual review of the 166 cluster names is conducted to identify those most relevant for further analysis, with a focus primarily on fashion-related groups.

The finalized selection of groups includes:

Group Name	Number of Tags
coat jacket \| dress garment \| dress shirt	276
neck \| dress hat hat \| helmet	75
child girl \| person woman \| mother	69
boot cowboy boot \| leg \| foot shoe	66
bath \| swimwear bikini \| pool	59
bag bag \| tarp \| clutch	55
shawl \| hanger \| strap tie	49
earring \| necklace \| earring	46
hip waist \| stomach waist \| belt	42
sleepwear \| sandal \| sandal shoe	30
earphone goggles \| glasses goggles \| lens	23
trainer woman \| wrestler \| ballet dancer	15
kiwi \| khaki wear \| dress kilt	14
cardigan crop top \| crop top shirt \| crop top	13
bat \| man rugby player \| baseball player	12
person spectator \| antler \| crowd person	8
student \| class classroom \| campus	8
worker \| home office \| office	8
boy man \| hisk \| man	7
love \| beautiful girl \| beautiful	6
tourist woman \| tourist \| spa	6
poncho \| poinsettia \| poncho robe	5
suede \| yurt \| crape	5
groom man \| bishop \| jockey	4
miniskirt \| big ben \| miniature	3

Challenges of the Choice of Model

Just like with zero-shot text classification, the effectiveness of text embedding and clustering hinges on the chosen pre-trained language model. In this case, the sentence-transformers/all-MiniLM-L6-v2 model is utilized.

However, considering the task involves embedding tags, typically simple words rather than complex sentences, the impact of the model choice may not be as critical. Additionally, the unsupervised nature of clustering algorithms often makes them more resilient to minor variations in embeddings.

Furthermore, the clustering method reduces the need for meticulous taxonomy design, significantly simplifying the task. Thanks to its groupping nature, we are able to review the results by groups, not like the text classification, we still need to review tag by tag to ensure the correctness when necessary.

Implementation Details

Input

Name	Description
`ins_posts_object_tags_cleaned.csv`	csv file containing objects with both raw and cleaned tags

Process

Code	Description
`codes/object_grouping_and_filtering/text_embedding_clustering.ipynb`	Tag embedding and clustering

Output

Name	Description
`tag_group.csv`	csv file containing tags and their respective groups

Data Sample:

{
    "tag": "clown fish",
    "idx": 4464,
    "group": 50,
    "group_name": "meat seafood | shellfish | seafood",
    "group_size": 66,
    "selected": 0
}

selected indicates whether the tag is selected for further analysis.

Group Name	Number of Tags
coat jacket \| dress garment \| dress shirt	276
neck \| dress hat hat \| helmet	75
child girl \| person woman \| mother	69
boot cowboy boot \| leg \| foot shoe	66
bath \| swimwear bikini \| pool	59
bag bag \| tarp \| clutch	55
shawl \| hanger \| strap tie	49
earring \| necklace \| earring	46
hip waist \| stomach waist \| belt	42
sleepwear \| sandal \| sandal shoe	30
earphone goggles \| glasses goggles \| lens	23
trainer woman \| wrestler \| ballet dancer	15
kiwi \| khaki wear \| dress kilt	14
cardigan crop top \| crop top shirt \| crop top	13
bat \| man rugby player \| baseball player	12
person spectator \| antler \| crowd person	8
student \| class classroom \| campus	8
worker \| home office \| office	8
boy man \| hisk \| man	7
love \| beautiful girl \| beautiful	6
tourist woman \| tourist \| spa	6
poncho \| poinsettia \| poncho robe	5
suede \| yurt \| crape	5
groom man \| bishop \| jockey	4
miniskirt \| big ben \| miniature	3