Text Embedding and Clustering
As highlighted in the previous section, the method of grouping similar free-form tags into broader categories through text classification is heavily dependent on the construction of candidate_labels
. This design process, unfortunately, tends to be iterative and inefficient.
Consequently, an alternative approach is considered, involving the use of text embedding combined with clustering to group similar tags.
Obtaining text embeddings for the tags is quite direct. Subsequent to this, clustering is performed on these embeddings to form groups.
Agglomerative Clustering with Precomputed Cosine Distance
Reference: sklearn.cluster.AgglomerativeClustering
Utilizing cosine distance as the metric for evaluating the proximity of generated embeddings is a logical choice. Its intuitiveness allows for straightforward threshold setting due to its bounded range of 0 to 2. However, normalizing the embeddings is a necessary step prior to calculating cosine distances.
The choice of Agglomerative Clustering for the clustering task is fitting, given that it allows for the use of a distance_threshold
instead of specifying the number of clusters. This feature makes the parameter setting more intuitive. Additionally, the relatively small number of tags justifies the use of this potentially resource-intensive algorithm.
With the requirement of manual review of the resultant clusters, the distance_threshold
is primarily determined by the desired number of clusters it yields. A lower threshold leads to a greater number of clusters, as it groups tags only when their cosine distance falls below this threshold.
To streamline the review process, the cluster count is capped at fewer than 200. For the current use case, setting the distance_threshold
to 0.7 results in 166 clusters.
Below are some of the top clusters identified:
group | num_tags |
---|---|
100 | 276 |
14 | 276 |
19 | 184 |
69 | 138 |
27 | 104 |
13 | 99 |
122 | 97 |
53 | 84 |
117 | 81 |
91 | 79 |
To gain insight into the top clusters, a sampling of 20 tags from each cluster can be examined:
Cluster | Concept | Sampled Tags |
---|---|---|
100 | Clothing | shirt shirtless, cardigan sweater, suit, legging, undercloth, shirt t shirt, business suit jumpsuit, top shirt, sweatshirt wear, gray legging, trench coat coat, t shirt wear, blue dress dress shirt, black sports coat jacket, wear woman, sports coat, bride, pillow shirt, costume, muscle shirtless |
14 | Food | grain, oats peanut, daikon, butter, pancake toy, french toast, sandwich, pepperoni, almond, chocolate chip cookie, spaghetti, potato tofu, pancake pastry waffle, dessert, cheese cream cheese, bread sandwich, bean potato, chip peanut butter, cheese, carrot |
19 | Room/House | mall, collage store, shelf, store store, boutique, file cabinet, stadium, closet room, bookcase bookshelf, family room living room, convenience store storefront, barn, clothing store store, bookshel, hotel room, bedroom living room room, kitchen room, liquor store store, bookstore building, cabinetry kitchen cabinet |
69 | Beverage | bottle drink, beverage coffee cup, coffee coffee cup cup mug, wine bottle, beverage champagne, coffee coffee cup cup, beverage paper cup, beer beverage, cup drink, bartender person man, wine, glass wine glass, champagne wine, orange juice soda, coffee cup cup, beverage, cup face, juice beer beverage, liquid tea, drink juice soda |
27 | Animals | duck, cub, owl, animal lamb, hen, bull cow, insect, animal dog, bird cage lamp, wolf, bird cage candle, eagle, rabbit, jungle, turkey, goat, husky, snake, butterfly, pigtail |
The above results clearly demonstrate the efficacy of the clustering process in grouping tags that share similar concepts.
Nonetheless, employing hierarchical clustering presents a drawback: it does not yield cluster centers. Consequently, it becomes necessary to define the cluster centers and assign them meaningful names to aid subsequent filtering processes. The ultimate goal is to perform filtering at the cluster level rather than on the individual original tags.
Assigning Cluster Names
Method 1: Utilizing Top-N Tags Closest to Cluster Center
In this methodology, the center embedding of each cluster is represented by its L2 normalized mean embedding. For each cluster, the top-n tags that are closest to this center embedding are chosen to represent the cluster. The aim of this approach is to reflect the cluster’s diversity by using multiple representative samples.
However, there’s a noticeable drawback. Even though this method doesn’t solely rely on the tag nearest to the center embedding, it still selects the top tags based on their proximity to a single center. Consequently, these selected tags might be too similar to each other, potentially failing to capture the full diversity of the cluster.
Method 2: Utilizing N Sub-cluster Centers for Each Cluster
To guarantee a more representative and diverse set of tags for each cluster, KMeans clustering is applied to the text embeddings of its samples, with the number of sub-clusters set to n.
Within these n sub-clusters, the tag nearest to the center of each sub-cluster is pinpointed. Consequently, n tags are identified for the n sub-clusters, and these tags are then used to represent the overarching main cluster.
This technique is anticipated to produce a broader and more varied collection of tags, effectively encapsulating the unique characteristics of each cluster.
Comparing Both Methods
Comparing the final cluster names generated by these two methods by randomly sampling 10 groups:
Group | #Tags | Group Name | Group Name KMeans |
---|---|---|---|
26 | 49 | strap tie | scarf tie | bow tie | shawl | hanger | strap tie |
61 | 59 | swimwear bikini | swimwear bikini top | swim bikini | bath | swimwear bikini | pool |
39 | 69 | child girl woman | girl person woman | child woman | child girl | person woman | mother |
158 | 57 | doorway | arch doorway | doorway entrance | hallway | stair | door doorway |
38 | 75 | bonnet hat | cap hat | dress hat hat | neck | dress hat hat | helmet |
81 | 38 | ice | snow | snowboard | ski | snow | ice cube |
45 | 72 | hairbrush | hair | hair hair | haircut | braid ponytail | brush |
69 | 138 | beverage | beverage drink | beverage cup | alcohol beverage bottle | juice juice | coffee cup |
160 | 30 | sandal shoe | sandal | foot sandal | sleepwear | sandal | sandal shoe |
31 | 42 | river water | creek water | water waterway | canal | hose | lake water |
The comparison clearly shows that group_name_kmeans
offers greater diversity compared to group_name
. Consequently, group_name_kmeans
is selected as the definitive naming convention for the clusters. Following this, a manual review of the 166 cluster names is conducted to identify those most relevant for further analysis, with a focus primarily on fashion-related groups.
The finalized selection of groups includes:
Group Name | Number of Tags |
---|---|
coat jacket | dress garment | dress shirt | 276 |
neck | dress hat hat | helmet | 75 |
child girl | person woman | mother | 69 |
boot cowboy boot | leg | foot shoe | 66 |
bath | swimwear bikini | pool | 59 |
bag bag | tarp | clutch | 55 |
shawl | hanger | strap tie | 49 |
earring | necklace | earring | 46 |
hip waist | stomach waist | belt | 42 |
sleepwear | sandal | sandal shoe | 30 |
earphone goggles | glasses goggles | lens | 23 |
trainer woman | wrestler | ballet dancer | 15 |
kiwi | khaki wear | dress kilt | 14 |
cardigan crop top | crop top shirt | crop top | 13 |
bat | man rugby player | baseball player | 12 |
person spectator | antler | crowd person | 8 |
student | class classroom | campus | 8 |
worker | home office | office | 8 |
boy man | hisk | man | 7 |
love | beautiful girl | beautiful | 6 |
tourist woman | tourist | spa | 6 |
poncho | poinsettia | poncho robe | 5 |
suede | yurt | crape | 5 |
groom man | bishop | jockey | 4 |
miniskirt | big ben | miniature | 3 |
Challenges of the Choice of Model
Just like with zero-shot text classification, the effectiveness of text embedding and clustering hinges on the chosen pre-trained language model. In this case, the sentence-transformers/all-MiniLM-L6-v2 model is utilized.
However, considering the task involves embedding tags, typically simple words rather than complex sentences, the impact of the model choice may not be as critical. Additionally, the unsupervised nature of clustering algorithms often makes them more resilient to minor variations in embeddings.
Furthermore, the clustering method reduces the need for meticulous taxonomy design, significantly simplifying the task. Thanks to its groupping nature, we are able to review the results by groups, not like the text classification, we still need to review tag by tag to ensure the correctness when necessary.
Implementation Details
Input
Name | Description |
---|---|
ins_posts_object_tags_cleaned.csv | csv file containing objects with both raw and cleaned tags |
Process
Code | Description |
---|---|
codes/object_grouping_and_filtering/text_embedding_clustering.ipynb | Tag embedding and clustering |
Output
Name | Description |
---|---|
tag_group.csv | csv file containing tags and their respective groups |
Data Sample:
{
"tag": "clown fish",
"idx": 4464,
"group": 50,
"group_name": "meat seafood | shellfish | seafood",
"group_size": 66,
"selected": 0
}
selected
indicates whether the tag is selected for further analysis.