Introduction and general motivation
Personal style and fit play a crucial role in purchase decisions, although many users experience the limitations of traditional keyword-based search. Since these searches often rely too heavily on exact keyword matches, this limits discovery and leaves potential matches undiscovered. The idea was to solve the challenges of online clothing selection by bridging the gap between human language and search terms, creating meaningful search results, and capturing the essence of style through more nuanced keyword interpretation. As e-commerce continues to grow, integrating advanced tools to improve product recommendations and customer experience is becoming increasingly important.
Fine-tuning AI models is key to adapting them for specific tasks, improving accuracy and performance. By training models on task-specific data, fine-tuning helps them recognize both general patterns and crucial details, allowing customization for various industries, from healthcare to personalized shopping recommendations.
CLIP vs FashionCLIP
CLIP, developed by OpenAI, is a versatile model designed to understand and relate textual descriptions with images. It is trained on a diverse dataset of image-caption pairs with diverse internet-sourced imagery (around 400 million) and natural language. CLIP understands a broad spectrum of visual concepts, enabling it to perform various tasks without needing task-specific training data. FashionCLIP comes to the stage as a form of a specialization of CLIP, fine-tuned specifically for the fashion domain on around 40k images. It is built on the foundational capabilities of CLIP but is enhanced to handle the unique requirements of fashion-related tasks.
To better understand the main difference between these two models, it is good to dive into the different use cases where they are used. CLIP is suitable for general image and text processing tasks, such as image classification, captioning, and broad visual search. On the other hand, FashionCLIP is ideal for fashion e-commerce platforms, enhancing product search capabilities, improving recommendation systems, and providing detailed attribute-based filtering.
We chose to fine-tune FashionCLIP since it was the model used in the first project iteration, and we were well familiar with its capabilities and performance.
The first iteration aimed to develop a robust product search functionality capable of handling both text and image queries:
- Text Queries: Users can input descriptive text queries (e.g., “t-shirt with white stripes”) to find relevant products.
- Image Queries: Users can upload an image of an item, and the system will identify and return the most visually related products available on the website.
During the testing phase, we concluded that the chosen model did not perform adequately in certain fashion-specific cases, such as recognizing collars, soles, and shoelaces. To address these shortcomings, we decided to fine-tune FashionCLIP, focusing particularly on footwear attributes such as shoelaces. This fine-tuning ensures the model performs well for these fashion attributes while preserving its original capabilities.
Preparing the dataset
With the model and our goal established, the next logical step was to choose the dataset. The main motivation was to use descriptions for the images which would contain information about the shoelace color. This would enable the model to accurately identify and differentiate shoelace colors within the broader context of footwear images.
After some investigation, it was decided to take Fashion Product Images Dataset from Kaggle, containing 44,441 diverse clothing images. From there, only footwear was selected, totaling around 7,344 items. The dataset was then split into 80% for training (5,877 images) and 20% for testing (1,467 images), as recommended. During the testing phase, a validation dataset was created by taking a 10% sample from the training dataset, split evenly by categories.
The dataset already included descriptions for each item, featuring details like brand, product name, gender, and main color. However, we needed more detailed annotations, so we used the Bakllava AI tool to enrich the shoe descriptions. It is an open-source multi-modal language model based on Mistral 7B language model with added LLava (Large Language and Vision Assistant) architecture to support these multi-modal inputs. It gave the best results while also being free of charge.
The prompt given was “Describe the item in the image.” The resulting descriptions were inconsistent, sometimes highly detailed and sometimes lacking. After enrichment, the shoelace color was still missing for some training items, so we manually checked and added the correct color to each footwear item.
The final fields used to train the model were the image and the enriched description. This description now included the initial details (brand, product name, gender, and main color), an additional sentence generated by Bakllava AI, and a sentence describing the shoelace color.
Fine-tuning techniques
We fine-tuned FashionClip using PyTorch Lightning from Lightning AI and Google Colab, both offering cloud-based environments with Jupyter Notebooks that simplify writing and running Python code directly in the browser. This setup removed the need for complex local configurations, giving us access to powerful GPUs/CPUs and a free tier for a limited time. Lightning AI provided additional free credits and a built-in filesystem for saving models, while Google Colab allowed us to utilize Google Drive for smooth file storage and access. In the end, we used Lightning AI more extensively due to the extra free credits it offered.
We also used PyTorch DataLoader for data manipulation and processing during training, which efficiently handled batching, shuffling, and parallel loading of data, enhancing the speed and efficiency of our training process.
During fine-tuning, we repeat the process of going through the entire training and validation dataset a few times. Each time we go through a complete dataset it is known as epoch. Essentially, epochs determine how many times the model learns from the entire dataset. Increasing the number of epochs can improve the model’s performance up to a certain point, as it allows the model to learn more from the data. However, if we use too many epochs, it might memorize the training data too well and not work well on new data, which is called overfitting.
As we already said each epoch consists of training and validation iteration, validation part will tell us in which direction our training is going and give sign to scheduler to react and prevent overfitting if exist. By comparing performance on the training and validation datasets, overfitting can be detected and early stopped if there is a significant performance gap.
So, finding the right number of epochs in combination with the best fitting optimizer and scheduler is crucial for achieving the best performing model.
Choosing optimizer and scheduler
Selecting the appropriate optimizer and scheduler is critical for achieving optimal performance in model training. The choice of optimizer and scheduler impacts how quickly and effectively a model learns and generalizes from the training data. Factors such as the nature of the task, the size and complexity of the dataset, and the model architecture all influence this decision.
An optimizer is an algorithm used to change the attributes of the neural network, such as weights and learning rate, and to reduce the training losses. Our choice was AdamW optimizer. Unlike traditional methods, AdamW separates the weight decay process, which helps in maintaining adaptive learning rates without interference. Key parameters for AdamW are learning rate, betas, eps, and weight decay. By fine-tuning these parameters, AdamW helped our model to learn and generalize better, especially for tasks with unique features or smaller datasets.
A scheduler helped us to improve performance by adjusting the learning rate during training. The ReduceLROnPlateau scheduler showed the best result for our case of fine-tuning with smaller dataset. It reduces the learning rate when the model’s performance metric stops improving, preventing the model from missing the optimal point due to a high learning rate. Key parameters include mode, factor, patience, threshold, cooldown. This approach helps keep the training progress steady and ensures the model converges effectively.
Despite experimenting with various optimizers and schedulers, including SGD, Adagrad, ExponentialLR, CosineAnnealingLR, and CyclicLR, we encountered challenges such as overfitting and poor performance on the testing set. In the end, AdamW and ReduceLROnPlateau proved to be our winning combination to overcome challenges in training, particularly in mitigating overfitting and enhancing testing performance.
Pre-test phase
As mentioned before, determining if model training is heading in the right direction is essential. To achieve this, we did not only use validation datasets to prevent overfitting but also employed various metrics and tests to assess the training quality. This approach helped us decide if the fine-tuning parameters needed optimization. Although there are many ways to measure classification performance, we focused on several key indicators as recommended. These classification metrics evaluate the discrete values produced by the model after classifying all the given data.
Metrics
The first metric to address is accuracy, which represents the percentage of correctly classified variables. However, in the case of an imbalanced dataset like ours, relying solely on accuracy may not effectively evaluate the model’s performance. To better assess performance, we use the F1 Score, a measure that combines precision and recall. The F1 Score is particularly useful for imbalanced datasets because it forces the model to consider both classes, balancing precision and recall. In multi-class classification, where we have more than two classes, precision and recall might not seamlessly apply. Therefore, we calculate the F1 Score for each class individually and then average them to obtain a single number that describes the overall performance of the model. That brings us to a new metric, Weighted Average, which is calculated considering each class’s support. Support is the number of actual occurrences of the class in the dataset. Finally, Macro Average metric is computed by taking the arithmetic mean of all the per-class F1 scores. This metric treats all classes equally regardless of their support values. In the context of an imbalanced dataset where equal importance is attributed to all classes, opting for the macro average stands as a desirable choice since it treats each class with equal importance.
To calculate metrics, we needed information about the shoelace categories and images of shoes from the test dataset. The shoelace categories represent an array of unique shoelace colors extracted from our dataset. Here’s the process we followed:
- Embeddings are calculated using fine-tuned/FashionCLIP model for categories and images.
- Cosine similarity is used to measure the resemblance between each image and each category.
- The most similar category is identified for each image, which serves as the model’s prediction.
The process resulted in an array of predicted categories for every image from the test dataset. The ‘classification_report’ function from sklearn library used as input the array of actual categories and the array of predicted categories to evaluate the model. The image below shows the differences in these metrics between the original FashionCLIP model and our fine-tuned model:
Besides the forementioned regular metrics, tests for zero shot classification and top K were created. Zero shot is a machine learning method in which an AI model is trained to recognize and categorize objects or concepts without seeing any examples of those categories or concepts beforehand. In our case, for each photo in the testing dataset, we compared the actual shoelace color from the metadata with the shoelace color predicted by the model. The model either correctly guessed the shoelace color or it did not, with the accuracy of the zero-shot method being the percentage of correct guesses. The FashionCLIP model achieved an accuracy of 59.66%, and our goal was to surpass this with the fine-tuned model. The fine-tuned model successfully achieved an accuracy of 69.76%.
Top-K is a probabilistic data structure that helps identify the most frequent items in a data stream. Unlike the zero-shot method, which only considers whether the category is correctly guessed, Top-K evaluates the ranking of all predicted categories. It selects the top K predictions that are most meaningful. In our case, we used Top-K to find the top 5 shoelace colors that the model predicted as the most similar.
In addition to the recommended metrics, we defined our own scoring technique to better assess the model. This scoring system relied on the top 5 category guesses, with the highest points awarded if the guessed color was in first place and the lowest points if the guessed color was beyond fifth place. Scores ranged from 5 to 1 for 1st to 5th places, respectively, with guesses below 5th place receiving a score of 0. The final score is determined by summing the scores for each image in our dataset. The total possible maximum points amounted to 7,325. In this scenario, FashionCLIP scored 5,622 and our fine-tuned model scored 6,290.
Representation of the forementioned metrics and scorings are visible in the image below.
Testing strategy
We decided on the structure of our test cases to prioritize enhancing the search for shoes with specific shoelace colors while keeping the existing search system’s functionality intact. Our test cases primarily focus on the main search criteria of shoe category and shoelace color. For instance, an example test case is prompt “Casual shoes with Beige shoelace.” or image of casual shoes with beige shoelace. These test cases are used in image retrieval tests. To perform the test, we follow these steps:
- Calculate embeddings for all images from test dataset using our fine-tuned model.
- Calculate the embedding for the text test case or image test case using our fine-tuned model also.
- Use cosine similarity to measure the resemblance between the test case embedding and every image embedding.
- Return the top 5 most similar images as the result of the image retrieval test.

Our model was tested on a couple of different levels. Levels were created based on our knowledge of the dataset and the expected combinations of shoe categories and shoelace colors. This approach was necessary because our testing dataset is imbalanced and not all test cases have more than five corresponding results in it.
- The first level consists of test cases which correspond to 5 or more combinations in the testing dataset.
- The second level consists of test cases which correspond to between 1 and 5 combinations in the testing dataset.
- The third level consists of test cases which do not correspond to any combinations in the testing dataset.
Scoring
Multiple scoring methods were used to evaluate the model’s predictions as accurately as possible. For each test case mentioned above, we took 5 image results and independently scored whether the model correctly guessed the shoe category and the shoelace color.
The first one was if shoe category is guessed correctly model gets 2 points, if shoelace category is guessed correctly model gets 1 point. In other cases, there are no points. Priority is given to the shoe category.
Correct shoe category | Wrong shoe Category | |
Correct shoelace color | 3 | 1 |
Wrong shoelace color | 2 | 0 |
After summing these points for every image and every test case we get the result. Maximum possible result is 390 points, Fashion CLIP got 247 points, fine-tuned model got 274 points.
Score (max. 390) | |
Fashion CLIP | 247 |
Fine-tuned model | 274 |
The second method awarded 1 point if the shoelace color was guessed correctly, otherwise, no points were given. With a maximum possible score of 130 points, the Fashion CLIP model scored 35 points, while the fine-tuned model scored 50 points.
Score (max. 130) | |
Fashion CLIP | 35 |
Fine-tuned model | 50 |
Unlike the first two methods, which didn’t consider the position of the image results, the third method took this into account. Emphasis the importance of a ‘good’ result appearing higher in the search results. The scores were defined in a way that, if the model did not correctly guess the shoe category or shoelace color, no points were awarded. But the higher-ranking results receive more points: the 1st position received 5 points, the 2nd position 4 points, etc.
Example 1:
Search of test case: Casual shoes with black shoelace
5 results are returned:
Shoe category | Shoelace color | Points | |
1. | Casual shoes | Beige | 0 |
2. | Casual shoes | Black | 4 |
3. | Casual Shoes | Black | 3 |
4. | Casual shoes | Black | 2 |
5. | Casual shoes | Black | 1 |
Example 2:
Search of test case: Casual shoes with grey shoelace
5 results are returned:
Shoe category | Shoelace color | Points | |
1. | Casual shoes | Grey | 5 |
2. | Casual shoes | White | 0 |
3. | Casual Shoes | Beige | 0 |
4. | Casual shoes | Grey | 2 |
5. | Casual shoes | Black | 0 |
After summing these points for every image and every test case we get the result.
Score (max. 390) | |
Fashion CLIP | 84 |
Fine-tuned model | 127 |
The final method also included scoring based on the position of results but focused solely on the correct shoelace color.
Score (max. 390) | |
Fashion CLIP | 111 |
Fine-tuned model | 163 |
By accommodating these multiple scoring methods, we ensured a comprehensive evaluation of the model’s performance, considering the accuracy and relevance of the results and their ranking in the search results.
Benchmarks
Different datasets were used to benchmark our fine-tuned model: Fashion-MNIST, DeepFashion, and our custom fashion dataset. Fashion-MNIST is a dataset of Zalando’s article images for training and testing models. We used a test dataset with 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. DeepFashion dataset is a large-scale clothes database that holds 800,000 diverse fashion images. There are several subsets available, and we used the one called In-shop, which contains images with significant pose and scale variations. It also features extensive diversity, large quantities, and rich annotations, with each image associated with a label from 17 classes. Custom dataset is created by the data that one fashion retail company provided for us. Those are pictures of clothing items on a white surface associated with many categories that we filtered down to 65. This dataset has 18,435 images.
The results of benchmark are shown in the picture below. Our testing involved assessing how well our model predicts default dataset categories, demonstrating that our results are not significantly inferior to those of the FashionCLIP model. While retaining the performance of FashionCLIP, we tailored it to better suit our specific use case of predicting shoelace colors.

Choosing the best model
Following the fine-tuning phase, we generated numerous models from different epochs, each requiring pretesting and testing phases to ensure fair comparison. After comprehensive analysis encompassing metrics, test results, and benchmarks for each model, we determine the optimal choice presented in the table below.

Conclusion
Fine-tuning models for specific domains like fashion can significantly improve performance and user experience. By carefully selecting and preparing datasets, choosing proper optimizers and schedulers, and rigorously testing the model, we can achieve superior results. Our fine-tuned FashionCLIP model demonstrates the potential of specialized models in enhancing e-commerce platforms, providing more accurate and meaningful search results for consumers.
Who knew that something as simple as a shoelace could present such a challenge in the world of AI? So next time you are searching for the perfect pair of shoes, remember – it is not just about finding the right style, but also making sure those laces are spot on!

Jana Terzić, Milica Milivojević, Bojana Gudurić Maksimović
Levi9