Two-Stage Method for Clothing Feature Detection
[ad_1]
1. Introduction
One of the primary challenges in clothing recognition stems from the immense diversity in the style, texture, and cut of clothing. This diversity is influenced by cultural backgrounds, individual preferences, and constantly evolving fashion trends, leading to a wide variation in the visual appearance of clothing items. Such variation presents significant challenges for recognition systems, which often struggle to accurately identify styles due to this high degree of diversity.
2. Related Work
3. Proposed Method
Firstly, gender and age are associated with facial information. Original images are marked with facial keypoints, identifying coordinates of eye corners, nose tips, and mouth corners. These keypoints play a crucial role in facial expressions and individual characteristics. After facial detection, the original images are cropped to obtain facial images. These cropped facial images, with varied orientations, require facial calibration, which is determined by the bridge of the nose keypoints to align the facial images in a uniform direction. These facial images and corrected keypoint data are input into classifiers for training to learn gender and age attribute information.
Additionally, this study introduces a discrete differential operator for edge detection, the Scharrx operator, for preprocessing images to enhance horizontal edges in facial images. This enhancement helps highlight features such as the jawline, eyebrows, and lip edges, essential for learning age-related features (like wrinkles or structural changes in the face) and gender-related features (such as facial hair or jawline shape), aiding in achieving more precise classification outcomes. Thus, Scharrx edge images, as an additional channel to the original images and facial keypoint information, serve as inputs for classification learning.
Clothing features are also obtained by identifying essential body parts, such as the head, elbows, knees, and ankles, to obtain corresponding coordinate information. We propose a custom segmentation logic based on keypoints, dividing the body into detailed areas like upper arms, forearms, neck, torso, collar, thighs, and calves to obtain respective body part frames. These parts are then cropped to obtain detailed images of each section, facilitating the subsequent classification learning of detailed features for each part.
In this research, keypoint detection methods utilize OpenPose and Dlib. OpenPose outputs a set of 2D coordinates for each keypoint on each individual in the image, along with confidence scores for each detection, while Dlib outputs keypoint coordinates and the bounding boxes of detected objects.
The primary classification model used in this study is VGG16, renowned for its high accuracy in object detection algorithms and proven high performance in various image recognition tasks. As backbone networks, VGG models are particularly effective in recognizing and classifying facial features, capable of capturing complex patterns and features in images. Moreover, VGG models pre-trained on large datasets can be easily applied to other image tasks through transfer learning, achieving good performance even on smaller datasets. Additionally, SVM classifiers recognize features such as zippers and collar shapes. The datasets for zipper and collar type are relatively small, whereas SVMs can also perform well, offering efficiency compared to smaller neural network classifiers.
4. Experiment
4.1. Dataset
In this study, we employed random preprocessing adjustments to images from the same dataset to enhance the model’s generalization capability. By applying random transformations such as cropping, rotation, and color adjustment, we generated a diversified training sample set that simulates various scenarios and conditions encountered in the real world. This data augmentation strategy is instrumental in mitigating model overfitting, bolstering its robustness against new and unseen images.
-
Randomly mirror the image with a probability of 0.5.
-
Randomly adjust the brightness between 0.9 and 1.1 times the original image.
-
Randomly adjust the contrast between 0.9 and 1.1 times the original image.
-
Randomly adjust the hue between 0.9 and 1.1 times the original image.
-
Randomly adjust the saturation between 0.9 and 1.1 times the original image.
4.2. Obtaining Bounding Boxes
In human images, variations like tilted head angles are common. Effectively calibrating faces that are not fully oriented enhances facial feature recognition accuracy. Standard calibration methods include DeepFace and DEX. This study adopts a self-improved calibration method based on facial keypoints.
This formula computes the arctangent of the slope formed by the line connecting keypoints 27 and 30. The resulting angle θ is used for image rotation alignment. Notably, during this process, the original image is rotated, and the facial recognition box is recalibrated using the line between keypoint 27 and 30 as the rotation axis. This method involves rotating the entire original image instead of just the cropped face portion. Rotating the original image introduces edge information beyond the facial recognition box, including hair, ears, and collars. This extra context, extending outside the original facial bounding box, enhances facial recognition algorithms. It improves the algorithms’ ability to identify facial features by providing more contextual information.
A similar method is used for the body part boxes. Key points identify specific body parts, and then the corresponding body part boxes are cropped. The principal logic for cropping various body parts includes the following:
-
Torso: Keypoints 2, 5, 8, 11, and 1 are primarily used. The logic involves forming a rectangle using the line between keypoints 2 and 5 as the width and the line between 1 and 8 as the length. The length direction is determined by the line connecting the midpoint of keypoints 8 and 11 with keypoint 1.
-
Collar: Keypoints 0 and 1 are primarily used. The logic is to form a square with the neck as the intersection point of the diagonals and the distance between the neck and nose as the side length, oriented from the neck to the nose.
-
Zipper: Primarily using keypoints 1, 8, and 11, the logic forms a rectangle with the midpoint of the line between 8 and 11 as the central axis of the longer side. The length is determined by the line connecting keypoint 1 and the midpoint, and the width is a quarter of the line length between 1 and 8.
-
Upper Arm: For the right upper arm, keypoints 2 and 3 are used, and for the left, keypoints 5 and 6 are used. The method creates a rectangle with the line between keypoints 2 and 3 (or 5 and 6) as the central axis of the longer side, and the width is half the length of the longer side.
-
Forearm: Keypoints 3 and 4 are used for the right forearm, and keypoints 6 and 7 are used for the left. The logic forms a rectangle with the line between keypoints 3 and 4 (or 6 and 7) as the central axis of the longer side.
-
Bottom: Keypoints 8, 10, 11, and 13 are used. The distance between the midpoint of the line connecting keypoints 8 and 11 and the midpoint of the line connecting keypoints 10 and 13 is used as the length of the rectangle, with half of this distance serving as the width.
-
Thigh: For the right thigh, keypoints 8 and 9 are used, and keypoints 11 and 12 are used for the left. The method involves forming a rectangle with the line between keypoints 8 and 9 (or 11 and 12) as the central axis of the longer side.
-
Lower Leg: Keypoints 9 and 10 are used for the right lower leg, and keypoints 12 and 13 are used for the left. The logic is similar to the thigh, forming a rectangle with the line between keypoints 9 and 10 (or 12 and 13) as the central axis of the longer side.
4.3. Facial Attributes
4.4. Clothing Attributes
4.5. Proceeding Speed
4.6. Discussion and Results
In the realm of facial attribute recognition, the accuracy rate for gender identification reached an impressive 95.89%, while age recognition achieved an accuracy of 77.28%. This indicates that the employed models are adept at capturing the key features distinguishing gender and age. For age recognition, the accuracy may be affected by factors such as makeup and obstructions. Obstructions like glasses, hats, or hair could conceal key age-indicative features, thereby reducing the accuracy of recognition.
Regarding clothing attribute recognition, a significant variance in accuracy rates for different attributes was observed. The recognition accuracy for top pattern in the test set was 89.81%, whereas the accuracy for identifying top materials was only 57.04%. The lower accuracy in recognizing clothing materials may be linked to the inherent constraints of the VGG model. Patterns typically exhibit unique colors and shapes that are easily identifiable, even in lower-resolution images. In contrast, the texture details of materials may become indiscernible at lower image qualities. Material characteristics pose more challenges than pattern features under the same data quality. The lower performance of clothing materials in this study is also limited by the inherent limitations of the VGG model. The model’s focus on extracting global features and shape information may not sufficiently capture the nuanced differences in material textures, which are crucial for distinguishing fabrics. These subtle textures and details within images, pivotal for material differentiation, might be overlooked during the convolutional processing in VGG. Moreover, the model’s performance is notably influenced by the image’s quality and resolution; images of lower resolution or subpar quality might lack the necessary detail to differentiate between material textures effectively.
To address these limitations, future research could consider ensuring high-resolution images. The model’s generalization capability could be enhanced through increasing the diversity of material samples, including variations in lighting, angles, and potential obstructions. Additionally, more advanced deep learning architectures could be explored which capture the fine details of material textures.
For specific attributes like sleeve and bottom lengths, a logic-based method relying on the ratio of exposed skin pixels was employed, achieving an average accuracy of 89.1% for sleeve length and 80.23% for bottom length, thereby validating the effectiveness of this approach. Our study excluded the effects of stockings and tattoos because they could interfere with the judgment of skin pixels. Future models will need to account for these elements, facing challenges such as accurately distinguishing between tattoos and clothing patterns, as well as addressing changes in skin color and texture caused by stockings.
Due to the feature segmentation of clothing in this study, the accuracy of individual segmented features contributes to the overall recognition error for tops. In this experiment, the overall recognition accuracy for tops was determined to be 81.4%, and the overall recognition accuracy for bottoms was 85.72%. The overall recognition accuracy for tops is calculated as the average accuracy of segmented features, including collar, zipper, pattern, material, and sleeve length. Conversely, the overall recognition accuracy for bottoms is determined by averaging the accuracies of two categories: type and length of the bottom wear. By integrating different models for various features, we attained an above-average recognition accuracy for complete top categories, underscoring the efficacy and practicality of our method.
In terms of performance and processing time, our approach also demonstrated commendable results. The experimental findings revealed that the complete recognition process for top categories averaged only 1.6582 s, and the bottom category recognition process took merely 0.8359 s, keeping the average total processing time within a reasonable range. This ensures the practicality and operability of the model. These results suggest that, despite the typically high computational demands of deep learning models, our method has been optimized for performance, maintaining reasonable processing times.
In summary, our method, through the meticulous segmentation of clothing features and the adaptive application of the most suitable classification models for different features, not only enhances recognition accuracy but also ensures the efficiency of the model’s operation. This showcases the potential and practical value of our approach in the fields of clothing and facial attribute recognition.
5. Conclusions
In this study, we have developed an innovative approach for identifying and classifying clothing and facial attributes, with a particular emphasis on the detailed segmentation of features on tops. Contrary to traditional research that classifies entire tops or lower body attire directly, we further dissect tops into several feature categories, such as collars, zippers, materials, and patterns. Depending on the specific characteristics of these features, such as binary or multi-class issues, we select the most suitable classification model for processing.
The methodologies employed in this research, such as data augmentation, SVM classifiers, and deep learning models like VGG16, provide several advantages. Data augmentation enhances the model’s generalization capability by simulating real-world variations in training data, reducing the risk of overfitting. SVM classifiers have been proven effective in recognizing specific attributes such as zippers and collars. The VGG16 model is renowned for its deep architecture, adept at capturing complex patterns within facial and clothing features, which aids in improving the accuracy of gender recognition and certain clothing attributes.
Conclusively, the experiment yielded an overall recognition accuracy of 81.4% for tops and 85.72% for bottoms, highlighting the efficacy of the applied methodologies in garment categorization. The use of data augmentation and the combination of SVM and deep learning approaches represent methodological advancements, offering a more nuanced understanding of the complex interactions between different attributes in fashion images.
However, these methods also have limitations. SVM classifiers may not capture hierarchical feature representations as effectively as deep learning models, potentially limiting their performance in more complex classification tasks. The performance of the VGG16 model may be influenced by the quality and diversity of the training data.
In summary, the method proposed in this study performs well in terms of overall accuracy. In terms of computational performance, by optimizing models and algorithms, as well as leveraging high-performance hardware, this approach can achieve rapid processing times while maintaining high accuracy, making it suitable for practical application scenarios. Future work could further explore the optimization of models and algorithms to improve the accuracy of age recognition and certain clothing attribute identifications while maintaining or enhancing computational efficiency.
[ad_2]