[ad_1]
Given an unlabeled image dataset
consisting of
N images, where
represents the
i-th unlabeled image, our objective is to associate
with the true labels of
K categories. In other words, our goal is to match each image
with its corresponding true label
, where
represents the true label of the
i-th image. Subsequently, we employ diffusion models and classifiers to process the labeled images and generate high-quality images guided by their respective categories. To accomplish the aforementioned objectives, we propose the GDUI. The overall flow of the proposed GDUI for unlabeled images manipulation is illustrated in
Figure 1. First, the input unlabeled images
are clustered into
K classes. Then, the pseudo-label-matching algorithm is used to transform the image set with pseudo-labels into a set of images with true labels. Second, we fine-tune the labels of the images using the label-matching refinement algorithm. Third, we optimize guided diffusion using labeled images matched by the label-matching refinement algorithm. In the following subsections, we will first provide a brief background on diffusion models, followed by a detailed dissection of the individual modules.
3.1. Preliminary
Diffusion probabilistic models [
40,
41] are a class of latent variable models that involve both a forward diffusion process and a reverse diffusion process. The forward process of diffusion model is a Markov chain where data are gradually corrupted with Gaussian noise based on a variance schedule
:
where are latent variables of the same dimension, and follows the distribution . The inverse process of the diffusion model, denoted as , is defined as a Markov chain with learned Gaussian transitions:
where can be represented as a linear combination of and a noise predictor , the variance is fixed to a known constant typically.
The quality of samples can be optimized by the following parameterized and simplified objective:
Here, t is uniformly distributed between 1 and T. It is defined and .
Compared to unconditional diffusion models, conditional diffusion models can generate images specified by conditions. The classifier-guided [
5,
42] sampling method demonstrates that the gradient
of a classifier can guide conditional diffusion models to generate higher-quality samples with a specified class
y.
3.2. Pseudo-Label Generation
To extend the classifier guidance technique to unlabeled images, here we adopt a deep clustering approach for the unsupervised learning of visual features [
43] to cluster the samples and generate synthetic labels. We adopt the SPICE [
31] framework that divides network training into three stages. First, there are two branches in which two different random transformations of the same image are taken as inputs. Each branch includes a feature model and a projection head. Given two transformations
and
of an image
x, the outputs of the two branches are represented as
and
z, respectively. The parameters of the feature model
and projection head
are optimized by the following loss function:
where is the negative sample and is the temperature. The finally optimized feature model parameters are denoted as .
In the second stage, given the feature model parameters and the unlabeled images , the goal is to separately optimize the parameters of the clustering head in order to predict cluster labels . The optimization of parameters is performed within the EM framework, where the cluster labels are obtained given in the expectation (E) step, and then in the maximization (M) step, the parameters are optimized upon obtaining the cluster labels .
In the third stage, the feature model
and clustering head
are jointly optimized. After obtaining the embedding features
and cluster labels
corresponding to the images
in the first two stages, a subset of reliable samples
is selected as:
where is the semantically consistent ratio of the sample and denotes the confidence threshold for . The semantically consistent ratio of the sample is defined as:
where represents the number of samples that are closest to the sample based on the cosine similarity between their embedding features, and represents the corresponding labels of these samples. The jointly trained network optimizes the parameters and using the following loss function:
where the first part is calculated using L reliable samples from , and the second part is calculated using U pseudo-labeled samples with pseudo-labels . These pseudo-labels are assigned to the classes predicted by the network with the highest probability and exceeding a certain threshold. and respectively denote the operators for weak and strong transformations on the input image. is the cross-entropy loss function.
After three stages of clustering, the input unlabeled images are divided into K clusters with clustering labels . The probability matrix is generated for the image set for each clusters. The probability matrix represents the probabilities of each image belonging to specific clusters.
3.3. Diffusion Pseudo-Label Matching
Based on the obtained cluster labels
of
over
K clusters, our goal is to match them with the ground-truth labels to guide target generation in the GDUI model. In unsupervised situations, we do not have ground truth to match against. To address the challenge of matching the ground-truth labels with the obtained cluster labels and to ensure a globally attentive alignment, we adopt the principles of the Stable Marriage Algorithm (SMA) [
44] for the overall matching strategy. This approach emphasizes the importance of considering global information in the matching process. To address this issue, we propose the pseudo-label-matching algorithm, which leverages the zero-shot capability of CLIP to achieve bilateral matching between the clustering labels and the ground truth. Given the clustering probability matrix
and
K clusters with cluster labels
, the top confident samples are selected as the clustering prototypes for each cluster.
To illustrate the process, we take the
m-th cluster as an example. We define the
m-th cluster as:
The top samples are selected as:
where the argtopk function returns the indices of the top highest-scoring samples in the m-th cluster.
Using CLIP, zero-shot classification is performed on the samples in with respect to the provided ground-truth label set . The classes with the highest classification probability are then selected as the labels for the samples. We obtain as the CLIP classification probability for the m-th cluster by calculating the proportion of each class in the samples. This ensures that the priority of each class in the clustering result is directly proportional to its probability such that higher probabilities correspond to higher priorities. Then, based on the previously obtained K clusters, the CLIP priority matrix for the K clusters, , is constructed.
A cluster, for example, the
m-th cluster
, is selected from the unmatched clusters
that have not been matched with the ground-truth label set
, and the highest priority class
, represented as:
where represents the ground truth of the m-th cluster that has not been requested for label matching, and denotes the element in the m-th row and column j of matrix , where j is the index of the element in the . Then, returns the index of the highest priority class among all unmatched ground truths for the m-th cluster. If the class has not been assigned, then it is assigned to the current m-th cluster. Otherwise, its priority is compared with the already assigned clusters. The cluster with a lower priority is added back to the unmatched clustering set , while the one with a higher priority is matched with class . Until all clusters are matched, we can obtain each cluster and its corresponding label, denoted by , where each label corresponds to the true class.
The above process for diffusion pseudo-label matching is summarized in Algorithm 1.
Algorithm 1 Pseudo-label matching |
Input: Unmatched clusters |
Output: Matched clusters |
- 1:
-
for do
- 2:
-
Select i-th cluster from with Equation ( 10);
- 3:
-
Select top confident samples from with Equation ( 11);
- 4:
-
Compute class proportions based on highest classification probability of each using CLIP;
- 5:
-
end for
- 6:
-
Generate priority matrix ;
- 7:
-
while do
- 8:
-
Pick a cluster from ;
- 9:
-
Select the highest priority true label among unrequested matches with Equation ( 12);
- 10:
-
if has not been assigned then
- 11:
-
Assign to cluster ;
- 12:
-
else
- 13:
-
has been assigned to cluster ;
- 14:
-
Assign to the cluster with higher priority between and ;
- 15:
-
Add lower-priority cluster to unmatched clusters
- 16:
-
end if
- 17:
-
end while
- 18:
-
return Matched clusters ;
|
3.4. Diffusion Label Matching Refinement
In the SPICE framework, an imperfect feature model can cause similar features to be assigned to truly different clusters, while imperfect cluster heads can result in dissimilar samples being assigned the same cluster label. These issues may eventually lead to the presence of samples from different classes in the same cluster and mismatches between samples and their true labels. Alternatively, these errors can also be caused by the pseudo-label-matching algorithm, which can result in mismatches between the cluster labels and the true labels of the clusters they represent. To overcome these issues, we propose a diffusion model label-matching refinement algorithm to adjust the matching of labels within clusters.
Here, we also use the
m-th cluster as an example, similar to our previous selection of
and the least confident samples
for
m-th cluster being selected as:
where the arglowk function returns the indices of the least confident samples, selected from the indices belonging to the m-th samples in the m-th column of matrix . Similar to , zero-shot classification based on true labels for is also performed using CLIP.
Furthermore, the semantic matching ratios
and
for the top and bottom of the
m-th cluster can be represented as:
where and reflect the matching status of the top and bottom of the m-th cluster. To comprehensively reflect the matching status of the m-th cluster, the overall semantic matching ratio is defined as:
where and represent the weights of and , respectively, in the overall matching ratio .
If the overall semantic matching ratio
, where
is the overall matching threshold, then a high matching degree for the
m-th cluster
implies that the cluster label
for that is trustworthy. In other cases, further examination is required to determine the matching status of the top and bottom of the
m-th cluster. In cases where the matching status of the top
in the
m-th cluster is greater than the top matching threshold
but the matching status of the bottom
is less than the bottom matching threshold
, it is necessary to evaluate the semantic consistency ratio
of the least confident samples
, which can be defined as:
where denotes the clustering probability of the element located in the i-th row and m-th column of matrix , specifically referring to the probability that belongs to the m-th cluster . When exceeds the confidence threshold , it implies that even if the matching degree at the bottom level is lower than the threshold, the overall consistency of the m-th cluster is sufficiently reliable, thus suggesting that the cluster label as a whole can be trusted. Otherwise, samples in the m-th cluster with low clustering consistency, denoted as:
need to be reassigned to other clusters using CLIP.
If is less than the top matching threshold , it suggests a mismatch in the overall clustering. If is also lower than the confidence threshold , we use CLIP to reassign the samples in the m-th cluster with low clustering consistency to other clusters, indicating that their original cluster labels are no longer valid. Furthermore, when exceeds the confidence threshold , suggesting overall clustering consistency but a mismatch in the overall clustering, we will maintain the original matching cluster labels . Following the fine-tuning of each cluster, the resulting fine-tuned clusters, along with their corresponding labels
i
=
1
K
are obtained; denotes the cluster that has undergone fine-tuning.
3.5. Synthesis Guided with Matching Labels
For conditional image synthesis, we use a classifier
to enhance the generator of diffusion models based on clusters
with given matching real classes, where
is the input image to the classifier and
y is the corresponding output label. The classification network is composed of the feature model
and the clustering head
from the clustering stage, along with an additional classification head
. As demonstrated in previous works [
41,
42], a pre-trained diffusion model can be conditioned via the gradients of a classifier. The conditioned reverse denoising process, denoted as
in Equation (
4), can be expressed as
. In [
41,
42], the following equation:
where
x
t
=
μ
and for brevity, have been proven. and are constants.
is a shorthand for the classifier
trained on noisy images .
[ad_2]