Algorithms | Free Full-Text | GDUI: Guided Diffusion Model for Unlabeled Images

By inergency On Mar 18, 2024

[ad_1]

Given an unlabeled image dataset

X = {\{x_{i}\}}_{i = 1}^{N}

consisting of N images, where

x_{i}

represents the i-th unlabeled image, our objective is to associate

X

with the true labels of K categories. In other words, our goal is to match each image

x_{i}

with its corresponding true label

y_{i}^{g}

, where

y_{i}^{g}

represents the true label of the i-th image. Subsequently, we employ diffusion models and classifiers to process the labeled images and generate high-quality images guided by their respective categories. To accomplish the aforementioned objectives, we propose the GDUI. The overall flow of the proposed GDUI for unlabeled images manipulation is illustrated in Figure 1. First, the input unlabeled images

X = {\{x_{i}\}}_{i = 1}^{N}

are clustered into K classes. Then, the pseudo-label-matching algorithm is used to transform the image set with pseudo-labels into a set of images with true labels. Second, we fine-tune the labels of the images using the label-matching refinement algorithm. Third, we optimize guided diffusion using labeled images matched by the label-matching refinement algorithm. In the following subsections, we will first provide a brief background on diffusion models, followed by a detailed dissection of the individual modules.

3.1. Preliminary

Diffusion probabilistic models [40,41] are a class of latent variable models that involve both a forward diffusion process and a reverse diffusion process. The forward process of diffusion model is a Markov chain where data are gradually corrupted with Gaussian noise based on a variance schedule

β_{1}, \dots, β_{T}

$q (x_{1 : T} ∣ x_{0}) : = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})$

(1)

$q (x_{t} ∣ x_{t - 1}) : = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)$

(2)

where $x_{0}, \dots, x_{T}$ are latent variables of the same dimension, and $x_{0}$ follows the distribution $q (x_{0})$ . The inverse process of the diffusion model, denoted as $p_{θ} (x_{0 : T})$ , is defined as a Markov chain with learned Gaussian transitions:

$p_{θ} (x_{0 : T}) : = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} ∣ x_{t})$

(3)

$p_{θ} (x_{t - 1} ∣ x_{t}) : = N (x_{t - 1}; μ_{θ}) (x_{t}, t), Σ_{θ} (x_{t}, t) I$

(4)

where $μ_{θ} (x_{t}, t)$ can be represented as a linear combination of $x_{T}$ and a noise predictor $ϵ_{θ} (x_{t}, t)$ , the variance $Σ_{θ} (x_{t}, t)$ is fixed to a known constant typically.

The quality of samples can be optimized by the following parameterized and simplified objective:

$L_{s i m p l e} (θ) : = E_{t, x_{0}, ϵ} [{∥ϵ - ϵ_{θ}∥}^{(\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t)}] 2$

(5)

Here, t is uniformly distributed between 1 and T. It is defined $α_{t} : = 1 - β_{t}$ and ${\bar{α}}_{t} : = \prod_{s = 1}^{t} α_{s}$ .

Compared to unconditional diffusion models, conditional diffusion models can generate images specified by conditions. The classifier-guided [5,42] sampling method demonstrates that the gradient

\nabla x_{t} log p_{ϕ} (y ∣ x_{t})

of a classifier can guide conditional diffusion models to generate higher-quality samples with a specified class y.

3.2. Pseudo-Label Generation

To extend the classifier guidance technique to unlabeled images, here we adopt a deep clustering approach for the unsupervised learning of visual features [43] to cluster the samples and generate synthetic labels. We adopt the SPICE [31] framework that divides network training into three stages. First, there are two branches in which two different random transformations of the same image are taken as inputs. Each branch includes a feature model and a projection head. Given two transformations

x^{'}

and

x^{″}

of an image x, the outputs of the two branches are represented as

z^{+}

and z, respectively. The parameters of the feature model

F

and projection head

P

are optimized by the following loss function:

$L_{f e a} = - log (\frac{exp (z^{T} z^{+} / τ)}{\sum_{i = 1}^{N_{q}} exp (z^{T} z_{i}^{-} / τ) + exp (z^{T} z^{+} / τ)})$

(6)

where $z_{i}^{-}$ is the negative sample and $τ$ is the temperature. The finally optimized feature model parameters are denoted as $θ_{F}^{s}$ .

In the second stage, given the feature model parameters $θ_{F}^{s}$ and the unlabeled images $X$ , the goal is to separately optimize the parameters $θ_{C}$ of the clustering head $C$ in order to predict cluster labels $\{y_{i}^{s}\}$ . The optimization of parameters $θ_{C}$ is performed within the EM framework, where the cluster labels $\{y_{i}^{s}\}$ are obtained given $θ_{C}$ in the expectation (E) step, and then in the maximization (M) step, the parameters $θ_{C}$ are optimized upon obtaining the cluster labels $\{y_{i}^{s}\}$ .

In the third stage, the feature model

F

and clustering head

C

are jointly optimized. After obtaining the embedding features

{\{f_{i}\}}_{i = 1}^{N}

and cluster labels

\{y_{i}^{s}\}

corresponding to the images

X

in the first two stages, a subset of reliable samples

X^{r}

is selected as:

$X^{r} = \{\} (x_{i}, y_{i}^{s}) ∣ r_{i} > σ_{r}, \forall i = 1, 2, \dots, N$

(7)

where $r_{i}$ is the semantically consistent ratio of the sample $x_{i}$ and $σ_{r}$ denotes the confidence threshold for $X^{r}$ . The semantically consistent ratio $r_{i}$ of the sample $x_{i}$ is defined as:

$r_{i} = \frac{1}{N_{s}} \sum_{y \in L_{i}} 1 (y = y_{i}^{s})$

(8)

where $N_{s}$ represents the number of samples that are closest to the sample $x_{i}$ based on the cosine similarity between their embedding features, and $L_{i}$ represents the corresponding labels of these $N_{s}$ samples. The jointly trained network optimizes the parameters $θ_{F}$ and $θ_{C}$ using the following loss function:

$\begin{matrix} L_{j o i n t} = \frac{1}{L} \sum_{i = 1}^{L} L_{c e} (y_{i}^{s}, C) (F) (α) (x_{i}); θ_{F}; θ_{C} + \frac{1}{U} \sum_{j = 1}^{U} 1 (y_{j}^{u} \geq 0) L_{c e} (y_{j}^{u}, C) (F) (β) (x_{j}); θ_{F}; θ_{C} \end{matrix}$

(9)

where the first part is calculated using L reliable samples $(x_{i}, y_{i}^{s})$ from $X^{r}$ , and the second part is calculated using U pseudo-labeled samples $(x_{i}, y_{i}^{u})$ with pseudo-labels $y_{j}^{u}$ . These pseudo-labels $y_{j}^{u}$ are assigned to the classes predicted by the network with the highest probability and exceeding a certain threshold. $α$ and $β$ respectively denote the operators for weak and strong transformations on the input image. $L_{c e}$ is the cross-entropy loss function.

After three stages of clustering, the input unlabeled images $X = {\{x_{i}\}}_{i = 1}^{N}$ are divided into K clusters with clustering labels $\{y_{i}^{s}\}$ . The probability matrix $P = {[p_{1}, p_{2}, \dots, p_{N}]}^{T} \in R^{N \times K}$ is generated for the image set for each clusters. The probability matrix $P$ represents the probabilities of each image belonging to specific clusters.

3.3. Diffusion Pseudo-Label Matching

Based on the obtained cluster labels

\{y_{i}^{s}\}

X

over K clusters, our goal is to match them with the ground-truth labels to guide target generation in the GDUI model. In unsupervised situations, we do not have ground truth to match against. To address the challenge of matching the ground-truth labels with the obtained cluster labels and to ensure a globally attentive alignment, we adopt the principles of the Stable Marriage Algorithm (SMA) [44] for the overall matching strategy. This approach emphasizes the importance of considering global information in the matching process. To address this issue, we propose the pseudo-label-matching algorithm, which leverages the zero-shot capability of CLIP to achieve bilateral matching between the clustering labels and the ground truth. Given the clustering probability matrix

P

and K clusters with cluster labels

\{y_{i}^{s}\}

, the top confident samples are selected as the clustering prototypes for each cluster.

To illustrate the process, we take the m-th cluster as an example. We define the m-th cluster as:

$X_{m} = \{\} (x_{i}, y_{i}^{s}) ∣ y_{i}^{s} = m, \forall i = 1, 2, \dots, N$

(10)

The top samples are selected as:

$X_{m}^{t o p} = {(x_{i}, y_{i}^{s}) ∣ i \in argtopk (P_{j, m}, N_{t o p}), x_{j} \in X_{m}}$

(11)

where the argtopk function $argtopk (P_{j, m}, N_{t o p})$ returns the indices of the top $N_{t o p}$ highest-scoring samples in the m-th cluster.

Using CLIP, zero-shot classification is performed on the samples in $X_{m}^{t o p}$ with respect to the provided ground-truth label set $Y^{g} = {\{y_{i}^{g}\}}_{i = 1}^{K}$ . The classes with the highest classification probability are then selected as the labels for the samples. We obtain $p_{i}^{c}$ as the CLIP classification probability for the m-th cluster by calculating the proportion of each class in the $N_{t o p}$ samples. This ensures that the priority of each class in the clustering result is directly proportional to its probability such that higher probabilities correspond to higher priorities. Then, based on the previously obtained K clusters, the CLIP priority matrix for the K clusters, $P^{c} = {[p_{1}^{c}, p_{2}^{c}, \dots, p_{K}^{c}]}^{T} \in R^{K \times K}$ , is constructed.

A cluster, for example, the m-th cluster

X_{m}

, is selected from the unmatched clusters

U = {\{X_{i}\}}_{i = 1}^{K}

that have not been matched with the ground-truth label set

Y^{g}

, and the highest priority class

y_{i}^{g}

, represented as:

$y_{i}^{g} = argmax P_{m, j}^{c}, j \in Y_{m}^{u}$

(12)

where $Y_{m}^{u}$ represents the ground truth of the m-th cluster that has not been requested for label matching, and $P_{m, j}^{c}$ denotes the element in the m-th row and column j of matrix $P^{c}$ , where j is the index of the element in the $Y_{m}^{u}$ . Then, $argmax P_{m, j}^{c}$ returns the index of the highest priority class among all unmatched ground truths for the m-th cluster. If the class $y_{i}^{g}$ has not been assigned, then it is assigned to the current m-th cluster. Otherwise, its priority is compared with the already assigned clusters. The cluster with a lower priority is added back to the unmatched clustering set $U$ , while the one with a higher priority is matched with class $y_{i}^{g}$ . Until all clusters are matched, we can obtain each cluster and its corresponding label, denoted by $M = {\{(X_{i}, y_{i}^{g})\}}_{i = 1}^{K}$ , where each label corresponds to the true class.

The above process for diffusion pseudo-label matching is summarized in Algorithm 1.

Algorithm 1 Pseudo-label matching

Input: Unmatched clusters

U = {\{X_{i}\}}_{i = 1}^{K}

Output: Matched clusters

M = {\{(X_{i}, y_{i}^{g})\}}_{i = 1}^{K}

1:: for $i = 1, 2, \dots, K$ do
2:: Select i-th cluster $X_{i}$ from $U$ with Equation (10);
3:: Select $N_{t o p}$ top confident samples $X_{i}^{t o p}$ from $X_{i}$ with Equation (11);
4:: Compute class proportions $p_{i}^{c}$ based on highest classification probability of each $x_{i} \in X_{i}^{t o p}$ using CLIP;
5:: end for
6:: Generate priority matrix $P^{c} = {[p_{1}^{c}, p_{2}^{c}, \dots, p_{K}^{c}]}^{T}$ ;
7:: while $U \neq ⌀$ do
8:: Pick a cluster $X_{m}$ from $U$ ;
9:: Select the highest priority true label $y_{i}^{g}$ among unrequested matches with Equation (12);
10:: if $y_{i}^{g}$ has not been assigned then
11:: Assign $y_{i}^{g}$ to cluster $X_{m}$ ;
12:: else
13:: $y_{i}^{g}$ has been assigned to cluster $X_{k}$ ;
14:: Assign $y_{i}^{g}$ to the cluster with higher priority between $X_{m}$ and $X_{k}$ ;
15:: Add lower-priority cluster to unmatched clusters $U$
16:: end if
17:: end while
18:: return Matched clusters $M = {\{(X_{i}, y_{i}^{g})\}}_{i = 1}^{K}$ ;

3.4. Diffusion Label Matching Refinement

In the SPICE framework, an imperfect feature model can cause similar features to be assigned to truly different clusters, while imperfect cluster heads can result in dissimilar samples being assigned the same cluster label. These issues may eventually lead to the presence of samples from different classes in the same cluster and mismatches between samples and their true labels. Alternatively, these errors can also be caused by the pseudo-label-matching algorithm, which can result in mismatches between the cluster labels and the true labels of the clusters they represent. To overcome these issues, we propose a diffusion model label-matching refinement algorithm to adjust the matching of labels within clusters.

Here, we also use the m-th cluster as an example, similar to our previous selection of

X_{m}^{t o p}

and the least confident samples

X_{m}^{b t m}

for m-th cluster being selected as:

$X_{m}^{b t m} = {(x_{i}, y_{i}^{s}) ∣ i \in arglowk (P_{j, m}, N_{b t m}), x_{j} \in X_{m}}$

(13)

where the arglowk function $arglowk (P_{j, m}, N_{b t m})$ returns the indices of the least $N_{b t m}$ confident samples, selected from the indices belonging to the m-th samples in the m-th column of matrix $P_{:, m}$ . Similar to $X_{m}^{t o p}$ , zero-shot classification based on true labels for $X_{m}^{b t m}$ is also performed using CLIP.

Furthermore, the semantic matching ratios

δ_{m}^{t o p}

and

δ_{m}^{b t m}

for the top and bottom of the m-th cluster can be represented as:

$δ_{m}^{t o p} = \frac{1}{N_{t o p}} \sum_{y \in X_{m}^{t o p}} 1 (y = y_{i}^{g})$

(14)

$δ_{m}^{b t m} = \frac{1}{N_{b t m}} \sum_{y \in X_{m}^{b t m}} 1 (y = y_{i}^{g})$

(15)

where $δ_{m}^{t o p}$ and $δ_{m}^{b t m}$ reflect the matching status of the top and bottom of the m-th cluster. To comprehensively reflect the matching status of the m-th cluster, the overall semantic matching ratio $δ_{m}$ is defined as:

$δ_{m} = δ_{m}^{t o p} * w_{t o p} + δ_{m}^{b t m} * w_{b t m}$

(16)

where $w_{t o p}$ and $w_{b t m}$ represent the weights of $δ_{m}^{t o p}$ and $δ_{m}^{b t m}$ , respectively, in the overall matching ratio $δ_{m}$ .

If the overall semantic matching ratio

δ_{m} > λ_{δ}

, where

λ_{δ}

is the overall matching threshold, then a high matching degree for the m-th cluster

X_{m}

implies that the cluster label

y_{m}^{g}

for that is trustworthy. In other cases, further examination is required to determine the matching status of the top and bottom of the m-th cluster. In cases where the matching status of the top

δ_{m}^{t o p}

in the m-th cluster is greater than the top matching threshold

λ_{t o p}

but the matching status of the bottom

δ_{m}^{b t m}

is less than the bottom matching threshold

λ_{b t m}

, it is necessary to evaluate the semantic consistency ratio

r_{m}^{b t m}

of the least confident samples

X_{m}^{b t m}

, which can be defined as:

$r_{m}^{b t m} = \frac{1}{N_{b t m}} \sum_{x_{i} \in X_{m}^{b t m}} P_{i, m}$

(17)

where $P_{i, m}$ denotes the clustering probability of the element located in the i-th row and m-th column of matrix $P$ , specifically referring to the probability that $x_{i} \in X_{m}^{b t m}$ belongs to the m-th cluster $X_{m}$ . When $r_{m}^{b t m}$ exceeds the confidence threshold $σ_{b t m}$ , it implies that even if the matching degree at the bottom level is lower than the threshold, the overall consistency of the m-th cluster $X_{m}$ is sufficiently reliable, thus suggesting that the cluster label $y_{m}^{g}$ as a whole can be trusted. Otherwise, samples $X_{m}^{a d j}$ in the m-th cluster with low clustering consistency, denoted as:

$X_{m}^{a d j} = \{\} (x_{i}, y_{i}^{g}) ∣ P_{i, m} < σ_{b t m}, x_{i} \in X_{m}$

(18)

need to be reassigned to other clusters using CLIP.

If $δ_{m}^{t o p}$ is less than the top matching threshold $λ_{t o p}$ , it suggests a mismatch in the overall clustering. If $r_{m}^{b t m}$ is also lower than the confidence threshold $σ_{b t m}$ , we use CLIP to reassign the samples $X_{m}^{a d j}$ in the m-th cluster with low clustering consistency to other clusters, indicating that their original cluster labels are no longer valid. Furthermore, when $r_{m}^{b t m}$ exceeds the confidence threshold $σ_{b t m}$ , suggesting overall clustering consistency but a mismatch in the overall clustering, we will maintain the original matching cluster labels $y_{m}^{g}$ . Following the fine-tuning of each cluster, the resulting fine-tuned clusters, along with their corresponding labels $M^{*} = {\{\}}_{(X_{i}^{*}, y_{i}^{g})}^{i = 1}$ are obtained; $X^{*}$ denotes the cluster that has undergone fine-tuning.

3.5. Synthesis Guided with Matching Labels

For conditional image synthesis, we use a classifier

p_{ϕ} (y ∣ x)

to enhance the generator of diffusion models based on clusters

M^{*} = {\{\}}_{(X_{i}^{*}, y_{i}^{g})}^{i = 1}

with given matching real classes, where

x

is the input image to the classifier and y is the corresponding output label. The classification network is composed of the feature model

F

and the clustering head

C

from the clustering stage, along with an additional classification head

C_{c l f}

. As demonstrated in previous works [41,42], a pre-trained diffusion model can be conditioned via the gradients of a classifier. The conditioned reverse denoising process, denoted as

p_{θ} (x_{t - 1} ∣ x_{t})

in Equation (4), can be expressed as

p_{θ, ϕ} (x_{t} ∣ x_{t + 1}, y)

. In [41,42], the following equation:

$\begin{matrix} log p_{θ, ϕ} (x_{t} ∣ x_{t + 1}, y) & = log ((p_{θ}) (x_{t} ∣ x_{t + 1}) p_{ϕ} (y ∣ x_{t}) + B_{1} \\ \approx log p (z) + B_{2}, z \sim N (μ + Σ g, Σ) \end{matrix}$

(19)

where $g = \nabla x_{t}| log p_{ϕ} (y ∣ x_{t}) x_{t} = μ$ and $Σ_{θ} (x_{t}, t) = Σ$ for brevity, have been proven. $B_{1}$ and $B_{2}$ are constants. $p_{ϕ}) (y ∣ x_{t})$ is a shorthand for the classifier $p_{ϕ}) (y ∣ x_{t}, t)$ trained on noisy images $x_{t}$ .

[ad_2]