208 lines
7.6 KiB
Markdown
Raw Permalink Normal View History

2025-09-03 19:29:00 +02:00
# GANs
## Pseudorandom Numbers
2025-11-01 17:02:58 +01:00
There are several alagorithms to generate *random* numbers on computers. These
techniques are to have randomness over inputs to make experiments, but
are still deterministic algorithms.
2025-09-03 19:29:00 +02:00
2025-11-01 17:02:58 +01:00
In fact, they take a seed that in turn generates the whole sequence. Now, if
this seed is either truly random or not is not the goal. However the important
thing is that **for each seed** there exists **only one sequence** of *random*
values for an algorithm.
$$
random_{seed = 12}(k) = random_{seed = 12}(k)
$$
> [!CAUTION]
> Never use these generators for crypthographical related tasks. There are
> ways to leak both the seed and the whole sequence, making any encryption
> or signature useless.
>
> Cryptographically secure algorithms need a source of entropy, input that
> is truly random, that is external to the system (since computers are
> deterministic)
## Pseudorandom Algorithms
### [Rejection Sampling (aka Acceptance-Rejection Method)](https://en.wikipedia.org/wiki/Rejection_sampling)
Instead of sampling directly over the **complex distribution**, we sample over
a **simpler one**, usually a uniform distribution,
2025-09-03 19:29:00 +02:00
and accept or reject values according to some conditions.
If these conditions are crafted in a specific way, our samples will resemble those of the **complex distribution**
2025-11-01 17:02:58 +01:00
However we need to know the real probability distribution before being able
to accept and reject samples.
![rejection sampling image](./pngs/rejection-sampling.png)
### [Metropolis-Hasting](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)
2025-09-03 19:29:00 +02:00
The idea is of **constructing a stationary Markov Chain** along the possible values we can sample.
2025-11-01 17:02:58 +01:00
Then we take a long journey and see where we end up. then we sample that point
2025-09-03 19:29:00 +02:00
2025-11-01 17:02:58 +01:00
### [Inverse Transform (aka Smirnov Transform)](https://en.wikipedia.org/wiki/Inverse_transform_sampling)
2025-09-03 19:29:00 +02:00
The idea is of sampling directly from a **well known distribution** and then return a number from the **complex distribution** such
that its probability function is higher than our **simple sample**
$$
u \in \mathcal{U}[0, 1] \\
2025-11-01 17:02:58 +01:00
\text{take } x \text{ so that } F(x) \geq u \text{ where } \\
F(x) \text{ is the the desired cumulative distribution function }
2025-09-03 19:29:00 +02:00
$$
As proof we have that
$$
F_{X}(x) \in [0, 1] = X \rightarrow F_{X}^{-1}(X) = x \in \R \\
\text{Let's define } Y = F_{X}^{-1}(U) \\
F_{Y}(y) =
P(Y \leq y) =
2025-11-01 17:02:58 +01:00
P(F_{X}^{-1}(U) \leq y) = P(F_{X}(F_{X}^{-1}(U)) \leq F_{X}(y)) =
2025-09-03 19:29:00 +02:00
P(U \leq F_{X}(y)) = F_{X}(y)
$$
2025-11-01 17:02:58 +01:00
In particular, we demonstrated that $X$ and $Y$ have the same distribution,
so they are the same random variable. We also demonstrated that it is possible
to define a random variable of a desired distribution, starting from another.
> [!TIP]
> Another way to view this is by thinking of the uniform sample as the
> probability that we want from our target distribution.
>
> When we find the smallest number that gives the same probability, it's
> like sampling from the real distribution (especially because we discretize).
>
> ![gif of an example](./pngs/wikipedia_inverse_transform_sampling_example.gif)
>
> By <a href="//commons.wikimedia.org/w/index.php?title=User:Davidjessop&amp;action=edit&amp;redlink=1" class="new" title="User:Davidjessop (page does not exist)">Davidjessop</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=100369573">Link</a>
2025-09-03 19:29:00 +02:00
> [!NOTE]
> The last passage says that a uniform variable is less than a value, that
> comes from a distribution. Since $F_{U}(x) = P(U \leq x) = x$ the last
> part holds.
>
> We can use the CDF of X in the probability function, without inverting the
> inequality because any CDF is positively monotone
## Generative Models
2025-11-01 17:02:58 +01:00
Let's suppose for now that all similar things are ruled by a
hidden probability distribution. This means that similar
things live close together, but we don't know the actual
distribution.
2025-09-03 19:29:00 +02:00
2025-11-01 17:02:58 +01:00
Let's say that we need to generate flower photos. To do so,
we would collect several pictures of them. This allows us
to infer a probability distribution that, while being different
from the real one, it is still useful for our goal.
2025-09-03 19:29:00 +02:00
2025-11-01 17:02:58 +01:00
Now, to generate flower photos, we only need to sample from
this distribution. However if none of the above methods are
feasible (which is usually the case), we need to
train a network capable of learning this distribution.
2025-09-03 19:29:00 +02:00
### Direct Learning
2025-11-01 17:02:58 +01:00
This method consisits in comparing the generated
distribution with the real one, e.g. see the MMD distance between them.
However this is usually cumbersome.
2025-09-03 19:29:00 +02:00
### Indirect Learning
We introduce the concept of a **discriminator** which will guess whether
our image was generated or it comes from the actual distribution.
We then update our model based on the grade achieved by the **discriminator**
## Generative Adversarial Networks
If we use the **indirect learning** approach, we need a `Network` capable of
classificating generated and genuine content according to their labels.
2025-11-01 17:02:58 +01:00
However, for the same reasons of **generative models**, we don't have such
2025-09-03 19:29:00 +02:00
a `Network` readily available, but rather we have to **learn** it.
2025-11-01 17:02:58 +01:00
The idea is that **the discriminator learns to classify images** (meaning that
it will learn the goal distribution) and **the generator learns to fool
the discriminator** (meaning that it will learn to generate according to
the goal distribution)
2025-09-03 19:29:00 +02:00
## Training Phases
We can train both the **generator an discriminator** together. They will
`backpropagate` based on **classification errors** of the **discriminator**,
however they will have 2 distinct objective:
- `Generator`: **Maximize** this error
- `Discriminator`: **Minimize** this error
From a **game theory** view, they are playing a **zero-sum** game and the
perfect outcome is **when the discriminator matches tags 50% of times**
### Loss functions
Assuming that $G(\vec{z})$ is the result of the `generator`, $D(\vec{a})$ is the result of the `discriminator`, $y = 0$ is the generated content and $y = 1$ is the real one
2025-11-01 17:02:58 +01:00
- **Loss Function**:\
This is the actual loss for the `Discriminator`, however we use the ones
below to explain the antagonist goal of both networks. Here $\vec{x}$ may
be both.
$$
\min_{D} \{
\underbrace{
-y \log{(D(\vec{x}))}
}_{\text{Loss over Real Data}}
\underbrace{
- (1 -y)\log{(1-D(\vec{x}))}
}_{\text{Loss over Generated Data}}
\}
$$
> [!NOTE]
> Notice that our generator can only control the loss over generated data.
> This will be its objective.
2025-09-03 19:29:00 +02:00
- **`Generator`**:\
We want to maximize the error of the `Discriminator` $D(\vec{a})$ when $y = 0$
$$
\max_{G}\{ -(1 -y) \log(1 - D(G(\vec{z})))\} =
\max_{G}\{- \log(1 - D(G(\vec{z})))\}
$$
- **`Discriminator`**: \
We want to minimize its error when $y = 1$. However, if the `Generator` is good at its job
2025-11-01 17:02:58 +01:00
$D(G(\vec{z})) \approx 1$ the equation above
2025-09-03 19:29:00 +02:00
will be near 0, so we use these instead:
$$
\begin{aligned}
\min_{G}\{ (1 -y) \log(1 - D(G(\vec{z})))\} &=
\min_{G}\{\log(1 - D(G(\vec{z})))\} \\
&\approx \max_{G}\{- \log(D(G(\vec{z})))\}
\end{aligned}
$$
However they are basically playing a `minimax` game:
$$
\min_{D} \max_{G} \{
- E_{x \sim Data} \log(D(\vec{x})) - E_{z \sim Noise} \log(1 - D(G(\vec{z})))
\}
2025-11-01 17:02:58 +01:00
$$
In this case we want both scores balanced, so a value near 0.5 is our goal,
meaning that both are performing at their best.
> [!NOTE]
> Performing at their best means that the **`Discriminator` is always able
> to classify correctly real data** and the **`Generator` is always able to
> fool the `Discriminator`.**