Revised notes

This commit is contained in:
Christian Risi 2025-11-01 17:02:58 +01:00
parent e17d067c24
commit 586b295ed8

View File

@ -2,28 +2,58 @@
## Pseudorandom Numbers
### Rejection Sampling
There are several alagorithms to generate *random* numbers on computers. These
techniques are to have randomness over inputs to make experiments, but
are still deterministic algorithms.
Instead of sampling directly over the **complex distribution**, we sample over a **simpler one**
In fact, they take a seed that in turn generates the whole sequence. Now, if
this seed is either truly random or not is not the goal. However the important
thing is that **for each seed** there exists **only one sequence** of *random*
values for an algorithm.
$$
random_{seed = 12}(k) = random_{seed = 12}(k)
$$
> [!CAUTION]
> Never use these generators for crypthographical related tasks. There are
> ways to leak both the seed and the whole sequence, making any encryption
> or signature useless.
>
> Cryptographically secure algorithms need a source of entropy, input that
> is truly random, that is external to the system (since computers are
> deterministic)
## Pseudorandom Algorithms
### [Rejection Sampling (aka Acceptance-Rejection Method)](https://en.wikipedia.org/wiki/Rejection_sampling)
Instead of sampling directly over the **complex distribution**, we sample over
a **simpler one**, usually a uniform distribution,
and accept or reject values according to some conditions.
If these conditions are crafted in a specific way, our samples will resemble those of the **complex distribution**
### Metropolis-Hasting
However we need to know the real probability distribution before being able
to accept and reject samples.
![rejection sampling image](./pngs/rejection-sampling.png)
### [Metropolis-Hasting](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)
The idea is of **constructing a stationary Markov Chain** along the possible values we can sample.
Then we take a long path and see where we end up and sample that point
Then we take a long journey and see where we end up. then we sample that point
### Inverse Transform
### [Inverse Transform (aka Smirnov Transform)](https://en.wikipedia.org/wiki/Inverse_transform_sampling)
The idea is of sampling directly from a **well known distribution** and then return a number from the **complex distribution** such
that its probability function is higher than our **simple sample**
$$
u \in \mathcal{U}[0, 1] \\
\text{take } x \text{ so that } u \leq F(x) \text{ where } \\
F(x) \text{ is the the cumulative distribution funcion of } x
\text{take } x \text{ so that } F(x) \geq u \text{ where } \\
F(x) \text{ is the the desired cumulative distribution function }
$$
As proof we have that
@ -33,13 +63,24 @@ F_{X}(x) \in [0, 1] = X \rightarrow F_{X}^{-1}(X) = x \in \R \\
\text{Let's define } Y = F_{X}^{-1}(U) \\
F_{Y}(y) =
P(Y \leq y) =
P(F_{X}^{-1}(U) \leq y) \rightarrow \\
\rightarrow P(F_{X}(F_{X}^{-1}(U)) \leq F_{X}(y)) =
P(F_{X}^{-1}(U) \leq y) = P(F_{X}(F_{X}^{-1}(U)) \leq F_{X}(y)) =
P(U \leq F_{X}(y)) = F_{X}(y)
$$
In particular, we demonstrated that by crafting a variable such as that,
makes it possible to sample $X$ and $Y$ by sampling $U$
In particular, we demonstrated that $X$ and $Y$ have the same distribution,
so they are the same random variable. We also demonstrated that it is possible
to define a random variable of a desired distribution, starting from another.
> [!TIP]
> Another way to view this is by thinking of the uniform sample as the
> probability that we want from our target distribution.
>
> When we find the smallest number that gives the same probability, it's
> like sampling from the real distribution (especially because we discretize).
>
> ![gif of an example](./pngs/wikipedia_inverse_transform_sampling_example.gif)
>
> By <a href="//commons.wikimedia.org/w/index.php?title=User:Davidjessop&amp;action=edit&amp;redlink=1" class="new" title="User:Davidjessop (page does not exist)">Davidjessop</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=100369573">Link</a>
> [!NOTE]
> The last passage says that a uniform variable is less than a value, that
@ -51,22 +92,26 @@ makes it possible to sample $X$ and $Y$ by sampling $U$
## Generative Models
The idea is that, given a $n \times n$ vector $N$, made of stacked pixels,
not all vectors will be a dog photo, so we must find the probability
distribution associated with dog images
Let's suppose for now that all similar things are ruled by a
hidden probability distribution. This means that similar
things live close together, but we don't know the actual
distribution.
However we have little to no information about the **actual distribution**
of dog images.
Let's say that we need to generate flower photos. To do so,
we would collect several pictures of them. This allows us
to infer a probability distribution that, while being different
from the real one, it is still useful for our goal.
Our solution is to **sample from a simple distribution, transform** it
into our target distribution and then **compare with** a **sampled subset** of
the **complex distribution**
We then train our network to make better transformations
Now, to generate flower photos, we only need to sample from
this distribution. However if none of the above methods are
feasible (which is usually the case), we need to
train a network capable of learning this distribution.
### Direct Learning
We see the MMD distance between our generated set and the real one
This method consisits in comparing the generated
distribution with the real one, e.g. see the MMD distance between them.
However this is usually cumbersome.
### Indirect Learning
@ -80,9 +125,14 @@ We then update our model based on the grade achieved by the **discriminator**
If we use the **indirect learning** approach, we need a `Network` capable of
classificating generated and genuine content according to their labels.
However, for the same reasons of the **generative models**, we don't have such
However, for the same reasons of **generative models**, we don't have such
a `Network` readily available, but rather we have to **learn** it.
The idea is that **the discriminator learns to classify images** (meaning that
it will learn the goal distribution) and **the generator learns to fool
the discriminator** (meaning that it will learn to generate according to
the goal distribution)
## Training Phases
We can train both the **generator an discriminator** together. They will
@ -99,6 +149,26 @@ perfect outcome is **when the discriminator matches tags 50% of times**
Assuming that $G(\vec{z})$ is the result of the `generator`, $D(\vec{a})$ is the result of the `discriminator`, $y = 0$ is the generated content and $y = 1$ is the real one
- **Loss Function**:\
This is the actual loss for the `Discriminator`, however we use the ones
below to explain the antagonist goal of both networks. Here $\vec{x}$ may
be both.
$$
\min_{D} \{
\underbrace{
-y \log{(D(\vec{x}))}
}_{\text{Loss over Real Data}}
\underbrace{
- (1 -y)\log{(1-D(\vec{x}))}
}_{\text{Loss over Generated Data}}
\}
$$
> [!NOTE]
> Notice that our generator can only control the loss over generated data.
> This will be its objective.
- **`Generator`**:\
We want to maximize the error of the `Discriminator` $D(\vec{a})$ when $y = 0$
@ -107,12 +177,9 @@ $$
\max_{G}\{- \log(1 - D(G(\vec{z})))\}
$$
- **`Discriminator`**: \
We want to minimize its error when $y = 1$. However, if the `Generator` is good at its job
$D(G(\vec{z})) \approx 1$ the above equation
$D(G(\vec{z})) \approx 1$ the equation above
will be near 0, so we use these instead:
$$
@ -129,4 +196,12 @@ $$
\min_{D} \max_{G} \{
- E_{x \sim Data} \log(D(\vec{x})) - E_{z \sim Noise} \log(1 - D(G(\vec{z})))
\}
$$
$$
In this case we want both scores balanced, so a value near 0.5 is our goal,
meaning that both are performing at their best.
> [!NOTE]
> Performing at their best means that the **`Discriminator` is always able
> to classify correctly real data** and the **`Generator` is always able to
> fool the `Discriminator`.**