Revised notes

2025-11-01 17:02:58 +01:00
parent e17d067c24
commit 586b295ed8
1 changed files with 103 additions and 28 deletions
--- a/Chapters/11-GANs/INDEX.md
+++ b/Chapters/11-GANs/INDEX.md
@@ -2,28 +2,58 @@

 ## Pseudorandom Numbers

-### Rejection Sampling
+There are several alagorithms to generate *random* numbers on computers. These
+techniques are to have randomness over inputs to make experiments, but
+are still deterministic algorithms.

-Instead of sampling directly over the **complex distribution**, we sample over a **simpler one**
+In fact, they take a seed that in turn generates the whole sequence. Now, if
+this seed is either truly random or not is not the goal. However the important
+thing is that **for each seed** there exists **only one sequence** of *random*
+values for an algorithm.
+
+$$
+random_{seed = 12}(k) = random_{seed = 12}(k)
+$$
+
+> [!CAUTION]
+> Never use these generators for crypthographical related tasks. There are
+> ways to leak both the seed and the whole sequence, making any encryption
+> or signature useless.
+>
+> Cryptographically secure algorithms need a source of entropy, input that
+> is truly random, that is external to the system (since computers are
+> deterministic)
+
+## Pseudorandom Algorithms
+
+### [Rejection Sampling (aka Acceptance-Rejection Method)](https://en.wikipedia.org/wiki/Rejection_sampling)
+
+Instead of sampling directly over the **complex distribution**, we sample over
+a **simpler one**, usually a uniform distribution,
 and accept or reject values according to some conditions.

 If these conditions are crafted in a specific way, our samples will resemble those of the **complex distribution**

-### Metropolis-Hasting
+However we need to know the real probability distribution before being able
+to accept and reject samples.
+
+![rejection sampling image](./pngs/rejection-sampling.png)
+
+### [Metropolis-Hasting](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)

 The idea is of **constructing a stationary Markov Chain** along the possible values we can sample.

-Then we take a long path and see where we end up and sample that point
+Then we take a long journey and see where we end up. then we sample that point

-### Inverse Transform
+### [Inverse Transform (aka Smirnov Transform)](https://en.wikipedia.org/wiki/Inverse_transform_sampling)

 The idea is of sampling directly from a **well known distribution** and then return a number from the **complex distribution** such
 that its probability function is higher than our **simple sample**

 $$
 u \in \mathcal{U}[0, 1] \\
-\text{take } x \text{ so that } u \leq F(x) \text{ where } \\
-F(x) \text{ is the the cumulative distribution funcion of } x
+\text{take } x \text{ so that }  F(x) \geq u \text{ where } \\
+F(x) \text{ is the the desired cumulative distribution function }
 $$

 As proof we have that
@@ -33,13 +63,24 @@ F_{X}(x) \in [0, 1] = X \rightarrow F_{X}^{-1}(X) = x \in \R \\
 \text{Let's define } Y = F_{X}^{-1}(U) \\
 F_{Y}(y) =
    P(Y \leq y) =
-    P(F_{X}^{-1}(U) \leq y) \rightarrow \\
-\rightarrow P(F_{X}(F_{X}^{-1}(U)) \leq F_{X}(y)) =
+    P(F_{X}^{-1}(U) \leq y) = P(F_{X}(F_{X}^{-1}(U)) \leq F_{X}(y)) =
    P(U \leq F_{X}(y)) = F_{X}(y)
 $$

-In particular, we demonstrated that by crafting a variable such as that,
-makes it possible to sample $X$ and $Y$ by sampling $U$
+In particular, we demonstrated that $X$ and $Y$ have the same distribution,
+so they are the same random variable. We also demonstrated that it is possible
+to define a random variable of a desired distribution, starting from another.
+
+> [!TIP]
+> Another way to view this is by thinking of the uniform sample as the
+> probability that we want from our target distribution.
+>
+> When we find the smallest number that gives the same probability, it's
+> like sampling from the real distribution (especially because we discretize).
+>
+> ![gif of an example](./pngs/wikipedia_inverse_transform_sampling_example.gif)
+>
+> By <a href="//commons.wikimedia.org/w/index.php?title=User:Davidjessop&amp;action=edit&amp;redlink=1" class="new" title="User:Davidjessop (page does not exist)">Davidjessop</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=100369573">Link</a>

 > [!NOTE]
 > The last passage says that a uniform variable is less than a value, that
@@ -51,22 +92,26 @@ makes it possible to sample $X$ and $Y$ by sampling $U$

 ## Generative Models

-The idea is that, given a $n \times n$ vector $N$, made of stacked pixels,
-not all vectors will be a dog photo, so we must find the probability
-distribution associated with dog images
+Let's suppose for now that all similar things are ruled by a
+hidden probability distribution. This means that similar
+things live close together, but we don't know the actual
+distribution.

-However we have little to no information about the **actual distribution**
-of dog images.
+Let's say that we need to generate flower photos. To do so,
+we would collect several pictures of them. This allows us
+to infer a probability distribution that, while being different
+from the real one, it is still useful for our goal.

-Our solution is to **sample from a simple distribution, transform** it
-into our target distribution and then **compare with** a **sampled subset** of
-the **complex distribution**
-
-We then train our network to make better transformations
+Now, to generate flower photos, we only need to sample from
+this distribution. However if none of the above methods are
+feasible (which is usually the case), we need to
+train a network capable of learning this distribution.

 ### Direct Learning

-We see the MMD distance between our generated set and the real one
+This method consisits in comparing the generated
+distribution with the real one, e.g. see the MMD distance between them.
+ However this is usually cumbersome.

 ### Indirect Learning

@@ -80,9 +125,14 @@ We then update our model based on the grade achieved by the **discriminator**
 If we use the **indirect learning** approach, we need a `Network` capable of
 classificating generated and genuine content according to their labels.

-However, for the same reasons of the **generative models**, we don't have such
+However, for the same reasons of **generative models**, we don't have such
 a `Network` readily available, but rather we have to **learn** it.

+The idea is that **the discriminator learns to classify images** (meaning that
+it will learn the goal distribution) and **the generator learns to fool
+the discriminator** (meaning that it will learn to generate according to
+the goal distribution)
+
 ## Training Phases

 We can train both the **generator an discriminator** together. They will
@@ -99,6 +149,26 @@ perfect outcome is **when the discriminator matches tags 50% of times**

 Assuming that $G(\vec{z})$ is the result of the `generator`, $D(\vec{a})$ is the result of the `discriminator`, $y = 0$ is the generated content and $y = 1$ is the real one

+- **Loss Function**:\
+    This is the actual loss for the `Discriminator`, however we use the ones
+    below to explain the antagonist goal of both networks. Here $\vec{x}$ may
+    be both.
+
+$$
+\min_{D} \{
+    \underbrace{
+        -y \log{(D(\vec{x}))}
+    }_{\text{Loss over Real Data}}
+    \underbrace{
+        - (1 -y)\log{(1-D(\vec{x}))}
+    }_{\text{Loss over Generated Data}}
+\}
+$$
+
+> [!NOTE]
+> Notice that our generator can only control the loss over generated data.
+> This will be its objective.
+
 - **`Generator`**:\
 We want to maximize the error of the `Discriminator` $D(\vec{a})$ when $y = 0$

@@ -107,12 +177,9 @@ $$
    \max_{G}\{- \log(1 - D(G(\vec{z})))\}
 $$

-
-
-
 - **`Discriminator`**: \
 We want to minimize its error when $y = 1$. However, if the `Generator` is good at its job
-$D(G(\vec{z})) \approx 1$ the above equation
+$D(G(\vec{z})) \approx 1$ the equation above
 will be near 0, so we use these instead:

 $$
@@ -129,4 +196,12 @@ $$
 \min_{D} \max_{G} \{
 - E_{x \sim Data} \log(D(\vec{x})) - E_{z \sim Noise} \log(1 - D(G(\vec{z})))
 \}
-$$
+$$
+
+In this case we want both scores balanced, so a value near 0.5 is our goal,
+meaning that both are performing at their best.
+
+> [!NOTE]
+> Performing at their best means that the **`Discriminator` is always able
+> to classify correctly real data** and the **`Generator` is always able to
+> fool the `Discriminator`.**