195 lines
6.7 KiB
Markdown
195 lines
6.7 KiB
Markdown
# Appendix A
|
|
|
|
## Entropy[^wiki-entropy]
|
|
|
|
The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
|
|
knowing the result.
|
|
|
|
You can visualize it like this: ***"What can I learn from getting to know something
|
|
obvious?"***
|
|
|
|
As an example, you would be unsurprised to know that if you leav an apple mid-air
|
|
it falls. However, if it where to remain suspended, that would be mind boggling!
|
|
|
|
The entropy now gives us this same sentiment analyzing the actual values,
|
|
the lower its value, the more suprising the events, and its formula is:
|
|
|
|
$$
|
|
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> Technically speaking, anothet interpretation is the amount of bits needed to
|
|
> represent a random event happening, but in that case we use $\log_2$
|
|
|
|
## Kullback-Leibler Divergence
|
|
|
|
This value gives us the difference in distribution between an estimation $q$
|
|
and the real one $p$:
|
|
|
|
$$
|
|
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
|
|
$$
|
|
|
|
## Cross Entropy Loss derivation
|
|
|
|
Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"*
|
|
we get from distribution $p$ knowing
|
|
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
|
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
|
|
|
$$
|
|
\begin{aligned}
|
|
H(p, q) &= H(p) + D_{KL}(p || q) =\\
|
|
&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
|
|
\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
|
|
&= \sum_{x\in \mathcal{X}} p(x) \left(
|
|
\log \frac{p(x)}{q(x)} - \log p(x)
|
|
\right) = \\
|
|
&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
|
|
&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
|
|
\end{aligned}
|
|
$$
|
|
|
|
Since we in deep learning we usually don't work with distributions, but actual
|
|
probabilities, it becomes:
|
|
|
|
$$
|
|
l_n = - \log \hat{y}_{n,c} \\
|
|
\hat{y} \coloneqq \text{probability of class}
|
|
$$
|
|
|
|
Usually $\hat{y}$ comes from using a
|
|
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
|
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
|
|
|
## Computing PCA[^wiki-pca]
|
|
|
|
> [!CAUTION]
|
|
> $X$ here is the matrix of dataset with **<ins>features over rows</ins>**
|
|
|
|
- $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation
|
|
- $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$
|
|
- $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues
|
|
- $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$
|
|
highest eigenvalue
|
|
- $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation
|
|
|
|
> [!NOTE]
|
|
> You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2
|
|
> are closely related and apply the same concept but applying different
|
|
> mathematical formulas.
|
|
|
|
## Laplace Operator[^khan-1]
|
|
|
|
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
|
**divergence of the function**. Technically speaking it gives us the
|
|
**magnitude of a local maximum or minimum**.
|
|
|
|
Positive values mean that we are around a local maximum and vice-versa. The
|
|
higher the magnitude, the higher (or lower) is the local maximum (or minimum).
|
|
|
|
Another way to see this is as the divergence of the function that tells us whether
|
|
that is a point of attraction or divergence.
|
|
|
|
It can also be used to compute the net flow of particles in that region of space
|
|
|
|
> [!CAUTION]
|
|
> This is not a **discrete laplace operator**, which is instead a **matrix** here,
|
|
> as there are many other formulations.
|
|
|
|
## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)
|
|
|
|
A Hessian Matrix represents the 2nd derivative of a function, thus it gives
|
|
us the curvature of a function.
|
|
|
|
It is also used to tell us whether the point is a local minimum (it is positive
|
|
defined), local maximum (it is negative defined) or saddle (neither positive or
|
|
negative defined).
|
|
|
|
It is computed by computing the partial derivatives of the gradient along
|
|
all dimensions and then transpose it.
|
|
|
|
$$
|
|
\nabla f = \begin{bmatrix}
|
|
\frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
|
|
\end{bmatrix} \\
|
|
H(f) = \begin{bmatrix}
|
|
\frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
|
|
\frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
|
|
\end{bmatrix}
|
|
$$
|
|
|
|
## [Flow](https://en.wikipedia.org/wiki/Flow_(mathematics))[^wiki-flow]
|
|
|
|
A flow over a set $A$ is a mapping of $R$ over $A$:
|
|
|
|
$$
|
|
a \in A, t \in \R \\
|
|
\varphi(a, t) \in A
|
|
$$
|
|
|
|
Moreover, since $\varphi(a, t) \in A$ it also applies:
|
|
|
|
$$
|
|
a \in A, t \in \R, s \in \R \\
|
|
\begin{aligned}
|
|
\varphi(\varphi(a, t), s) &= \varphi(a, t + s) \in A \\
|
|
\varphi(a, 0) &= a
|
|
\end{aligned}
|
|
$$
|
|
|
|
In other words, applying a flow over a flow of a variable is like applying
|
|
the flow over the variable and the sum of real numbers (think of summing times).
|
|
|
|
Also, 0 is the neutral element of a flow.
|
|
|
|
## [Vector Field](https://en.wikipedia.org/wiki/Vector_field)[^wiki-vector-field]
|
|
|
|
It is a mapping from from a set $A \subset \R^n$ so that:
|
|
|
|
$$
|
|
V: A \rightarrow \R^n
|
|
$$
|
|
|
|
So, this means that for each element of $A$, which we can consider point, it
|
|
associates another vector, which we may consider a velocity (but also points).
|
|
|
|
So, in a way, it can be seen as the amount of movement of that point in space.
|
|
|
|
## Change of Variables in probability[^stack-change-var]
|
|
|
|
let's change from 2 random variables, $X$ and $Y$ where $X$ has a CDF that is $F_X$
|
|
and $Y = g(X)$ and $g$ is monotonic:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
P(Y \leq y) &= P(g(X) \leq y) = P(g^{-1}(g(X)) \leq g^{-1}(y)) = \\
|
|
&= P(X \leq x) \rightarrow \\
|
|
\rightarrow F_Y(y) &= F_X(x) = F_X(g^{-1}(y))
|
|
\end{aligned}
|
|
$$
|
|
|
|
Now, let's derive both handles of the equation for y:
|
|
|
|
$$
|
|
f_Y(y) = f_X(g^{-1}(y)) \cdot \frac{d\, g^{-1}(y)}{d \, y}
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> In case x and y are in higher dimensions, the last term is the determinant of
|
|
> the Jacobian matrix, or Jacobian
|
|
|
|
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
|
|
|
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
|
|
|
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
|
|
|
[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
|
|
|
|
[^wiki-flow]: [Wikipedia | Flow (Mathematics) | 23rd November 2025](https://en.wikipedia.org/wiki/Flow_(mathematics))
|
|
|
|
[^wiki-vector-field]: [Wikipedia | Vector Field |23rd november 2025](https://en.wikipedia.org/wiki/Vector_field)
|
|
|
|
[^stack-change-var]: [StackExchange | Derivation of change of variables of a probability density function? | 25th November 2025](https://stats.stackexchange.com/questions/239588/derivation-of-change-of-variables-of-a-probability-density-function) |