6.7 KiB
Appendix A
Entropy1
The entropy of a random value gives us the "surprise" or "informativeness" of knowing the result.
You can visualize it like this: "What can I learn from getting to know something obvious?"
As an example, you would be unsurprised to know that if you leav an apple mid-air it falls. However, if it where to remain suspended, that would be mind boggling!
The entropy now gives us this same sentiment analyzing the actual values, the lower its value, the more suprising the events, and its formula is:
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
Note
Technically speaking, anothet interpretation is the amount of bits needed to represent a random event happening, but in that case we use
\log_2
Kullback-Leibler Divergence
This value gives us the difference in distribution between an estimation q
and the real one p:
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
Cross Entropy Loss derivation
Cross entropy2 is the measure of "surprise"
we get from distribution p knowing
results from distribution q. It is defined as the entropy of p plus the
Kullback-Leibler Divergence between p and q
\begin{aligned}
H(p, q) &= H(p) + D_{KL}(p || q) =\\
&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
&= \sum_{x\in \mathcal{X}} p(x) \left(
\log \frac{p(x)}{q(x)} - \log p(x)
\right) = \\
&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
\end{aligned}
Since we in deep learning we usually don't work with distributions, but actual probabilities, it becomes:
l_n = - \log \hat{y}_{n,c} \\
\hat{y} \coloneqq \text{probability of class}
Usually \hat{y} comes from using a
softmax. Moreover, since it uses a
logaritm and probability values are at most 1, the closer to 0, the higher the loss
Computing PCA3
Caution
Xhere is the matrix of dataset with features over rows
\Sigma = \frac{X \times X^T}{N} \coloneqqCorrelation Matrix approximation\vec{\lambda} \coloneqqvector of eigenvalues of\Sigma\Lambda \coloneqqeigenvector columnar matrix sorted by eigenvalues\Lambda_{red} \coloneqqeigenvector matrix reduced tok^{th}highest eigenvalueZ = X \times\Lambda_{red}^T \coloneqqCompressed representation
Note
You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2 are closely related and apply the same concept but applying different mathematical formulas.
Laplace Operator4
It is defined as \nabla \cdot \nabla f \in \R and is equivalent to the
divergence of the function. Technically speaking it gives us the
magnitude of a local maximum or minimum.
Positive values mean that we are around a local maximum and vice-versa. The higher the magnitude, the higher (or lower) is the local maximum (or minimum).
Another way to see this is as the divergence of the function that tells us whether that is a point of attraction or divergence.
It can also be used to compute the net flow of particles in that region of space
Caution
This is not a discrete laplace operator, which is instead a matrix here, as there are many other formulations.
Hessian Matrix
A Hessian Matrix represents the 2nd derivative of a function, thus it gives us the curvature of a function.
It is also used to tell us whether the point is a local minimum (it is positive defined), local maximum (it is negative defined) or saddle (neither positive or negative defined).
It is computed by computing the partial derivatives of the gradient along all dimensions and then transpose it.
\nabla f = \begin{bmatrix}
\frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
\end{bmatrix} \\
H(f) = \begin{bmatrix}
\frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
\frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
\end{bmatrix}
Flow5
A flow over a set A is a mapping of R over A:
a \in A, t \in \R \\
\varphi(a, t) \in A
Moreover, since \varphi(a, t) \in A it also applies:
a \in A, t \in \R, s \in \R \\
\begin{aligned}
\varphi(\varphi(a, t), s) &= \varphi(a, t + s) \in A \\
\varphi(a, 0) &= a
\end{aligned}
In other words, applying a flow over a flow of a variable is like applying the flow over the variable and the sum of real numbers (think of summing times).
Also, 0 is the neutral element of a flow.
Vector Field6
It is a mapping from from a set A \subset \R^n so that:
V: A \rightarrow \R^n
So, this means that for each element of A, which we can consider point, it
associates another vector, which we may consider a velocity (but also points).
So, in a way, it can be seen as the amount of movement of that point in space.
Change of Variables in probability7
let's change from 2 random variables, X and Y where X has a CDF that is F_X
and Y = g(X) and g is monotonic:
\begin{aligned}
P(Y \leq y) &= P(g(X) \leq y) = P(g^{-1}(g(X)) \leq g^{-1}(y)) = \\
&= P(X \leq x) \rightarrow \\
\rightarrow F_Y(y) &= F_X(x) = F_X(g^{-1}(y))
\end{aligned}
Now, let's derive both handles of the equation for y:
f_Y(y) = f_X(g^{-1}(y)) \cdot \frac{d\, g^{-1}(y)}{d \, y}
Note
In case x and y are in higher dimensions, the last term is the determinant of the Jacobian matrix, or Jacobian