Revised notes and added citations
This commit is contained in:
parent
434e4cdd0e
commit
e07a80649a
@ -2,30 +2,80 @@
|
|||||||
|
|
||||||
## Linear Classifiers and Limitations
|
## Linear Classifiers and Limitations
|
||||||
|
|
||||||
A `Linear Classifier` can be defined as an ***hyperplane** with the following expression:
|
A `Linear Classifier` can be defined as an **hyperplane** with the following:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\sum_{i=1}^{N} w_{i}x_{i} + b = 0
|
\sum_{i=1}^{N} w_{i}x_{i} + b = 0
|
||||||
$$
|
$$
|
||||||
<!-- TODO: Add images -->
|
<!-- TODO: Add images -->
|
||||||
So, the output is $o = sign \left( \sum_{i=1}^{N} w_{i}x_{i} + b = 0 \right)$
|
So, for each point in the hyperspace the output is
|
||||||
|
|
||||||
### Limitations
|
$$
|
||||||
|
o = sign \left( \sum_{i=1}^{N} w_{i}x_{i} + b \right)
|
||||||
|
$$
|
||||||
|
|
||||||
The question is, how probable is to divide $P$ `points` in $N$ `dimensions`?
|
This quickly gives us a separation of points in 2 classes. However
|
||||||
|
is it always possible to linearly separate $P$ `points` in $N$ `dimensions`?
|
||||||
|
|
||||||
>*The probability that a dichotomy over $P$ points in $N$ dimensions is* ***linearly separable***
|
According to Cover's Theorem ***"The probability that a dichotomy over $P$
|
||||||
>*goes to zero as $P$ gets larger than $N$* [Cover’s theorem 1966]
|
points in $N$ dimensions is linearly separable goes to zero as $P$ gets
|
||||||
|
larger than $N$"***[^cover]
|
||||||
|
|
||||||
So, once you have the number of features $d$, the number of dimensions is **exponentially higher**
|
## Ideas to extract features generically
|
||||||
$N^{d}$.
|
|
||||||
|
### Polynomial Classifier[^polinomial-sklearn]
|
||||||
|
|
||||||
|
We could try to extract features by combining all features in inputs,
|
||||||
|
however this scales poorly in higher dimensions.
|
||||||
|
|
||||||
|
Let's say that we choose a degree $d$ and the number of dimensions of
|
||||||
|
point $\vec{x}$ are $N$, we would have $N^d$ computations to do.
|
||||||
|
|
||||||
|
However most of the times we will have $d = 2$, so as long as we don't
|
||||||
|
exagerate with $d$ it is still feasible
|
||||||
|
|
||||||
|
### Space Tiling
|
||||||
|
|
||||||
|
It is a family of functions to preserve as much info as possible in images
|
||||||
|
while reducing their dimension.
|
||||||
|
|
||||||
|
### Random Projections
|
||||||
|
|
||||||
|
It consists on lowering dimensionality of data by using a random chosen
|
||||||
|
matrix. The idea is that points that are in a high dimensional space may be
|
||||||
|
projected into a low dimensional spaces preserving their most of their
|
||||||
|
distance
|
||||||
|
|
||||||
|
### Radial Basis Functions[^wikipedia-radial]
|
||||||
|
|
||||||
|
This is a method that employs a family of functions that operate over
|
||||||
|
**distance** of the input over a fixed point.
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
> They may take a point (or vector) as their input, but they'll compute a
|
||||||
|
> distance, usually the euclidean one, and then use it for the final
|
||||||
|
> computation
|
||||||
|
|
||||||
|
### Kernel Machines[^wikipedia-kernel-machines]
|
||||||
|
|
||||||
|
They are algorithms in machine learning that make use of a kernel function
|
||||||
|
which makes computation on higher dimensions possible without actually
|
||||||
|
computing higher coordinates for our points.
|
||||||
|
|
||||||
## Deep vs Shallow Networks
|
## Deep vs Shallow Networks
|
||||||
|
|
||||||
While it is **theoretically possible** to have only `shallow-networks` as **universal predictors**,
|
While it is **theoretically possible** to have only `shallow-networks` as
|
||||||
|
**universal predictors**,
|
||||||
this is **not technically possible** as it would require more **hardware**.
|
this is **not technically possible** as it would require more **hardware**.
|
||||||
|
|
||||||
Instead `deep-networks` trade `time` for `space`, taking longer, but requiring **less hardware**
|
Instead `deep-networks` trade `time` for `space`, taking longer, but requiring **less hardware**.
|
||||||
|
|
||||||
|
Usually, for just boolean function we have a complexity of $O(n^2)$ for shallow
|
||||||
|
networks, making it scale exponentially.
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Mainly the hardware we are talking about is both Compute Units (we don't care
|
||||||
|
> id they come from CPU, GPU or NPU. You name it) and Memory.
|
||||||
|
|
||||||
## Invariant Feature Learning
|
## Invariant Feature Learning
|
||||||
|
|
||||||
@ -47,3 +97,11 @@ We can assume that whatever we are trying to represent **doesn't need that much
|
|||||||
Thus, our goal is to find this **low dimensional latent-space** and **disentangle** features, so that each
|
Thus, our goal is to find this **low dimensional latent-space** and **disentangle** features, so that each
|
||||||
`direction` of our **latent-space** is a `feature`. Essentially, what we want to achieve with `Deep-Learning` is
|
`direction` of our **latent-space** is a `feature`. Essentially, what we want to achieve with `Deep-Learning` is
|
||||||
a **system** capable of *learning* this **latent-space** on ***its own***.
|
a **system** capable of *learning* this **latent-space** on ***its own***.
|
||||||
|
|
||||||
|
[^cover]: Cover’s theorem 1966
|
||||||
|
|
||||||
|
[^polinomial-sklearn]: [Sklearn | Polynomial Features | 14th November 2025](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
|
||||||
|
|
||||||
|
[^wikipedia-radial]: [Wikipedia | Radial Basis Function | 14th November 2025](https://en.wikipedia.org/wiki/Radial_basis_function)
|
||||||
|
|
||||||
|
[^wikipedia-kernel-machines]: [Wikipedia | Kernel Method | 14th November 2025](https://en.wikipedia.org/wiki/Kernel_method)
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user