3.9 KiB
Learning Representations
Linear Classifiers and Limitations
A Linear Classifier can be defined as an hyperplane with the following:
\sum_{i=1}^{N} w_{i}x_{i} + b = 0
So, for each point in the hyperspace the output is
o = sign \left( \sum_{i=1}^{N} w_{i}x_{i} + b \right)
This quickly gives us a separation of points in 2 classes. However
is it always possible to linearly separate P points in N dimensions?
According to Cover's Theorem "The probability that a dichotomy over P
points in N dimensions is linearly separable goes to zero as P gets
larger than $N$"1
Ideas to extract features generically
Polynomial Classifier2
We could try to extract features by combining all features in inputs, however this scales poorly in higher dimensions.
Let's say that we choose a degree d and the number of dimensions of
point \vec{x} are N, we would have N^d computations to do.
However most of the times we will have d = 2, so as long as we don't
exagerate with d it is still feasible
Space Tiling
It is a family of functions to preserve as much info as possible in images while reducing their dimension.
Random Projections
It consists on lowering dimensionality of data by using a random chosen matrix. The idea is that points that are in a high dimensional space may be projected into a low dimensional spaces preserving their most of their distance
Radial Basis Functions3
This is a method that employs a family of functions that operate over distance of the input over a fixed point.
Caution
They may take a point (or vector) as their input, but they'll compute a distance, usually the euclidean one, and then use it for the final computation
Kernel Machines4
They are algorithms in machine learning that make use of a kernel function which makes computation on higher dimensions possible without actually computing higher coordinates for our points.
Deep vs Shallow Networks
While it is theoretically possible to have only shallow-networks as
universal predictors,
this is not technically possible as it would require more hardware.
Instead deep-networks trade time for space, taking longer, but requiring less hardware.
Usually, for just boolean function we have a complexity of O(n^2) for shallow
networks, making it scale exponentially.
Note
Mainly the hardware we are talking about is both Compute Units (we don't care id they come from CPU, GPU or NPU. You name it) and Memory.
Invariant Feature Learning
When we need to learn something that may have varying features (backgrounds, colors, shapes, etc...), it is useful to do these steps:
- Embed data into high-dimensional spaces
- Bring closer data that are similar and reduce-dimensions
Sparse Non-Linear Expansion
In this case, we break our datas, and then we aggreate things together
Manifold Hypothesis
We can assume that whatever we are trying to represent doesn't need that much features but rather is a point of a latent space with a lower count of dimensions.
Thus, our goal is to find this low dimensional latent-space and disentangle features, so that each
direction of our latent-space is a feature. Essentially, what we want to achieve with Deep-Learning is
a system capable of learning this latent-space on its own.