Learning Representations

Linear Classifiers and Limitations

A Linear Classifier can be defined as an hyperplane with the following:


\sum_{i=1}^{N} w_{i}x_{i} + b = 0

So, for each point in the hyperspace the output is


o = sign \left( \sum_{i=1}^{N} w_{i}x_{i} + b  \right)

This quickly gives us a separation of points in 2 classes. However is it always possible to linearly separate P points in N dimensions?

According to Cover's Theorem "The probability that a dichotomy over P points in N dimensions is linearly separable goes to zero as P gets larger than $N$"¹

Ideas to extract features generically

Polynomial Classifier²

We could try to extract features by combining all features in inputs, however this scales poorly in higher dimensions.

Let's say that we choose a degree d and the number of dimensions of point \vec{x} are N, we would have N^d computations to do.

However most of the times we will have d = 2, so as long as we don't exagerate with d it is still feasible

Space Tiling

It is a family of functions to preserve as much info as possible in images while reducing their dimension.

Random Projections

It consists on lowering dimensionality of data by using a random chosen matrix. The idea is that points that are in a high dimensional space may be projected into a low dimensional spaces preserving their most of their distance

Radial Basis Functions³

This is a method that employs a family of functions that operate over distance of the input over a fixed point.

Caution

They may take a point (or vector) as their input, but they'll compute a distance, usually the euclidean one, and then use it for the final computation

Kernel Machines⁴

They are algorithms in machine learning that make use of a kernel function which makes computation on higher dimensions possible without actually computing higher coordinates for our points.

Deep vs Shallow Networks

While it is theoretically possible to have only shallow-networks as universal predictors, this is not technically possible as it would require more hardware.

Instead deep-networks trade time for space, taking longer, but requiring less hardware.

Usually, for just boolean function we have a complexity of O(n^2) for shallow networks, making it scale exponentially.

Note

Mainly the hardware we are talking about is both Compute Units (we don't care id they come from CPU, GPU or NPU. You name it) and Memory.

Invariant Feature Learning

When we need to learn something that may have varying features (backgrounds, colors, shapes, etc...), it is useful to do these steps:

Embed data into high-dimensional spaces
Bring closer data that are similar and reduce-dimensions

Sparse Non-Linear Expansion

In this case, we break our datas, and then we aggreate things together

Manifold Hypothesis

We can assume that whatever we are trying to represent doesn't need that much features but rather is a point of a latent space with a lower count of dimensions.

Thus, our goal is to find this low dimensional latent-space and disentangle features, so that each direction of our latent-space is a feature. Essentially, what we want to achieve with Deep-Learning is a system capable of learning this latent-space on its own.

Cover’s theorem 1966 ↩︎
Sklearn | Polynomial Features | 14th November 2025 ↩︎
Wikipedia | Radial Basis Function | 14th November 2025 ↩︎
Wikipedia | Kernel Method | 14th November 2025 ↩︎

3.9 KiB Raw Permalink Blame History Unescape Escape