# Learning Representations ## Linear Classifiers and Limitations A `Linear Classifier` can be defined as an **hyperplane** with the following: $$ \sum_{i=1}^{N} w_{i}x_{i} + b = 0 $$ So, for each point in the hyperspace the output is $$ o = sign \left( \sum_{i=1}^{N} w_{i}x_{i} + b \right) $$ This quickly gives us a separation of points in 2 classes. However is it always possible to linearly separate $P$ `points` in $N$ `dimensions`? According to Cover's Theorem ***"The probability that a dichotomy over $P$ points in $N$ dimensions is linearly separable goes to zero as $P$ gets larger than $N$"***[^cover] ## Ideas to extract features generically ### Polynomial Classifier[^polinomial-sklearn] We could try to extract features by combining all features in inputs, however this scales poorly in higher dimensions. Let's say that we choose a degree $d$ and the number of dimensions of point $\vec{x}$ are $N$, we would have $N^d$ computations to do. However most of the times we will have $d = 2$, so as long as we don't exagerate with $d$ it is still feasible ### Space Tiling It is a family of functions to preserve as much info as possible in images while reducing their dimension. ### Random Projections It consists on lowering dimensionality of data by using a random chosen matrix. The idea is that points that are in a high dimensional space may be projected into a low dimensional spaces preserving their most of their distance ### Radial Basis Functions[^wikipedia-radial] This is a method that employs a family of functions that operate over **distance** of the input over a fixed point. > [!CAUTION] > They may take a point (or vector) as their input, but they'll compute a > distance, usually the euclidean one, and then use it for the final > computation ### Kernel Machines[^wikipedia-kernel-machines] They are algorithms in machine learning that make use of a kernel function which makes computation on higher dimensions possible without actually computing higher coordinates for our points. ## Deep vs Shallow Networks While it is **theoretically possible** to have only `shallow-networks` as **universal predictors**, this is **not technically possible** as it would require more **hardware**. Instead `deep-networks` trade `time` for `space`, taking longer, but requiring **less hardware**. Usually, for just boolean function we have a complexity of $O(n^2)$ for shallow networks, making it scale exponentially. > [!NOTE] > Mainly the hardware we are talking about is both Compute Units (we don't care > id they come from CPU, GPU or NPU. You name it) and Memory. ## Invariant Feature Learning When we need to learn something that may have varying features (**backgrounds**, **colors**, **shapes**, etc...), it is useful to do these steps: 1. Embed data into **high-dimensional** spaces 2. Bring **closer** data that are **similar** and **reduce-dimensions** ## Sparse Non-Linear Expansion In this case, we break our datas, and then we aggreate things together ## Manifold Hypothesis We can assume that whatever we are trying to represent **doesn't need that much features** but rather **is a point of a latent space with a lower count of dimensions**. Thus, our goal is to find this **low dimensional latent-space** and **disentangle** features, so that each `direction` of our **latent-space** is a `feature`. Essentially, what we want to achieve with `Deep-Learning` is a **system** capable of *learning* this **latent-space** on ***its own***. [^cover]: Cover’s theorem 1966 [^polinomial-sklearn]: [Sklearn | Polynomial Features | 14th November 2025](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) [^wikipedia-radial]: [Wikipedia | Radial Basis Function | 14th November 2025](https://en.wikipedia.org/wiki/Radial_basis_function) [^wikipedia-kernel-machines]: [Wikipedia | Kernel Method | 14th November 2025](https://en.wikipedia.org/wiki/Kernel_method)