Why Neural Networks Work: A Gentle Dive Into the Math
Neural networks can sometimes feel like a mystery. They’re huge webs of weights and activations that learn to identify faces, translate language, or identify images in medicine. But, what lurks behind the magic is a surprisingly elegant form of mathematics.
A neural network is made up of artificial neurons organized into layers - each neuron does not much more than compute a weighted sum of the inputs to the neuron, applies a non-linear activation (like ReLU or sigmoid), and sends it along to the next layer. While inspired by biologically functioning brains, current networks are a mathematical representation (nodes and edges, and then matrix multiplications) that serve to approximate functions.
One of the results in the theory of neural networks is the Universal Approximation Theorem. Roughly, this theorem states that even an extremely simple feedforward network (with one hidden layer, and a non-polynomial activation) can approximate any continuous function with any precision (if it has enough neurons). And with additional layers (and paying attention to how wide we want to keep the layers, while stacking complexity) we can achieve our universality.
The real magic happens with backpropagation, which is a clever and efficient way to use the chain rule to understand how much each weight contributes to the total error. Once we have determined the gradients, stochastic gradient descent (SGD) can make adjustments to the weights to minimize a loss function, like classification error. The iterative process of passing new samples through the network will determine how the network learns to generalize.
You may wonder, “Why deeper and not just wide?” There are generally two areas of mathematical trade-offs: Shallow, wide networks can approximate complex functions (by universal approximation) but will often need to be large. Deep networks can represent complex hierarchical functions more concisely because they share intermediate features. The hierarchy is important; the early layers are catching simple patterns/attributes (edges/layers), and the later layers are building towards concepts (like faces or sentiments).
Linear functions, no matter how deep, will still be linear, hence there will be no advantage in having a deeper network. Non-linear activation functions allow neural nets to help them capture a richer and deeper pattern of approximations. ReLU, sigmoid, and tanh are all non-linear activation functions that take in a set of weighted inputs and keep the layers from behaving linearly. In contrast, the activation function of a neural network determines whether it has a non-linear decision boundary and helps determine the level of non-linearity.
Despite a solid theory behind these ideas, there are still some questions remaining to be answered, particularly in practical form. Generalization conundrums: Why is it that huge networks do not overfit completely? Several bodies of research have looked into a combination of SGD, the architecture of the network itself, and some implicit form of regularization. Catastrophic forgetting: While it is useful to learn new things with a network, it is quite possibly problematic to overwrite knowledge through subsequent training. This problem remains very active research. Biological fidelity: While it was easy to become excited about the biological resemblance of neural networks and the brain, we should be cautious not to think that they are precise models of the brain. The invariances that neural networks develop can be grossly different from what humans perceive.
Deep learning refers to one or more layers of networks which consist of hierarchical features and (as a process) do not rely on manual feature engineering. Deep learning supports a massive amount of applications from image recognition, voice assistants, graph networks, to specialized hardware. In this way, each model identifies and automates feature extraction, as well as scales with the data. In the best cases, deep learning surpasses classical machine-learning methods.
Sources:
https://news.mit.edu/2017/explained-neural-networks-deep-learning-0414