
Deep learning successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. In this talk, however, Pedro will show that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).