Post

Machine Learning Papers I've Been Reading

The standard text in most machine learning classes is The Elements of Statistical Learning (“ESL”), which provides a solid foundation in the main techniques of machine learning. While ESL is still an excellent reference for many techniques, it was released in 2009, meaning its treatments of neural networks and deep learning are positively ancient. Major advancements have been made, specifically around attention and transformers, which have led to an explosion in large language models and computer vision, and diffusion, which has led to a similar explosion in photo and video generation. The following is a selection of papers for people who already possess a basic understanding of neural networks, like from Deep Learning by Bishop and Bishop.

Most or all of these papers are posted as preprints to arxiv.org, but arxiv is a bit of a firehose. A good resource for finding high quality papers is Papers With Code. For some specific and more technical papers, I’ve provided short reference notes for my future self where useful.

Data Preprocessing, Pretraining, and Transfer Learning:

Layers:

  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014)

    This paper describes dropout, a technique to randomly “drop” neurons from the network during training as a way to effectively sample from an exponential number of “thinned” neural networks and average their results, which effectively prevents overfitting. It’s a pretty easy and well-motivated read and the inspiration being the “dropping” of genes during sexual reproduction was neat. There is also a discussion of generalizing dropout beyond Bernoulli random variables, and the justification for dropout being a general technique rather than domain-specific is also worth a read.

  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)

    Internal covariate shift is when the mean and variance for each layer’s inputs change. This covariate shift happens naturally when model parameters are updated, which slows down training and constrains learning rates to be much lower. Batch normalization is a process for normalizing the inputs to each layer in a way that both prevents any parameters from blowing up or from the normalization restricting what each layer can represent, both of which happen with other techniques. Batch normalization means neural nets need fewer epochs to converge and can reduce the need for dropout layers, but can be computationally expensive.

Architectures:

Convolutional Neural Nets

Graph Neural Nets

Large Language Models

Optimizers:

  • Adam: A Method for Stochastic Optimization (2017)

    This paper introduces Adam, a very popular optimizer that tends to converge faster and more consistently than naive gradient descent. Adam is adaptive, which means that updates take the first and second moments (mean and raw variance) into account, allowing it to converge quicker on local minima. I highly recommend reading the next paper immediately below, which improves on Adam by correctly implementing weight decay.

  • Decoupled Weight Decay Regularization (2019)

    The original Adam paper made the mistake of conflating weight decay with $L^2$ regularization, which are equivalent in the case of gradient descent, but not in the case of Adam. As a result, Adam was still being outperformed by stochastic gradient descent on specific tasks like image recognition. This paper introduces AdamW, which correctly implements weight decay in Adam, and results in an optimizer that finally outperforms SGD on many tasks.

Attention and Transformers:

Topological Data Analysis

This post is licensed under CC BY 4.0 by the author.