Notable Works

Influential works not published at venues the anthology already indexes.

50 works spanning 1763–2021.

The Foundations

1763 An Essay Towards Solving a Problem in the Doctrine of Chances
Philosophical Transactions of the Royal Society

The foundation of everything probabilistic in ML.

PDF
1942 Runaround
Astounding Science Fiction

The Three Laws of Robotics. "After 'Runaround' appeared in the March 1942 issue of Astounding, I never stopped thinking about how minds might work." — Marvin Minsky

1943 A Logical Calculus of the Ideas Immanent in Nervous Activity
Bulletin of Mathematical Biophysics

The first mathematical model of a neuron. Everything starts here.

PDF
1948 A Mathematical Theory of Communication
Bell System Technical Journal

Entropy, mutual information, channel capacity. Every loss function in deep learning is downstream of this.

PDF
1950 Computing Machinery and Intelligence
Mind

The imitation game and the question of whether machines can think.

PDF
1961 Steps Toward Artificial Intelligence
Proceedings of the IRE

The call-to-arms for a generation of AI researchers. Search, pattern recognition, learning, planning, induction.

PDF

Neural Computation

1949 The Organization of Behavior: A Neuropsychological Theory
Wiley, New York

"Neurons that fire together wire together." The origin of associative learning.

PDF
1958 The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain
Psychological Review

The idea of neural networks as computation.

PDF
1962 Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex
Journal of Physiology

The neuroscience that inspired convolutional networks.

PDF
1969 Perceptrons: An Introduction to Computational Geometry
MIT Press

The book that killed connectionism for a decade. The XOR problem.

PDF
1980 Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position
Biological Cybernetics

The origin of convolutional neural networks.

PDF
1982 Neural Networks and Physical Systems with Emergent Collective Computational Abilities
Proceedings of the National Academy of Sciences

Brought physicists into neural networks, led to Boltzmann machines.

PDF
1986 Learning Representations by Back-Propagating Errors
Nature

Backpropagation. You know what this is.

PDF
1998 Gradient-Based Learning Applied to Document Recognition
Proceedings of the IEEE

The LeNet paper.

PDF
2006 A Fast Learning Algorithm for Deep Belief Nets
Neural Computation

Deep belief nets. The paper that ended the second AI winter.

PDF

Statistical Learning Theory

1951 A Stochastic Approximation Method
Annals of Mathematical Statistics

SGD. You also know what this is.

PDF
1964 A Formal Theory of Inductive Inference, Part I
Information and Control

Algorithmic probability, compression and intelligence.

PDF
1971 On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities
Theory of Probability and Its Applications

VC dimension. The theoretical foundation of statistical learning.

PDF
1977 Maximum Likelihood from Incomplete Data via the EM Algorithm
Journal of the Royal Statistical Society, Series B

The EM algorithm. Cited constantly across all of machine learning.

PDF
1978 Modeling by Shortest Data Description
Automatica

Minimum description length. The formal link between compression and learning.

1984 A Theory of the Learnable
Communications of the ACM

PAC learning. The origin of computational learning theory.

PDF
1989 Approximation by Superpositions of a Sigmoidal Function
Mathematics of Control, Signals and Systems

The universal approximation theorem; why neural networks work at all.

PDF
1996 Bayesian Learning for Neural Networks
PhD thesis

The original treatment of Bayesian neural networks.

PDF

Reinforcement Learning

1957 Dynamic Programming
Princeton University Press

The Bellman equation. The foundation of dynamic programming and RL.

PDF
1959 Some Studies in Machine Learning Using the Game of Checkers
IBM Journal of Research and Development

Coined the term "machine learning."

PDF
1989 Learning from Delayed Rewards
PhD thesis, University of Cambridge

Q-learning.

PDF
1991 Curious Model-Building Control Systems
IJCNN

Intrinsic motivation, curiosity-driven learning. Predates the exploration-exploitation literature in deep RL.

PDF

Information Coding

1957 Information Theory and Statistical Mechanics
Physical Review

Maximum entropy. The bridge between Shannon's information theory and statistical inference.

1961 Possible Principles Underlying the Transformations of Sensory Messages
Sensory Communication

Efficient coding hypothesis, redundancy reduction. The intellectual ancestor of autoencoder, sparse coding, and noncontrastive SSL.

PDF
1991 Elements of Information Theory
Wiley

The textbook that taught ML how to use information theory. KL divergence, rate-distortion, channel capacity.

1997 Low-Complexity Art
Leonardo, Journal of the International Society for the Arts, Sciences and Technology

Guess what else Schmidhuber said he anticipated in 1997? Hint: it starts with A.

PDF
1999 The Information Bottleneck Method
Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing

The information bottleneck. Learning as compression formalized.

PDF
2013 Efficient Estimation of Word Representations in Vector Space
arXiv

Word2Vec. King minus man plus woman equals queen.

PDF

Critiques

1965 Alchemy and Artificial Intelligence
RAND

The first philosophical critique of AI.

PDF
1973 Artificial Intelligence: A General Survey
UK Science Research Council

"In no part of the field have the discoveries made so far produced the major impact that was then promised."

PDF
1980 Minds, Brains, and Programs
Behavioral and Brain Sciences

The Chinese Room. The famous argument against strong AI.

PDF
1991 Intelligence Without Representation
Artificial Intelligence

The embodied cognition manifesto. Looks increasingly prescient.

PDF
2014 Superintelligence: Paths, Dangers, Strategies
Oxford University Press

The book that made the world take AI risk seriously.

2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
FAccT

Stochastic parrots. The paper that launched the AI ethics debate and got an author fired.

PDF

Proposals and Founding Documents

1955 A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence
Dartmouth College

A two-page funding application that named the field of artificial intelligence.

PDF
1959 Pandemonium: A Paradigm for Learning
Mechanisation of Thought Processes, HMSO London

Parallel feature detection. An early conceptual convolutional network.

PDF
1975 Adaptation in Natural and Artificial Systems
University of Michigan Press

The founding document of evolutionary computation.

1988 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
Morgan Kaufmann

Graphical models. Probabilistic reasoning made tractable.

Blog Posts and Informal Publications

2015 The Unreasonable Effectiveness of Recurrent Neural Networks
Blog post

Taught a lot of people how RNNs work.

2015 Understanding LSTM Networks
Blog post

Neural network interpretability as visual explanation. The seed of Distill.

2018 The Illustrated Transformer
Blog post

Much of the world learned how transformers work this way.

2019 The Bitter Lesson
Blog post

The theoretical justification for the scaling hypothesis.

2020 Scaling Laws for Neural Language Models
arXiv

The empirical receipt of the scaling hypothesis.

PDF