Notable Works

Influential works not published at venues the anthology already indexes.

50 works spanning 1763–2021.

The Foundations

1763 An Essay Towards Solving a Problem in the Doctrine of Chances

Thomas Bayes

Philosophical Transactions of the Royal Society

The foundation of everything probabilistic in ML.

PDF

1942 Runaround

Isaac Asimov

Astounding Science Fiction

The Three Laws of Robotics. "After 'Runaround' appeared in the March 1942 issue of Astounding, I never stopped thinking about how minds might work." — Marvin Minsky

1943 A Logical Calculus of the Ideas Immanent in Nervous Activity

Warren S. McCulloch, Walter Pitts

Bulletin of Mathematical Biophysics

The first mathematical model of a neuron. Everything starts here.

PDF

1948 A Mathematical Theory of Communication

Claude E. Shannon

Bell System Technical Journal

Entropy, mutual information, channel capacity. Every loss function in deep learning is downstream of this.

PDF

1950 Computing Machinery and Intelligence

Alan M. Turing

Mind

The imitation game and the question of whether machines can think.

PDF

1961 Steps Toward Artificial Intelligence

Marvin Minsky

Proceedings of the IRE

The call-to-arms for a generation of AI researchers. Search, pattern recognition, learning, planning, induction.

PDF

Neural Computation

1949 The Organization of Behavior: A Neuropsychological Theory

Donald O. Hebb

Wiley, New York

"Neurons that fire together wire together." The origin of associative learning.

PDF

1958 The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain

Frank Rosenblatt

Psychological Review

The idea of neural networks as computation.

PDF

1962 Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex

David H. Hubel, Torsten N. Wiesel

Journal of Physiology

The neuroscience that inspired convolutional networks.

PDF

1969 Perceptrons: An Introduction to Computational Geometry

Marvin Minsky, Seymour Papert

MIT Press

The book that killed connectionism for a decade. The XOR problem.

PDF

1980 Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position

Kunihiko Fukushima

Biological Cybernetics

The origin of convolutional neural networks.

PDF

1982 Neural Networks and Physical Systems with Emergent Collective Computational Abilities

John J. Hopfield

Proceedings of the National Academy of Sciences

Brought physicists into neural networks, led to Boltzmann machines.

PDF

1986 Learning Representations by Back-Propagating Errors

David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams

Nature

Backpropagation. You know what this is.

PDF

1998 Gradient-Based Learning Applied to Document Recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner

Proceedings of the IEEE

The LeNet paper.

PDF

2006 A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh

Neural Computation

Deep belief nets. The paper that ended the second AI winter.

PDF

Statistical Learning Theory

1951 A Stochastic Approximation Method

Herbert Robbins, Sutton Monro

Annals of Mathematical Statistics

SGD. You also know what this is.

PDF

1964 A Formal Theory of Inductive Inference, Part I

Ray J. Solomonoff

Information and Control

Algorithmic probability, compression and intelligence.

PDF

1971 On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

Vladimir N. Vapnik, Alexey Ya. Chervonenkis

Theory of Probability and Its Applications

VC dimension. The theoretical foundation of statistical learning.

PDF

1977 Maximum Likelihood from Incomplete Data via the EM Algorithm

Arthur P. Dempster, Nan M. Laird, Donald B. Rubin

Journal of the Royal Statistical Society, Series B

The EM algorithm. Cited constantly across all of machine learning.

PDF

1978 Modeling by Shortest Data Description

Jorma Rissanen

Automatica

Minimum description length. The formal link between compression and learning.

1984 A Theory of the Learnable

Leslie G. Valiant

Communications of the ACM

PAC learning. The origin of computational learning theory.

PDF

1989 Approximation by Superpositions of a Sigmoidal Function

George Cybenko

Mathematics of Control, Signals and Systems

The universal approximation theorem; why neural networks work at all.

PDF

1996 Bayesian Learning for Neural Networks

Radford M. Neal

PhD thesis

The original treatment of Bayesian neural networks.

PDF

Reinforcement Learning

1957 Dynamic Programming

Richard Bellman

Princeton University Press

The Bellman equation. The foundation of dynamic programming and RL.

PDF

1959 Some Studies in Machine Learning Using the Game of Checkers

Arthur L. Samuel

IBM Journal of Research and Development

Coined the term "machine learning."

PDF

1989 Learning from Delayed Rewards

Christopher J. C. H. Watkins

PhD thesis, University of Cambridge

Q-learning.

PDF

1991 Curious Model-Building Control Systems

Jürgen Schmidhuber

IJCNN

Intrinsic motivation, curiosity-driven learning. Predates the exploration-exploitation literature in deep RL.

PDF

Information Coding

1957 Information Theory and Statistical Mechanics

Edwin T. Jaynes

Physical Review

Maximum entropy. The bridge between Shannon's information theory and statistical inference.

1961 Possible Principles Underlying the Transformations of Sensory Messages

Horace B. Barlow

Sensory Communication

Efficient coding hypothesis, redundancy reduction. The intellectual ancestor of autoencoder, sparse coding, and noncontrastive SSL.

PDF

1991 Elements of Information Theory

Thomas M. Cover, Joy A. Thomas

Wiley

The textbook that taught ML how to use information theory. KL divergence, rate-distortion, channel capacity.

1997 Low-Complexity Art

Jürgen Schmidhuber

Leonardo, Journal of the International Society for the Arts, Sciences and Technology

Guess what else Schmidhuber said he anticipated in 1997? Hint: it starts with A.

PDF

1999 The Information Bottleneck Method

Naftali Tishby, Fernando C. Pereira, William Bialek

Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing

The information bottleneck. Learning as compression formalized.

PDF

2013 Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

arXiv

Word2Vec. King minus man plus woman equals queen.

PDF

Critiques

1965 Alchemy and Artificial Intelligence

Hubert L. Dreyfus

RAND

The first philosophical critique of AI.

PDF

1973 Artificial Intelligence: A General Survey

James Lighthill

UK Science Research Council

"In no part of the field have the discoveries made so far produced the major impact that was then promised."

PDF

1980 Minds, Brains, and Programs

John R. Searle

Behavioral and Brain Sciences

The Chinese Room. The famous argument against strong AI.

PDF

1991 Intelligence Without Representation

Rodney A. Brooks

Artificial Intelligence

The embodied cognition manifesto. Looks increasingly prescient.

PDF

2014 Superintelligence: Paths, Dangers, Strategies

Nick Bostrom

Oxford University Press

The book that made the world take AI risk seriously.

2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Margaret Mitchell

FAccT

Stochastic parrots. The paper that launched the AI ethics debate and got an author fired.

PDF

Proposals and Founding Documents

1955 A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence

John McCarthy, Marvin L. Minsky, Nathaniel Rochester, Claude E. Shannon

Dartmouth College

A two-page funding application that named the field of artificial intelligence.

PDF

1959 Pandemonium: A Paradigm for Learning

Oliver G. Selfridge

Mechanisation of Thought Processes, HMSO London

Parallel feature detection. An early conceptual convolutional network.

PDF

1975 Adaptation in Natural and Artificial Systems

John H. Holland

University of Michigan Press

The founding document of evolutionary computation.

1988 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Judea Pearl

Morgan Kaufmann

Graphical models. Probabilistic reasoning made tractable.

Blog Posts and Informal Publications

2015 The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy

Blog post

Taught a lot of people how RNNs work.

2015 Understanding LSTM Networks

Christopher Olah

Blog post

Neural network interpretability as visual explanation. The seed of Distill.

2018 The Illustrated Transformer

Jay Alammar

Blog post

Much of the world learned how transformers work this way.

2019 The Bitter Lesson

Richard S. Sutton

Blog post

The theoretical justification for the scaling hypothesis.

2020 Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

arXiv

The empirical receipt of the scaling hypothesis.

PDF