The year was 1936. Alan Turring asked a simple question. It can machines think? Actually, no, that's not right. What he really asked was something way more boring. It can every mathematical problem be solved by an algorithm? Surprisingly, he proved the answer is no. But in the process, he accidentally invented the computer. Then 12 years later, in 1948, another legend shows up named Claude Shannon, and he reduced all human communication down to ones and zeros, casually inventing the bit like it was no big deal. One thing led to another, and now in 2026, we have 18-year-olds in hoodies typing import torch into Python files and cashing billion-dollar checks from venture capitalist boomers. But reaching this
point has been underpinned by a century long chain reaction of computer science papers written mostly by dead people much smarter than us. In today's video, we'll look at 10 of the most important scientific papers in the history of computer science and how they changed the world for better or worse. Our story begins nearly a century ago when mathematician David Hilbert asked the field's biggest flex of a question. Is there a universal algorithm that can decide whether any mathematical statement is true? Or in other words, can we automate math itself? He called this the Enchunk's problem which is German for decision problem. By 1936, Alan Turing comes around and gives a brutal answer to this question. No. But
in order to prove it, he wrote this paper on computable numbers that had to define what an algorithm even is. And so he imagined a hypothetical machine with an infinite tape, a read write head, and a tiny table of rules. This touring machine is the abstract blueprint for every computing device you've ever owned. Once created, he asked it to solve the halting problem. Can you write a program that looks at any other program and tells you if it'll finish running or loop forever? During proved that it's impossible for a program like this to exist. It simply leads to a logical contradiction, which means math has problems that no algorithm can solve. That's annoying. But 12 years later, a guy named Claude Shannon would
ask his own annoying question. What is information as a thing you can measure? In his paper, a mathematical theory of communication, he rips out the meaning from normal words entirely. I love you and the cat is on fire carry the same information if they're equally surprising. And he measures that surprise in a unit called the bit. He proved that all information could ultimately be boiled down to a stream of ones and zeros. But here's the crazy part. To estimate how much information was needed to transmit a message, he borrowed a word from thermodynamics nobody understands. Entropy. To estimate entropy of English, Shannon made people guess the next letter in a sentence. When a letter is easy to guess, it has
low entropy. When a letter is hard to guess, it has high entropy. But wait a minute. Having humans guess the next token is exactly what AI does today, just on a much bigger scale. Shannon wasn't trying to build artificial intelligence, but he gave us the math for uncertainty, prediction, and compression and accidentally wrote the spiritual ancestor to the loss function. And that's exactly why Anthropic named their AI model Claude. Then 10 years later at Cornell, a psychologist, not a computer scientist, builds the first machine that actually learns. He gets inspired by the way neurons work in the brain. So he designs a thing called a perceptron that takes inputs, weighs them, and then adjusts those weights
when it's wrong until it can classify patterns on its own. It's the building block for modern neural networks, and the hype is immediate and unhinged. The Navy funds it, and the New York Times reports that the computer will soon be conscious, but 11 years later, the hype would die out completely, thanks to two haters at MIT, who published another paper with a completely different vibe. With basic math, they prove that a single layer perceptron can't even learn exclusive ore, which is just trivial logic that means this or that, but not both. This paper, or technically a book, was essentially a death certificate for AI at the time. Funding evaporated, and deep neural networks entered their first
AI winter, but there was a twist buried in the fine print. They actually figured out that stacking layers of perceptrons fixes everything. The only problem is that back then, nobody knew how to train a stack of perceptrons. It would take another 17 years to figure it out. But first, we need to talk about times, clocks, and the ordering of events in a distributed system by Lesie Lamport. Because neural networks are useless unless you can run them on a massive scale. This paper realized that separate computers with no shared clock, it can't really have a universal now time. And that's a big problem when you have multiple computers in a distributed system trying to do things in order.
Well, he figured out a way to fix this with the happen before relation. You stop trusting the wall clock time and order events by causality instead. If A could have caused B, A comes first. From that, he builds logical clocks which allow an unlimited number of machines to stay in agreement without ever looking at a real clock. Eventually, this paper would become the bedrock for every database, blockchain, and every massive AI training run because you need thousands of GPUs that constantly stay in sync and agree on state without dissolving into chaos. That was a gamecher. But 17 years after neural networks were left for dead, the three researchers, including the godfather Jeffrey Hinton, answered the question
that everyone gave up on. How do you train a stack of layers? But before we answer that, we need to quickly talk about Coder, who was cool enough to sponsor this 10-minute video on esoteric computer science papers. They provide self-hosted development environments that let you work with multiple agents in parallel and with enterprise level security. and they just launched coder agents, a chat interface and API for delegating coding jobs to agents running on your own infrastructure. It's the only architecture that lets organizations self-host both the agent workflow and the development environments where the code is actually executed. This gives teams greater control over source code access, agent
execution, governance, and security boundaries. It's also model agnostic, so you can connect any LLM you want and switch between them with just a config change. Coder agents are designed for teams in regulated industries who need to self-host their AI workflows with complete control that they're already used by dozens of financial institutions and government organizations. And you can check it out at the link below. Now, back to the question, how do you train a stack of layers? The answer is back propagation. Run your data forward, measure how wrong the output is, and then push that error backward through every layer using the chain rule from calculus to nudge each weight in the
direction that's a little less wrong. Do that a few million times and the network teaches itself. The crazy discovery though is that the middle hidden layers started inventing their own features. Edges, shapes, and concepts that nobody programmed in that exclusive or problem that was impossible 17 years ago. It just became trivial. Back propagation is still essential to neural networks today, but back then they sucked because we didn't have enough data or compute. Well, that was about to change in 1998 with the rise of the internet and this famous paper from Larry and Sergey about the anatomy of a large-scale web search engine. The paper describes the page
rank algorithm where instead of ranking a web page by how often a word appears, it treats every link as a vote and each vote is weighted by how trustworthy the voter is. They built a prototype in their dorm room which eventually became a company called Google that you may have heard of. Most importantly though, this algorithm helped assemble the largest structured pile of human text ever created. And that massive pile of text would eventually become the training data or feed stock for future AI models. We'd finally see this in action in 2012 with a legendary imageet paper. It created by a dream team of Alex Kresensky, Ilaskever, and Jeffrey Hinton. Remember when I said back propagation needs data and compute?
Well, finally the star is aligned. The data set is called ImageNet and it's a monster data set of millions of handlabeled photos. While the compute is a couple of Nvidia consumer grade gaming GPUs, a grad student named Alex wires up a deep convolutional neural network, names it AlexNet, and trains it in his bedroom. Then he walks it into the annual imageet contest and humiliates everyone. This is a contest where AI models try to classify objects in an image like hot dog or not hot dog. And while everyone was fighting over a fraction of a percent, Alex Net walked in and dropped the error rate by 10 points in a single year. And this freaked everyone out because it was suddenly clear that deep learning
actually works. It just needs more data, more compute, and the right architecture. Luckily, we would get that architecture a few years later thanks to Ashes Vashwani and Google in the paper. Attention is all you need. Around this time, large language models had a huge problem. They would start a sentence and by the end they would forget what they were even talking about. That's because they would read and predict tokens sequentially one after the other. This paper fixed that by introducing a new architecture called the transformer that throws out sequential reading entirely.
Instead, it lets every word look at every other word at once and decide what's relevant. Not only does this make large language models feel more intelligent, but transformers also scale better as well. Google made the big mistake of giving this architecture away for free, and now every AI lab uses it, and that's where you get the T in chat GPT. Speaking of which, that brings us to a paper released by OpenAI in 2020. Language models are fewshot learners. Basically, OpenAI takes the transformer and then asks the dumbest question possible. What if we just make it enormous? Not two times bigger, but scale it to 175 billion parameters and feed it the entire internet as a data
set. They made a crazy bet that intelligence isn't some secret algorithm we're missing, but rather it simply emerges once you cross a threshold of scale. The end result was GPT3, the model that ignited the current AI bubble that we're living through right now. What's crazy is that all of a sudden, this model could translate, summarize, and write code without ever being specifically told how to do these things at such a large scale. It learned how to generalize these things on the fly. 2 years later, this paper would evolve into Chat GPT, which today is now a trillion dollar product. But when you think about it, what is chat GPT even doing? Well, it's just predicting the next word or token just like Claude Shannon was doing in 1948. So, here's
the TLDDR for the last 100 years. Alan Turing defined the machine. Claude Shannon gave it currency. Rosenl Black gave it a neuron. Jeffrey Hinton taught it how to learn. Google gave it data and an architecture. And Open AI just turned the dial to the maximum. This has been the history of artificial intelligence in 10 scientific papers. Thanks for watching and I will see you in the next