1
00:00:00,000 --> 00:00:03,475
How does a machine learn to recognize handwritten digits

2
00:00:03,575 --> 00:00:06,668
— turning a blurry sketch into a confident answer?

3
00:00:06,783 --> 00:00:10,019
Start with an image. A grid of pixels, each holding a

4
00:00:10,119 --> 00:00:14,867
brightness value between zero and one. Seven hundred and eighty-four numbers.

5
00:00:14,983 --> 00:00:18,617
Stretch them into a column. This is the input layer of a neural

6
00:00:18,717 --> 00:00:22,707
network — every pixel becomes a neuron, lit up by its own activation.

7
00:00:22,816 --> 00:00:26,904
Then come hidden layers, neurons that don't see the image directly,

8
00:00:27,004 --> 00:00:30,592
only the layers before them. And finally an output layer of

9
00:00:30,692 --> 00:00:33,780
ten neurons, one for each digit, zero through nine.

10
00:00:33,883 --> 00:00:37,137
Every neuron in one layer connects to every neuron in

11
00:00:37,237 --> 00:00:40,934
the next. Each connection carries a weight. Big weights mean

12
00:00:41,034 --> 00:00:45,111
strong attention. Small weights mean almost nothing flows through.

13
00:00:45,216 --> 00:00:49,321
A neuron's activation is the weighted sum of everything feeding in,

14
00:00:49,421 --> 00:00:53,338
plus a bias. We squash that number through a non-linear function

15
00:00:53,438 --> 00:00:56,852
— making the neuron either quiet, or confidently active.

16
00:00:56,966 --> 00:01:01,936
Feed the pixels in. The signal cascades layer by layer, transformed by thousands

17
00:01:02,036 --> 00:01:06,119
of weights and squashed at every step, until at the end one output

18
00:01:06,219 --> 00:01:09,922
neuron lights up — the network's guess of what digit it saw.

19
00:01:10,033 --> 00:01:13,709
But at first, the weights are random — chosen with no thought

20
00:01:13,809 --> 00:01:17,485
behind them. The guess is noise. The network has no idea what

21
00:01:17,585 --> 00:01:20,085
a three looks like, or a seven, or a four.

22
00:01:20,200 --> 00:01:24,239
So we measure the damage. Compare the guess to the truth, square the

23
00:01:24,339 --> 00:01:29,534
difference, and average across thousands of training examples. This single number — the

24
00:01:29,634 --> 00:01:32,700
cost — is what we want to make as small as possible.

25
00:01:32,816 --> 00:01:35,578
Imagine the cost as a landscape, one axis for

26
00:01:35,678 --> 00:01:38,695
every weight in the network. We find the steepest

27
00:01:38,795 --> 00:01:41,620
direction down, and take a tiny step that way.

28
00:01:41,733 --> 00:01:45,246
Repeat this billions of times, across millions of training

29
00:01:45,346 --> 00:01:49,295
examples, nudging weights in microscopic steps. Each example tugs

30
00:01:49,395 --> 00:01:52,409
the network just a little. Slowly, the cost falls.

31
00:01:52,509 --> 00:01:55,337
Slowly, the noise resolves into something else.

32
00:01:55,450 --> 00:01:59,143
The hidden layers start to detect edges, then loops, then the

33
00:01:59,243 --> 00:02:02,999
strokes that make up digits. Patterns no one programmed — only

34
00:02:03,099 --> 00:02:06,606
learned, distilled out of the data, one example at a time.

35
00:02:06,716 --> 00:02:08,800
This is a neural network. Layers of

36
00:02:08,900 --> 00:02:11,608
weighted sums, sliding downhill toward truth.