## Foreword

Let's start by telling the truth: machines don't learn. What they do is that they find a mathematical formula that, when applied to a collection of inputs (called "training data"), gives the desired outputs. The mathematical formula also gives the correct outputs for most other inputs (distinct from the training data) on the condition that those inputs *come from the same statistical distribution* as the one that generated the training data.

Why isn't that learning? Because should you distort a little bit the input, the output can become completely wrong. This is not how learning works: if you learned to play a video game by looking straight at the screen, you will be still a good player if someone rotates the screen slightly. Machine learning models, unless they were taught to recognize rotation, will fail to apply the skill learned on a straighten screen to a rotated version of the input.

So why is it called "machine learning"? Several reasons, but the most important one is marketing. Like many important things we now use, they came out of IBM, including the term "machine learning". Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term in 1959 while at IBM. Similarly to how today IBM tries to market the term "cognitive computing" to distinguish itself from the competition, in the sixties the new term "machine learning" sounded cool and fresh and attracted clients and talented employees.

Now, for the sake of simplicity, let's stick to the term "machine learning" and pretend that machines actually learn. How they do that?

## Machine Learning

The machine learning process starts with gathering training data. Training data is a collection of pairs (input, output). Input could be anything, for example, email messages or pictures. Outputs are usually labels (ex: "spam", "not_spam", "cat", "dog", etc). In some cases, labels are vectors (ex: four coordinates or rectangle around a person on the picture), sequences (ex: ["adjective", "adjective", "noun"] for the input "big beautiful car"), or have some other structure (ex: tree or graph).

Now, as we have training data, for example, 10,000 email messages, each with a label either "spam" or "not_spam", we have to transform each email into a **feature vector**. Don't worry, a feature vector is just a list of numbers that represent one email in our collection.

The machine learning engineer decides, based on their experience, how to convert a real-world entity, such as an email message, into a list of numbers. One frequent way to convert a text into a list of numbers is to take a dictionary of English words (let's say it contains 20,000 words) and decide that in our list:

- the first element is equal to 1 if the email message contains the words "a", otherwise this element contains 0;
- the second element is equal to 1 if the email message contains the words "aaron", otherwise this element contains 0; ...
- the element on position 20,000 is equal to 1 if the email message contains the words "zulu", otherwise this element of the list contains 0.

We repeat the above procedure for every email message in our collection, which gives us 10,000 feature vectors, each vector having the dimensionality of 20,000 and a label (spam/not_spam).

Now we have a machine-readable input data, but the output labels are still in human-readable text. Different machine learning algorithms require transforming labels into numbers in different ways. The algorithm we will use to illustrate machine learning is this post is called Support Vector Machine (SVM). This algorithm requires that the positive label (in our case it's "spam") has the numeric value of +1 (one) and the negative labels ("not_spam") has the value of -1 (minus one).

The SVM algorithm sees every feature vector as a point in a high-dimensional space (in our case, space is 20,000-dimensional). The algorithm puts all points to an imaginary 20,000-dimensional plot and draws an imaginary line that separates positive examples from negative ones.

The equation of the line is given by two values, a real-valued vector \(w\) of the same dimensionality as our input feature vector \(x\), and a real number \(b\) like this:

$$ wx - b = 0 $$

Now, the classification for some input feature vector is given like this:

$$ y = sign(wx - b) $$

\(sign\) is a mathematical operation that takes any value as input and returns +1 is the input value is a positive number or -1 if the input value is a negative number.

Therefore, to make a prediction whether an email message is a spam or not a spam using an SVM model, we have to take a text of the message, convert it into a feature vector, then multiply this vector by \(w\), subtract \(b\) and take the sign of the result. This will give us the prediction (+1 means "spam", -1 means "not_spam").

Now, how the machine finds the values for \(w\) and \(b\)? It solves an optimization problem. Machines are good at optimizing some function under constraints.

So what are the constraints we want to optimize? Forst of all we want that the model correctly predict the labels for our 10,000 examples. Each example \(i = 1..10000\) is given by a pair \((x_i, y_i)\), where \(x_i\) is the feature vector of the example \(i\) and \(y_i\) it its label that takes values either -1 or +1. So the constraints are naturally:

- \(wx_i-b\geq 1\) if \(y_i = +1\), and
- \(wx_i-b\leq -1\) if \(y_i = -1\)

We also interested that our line that separates positive examples from negative ones was equally far from each of the two groups of points in the multi-dimensional space. To achieve that, we need to minimize \(w\). But because \(w\) is also a vector, so we need to minimize its norm \(\|w\|\).

So, the optimization problem that we ask the machine to solve sounds like this:

Minimize \(\|w\|\) subject to \(y_i(wx_i-b)\geq 1\) for \(i=1,\,\ldots ,\,n\), where \(n\) is the number of training examples.

The solution of this optimization problem, give by \(w\) and \(b\) is called the **statistical model**.

For two-dimensional feature vectors, the problem and the solution can be visualized as a plot below (taken from Wikipedia):

On the above illustration, the dark circles are positive examples, the white circles are negative examples, and the line given by \(wx - b = 0\) is the so-called **decision boundary**.

## Why machine learning works for new data

Why is a machine-learned statistical model capable of predicting the labels of previously unseen examples? To understand that, look at the above plot. It is **much more likely** that the new negative example will be located on the plot somewhere not so far from other negative examples. The same concerns the new positive example: it will most likely be somewhere around other positive examples. So our decision boundary will still separate them well from one another. For other unlikely situations, our model will make errors, but because they are unlikely, the number of errors will be small.

That's it. Now you have an answer to the question of how machines learn. Other algorithms than SVM have different internal mechanics, but all of them find a decision boundary in one way or another.

Read our previous post "Modern AI for Executives" or subscribe to our RSS feed.

*Found a mistyping or an inconsistency in the text? Let us know and we will improve it.*