We, at semanti.ca, use neural networks a lot. However, they are not easy to explain to someone who just starts in Machine Learning. If you struggle with finding a simple to follow neural network introduction, here's a one for you.

In simple terms, a neural network is a complex mathematical function:

$$y = f(x)$$

In the above equation, \(x\) is the input you give to the neural network, \(y\) is the output which is usually called "prediction", and \(f\) is the function, the network itself.

However, \(f\) is not an arbitrary function. It's a nested function. You have probably heard of neural network **layers**. So, for a 4-layer neural network, \(f\) will look like this:

$$y = f(x) = f_4(f_3(f_2(f_1(x)))),$$

where \(f_1\), \(f_2\), \(f_3\), and \(f_4\) are simple functions like this:

$$f_i = f_i(z) = nonlinear_i(a_i*z + b_i),$$

where \(i\) is called the layer index and can span from 1 to any number of layers. The function \(nonlinear_i\) is a fixed mathematical function chosen by the neural network designer (a human); it doesn't change once chosen. The coefficients \(a_i\) and \(b_i\), to the contrary, for every \(i\), are learned using an optimization algorithm called **gradient descent**.

The gradient descent algorithm finds the best values for all \(a_i\) and \(b_i\) for all layers \(i\) at once. What is considered "best" is also defined by the neural network designer by choosing the **loss function**. The latter is also a fixed function that doesn't change.

For example, the human can decide that the best values of those coefficients are obtained when the loss, given by the loss function, is equal to zero. The loss function can be very simple like this:

$$loss(f(x), y) = (y - f(x))^2.$$

To train the neural network, the designer has to prepare the training data. The training data is a collection of pairs \((x, y)\), where \(x\) is what the neural networks should take as input (can be a scalar or a vector) and \(y\) is what the neural network has to output (usually a scalar, but can be a vector too).

For example, \(x\) represents a text of the email message and \(y\) is equal to \(1\) when \(x\) is a spam, otherwise \(y\) is equal to \(0\). In this case, \(x\) is a vector that encodes the text somehow. One way to encode text as a vector is called bag-of-words.

## Bag-of-Words

Let's say you want to encode all text messages in English as vectors that your neural network can take as input. What you can do is take the English dictionary, sort the words in the alphabetical order and assign to each word an index: the word "a" will get the index 0, the word "aaron" will get the index 1, the word "abandoned" will get the index 2, and so on, up to the last word "zulu" that will have the index \(25762\) (let's pretend there are just \(25762\) words in the English dictionary).

To convert a text \(A\) into a vector \(x\) as a bag-of-words, you first fix the dimensionality of \(x\) to be \(25762\) (i.e., the cardinality of your English dictionary). Then you fill all dimensions of \(x\) with zeroes. Finally, for every word in \(A\), you look up its index in the dictionary and you fill the dimension in \(x\) corresponding to this index with a \(1\).

Now all your texts are converted to vectors of the same dimensionality (\(25762\)) but different texts have ones and zeroes in different dimensions.

## Nonlinearities

As we mentioned, our neural network's complex function \(f\) consists of several nonlinear functions \(nonlinear_i\) (also called **activation functions**), and we have one \(nonlinear_i\) for every layer \(i\). The neural network designer is free to choose any mathematical function, assuming it's differentiable. The main purpose of having nonlinear components in the function \(f\) is to allow the neural network to approximate any form of functions. Without nonlinearities, \(f\) would always remain linear. This is because \(a*z + b\) is a linear function and a linear function of a linear function is also a linear function.

Popular choices of nonlinear functions are **Sigmoid** (also known as Logistic function), **TanH** and **ReLU**:

## Training

Now as you have a training set as a collection of pairs \((x, y)\) and a loss function, you can start training.

The learning algorithm, called gradient descent, will first assign random values to all \(a_i\) and \(b_i\). Then the algorithm will compute the values of \(f_0(f_1(f_2(f_3(x)))) = f(x)\) for each \(x\). Then it will compute loss using the loss function:

$$loss(f(x), y) = (y - f(x))^2.$$

Then the learning algorithm will iteratively update all \(a_i\) and \(b_i\) such that the average loss (i.e., the loss averaged over all pairs \(x\), \(y\) in the training data) is minimized. At each iteration, the small steps are used to update \(a_i\) and \(b_i\). These small update steps are given by the gradient of the loss function. This is why it was important that the nonlinearity function were differentiable: when \(f\) is differentiable, the loss is differentiable as well. The gradients are computed with respect to every \(a_i\) and \(b_i\).

The training stops when the loss function becomes zero or when the fixed number of gradient descent iterations (also called "epochs") is reached.

## Improvements

There are multiple ways our neural network model can be improved.

First of all, for our binary classification problem (spam/not spam), a better loss function can be used. Usually, the logistic loss function is used:

$$loss(f(x), y)={\frac{1}{\ln 2}}\ln(1+e^{{-yf(x)}}).$$

The computation of the gradient using the whole training dataset can be very resource consuming and long. So the **stochastic gradient algorithm** is used to train almost all neural network models. In the stochastic gradient descent, only a small fraction of the training set is used to compute the gradient. This small fraction of the training set is called a **batch**. On epoch then consists in updating \(a_i\) and \(b_i\) multiple times, once for every batch. The size of the batch is usually a value between \(64\) and \(1024\). It is specified by the network designer.

There are multiple improvements over the classical stochastic gradient descent algorithm. In practice, neural network designers use such algorithms as Adam, RMSprop, Momentum and several others. The choice of the algorithm is also made by the neural network designer before the training starts.

In this guide, we only presented the so-called *feed forward* neural network. It is a good choice for classification project where the dimensions of the input vector \(x\) are independent of one another. If you have local dependencies between dimensions, like in images (neighboring pixels are often similar) or sounds (neighboring frequencies are similar), then such neural network architectures as recurrent neural networks and convolutional neural networks are used.

If you are interested in building your first neural network, we, at semanti.ca, recommend you to start with high-level neural network programming frameworks, such as Keras or Gluon. These frameworks are easy to use, they are well documented and contain multiple examples for your inspiration.

This is how you define a feedforward neural network with thre layers in Keras:

```
model = Sequential()
model.add(Dense(64, input_dim=2000, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(2))
model.add(Activation("softmax"))
```

In the above example, there are Layer \(1\) that takes a vector of dimensionality \(2000\) (the dimensinality of the input \(x\)) and outputs a vector of dimensionality \(64.\) Layer \(2\) takes the input of dimensionality \(64\) (the output of Layer \(1\)) and outputs a vector of diumensionality \(64.\) Layer \(3\) takes the input of dimensionality \(64\) and outputs a vector of dimensionality \(2.\) The first two layer's activation functions (nonlinearities) are the same, ReLU. The activation function of the third layer is softmax:

$$\sigma (z)_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}\textrm{ for }j = 1, \ldots, K,$$

where \(K\) is the number of classes of the classification problem.

Softmax acts similarily to Sigmoid: it outputs values that look like probabilities. However, softmax is applicable when there are more than two classes in the classification problem:

To continue learning about neural networks, we recommend the following books:

To learn the neural networks theory, we recommend Deep Learning book (can be legally downloaded online). To learn the practice, with Python and Keras, we recommend the book Deep Learning with Python.

If you are interested in Natural Language Processing, then these two books are important to read: Foundations Statistical Natural Language-Processing for the theory and Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit for the practice (with Python and NLTK).

Read our previous post "How to Start in Machine Learning and Data Science" or subscribe to our RSS feed.

*Found a mistyping or an inconsistency in the text? Let us know and we will improve it.*