At semanti.ca, we are heavily using machine learning to build our data extraction and classification solutions. We are constantly looking for new use cases for machine learning that could make the life of our clients, as well as our employees, better. However, many tasks still require human attention and could hardly be automated.
In this guide, we gathered a collection of practical problems that currently can be solved using machine learning and also we outline properties of problems that most likely cannot be solved due to inherent limitations on the modern AI scientific and engineering approach.
What Machine Learning Can Do
Machine learning can solve almost any problem that:
- you can formulate as \(y = f(x)\) and
- you can gather a sufficiently big collection of pairs in the form \((x, y)\).
We agree it's a very vague definition. However, in practice, before you try you don't know if a certain problem can be solved using machine learning.
For example, nobody could predict in advance that one can train an accurate classifier of text by topic by transforming text documents into a high-dimensional vectors \(x\) where each dimension \(x_i\) equals either \(1\) or \(0\) depending on specific dictionary word \(i\) is present in the text document (\(1\)) or absent (\(0\)).
In this problem formulation, \(x\) represents words that can be found in the text document, but the order of words in the text is completely lost. Scientists just decided to try and it worked! They could have failed. But this specific problem happened to be solvable using machine learning when formulated that way.
Or, who could tell a priori that by looking at small patches of an image's pixels, one patch at a time, the machine, trained on millions of images, will become capable of telling with almost certainty what is depicted on each image?
This is how a Convolutional Neural Network "looks" at images. Source: Machine Learning Guru.
Here are examples of other applications where machine learning excels:
\(x\): a vector of numbers where each \(x_i\) is a number between 0 and 255 representing either reg, blue, or green channel of a specific pixel of the image. \(y\): a vector of binary values, where each dimension represents the probability that the image belongs to a specific class. For example, if our problem is to classify an image as being either an object, an animal or something else, then \(y\) could be a vector like this: \([0.2, 0.7, 0.1]\) where \(0.2\) is the probability of \(x\) being an object, \(0.7\) is the probability of \(x\) being an animal and \(0.1\) is the probability of \(x\) being something else. Usually, the unique class is assigned to an input by taking the most probable class predicted by the machine learning algorithm (in our case it's "animal").
\(x\): is a high-dimensional vector representing one word of vocabulary. For example, if in our dictionary the word on position 127 is "car" then \(x = [0,0,0,...,1,0...0]\) where \(x_i = 1\) if \(i = 127\) and \(x_i = 0\) if \(i\) not equals to \(127\). Such an encoding of words is called "one-hot". \(y\): a low-dimensional vector of real numbers (usually the dimensionality of \(y\) is much lower than that of \(x\) and lays between \(100\) and \(500\)). \(y\) has the following property: if two words \(x\) and \(x'\) have a similar meaning, then the corresponding rela-valued vectors \(y\) and \(y'\) have similar values. In a pair \(x, y\), vector \(y\) is called an embedding vector of the word \(x\) or simply a word embedding.
We have already mentioned the problem of classification of text using the bag-of-words representation of the input text as the vector \(x\): each vocabulary word word \(i\) is either present or absent in the input text so \(x_i\) is either \(0\) or \(1\). Bag-of-words has the property that the order of words is lost and similarity in the meaning between words is not reflected either.
Using a simple combination of word embeddings of every word in a document, such as the average or a sum of word embedding vectors, one can encode input document in low-dimensional vectors \(x\) by taking into account the meaning of every word: this can improve the classification quality. The order of words is still lost though.
Recurrent or Convolutional Neural Networks can also take into account the order of words in the document, so state-of-the-art classification results are obtained by using word embeddings together with neural networks.
Text can be classified by either:
- Topic (ex: art, engineering, sport, etc.), or
- Sentiment (ex: positive, neutral or negative), or
- Writing style (ex: fiction, blog post, news article, press release, court decision, etc.)
Sequence Labeling / Segmentation
Machine learning excels at segmenting sequences or matrices into regions of related elements.
- Word Sequence
Assign to every word in a sequence a label. For example, if the problem is to assign part of speech to every word in a sentence then input could look like ["I", "like", "to", "sing", "."] (each word is represented as either a one-hot vector or as an embedding) and the output would be ["personal pronoun", "verb", "preposition", "verb", "punctuation mark"] (each part of speech is represented as a vector of probabilities assigned to each possible part of speech).
Another example of sequence labeling is the problem of named entity extraction. The input can look like this: ["Mark", "Zuckerberg", "went", "to", "San", "Francisco", "."] and the output would be ["name", "name", "other", "other", "location", "location", "other"].
- Sound Sequence
Similarly to labeling/segmenting a sequence of words and punctuation marks, a sequence of sound frequencies can be labeled/segmented. For example, in the Speach-to-Text problem, the input is a sequence of sound frequencies in Hertz (one can cut a continuous flow of frequencies into small patches where the frequencies within one patch are averaged) and the output is the character of the alphabet corresponding to each patch. Some characters can span multiple patches, for example the input frequencies are ["85", "86", "112", "118", "114", "66"] and the labels are ["c", "c", "a", "a", "a", "r"] ("car").
Videos are sequences of sound frequencies and images. Labels, such as "people talking", "car moving", "people dancing", "animals in nature" can be assigned to every frame of the video by a machine learning algorithm, based on sound frequencies and values of RGB-channels of pixels of every frame.
In this setting, the input is the same as in the task of image classification, but the output is the vector that assigns a label to every pixel (so the dimensionality of input and output are the same). Each pixel can get a label like "sky", "person", "animal", "car", "tree", etc:
An example of image segmentation. Source: Vladlen Koltun.
All the above examples of sequence labeling/segmentation have one thing in common: the length of the output is always the same as the length of the input. For a long time, this was a major constraint that limited application o machine learning to sequential data.
Recent advancements in artificial neural networks removed this constraint. Now output sequences can have a different length from input sequences. The practical problems that are solved using kind of machine learning are:
- Machine translation (state-of-the-art systems reach the performance comparable to that of a human translator)
- Text to speech (the quality of speech generated is currently almost indistinguishable from real human speech)
- Text summarization (as of 2018, this is still an active research area)
- Sentence paraphrasing (as of 2018, this is still an active research area)
The principle of functioning of sequence-to-sequence algorithms is as follows:
Encoding: the neural networks sequentially "reads" the input, for example, one word embedding at a time, and combines each read vector with previously read vectors to form a so-called state vector that represents the whole sequence.
Decoding start: the neural network reads a special "start-of-sequence" element in form of a vector, combines this vector with the state vector to obtain an updated embedding vector, and generates the output vector that represents the first element of the output sequence.
Decoding loop: iteratively, the neural network takes the output element generated on previous decoding iteration, combines this vector with the current state vector to obtain an updated state vector, and, using the updated state vector, generates the output vector that represents the next element of the output sequence.
Step 3 is repeated until a special "end-of-sequence" element is generated as output.
An example of sequence-to-sequence transformation. Source: Google.
Image to Text
A famous example of image-to-text is the problem of generating captions for an image. An algorithm takes an image as input and outputs a sequence of words that describe the image. Training data is easily available on the Web. The principle behind an image-to-text machine learning algorithm is similar to that of sequence-to-sequence transformation. The only significant difference is that the encoding step (Step 1) is transforming an input image (represented, as usual, as a vector of RGB-channels of image pixels) is transformed into a state vector by "scanning" the image. The scanning is done by sequentially multiplying patches of the image by a collection of matrices. The matrices are called "convolutions", and the types of artificial neural networks that use them are called Convolutional Neural Networks. We have already seen an example of convolution on a picture above.
Image generation is the problem of transforming a random number into a picture, such that the picture looks like a one made by a human (by using a camera or a pencil). Usually, such a problem is solved by creating two independent neural networks. The first neural network takes a random input and generates a matrix of pixels (represented as RGB-channels). The second neural network looks at a random picture and tries to recognize whether the picture is real (that is made by a human) or artificial (that is generated by the first neural network). When the second neural networks recognize the artificial image it gets a positive signal, while the first neural network gets a negative signal. Those signals are used to update neural network parameters so that each of the two neural networks gets better after every round of such a competitive game. The neural networks that are trained this way are called Generative Adversarial Networks, or GANs.
Music generation is similar to image generation. The only significant difference is that the first neural network generates sequences of sound frequencies while the second one is capable of reading those sequences and classify them as either music or not.
Text to Image
Text-to-image is the inverse of the previous, image-to-text problem. Here, from the sequence-to-sequence algorithm, we only keep the encoding part, while the decoding part is similar to the image generation part from GANs: it takes the state vector corresponding to the input text and produces the matrix of pixels represented as RGB-channels. Again, training examples are relatively easily obtainable from web pages.
Image Style Transfer
Because neural networks have multiple layers, each layer representing some features of the input, by analyzing each layer of the trained neural network, the machine learning engineer can understand which layer represents which kind of features:
Features learned by different layers of a neural network. Source: ResearchGate.
During the training of a neural network on pictures of various artists and by trying to generate similar images, one can observe that some layers of the neural network are responsible for the style: types of stokes, color palette, geometry, etc. By combining these layers with layers of another neural network, for example, the one trained on photographic images, one can transform photos into paintings. This can be done by applying the style of various painters to the layers of "photographic" neural network responsible for representing the factual information about the input photo: the objects it contains, their form and position:
Artistic style transfer. Source: Priyanka Mandikal.
What Machine Learning Cannot Do
The state of the art in machine learning is rapidly changing. Thousands of machine learning papers, especially those that study or experiment with artificial neural networks, are published every year. Problems that were impossible to solve by a computer five years ago today are solved by a machine on a superhuman level. For example, the game of Go was previously considered too complex to be solved by a machine. Today, the algorithm trained by playing with itself beats the best of human players (without any human assistance and without looking at games played by humans) in 100 games of 100.
However, today the limits of what's possible "to machine learn" are becoming clear.
Data wrangling is the process of transforming and mapping data from "raw" form into a format appropriate for use by a machine. We already mentioned that machine learning algorithms expect feature vectors as input. Data wrangling is the process of transforming raw data into feature vectors "consumable" by machine. Such activities as extracting the data from its location (for example, from a web page or a hospital archive), cleaning, normalizing, filling the missing parts, removing the noise and outliers, this is what is considered impossible to be done by a machine (unless the machine really understands its purpose).
A common property of all machine learning use cases described in the previous sections of this post is that the trained algorithm is applied to the input that always comes in the same format and in form of examples that don't differ much from the examples used to train the algorithm. Should the dimensionality of the input change, or some noise not present during the training appears in the production setting, the trained algorithm becomes almost always completely useless. Again, without machine gaining conscious and understanding its purpose, it's impossible for the machine to notice the changed setting and adapt:
These limitations mean that a lot of automation will prove more elusive than AI hyperbolists imagine. “A self-driving car can drive millions of miles, but it will eventually encounter something new for which it has no experience,” explains Pedro Domingos, the author of The Master Algorithm and a professor of computer science at the University of Washington. “Or consider robot control: A robot can learn to pick up a bottle, but if it has to pick up a cup, it starts from scratch.” In January, Facebook abandoned M, a text-based virtual assistant that used humans to supplement and train a deep learning system, but never offered useful suggestions or employed language naturally. — Wired
Any machine learning algorithm needs a mathematical function that defines the error of making a good versus a bad prediction about the input (this function is called the loss function). According to the current machine learning paradigm, this function is defined by a human based on their goals. This function does not necessarily reflect what the human really needs, but it's rather a proxy that the machine can easily use. For example, neural networks require that this function was differentiable. Without a deep understanding of what the practical problem is, defining such a function could be very difficult or even impossible. So, in many cases, the machine learning engineer has to be a good subject matter expert.
Learning from Observing
Machines cannot learn the language by conversing with people. Humans can do that (all babies do this). However, there's no idea among the scientists how it works in the human brain. So, after two years spent in your home, your Siri or Google Home doesn't become better at speaking with you. It only improves when machine learning engineers program additional use cases and gather additional training data.
Similarly, the machine cannot learn to perform a task by observing a human doing the task. The machine can learn to repeat the movements, but it cannot learn the purpose. So, again, should the setting in production change a little bit and the movements learned by the machine become useless.
Machine learning excels in reflex-like tasks: "if this then that". It's also good in tasks that have short-term dependencies, like in translation of sentences short-term dependencies between words in one sentence can be effectively learned given enough data. However, today it seems completely impossible to learn dependencies between actions of a character in a book. The quantity of data and the depth of the neural network that would be needed to learn that would be so huge that no computer, even the very best modern supercomputer would succeed.
Other Hard Problems
Below are several problems where some progress exists, but it doesn't get close to the human-level performance and it's not clear today whether it will ever get any closer:
- Sarcasm detection (many humans fails at that too)
- Text summarization (see Long-Term Dependencies above)
- Sentence comprehension: given two sentences, do they say the same thing?
- Learning from very few examples. Today, the state-of-the-art machine learning systems are trained on millions of examples. Humans can learn the same from a handful of examples. There's no clear idea how to bridge the gap.
- Producing emotions, such as love, empathy or anger.
- Computer programming.
- Explainability of the output. The machines can be very good at solving some problems, but incapable of explaining why a specific output is a correct solution to the given problem.
The modern artificial intelligence is powered by machine learning. In many practical cases, especially in perceptive tasks such as classification of images, sounds or texts, machine learning today performs comparably or even better than human. However, trained machine learning models are rigid, incapable of adapting to the changed environment, they depend on goals set by humans in form of mathematical equations, incapable of learning long-term relationships in data and explaining their output. Without changing the machine learning paradigm, most of these limitations today seem insoluble.
Found a mistyping or an inconsistency in the text? Let us know and we will improve it.