The Most Important Machine Learning Algorithms

Pros and cons of the top 11 algorithms every machine learning engineer should know

Posted by Tobias on 03-09-2018

There are plenty of machine learning algorithms. Some of them are very useful for solving some very specific problems. Others could be applied to solve a wide range of tasks.

In this post, we list the algorithms most frequently used by machine learning engineers at as well as those that often win Kaggle competitions.

Classification and Regression

Gradient Boosting Machine

Gradient Boosting Machine is currently (as of 2018) the most popular algorithm among Kagglers and most winning teams used Gradient Boosting Machine in their solutions.


  • Can handle huge datasets (millions of examples and millions of dimensions);
  • Extremely accurate;
  • Can be used for both classification and regression tasks;
  • Lots of flexibility with the choice of loss functions: can be tailored to the characteristics of the problem.


  • Can be slow at training, since the trees are built sequentially;
  • Model is a black box (as for all ensemble methods);
  • Prone to overfitting (however hyperparameter tuning helps).

Random Forest

Random Forests are still frequently used by the machine learning practitioners. They provide good accuracy and speed of training and can handle big datasets.


  • Much easier to tune than Gradient Boosting Machine (two hyperparameters vs three;
  • Almost always perform as well as or better than SVMs;
  • Deal well with uneven datasets that have missing variables;
  • Rarely overfits.


  • Slow at prediction time so less appropriate for high-speed data processing;
  • If data includes categorical variables with different numbers of levels, Random Forest is biased in favor of those with more levels. Therefore, the variable importance scores from Random Forest are not reliable.

Support Vector Machine

Support Vector Machines is a mature and well-studied machine learning algorithm, with a solid theoretical foundation. It supports kernels, so can handle non-linearly separable classification problems.


  • Handles both classification and regression;
  • Supports very high-dimensional and sparse data (hundreds of thousands of dimension); highly effective in text classification where high-scarcity and dimensionality are frequently observed in data.
  • High accuracy, good theoretical guarantees regarding overfitting;
  • Handles very well outliers and noise.
  • Extremely fast at prediction time;
  • Generalize well from few data.


  • Cannot be parallelized, the whole dataset has to fit in memory;
  • Not good at handling very big dataset (more than a hundred of thousands of examples);
  • Sometimes hard to tune due to a wide range of possible hyperparameter values.

LSTM Neural Network

LSTM Neural Network is a variant of the Recurrent Neural Network. LSTM units effectively solve the problem of vanishing gradient. They are typically used for sequential classification problems: text labeling, speech recognition. The Encoder-Decoder architecture of LSTM networks allows building machine translation systems.


  • Classify relatively long (up to 40-50 tokens) sequences well;
  • Highly accurate even with a couple of hidden layers.


  • Very slow at training as cannot be parallelized between multiple GPUs;
  • Relatively slow at classification, since it classifies one token at a time (especially the Encoder-Decoder architecture);
  • Black boxes without strong theoretical foundations.

Convolutional Neural Network

As of 2018, one cannot imagine the image processing without Convolutional Neural Networks. Such state-of-the-art CNN architectures as AlexNet, GoogLeNet, and ResNet are all variants of a Convolutional Neural Network.


  • Very fast to train, as can be parallelized between multiple GPUs;
  • Learn low-level and high-level features so can be used as feature learners for other ML algorithms;
  • Very accurate at image classification and many other tasks, including language processing.


  • Require a lot of training data (however, such techniques as transfer learning can reduce the need for data);
  • Sensible to parameter initialization and choice of hyperparameters;
  • Slow to train on CPU;
  • Black boxes without strong theoretical foundations (as of 2018).

K-Nearest Neighbors

K-Nearest Neighbors is one of the oldest and simplest algorithms of machine learning. In fact, there's not so much learning happening: the predictions are given as the majority (classification) or the average (regression) of the labels in the close neighborhood of the input, unannotated example. KNN can be very accurate recently, however, it was for a long time considered as slow at prediction time. However, modern libraries made this algorithm very fast and today it can easily compete with the state-of-the-art techniques.


  • No assumptions about data :  works well for non-linearly separable data;
  • Simple to implement;
  • Flexible to feature/distance choices;
  • Naturally handles multi-class cases;
  • Does well in practice with enough representative data
  • Can be used for both classification and regression.


  • Search in a large space of examples to find nearest neighbors can be slow (mostly resolved in modern libraries);
  • With a lot of training examples, the model can require a lot of memory;
  • Sensitive to irrelevant features and the scale of the data.

Clustering and Dimensionality Reduction


The K-means algorithm remains one of the most used clustering algorithms, despite its simplicity.


  • Simple to use;
  • Fast (when used with an appropriate data structure to store examples that preserve spatial locality);
  • Results are simple to understand by a human.


  • K-means only works well when the shape of clusters are hyper-spherical or when clusters are far from one another;
  • Different runs of the algorithm will most probably yield in different clusterings of the same data;
  • Requires the number of clusters to be known in advance.

Expectation Maximization

Expectation Maximization, as well as Latent Dirichlet Allocation (below), are both examples of the so-called "soft" clustering. They assign one example to multiple clusters with a probability of membership. Expectation Maximization learns Gaussian Mixture Models which solves the problem of K-Means: the clusters can be non-spherical.


  • Simplicity and ease of implementation;
  • Solutions to the M-steps often exist in the closed form;
  • Can often be easily parallelized and its memory requirements tend to be modest compared to other methods;
  • The algorithm is numerically very stable;
  • Soft-membership in clusters is often desirable;
  • Can be naturally used for finding outliers.


  • Slow linear convergence;
  • Can converge to local optima;
  • As in K-Means, one needs to know the number of clusters;
  • One needs to carefully choose the generative model.


DBSCAN (for Density-Based Spatial Clustering of Applications with Noise) is another popular clustering algorithm. Its difference from K-Means lies in the fact that DBSCAN doesn't use the notion of cluster centroids. It's density based, so the number of clusters depends on the data itself. Clusters can have absolutely any form: for DBSCAN a cluster is a subset of points that form a dense "cloud".


  • Doesn't need the number of clusters as input;
  • Clusters can have arbitrary shapes;
  • Robust to noise;
  • Deterministic.


  • Requires connected regions of sufficiently high density;
  • The maximal distance between two points that belong to one cluster has to be given as input;
  • Sensitive to hyperparameters;
  • Sampling affects density measures.
  • Datasets with varying densities are problematic.

Latent Dirichlet Allocation

LDA is one the most important unsupervised learning algorithms. It was successfully applied to text analysis (topic modeling), social network analysis (community overlaps), content recommendation, genetic population analysis, and many others.

In the nutshell, LDA takes a collection of documents, the number of topics, and returns a distribution of topics over documents and a distribution of words over topics.

LDA can be used as a soft-clustering method: every topic can be seen as a cluster and each document can have several topics with different probabilities, thus can belong to several clusters.


  • Excellent empirical results;
  • Mitigates overfitting well;


  • High computational complexity;
  • Nondeterministic (different runs can give different results);
  • Doesn't work well on short documents.


One important aspect of the machine learning practice is data visualization and dimensionality reduction. Dimensionality reduction can be done to visualize data (the human can only see up to three-dimensional data) and to make learning tractable and/or more accurate.

For a long Principal Component Analysis or PCA was the most popular tool for dimensionality reduction and t-SNE was used to visualize data. However, in 2018 the new algorithm, UMAP (for Uniform Manifold Approximation and Projection) was proposed and efficiently implemented. UMAP combines the advantage of both PCA and t-SNE: it can be used as a dimensionality reduction techniques (t-SNE can't) and has visualization properties similar or surpassing those of t-SNE on many tasks.


  • Can be used for both dimensionality reduction and visualization;
  • Has very fast implementation in multiple programming languages, including Python.


  • Has hyperparameters one has to tune to find a good visualization.

These were the 11 most important machine learning algorithms, according to engineers and scientists. Did we miss something important? Please, let us know and we will improve this text.

Read our previous post "Data Science Work Productivity Tips & Tricks" or subscribe to our RSS feed.

Like it? Share it!