Machine Learning Engineer Cheatsheet

July 28, 2020

Goal

The goal of this cheatsheet is to briefly summarize machine learning concepts and algorithms. Thank you to the following sources:

Data Science Cheatsheet Compiled by Maverick Lin
Machine Learning Cheatsheet by Rémi Canard.
Super VIP Cheatsheet: Machine Learning by Afshine Amidi and Shervine Amidi

Motivation

Robotics.

Topics

History
General
- Definition of Machine Learning
- Linear vs Non-linear Algorithms
- Supervised vs Unsupervised vs Semi-Supervised Learning
- Feature Engineering
- Bias-Variance trade-off
- Underfitting and Overfitting
Optimization
- Gradient Descent
Supervised Learning
- Linear Regression
- Logistic Regression
- K-Nearest Neighbors Algorithm
- Linear Discriminant Analysis
Unsupervised Learning
- Decision Trees
- Naive Bayes Classifier
- K-nearest Neighbors
- Support Vector Machines
Ensemble Algorithms
- Bagging and Random Forrest
- Boosting and AdaBoost

History

Samuel’s Checker Player (1952)
Perceptron (1957)
AI Winter: Minsky & Papert “killed” AI (1969)
TD-Gammon (1994)
Deep Blue (1997)
Tom Michel (1999) “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

General

Definition of Machine Learning

What is Machine Learning? Algorithms that improve on some task with experience. Based on Statistics and Optimization, not logic.

Arthur Samuel, a pioneer in Artificial Intelligence defined Machine Learning in 1959 as “Field of study that gives computers the capability to learn without being explicitly programmed”

You can think of it mathamatically as: we want to learn a target function \(f\) that maps input variable \(X\) to output variable \(Y\), with an error \(e\):

\[Y = f(X) + e\]

Linear vs Non-linear Models

Linear models are usually simpler, faster, and require less data whereas non-linear algorithm are more flexible, more powerful (they can represent a larger class of functions), and more performant but much harder to train.

Linear (or parametric): models that first make an assumption about a function form, or shape, of \(f\) (linear). Then fits the model. This reduces estimating \(f\) to just estimating set of parameters, but if our assumption was wrong, will lead to bad results.
Non-linear (or non-parametric): models that don’t make any assumptions about \(f\), which allows them to fit a wider range of shapes; but may lead to overfitting

Supervised vs Unsupervised vs Semi-Supervised Learning

Supervised learning methods learn to predict Y from X given that the data is labeled.
Unsupervised learning where all data is unlabeled and the algorithms learn to inherent structure from the input data.
Semi-Supervised Learning methods have some data labeled but most of it is unlabeled.

enter image description here

Feature Engineering

The process of using domain knowledge to create features or input variables that help machine learning algorithms perform better. Done correctly, it can help increase the predictive power of your models. Feature engineering is more of an art than science. FE is one of the most important steps in creating a good model.

As Andrew Ng puts it: “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”

Bias-Variance trade-off

In supervised learning, the prediction error \(e\) is composed of the bias, the variance, and the irreducible part.

Bias refers to simplifying assumptions made to learn the target function easily.
Variance refers to sensitivity of the model to changes in the training data.

The goal of parametrization (writing in terms of parameters) is to achieve a low bias (underlying pattern not too simplified), and low variance (not sensitive to the specificities of the training data) tradeoff.

enter image description here

Underfitting and Overfitting

In statistics, fit refers to how well the target function is approximated.

Underfitting refers to poor inductive learning from training data and poor generalization.
Overfitting refers to learning the training data detail and noise which leads to poor generalization. It can be limited by using resampling, and defining a validation dataset.

Dealing with bias and variance is really about dealing with over and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls.

enter image description here

The sweet spot for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. If our model complexity exceeds this sweet spot, we are in effect over-fitting our model; while if our complexity falls short of the sweet spot, we are under-fitting the model.

Optimization

Optimization refers to the task of minimizing/maximizing an objective function \(f(x)\) parameterized by \(x\). In machine learning, it’s the task of minimizing the cost or loss function \(J(\theta)\) parameterized by the model’s parameters \(\theta \in R^d\). In the case of minimization, optimization have one of the following goals:

Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
Find the lowest possible value of the objective function within its neighborhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

Gradient Descent

Gradient Descent is the most common optimization algorithm. It is a first-order optimization algorithm meaning it takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function \(\triangledown_\theta J(\theta)\) w.r.t the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate \(\alpha\). Therefore, we follow the direction of the slope downhill until we reach a local minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

Algorithm

Initialization coefficients \(\theta = 0\) or random value
Pick a value for the learning rate \(\alpha\). If \(\alpha\) is very small, it would take long time to converge and become computationally expensive. If \(\alpha\) is large, it may fail to converge and overshoot the minimum. The most commonly used rates are : \(0.001, 0.003, 0.01, 0.03, 0.1, 0.3\).
Calculate cost: \(J(\theta)=evaluate(f(coefficients))\)
Gradient of the cost: \(\frac{\delta}{\delta\theta_J}J(\theta)\)
Update coefficients: \(\theta_j =\theta_j - \alpha \frac{\delta}{\delta\theta_J}J(\theta)\)

The cost updating process is repeated until convergence or until minimum is found.

Ordinary Least Squares

Maximum Likelihood Estimation

Supervised Learning

Given a set of data points \([ x (1), ..., x(m) ]\) associated to a set of outcomes \([y (1), ..., y(m)]\), we would like to learn a function \(h\) such that for a new pair belonging to some distribution \((x,y)∼P\), we have \(h(x)=y\) with high probability (or \(h(x)≈y\)).

Types of Prediction


Binary classification	Eg. spam filtering. An email is either spam (+1), or not (−1).
Multi-class classification	Eg. face classification. A person can be exactly one of K identities (e.g., 1=”Barack Obama”, 2=”George W. Bush”, etc.).
Regression	Eg. predict future temperature or the height of a person.

Hypothesis

Before we can find a function \(h\), we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers.

No Free Lunch Theorem No single machine learning algorithm is better than all the others on all problems. It is common to try multiple models and find one that works best for a particular problem.

Loss Function

A loss function evaluates a hypothesis \(h ∈ H\) on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions.

Regression Loss Functions
- Mean Squared Error Loss (L2 Loss)
- Mean Absolute Error Loss (L1 Loss)
- Huber Loss
Binary Classification Loss Functions
- Binary Cross-Entropy (Log Loss)
- Hinge Loss
Multi-class Classification Loss Functions
- Multi-class Cross Entropy Loss
- Kullback Leibler Divergence Loss

Cost Function

The cost function \(J\) is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

Modeling - Evaluation Metrics

To determine how good our model is, best way to assess is to use data points your model has never seen.

Regression Task

Errors are defined as the difference between a prediction \(y'\) and the actual result \(y\).

Gurpreet's AI Blog

Machine Learning Engineer Cheatsheet

Goal

Motivation

Topics

History

General

Optimization

Supervised Learning

Types of Prediction

Hypothesis

Loss Function

Cost Function

Modeling - Evaluation Metrics

Regression Task

Classification Task