Gurpreet's AI Blog

Machine Learning Engineer Cheatsheet

Goal

The goal of this cheatsheet is to briefly summarize machine learning concepts and algorithms. Thank you to the following sources:

Motivation

Robotics.

Topics

History

General

Definition of Machine Learning

What is Machine Learning? Algorithms that improve on some task with experience. Based on Statistics and Optimization, not logic.

Arthur Samuel, a pioneer in Artificial Intelligence defined Machine Learning in 1959 as “Field of study that gives computers the capability to learn without being explicitly programmed”

You can think of it mathamatically as: we want to learn a target function \(f\) that maps input variable \(X\) to output variable \(Y\), with an error \(e\):

\[Y = f(X) + e\]

Linear vs Non-linear Models

Linear models are usually simpler, faster, and require less data whereas non-linear algorithm are more flexible, more powerful (they can represent a larger class of functions), and more performant but much harder to train.

Supervised vs Unsupervised vs Semi-Supervised Learning

enter image description here

Feature Engineering

The process of using domain knowledge to create features or input variables that help machine learning algorithms perform better. Done correctly, it can help increase the predictive power of your models. Feature engineering is more of an art than science. FE is one of the most important steps in creating a good model.

As Andrew Ng puts it: “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”

Bias-Variance trade-off

In supervised learning, the prediction error \(e\) is composed of the bias, the variance, and the irreducible part.

The goal of parametrization (writing in terms of parameters) is to achieve a low bias (underlying pattern not too simplified), and low variance (not sensitive to the specificities of the training data) tradeoff.

enter image description here

Underfitting and Overfitting

In statistics, fit refers to how well the target function is approximated.

Dealing with bias and variance is really about dealing with over and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls.

enter image description here

The sweet spot for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. If our model complexity exceeds this sweet spot, we are in effect over-fitting our model; while if our complexity falls short of the sweet spot, we are under-fitting the model.

Optimization

Optimization refers to the task of minimizing/maximizing an objective function \(f(x)\) parameterized by \(x\). In machine learning, it’s the task of minimizing the cost or loss function \(J(\theta)\) parameterized by the model’s parameters \(\theta \in R^d\). In the case of minimization, optimization have one of the following goals:

Gradient Descent

Gradient Descent is the most common optimization algorithm. It is a first-order optimization algorithm meaning it takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function \(\triangledown_\theta J(\theta)\) w.r.t the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate \(\alpha\). Therefore, we follow the direction of the slope downhill until we reach a local minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

Algorithm

The cost updating process is repeated until convergence or until minimum is found.

Ordinary Least Squares

Maximum Likelihood Estimation

Supervised Learning

Given a set of data points \([ x (1), ..., x(m) ]\) associated to a set of outcomes \([y (1), ..., y(m)]\), we would like to learn a function \(h\) such that for a new pair belonging to some distribution \((x,y)∼P\), we have \(h(x)=y\) with high probability (or \(h(x)≈y\)).

Types of Prediction

   
Binary classification Eg. spam filtering. An email is either spam (+1), or not (−1).
Multi-class classification Eg. face classification. A person can be exactly one of K identities (e.g., 1=”Barack Obama”, 2=”George W. Bush”, etc.).
Regression Eg. predict future temperature or the height of a person.

Hypothesis

Before we can find a function \(h\), we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers.

No Free Lunch Theorem No single machine learning algorithm is better than all the others on all problems. It is common to try multiple models and find one that works best for a particular problem.

Loss Function

A loss function evaluates a hypothesis \(h ∈ H\) on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions.

Cost Function

The cost function \(J\) is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

Modeling - Evaluation Metrics

To determine how good our model is, best way to assess is to use data points your model has never seen.

Regression Task

Errors are defined as the difference between a prediction \(y'\) and the actual result \(y\).

Classification Task