The idea of Machine Learning is to solve problems by gathering a dataset that describe the problem and algorithmically build a statistical model based on that dataset. The learning can be:
- supervised (the dataset is a collection of labelled examples, where each example is a feature vector made of features that describe the example, and a label belonging to a class like {spam, not_spam} or a real number)
- unsupervised (the dataset is a collection of feature vectors and the learning transform a vector in another – it is used for clustering, or reduction of features, or outliner detection when the output indicates the distance from “typical” examples in the dataset)
- Semi-Supervised (act as a supervised but it accepts unlabelled examples)
- Reinforcement (where the ML is in a state and it executes actions that brings a “reward” – the idea is to learn a policy, ie a model that takes the state as an input and decide the best action that maximizes the expected average reward – games or robotics or resource management are examples of usage)
The work of data analyst is to decide how to represent the dataset, ie how each example is described in a feature vector. After converting all examples in a consistent vector of features, he selects a learning algorithm that analyses the dataset to produce a model.
One of the most basic learning algorithm is Support Vector Machine (SVM) that plots the D features in a high-dimensional space and draws a D-1 (where D is the number of features) line (the decision boundary) to separate positive labels from negative ones, with the highest possible margin (as the distance between the closest examples of two classes in terms of the decision boundary).
The accuracy of the statistical model (or just model) is the ratio of the examples that are predicted correctly.
To understand ML there is a lot of math required: derivatives, standard deviation, variance, unbiased estimators (that are the sample statistics obtained from a subset of examples of the real statistics, instead of the unlimited samples from the real statistic – one unbiased estimator is the samples mean), Bayes’ rule, parameters (the values that the learning is determining to create the model), hyperparameters (the input parameter needed for a learning algorithm to compute the model – an hyperparameter is not extracted processing the data but it is an input chosen by the data analyst).
Model based learning algorithms are those that use examples to define the parameters of a model, and then they don’t use those examples anymore – SVM is an example. Instance based learning algorithms use all examples as a model – kNN (k-nearest neighbors) is an algorithm that takes a feature vector in input and decide the label based on the closest vector labels in the dataset.
Classifications algorithms are those that automatically assign a label to an unlabelled example. Regression algorithms assign a real value, called target. A shallow learning algorithm use the dataset to learn the value of the parameters of the model, a deep learning algorithm (or deep neural network learning) learns the parameters of the model through different layers, each one feeding data to the next.
Linear regression is a regression LM that works a bit similarly to SVM but it minimises the distances to all examples instead of maximising it. The loss function is a penalty for misclassification. Overfitting is a property of a model such that the model predicts very well examples in the dataset but very bad all the others.
Despite its name, Logistic Regression is a classification learning algorithm. It tries to classify with a linear function, but because classification has a finite number of accepted results, it requires to find a continuous function with a limited codomain, and that is the standard logistic function, aka sigmoid function f(x) = 1 / (1+e^-x). While in Linear regression we minimise the risk defined by the average square error loss, in logistic regression we maximize the likelihood of the training set according to the model, ie how likely the observation is according to the model.
Decision tree learning is also a classification algorithm based on an acyclic graph. In any branching node one feature is examined and at the leaf the class is defined. One of those algorithms is ID3. To evaluate how good is a split in the tree, we use the notion of entropy defined as “a measure of uncertainty about a random variable“, reaching the maximum when all values of that variable are equiprobable and the minimum when only one value is possible. Pruning is a backtracking technique that removes branches that don’t contribute enough to the error reduction but it replaces it with nodes.
For dealing with noise (examples that are not linearly separable) in a SVM we introduce the hinge loss function that is zero for vectors in the right side of the decision boundary, and gives a value proportional to distance for the vectors in the wrong side. Minimizing the cost function with hinge loss requires an hyperparameter C that, if using high value, ignores the missclassifications while with low value make SVN try to find the best margin for all the vectors.
For non-linearity we can use kernel functions, or kernels, to make computations in higher dimensional spaces without the costs of transformations
A learning algorithm is composed of 3 parts: a loss function, an optimization criterion based on the loss function (a cost function, for example), an optimization routine leveraging the training data to find a solution to the optimization criterion.
Two optimization algorithms are the gradient descent and the stochastic gradient descent: the gradient descent takes steps proportional to the negative of the gradient of the function at the current point to find a minimum of a function. It is used to find optimal parameters for many of the discussed algorithms, included neural networks. The negative sign of the derivative is used because a positive derivative means a growing function, so we need to go on the descending side. After some epoches/iterations of the gradient descent, it converges on the right value. The learning rate alpha controls the size of each step. There are faster version of this algorithm, like Minibatch SGD and Adagrad, Momentum and, for neural networks, RMSProp and Adam.
Most of the algorithms are quite standard, and there are libraries to use them. For example, in python there is the scikit-learn that implements the above algorithms
Feature engineering is the problem of transforming raw data into a dataset of (possibly labelled) vectors. It demands a lot of creativity and, possibly, domain knowledge. Everything measurable can be used as an informative feature in the vector, that possibly have high predictive power. If a model predicts well the training data, we say that it has a low bias.
Feature that are categorical (a color for example) can be modeled as a feature per category. We should refrain to use only one feature with different values because they would imply an order in the three values (unless the order is not required) and it may lead to overfitting. This is call one hot encoding.
Binning or bucketing is the opposite, where you take a numerical value (age for example) and you create a feature per category. Normalization is the process of taking a range of values that a feature could take and map on a standard range, and it is used to avoid larger feature to dominate the derivative (for gradient descent, for example). It also avoid numerical overflow.
Standardization is the procedure during which the feature values are rescaled to have the properties of a standard normal distribution. There is no rule to use standardization or normalization, usually standardization is preferred for unsupervised learning algorithms, for features that are distributed close to a normal distribution (bell curve) or a feature that have outliners (extremely high or low values), normalization in all the other cases.
When a dataset has examples with missing features, there are 3 possibilities: 1. removing such examples, if the dataset is big enough, 2. using algorithms that accept missing features, 3. using a data imputation techniques (using an average value, an out of range value or a middle range value, creating a model to infer the value or using a boolean feature to signal the missing value).
Data analysts work with 3 sets: training set (usually the biggest one), validation set and test set (much smaller, usually similar size, not used to train and therefore called holdout sets). Validation set is used to choose the learning algorithm and the best values of hyperparameters, test set to test the model before delivering it. While overfitting (high variance) is the ability of the model of performing well on training data and bad on at least one of the other two sets, underfitting is the issue of perform bad with training data, and it may happen in case of too simple data (a linear model modelling a curve) or bad feature engineering. Overfitting can be prevented in 4 ways: simplifying the model, reduce dimensionality, add more training data or regularization (L1 and L2 regularizations, where L1 performs feature selections and it is called lasso, L2 is called ridge regularization or weight decay).
If a model generalise well the target/category of the test set, it performs well. For a regression model, this means doing better than the mean model (a model that predicts the average of the labels). We can compute the mean squared error of the training set and the one of the test set: if the one of the test set is higher, the model may overfit.
For a classification model we may use other metrics like a confusion matrix (a table that summarizes how successful a model is at predicting categories – one row per category for the actual value and one column per category for the predicted – and, based on True Positive TP, False Positive FP, True Negative TN and False Negative FN, we can compute the precision, ie the ratio of positive predictions to all the positive predictions as TP/(TP+FP) and recall, ie the correct positive predictions over the positive examples of the dataset as TP/(TP+FN)), or the accuracy (the number of correct predictions divided by the total of examples as (TP + TN)/(TP + TN + FP + FN)) or cost-sensitive accuracy (an accuracy where mistakes have different costs – you multiply FP and FN by a positive number and use the accuracy formula). You can plot an Area under the ROC Curve (AUC) to visualize the performance of a classification model.
Tuning hyperparameters is a task that has some techniques: in SVM you have 2 hyperparameters: the penalty parameter C and the kernel (two values: linear or rbf). You can use grid search: you set a logarithmic scale of values for C (for example, 0.001, 0.01, 0.1, 1, 10, 100, 1000) and use it with both values of kernels (ending up generating 14 models in the previous example) and then you select the best model based on the mentioned metrics. You can use random search (a statistical distribution for each hyperparameter and the number of models to generate and then you randomly choose the values) or Bayesian hyperparameter optimization or other techniques.
Cross validation is a technique used when the validation set is not decent: you split all the training set into folds (usually 5) and you train models using all but one fold and use that fold for validation. You can use this technique to find the best values for hyperparameters.
Neural Networks are logistic regression models. Its generalisation, the softmax regression model, is a standard unit of neural networks. Neural Network are nested functions, each nesting is called layer. The inner layers are vector functions of the form F(z) = g(Wz+b) where g is a fixed, usually non linear, activation function. W is a matrix and b is a vector: they are both learned using gradient descent by optimising a cost function.
Feed-forward neural networks (FFNN) includes the Multilayer Perceptron (MLP). Each layer receive the output of the previous layer (a single value per vector of the matrix), it creates a vector, and compute an output per each associated vector of the matrix W and value of b (applying the activation function as well) and that forms its output. The activation function may differ in the same layer. If the input of a layer is defined as a vector of all the outputs of the previous one, the two layers are defined as fully connected. All activation functions should be differentiable in most of its points.
Popular activation function are logistic function, TanH (an hyperbolic tangent function) and ReLU (a function that is zero for values <0 and the value for anything >0).
Deep learning refers to training neural networks with more than 2 non output layers. Convolutional Neural Networks (CNN) are special kind of FFNN that reduces the number of parameters in deep neural network without losing too much in the quality of the network – they were created for image processing. For example, for a 100×100 images we can have a CNN that splits it in batches of 10×10, applies a filter function 10×10 and each value creates the new 100×100 matrix. In CNN if 2 subsequent layers are convolutions, the l+1 layer uses the outputs of the preceding layer as a collection of images, called volume. In computer vision, CNN usually use volumes as input, with one matrix per channel RGB.
The step to move the sliding window of the matrix for each batch is called slide. A slide >1 reduces the matrix. Padding is the number of rows/columns added around the image, usually with zeros, to get a larger output matrix and to make the filter scanner better the margins of the image.
Pooling is like convolution: it applies a fixed filter (with hyperparameters: usually size of the filter and stride) to a window with stride. An example is to apply max function as a filter of dimension 2 or 3 and stride 2. Pooling reduces the number of parameters, but usually each volume is treated separately, maintaining the same number of matrixes in the volume.
Recurrent Neural Networks (RNN) are used to label, classify or generate sequences (a sequence is a matrix each row being a feature vector in which order matters). Label means to predict a class for each vector, classify is to predict a class for the entire sequence. RNN are usually used in text processing and speech processing, and they are not feed-forward because they contain loops and a state (as the memory of the unit). The input and previous state are used to compute the output and new state. The activation function is usually the softmax function, a multidimensional generalisation of the sigmoid function. To train RNNs we use backpropagation through time.
Two issues affects RNN: vanishing gradient (some values tend to become very small during training and they become not influencing) and long term distances (early vectors in a sequence tend to be forgotten through the last units of the network).
Gated RNNs are networks that “store” information by applying filters to vectors. For example, a network may decide to store the information of the gender of a subject by filtering this in and filter everything else out, and that can be used by other networks. All of this is made with concepts of gates, that decide if a value needs or not to be updated.
Leave a Reply