7. Model Performance

When training ML Models, it is important to avoid overfitting the training data. Overfitting occurs when the ML model learns the noise in the training data and thus does not generalize well to data it has not been trained on. One hyperparameter that affects whether the ML model will overfit or not is the number of epochs or complete passes through the training split. If we use too many epochs, then the ML model is likely to overfit. On the other hand, if we use too few epochs, the ML model might not have the chance to learn fully from the training data.

model performance underfit overfit

Important terms

  • Training data is the data given to the model, from which the model build your model.
  • Training accuracy tells about how much your model learns to map the input and output.
  • Validation data is the data with which the training process is validated.
  • Validation accuracy tells about the generalizing power of the model.
  • Validation accuracy decreasing means lower generalization over the training data. Validation accuracy is evaluated while Training.
  • Test data assess the performance of a trained model.
  • Test accuracy gives the final generalization power. Test accuracy is evaluated after training.
  • If training and validation are both low, you are probably under fitting and you can probably increase the capacity of your model and train more or longer (increase the number of epochs).

7.1. Data Splitting

In practice, detecting that our model is overfitting is difficult. It’s not uncommon that our trained model is already in production and then we start to realize that something is wrong. In fact, it is only by confronting new data that you can make sure that everything is working properly. However, during the training we should try to reproduce the real conditions as much as possible. For this reason, it is good practice to divide our dataset into three parts - training set, dev set (also known as cross-validation or hold-out) and test set. Our model learns by seeing only the first of these parts. Hold-out is used to track our progress and draw conclusions to optimize the model. While, we use a test set at the end of the training process to evaluate the performance of our model.

model performance data split

It is very important to make sure that your cross-validation and test set come from the same distribution as well as that they accurately reflect data that we expect to receive in the future. Only then we can be sure that the decisions we make during the learning process bring us closer to a better solution. Our dev and test sets should be simply large enough to give us high confidence in the performance of our model.

7.2. Validation

We need to create a model with the best settings (the degree), but we don’t want to have to keep going through training and testing. There are no consequences in our example from poor test performance, but in a real application where we might be performing a critical task such as diagnosing cancer, there would be serious downsides to deploying a faulty model. We need some sort of pre-test to use for model optimization and evaluate. This pre-test is known as a validation set. A basic approach would be to use a validation set in addition to the training and testing set. This presents a few problems though: we could just end up overfitting to the validation set and we would have less training data. A smarter implementation of the validation concept is k-fold cross-validation.

The idea is straightforward: rather than using a separate validation set, we split the training set into a number of subsets, called folds. Let’s use five folds as an example. We perform a series of train and evaluate cycles where each time we train on 4 of the folds and test on the 5th, called the hold-out set. We repeat this cycle 5 times, each time using a different fold for evaluation. At the end, we average the scores for each of the folds to determine the overall performance of a given model. This allows us to optimize the model before deployment without having to use additional data.

model performance data split

7.3. Cost function

Whenever a model is trained on a training data and is used to predict values on a testing set, there exists a difference between the true and predicted values. The closer the predicted values to their corresponding real values, the better the model. That means, a cost function is used to measure how close the predicted values are to their corresponding real values. The function can be minimized or maximized, given the situation/problem.

For example, in case of ordinary least squares (OLS), the cost function(to be minimized) would be:

(1)\[\operatorname{J}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i)-y_i)^2\]

where,

  • \(J\) denotes the cost function,
  • \(m\) is the number of observations in the dataset,
  • \(h(x)\) is the predicted value of the response
  • \(y\) is the true value of the response

7.4. High Bias and High Variance

If a model is under-performing (e.g. if the test or training error is too high), there are several ways to improve performance. To find out which of these many techniques is the right one for the situation, the first step is to determine the root of the problem.

model performance high bias high performance

The graph above plots the training error and the test error and can be divided into two overarching regimes. In the first regime (on the left side of the graph), training error is below the desired error threshold (denoted by ϵ), but test error is significantly higher. In the second regime (on the right side of the graph), test error is remarkably close to training error, but both are above the desired tolerance of ϵ.

7.4.1. Regime 1 (High Variance)

In the first regime, the cause of the poor performance is high variance.

  • Symptoms

    1. Training error is much lower than test error
    2. Training error is lower than ϵ
    3. Test error is above ϵ
  • Remedies

    1. Add more training data
    2. Reduce model complexity – complex models are prone to high variance
    3. Bagging (will be covered later in the course)

7.4.2. Regime 2 (High Bias)

Unlike the first regime, the second regime indicates high bias: the model being used is not robust enough to produce an accurate prediction.

  • Symptoms

    1. Training error is higher than ϵ
  • Remedies

    1. Use more complex model (e.g. kernelize, use non-linear models)
    2. Add features
    3. Boosting

7.5. Regularizations

One of the first methods we should try when we need to reduce overfitting is regularization. It involves adding an extra element to the loss function, which punishes our model for being too complex or, in simple words, for using too high values in the weight matrix. This way we try to limit its flexibility, but also encourage it to build solutions based on multiple features. Two popular versions of this method are:

regularizations
  1. L1 regularization (Lasso Regression) : (Least Absolute Shrinkage and Selection Operator) adds absolute value of magnitude of coefficient as penalty term to the loss function.
(2)\[\operatorname{J}(\theta)^{L1} = \operatorname{J}(\theta)^{OLS} + \lambda\sum_{j=1}^{n}|\theta_{j}|\]
  1. L2 Regularization (Ridge regression) : adds squared magnitude of coefficient as penalty term to the loss function.
(3)\[\operatorname{J}(\theta)^{L2} = \operatorname{J}(\theta)^{OLS} + \lambda\sum_{j=1}^{n}\theta_{j}^{2}\]

In addition to the cost function we had in case of OLS, there is an additional term added, which is the regularization term. \(\theta(\text{norm of the coefficients})\), the addition is of \(\lambda(\text{the regularization parameter})\) and \(\theta^{2}(\text{norm of coefficient squared})\). The addition of regularization term penalizes big coefficients and tries to minimize them to zero, although not making them exactly to zero. This means that if the \(θ’s\) take on large values, the optimization function is penalized. We would prefer to take smaller \(θ\)’s, or \(θ\)’s that are close to zero to drive the penalty term small.

7.5.1. Key points

  1. Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
  2. Built-in feature selection is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients.
  3. Computational efficiency, L1-norm does not have an analytical solution, but L2-norm does. This allows the L2-norm solutions to be calculated computationally efficiently. However, L1-norm solutions does have the sparsity properties which allows it to be used along with sparse algorithms, which makes the calculation more computationally efficient.
  4. When there are many predictors (with some col-linearity among them) in the dataset and not all of them have the same predicting power, L2 regression can be used to estimate the predictor importance and penalize predictors that are not important. One issue with co-linearity is that the variance of the parameter estimate is huge. In cases where the number of features are greater than the number of observations, the matrix used in the OLS may not be invertible but Ridge Regression enables this matrix to be inverted. It seeks to reduce the MSE by adding some bias and, at the same time, reducing the variance. Remember high variance correlates to a over-fitting model.
  5. One of the things that Ridge can’t be used is variable selection since it retains all the predictors. Lasso on the other hand overcomes this problem by forcing some of the predictors to zero.
  6. As the \(\lambda\) is increased, variance is reduced and bias is added in the model, so getting the right value of the lambda is essential. Cross-validation is generally used to estimate the value of lambda.

7.6. Early Stopping

When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs. Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data. Early stopping refers stopping the training process before the learner passes that point.

model performance early stoppping

Today, this technique is mostly used in deep learning while other techniques (e.g. regularization) are preferred for classical machine learning.

7.7. Hyperparameter Optimization

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.[1] The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.

7.7.3. Bayesian optimization

Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization, aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum).

7.7.4. Gradient-based optimization

For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as support vector machines or logistic regression.


Citations

Footnotes

References

  1. Model Performance
  2. Model Evaluation