Skip to main content

Statistical Interpretation of Loss Function

· 3 min read

Machine learning(ML) defines a loss function and optimizes its model to minimize the loss. Since ML is based on probability theory and statistics, it is reasonable to interpret the loss function from a statistical perspective.

Model Output as Deterministic Value

We can interpret ML models without statistics. In this case, the output of the model is directly considered as a deterministic solution for the model. In this case, the loss function can have an arbitrary form as far as it satisfies some constraints:

  • Loss over a dataset is the sum of individual losses of each data. Loss(x)=iLoss(xi)Loss(\mathbf{x}) = \sum_{i} Loss(x_i)
  • It should give a low value if the output is similar to the true label. Although this is not a necessary condition to be a loss function, the loss function is semantically valid when it satisfies this condition.

Model Output as Noise Distribution Parameter

In statistical interpretation, we assume that the observations include some noise. Therefore, the relationship between input XX and output YY is not a deterministic function Y=f(X)Y=f(X) but a stochastic process YNoise(X)Y\sim Noise(X). We can define the noise distribution to have a parametrized form (ex. Gaussian noise). Then, the output of the model is considered as the parameters in the noise distribution.

YNoise Dist.P(Yw=fθ(X))\begin{align} Y &\sim Noise\ Dist.\\ &\sim P(Y \vert w=f_{\theta}(X)) \end{align}
note
  • The parameter θ\theta is the parameter for the ML model e.g. weights in nueral networks.
  • The parameter ww is the parameter for the noise distribution that we defined.

Define Loss Function

Let's consider an i-th sample (xi,yi)(x_i, y_i). Given the input xix_i, the likelihood of this sample is P(Y=yiw=fθ(xi))P(Y=y_i\vert w=f_\theta(x_i)). For the ease of notations, I'll denote it as p(yifθ(xi))p(y_{i}\vert f_{\theta}(x_{i})). Thanks to i.i.d. condition on data, total likelihood over the dataset is as follows:

P(X=xi,Y=yiθ)=p(yifθ(xi))(likelihood over a single data)P(Dθ)=ip(yifθ(xi))(likelihood over dataset)\begin{align} P(X=x_{i},Y=y_{i} \vert \theta) &= p(y_{i}\vert f_{\theta}(x_{i})) &&\quad \text{(likelihood over a single data)}\\ P(D \vert \theta) &= \prod_{i} p(y_{i} \vert f_{\theta}(x_{i})) &&\quad \text{(likelihood over dataset)} \end{align}

The goal is to maximize the likelihood. Considering that loss functions must be represented as a sum of terms, and it is preferred to minimize the loss, we use negative log likelihood(NLL) as the loss function.

maxθip(yifθ(xi))    maxθilogp(yifθ(xi))    minθilogp(yifθ(xi))\begin{align*} \max_\theta \prod_i p(y_i|f_\theta(x_i)) &\iff \max_\theta \sum_i \log p(y_i|f_\theta(x_i)) \\ &\iff \min_\theta \sum_i -\log p(y_i|f_\theta(x_i)) \end{align*}

MSE with Fixed Variance Gaussian Noise

Let's assume that the noise distribution is Gaussian with fixed variance. Then, mean(μ\mu) is the only parameter that can be changed when input data(XX) varies. We can rewrite the distribution as follows:

P(Yfθ(X))=N(Yμ(X),σ)P(Y\vert f_\theta(X)) = \mathcal{N}\Big(Y\vert \mu(X), \sigma\Big)

By replacing the μ(X)\mu(X) with the ML model output fθ(X)f_\theta(X), we can derive that minimizing the negative log likelihood is equivalent to minimizing the squared error.

logp(yifθ(xi))=log[N(yifθ(xi),σ)](yifθ(xi))2loss over dataseti(yifθ(xi))21Ni(yifθ(xi))2=mean squared error\begin{align*} -\log p(y_i|f_\theta(x_i)) &= -\log [\mathcal{N}\Big(y_i\vert f_\theta(x_i), \sigma\Big)] \\ &\propto \big(y_i-f_\theta(x_i)\big)^{2}\\ \therefore loss\ over\ dataset&\propto \sum_i \big(y_i-f_\theta(x_i)\big)^2\\ &\propto \frac{1}{N}\sum_i \big(y_i-f_\theta(x_i)\big)^2 = mean\ squared\ error \end{align*}

Cross Entropy with Multinomial Distribution Noise

If we assume the noise distribution to be a multinomial distribution, the loss function must be cross entropy.

Reference