Machine learning(ML) defines a loss function and optimizes its model to minimize the loss. Since ML is based on probability theory and statistics, it is reasonable to interpret the loss function from a statistical perspective.
Model Output as Deterministic Value
We can interpret ML models without statistics. In this case, the output of the model is directly considered as a deterministic solution for the model. In this case, the loss function can have an arbitrary form as far as it satisfies some constraints:
- Loss over a dataset is the sum of individual losses of each data.
- It should give a low value if the output is similar to the true label. Although this is not a necessary condition to be a loss function, the loss function is semantically valid when it satisfies this condition.
Model Output as Noise Distribution Parameter
In statistical interpretation, we assume that the observations include some noise. Therefore, the relationship between input and output is not a deterministic function but a stochastic process . We can define the noise distribution to have a parametrized form (ex. Gaussian noise). Then, the output of the model is considered as the parameters in the noise distribution.
- The parameter is the parameter for the ML model e.g. weights in nueral networks.
- The parameter is the parameter for the noise distribution that we defined.
Define Loss Function
Let's consider an i-th sample . Given the input , the likelihood of this sample is . For the ease of notations, I'll denote it as . Thanks to i.i.d. condition on data, total likelihood over the dataset is as follows:
The goal is to maximize the likelihood. Considering that loss functions must be represented as a sum of terms, and it is preferred to minimize the loss, we use negative log likelihood(NLL) as the loss function.
MSE with Fixed Variance Gaussian Noise
Let's assume that the noise distribution is Gaussian with fixed variance. Then, mean() is the only parameter that can be changed when input data() varies. We can rewrite the distribution as follows:
By replacing the with the ML model output , we can derive that minimizing the negative log likelihood is equivalent to minimizing the squared error.
Cross Entropy with Multinomial Distribution Noise
If we assume the noise distribution to be a multinomial distribution, the loss function must be cross entropy.