1
$\begingroup$

While following some code to for a least squares problem using gradient descent, the claim was that the functional to be minimized is the "mean square error", $E=\frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})^2$, where $y_i$ are the data points, and $\hat{y_i}$ are the outputs in the linear model for data points $x_i$.

My question is, why the factor of $\frac{1}{n}$? It makes no difference in setting the partial derivatives equal to $0$, so is there a statistical reason for using it, if say, the error terms are distributed a certain way (for the least squares problem, we don't care how they're distributed), or is there a coding reason?

$\endgroup$
2
  • $\begingroup$ Hi: you could use the sum when minimizing it. The term "mean" square error is more common probably because, when it comes to statistical hypothesis testing, mean squared error ( rather than the sum ) takes on a specific role because it is used ( and the $n$ is really needed ) in various test-statistics. $\endgroup$
    – mlofton
    Commented Mar 1 at 12:39
  • 3
    $\begingroup$ One of many possible answers: although multiplying by $1/n$ does not change any solution, it makes the values of $E$ more comparable for differing $n.$ In many circumstances, the value of $E$ converges to a fixed value or fixed distribution as $n$ grows large, whereas $nE$ would always diverge in any sense. There's no conceivable coding reason, because no reasonable $n$ will be so large as to compensate for overflow in the sum or create underflow in $E$ and the division by $n$ scarcely affects the floating point precision. $\endgroup$
    – whuber
    Commented Mar 1 at 15:26

2 Answers 2

3
$\begingroup$

As you note, the factor $1/n$ makes no difference to the calculation. In fact it is not uncommon in some texts for this to be dropped, with "least squares" defined as the solution to minimizing the sum of squared errors rather than the MSE.

The way that least squares solutions are actually computed in practice depends on the code, but a common approach is to gather the explanatory variables into a matrix $X$ and the dependent variables into a corresponding vector $y$ and then compute the elements of the first order condition for minimizing the sum of squared errors. That first order condition can be written $X'X b = X'y$ where $b$ is the solution to the minimization problem. There are many ways to solve numerically for $b$. The best known solution is to invert the $X'X$ matrix and multiply both sides by this inverse, which gives the familiar equation $b=(X'X)^{-1}X'y$. In practice this is not necessarily the most efficient or numerically stable way to find the answer. Consult any advanced text on statistical computing for details. But to get back to your question, notice that the factor $1/n$ does not appear in these equations, and so, no, applying that is not needed for coding reasons.

$\endgroup$
0
$\begingroup$

You are averaging the squared errors. The part inside epsilon adds all squared errors. The factor finds the mean.

Hence the name mean square error.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.