While following some code to for a least squares problem using gradient descent, the claim was that the functional to be minimized is the "mean square error", $E=\frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})^2$, where $y_i$ are the data points, and $\hat{y_i}$ are the outputs in the linear model for data points $x_i$.
My question is, why the factor of $\frac{1}{n}$? It makes no difference in setting the partial derivatives equal to $0$, so is there a statistical reason for using it, if say, the error terms are distributed a certain way (for the least squares problem, we don't care how they're distributed), or is there a coding reason?