https://datascience.stackexchange.com/questions/52157/why-do-we-have-to-divide-by-2-in-the-ml-squared-error-cost-function
It is simple. It is because when you take the derivative of the cost function, that is used in updating the parameters during gradient descent, that 2
in the power get cancelled with the 12 multiplier, thus the derivation is cleaner. These techniques are or somewhat similar are widely used in math in order "To make the derivations mathematically more convenient". You can simply remove the multiplier, see here for example, and expect the same result.
简单说就是为了反向传导时容易计算。这是在数学界常用的手法。