
Why do we use xavier initialization?
The following factors call for the application of xavier initialization:
If the weights in a network start very small, most of the signals will shrink and become dormant at the activation function in the later layers
If the weights start very large, most of the signals will massively grow and pass through the activation functions in the later layers
Thus, xavier initialization helps in generating optimal weights, such that the signals are within optimal range, thereby minimizing the chances of the signals getting neither too small nor too large.
The derivation of the preceding formula is beyond the scope of this book. Feel free to search here (http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization) and go through the derivation for a better understanding.