What Makes Training Multi-modal Classification Networks Hard Balanced Multimodal Learning via On-the-fly Gradient Modulation