Multimodal Residual Network (MRN) extends residual learning to visual question answering to achieve state-of-the-art results. MRN introduces shortcut connections between question and image embeddings to avoid degradation from very deep networks. Evaluation shows MRN outperforms stacked attention networks and improves with increased depth up to 3 blocks. Implicit attention maps reveal spatial focus without weighted sums.