Abstract— At present, sign language recognition (SLR) researchers are mainly committed to establishing a sign language recognition model based on single-mode data. Nevertheless, this manipulation often leads to a defective understanding of the sign language semantics and ignoring some visual information. In a nutshell, the challenges locate redundancy removing and the alignment of the sign language data with the given tag.To solve the conundrum, this paper proposes a deep learning strategy called CRB-Net, which has used a kind of multimodal fusion attention mechanism. We fifirst extract the features from RGB video and depth video, respectively, then conduct multi-modal fusion. Finally, the fused feature information is fed into an encoder-decoder network to achieve the goal of end-to-end continuous SLR. We verify the effectiveness of our method on three datasets, including the German dataset RWTH-Phoenix-Weather-2014, the Chinese dataset USTC-CSL and the Chinese dataset TJUT-SLRT. As shown by experimental results, the accuracy of 98.5% of our framework CRB-Net has outperformed the state-of-the-art works in the comparison, both in accuracy and algorithm execution effificiency.