merge asm fragments in H264_CHROMA_MC2_TMPL() 10% faster avg_h264_chroma_mc2_mmx2() 5% faster put_h264_chroma_mc2_mmx2() factor out common subexprssion (gcc of course is too stupid to do this ...) 5% faster avg_h264_chroma_mc2_mmx2() 10% faster put_h264_chroma_mc2_mmx2()