Since the micro architecture of AI chips are different from different vendors, so is Ascend, which is quite different from NV GPU. To make Triton ops perform well, we usually need to do some code ...