Inferring gene expressions from histopathological images has long been a fascinating yet challenging task, primarily due to the substantial disparities between the two modality. Existing strategies using local or global features of histological images are suffering model complexity, GPU consumption, low interpretability, insufficient encoding of local features, and over-smooth prediction of gene expressions among neighboring sites. In this paper, we develop TCGN (Transformer with Convolution …