In recent years, with development of Artificial Intelligence (AI) technology, deep learning algorithms and specialized AI processor chips are applied to edge and device data signal processing systems more and more widely. A key technical challenge is how to achieve high-bandwidth and low-latency data transmission between heterogeneous processors while enabling the system with high-level intelligent computing capabilities. Therefore, an embedded heterogeneous intelligent computing system was designed, which integrated Cambricon MLU220 chip, domestic Feiteng FT2000/4 CPU, and Xilinx XC7K325T Field Programmable Gate Array (FPGA). High-speed interconnection and data transmission between the system’s heterogeneous processors were realized through the PCIe (Peripheral Component Interconnect express) bus. In addition, a PCIe bus Scatter-Gather DMA (Direct Memory Access) transmission optimization technique under Linux was proposed, which improved the PCIe bus data transmission bandwidth between the CPU and FPGA heterogeneous processors effectively through a prefetch technique based on double buffering and an interrupt handling based on work queues. Test results of system’s image transmission show that when 10 grayscale images of 2 048×1 024 size are transferred between the CPU and FPGA heterogeneous processors via a PCIe2.0 X4 bus, the proposed system achieves read/write speeds of 1 610 MB/s and 1 655 MB/s in dual-channel DMA mode, respectively, achieving 81% and 83% of the theoretical PCIe2.0 X4 bus bandwidth, respectively. These results verify practicality and advancement of the designed system.