Research

DMA Explained: Moving Data Without the CPU

How Direct Memory Access works on microcontrollers, why it matters for high-throughput embedded systems and how to configure a DMA transfer on an STM32 without relying on HAL.

1 July 20266 min read

DMA

STM32

Embedded

Performance

DMA Explained: Moving Data Without the CPU

Every time a microcontroller moves data using the CPU - reading bytes from a UART receive register, filling a display buffer, copying ADC samples into memory - the CPU is doing work that does not require the CPU. Each byte requires a load instruction, a store instruction and a loop iteration. At 1 MHz sample rates or when streaming data to a DAC, this becomes a significant fraction of available CPU time. Direct Memory Access (DMA) is the hardware mechanism that moves data between peripherals and memory without involving the CPU at all.

How DMA Works

A DMA controller is a dedicated hardware block that sits between the bus and main memory. It has its own internal registers: a source address, a destination address, a transfer count and a configuration register specifying the transfer size (byte, half-word or word), the direction (peripheral to memory, memory to peripheral or memory to memory) and whether addresses should auto-increment after each transfer. The CPU programs these registers then starts the transfer. The DMA controller takes over the bus, moves the data and fires an interrupt when it is done. The CPU is free to do other work throughout.

The performance difference is substantial. Receiving 1024 bytes over SPI in software requires 1024 load-from-SPI-DR operations, 1024 store-to-buffer operations and 1024 loop iterations: roughly 3072 instructions. DMA moves the same 1024 bytes with the CPU executing essentially zero instructions for the transfer itself. On a system running at 72 MHz, this is the difference between occupying the CPU for 43 microseconds and releasing it entirely.

Direct Memory Access (DMA) is a feature built into most modern microcontrollers that lets hardware peripherals transfer data directly to or from RAM without involving the CPU in each byte transfer. Think of it as a dedicated delivery driver for data: you tell it where to pick up (source address), where to drop off (destination address) and how many items to move (transfer count), then it does the work while the CPU handles something else entirely.

DMA on STM32: Channels and Requests

STM32 microcontrollers have one or two DMA controllers (DMA1 and DMA2 on larger devices), each with multiple channels. Each channel can be triggered by a specific peripheral request: USART1 receive maps to DMA1 Channel 5 on the STM32F103, SPI1 receive maps to DMA1 Channel 2, ADC1 maps to DMA1 Channel 1. These mappings are fixed in hardware and documented in the reference manual. Choosing the wrong channel for a peripheral simply does not work.

// I configure DMA1 Channel 5 for USART1 receive on STM32F103
void usart1_dma_rx_init(uint8_t *buf, uint16_t len) {
    RCC->AHBENR |= RCC_AHBENR_DMA1EN;

    DMA1_Channel5->CCR = 0;  // I disable before configuring
    DMA1_Channel5->CPAR  = (uint32_t)&USART1->DR;   // peripheral address
    DMA1_Channel5->CMAR  = (uint32_t)buf;            // memory address
    DMA1_Channel5->CNDTR = len;                      // transfer count
    DMA1_Channel5->CCR   =
        DMA_CCR_MINC   |   // increment memory address after each byte
        DMA_CCR_TCIE   |   // interrupt when transfer complete
        DMA_CCR_EN;        // enable channel

    USART1->CR3 |= USART_CR3_DMAR;  // I enable USART DMA receive request
}

Circular Mode and Continuous Sampling

Normal mode DMA stops when the transfer count reaches zero and fires the transfer complete interrupt. Circular mode resets the counter automatically and continues indefinitely. This is the correct mode for continuous ADC sampling: configure DMA in circular mode pointing at a buffer, enable ADC continuous conversion and the DMA controller fills the buffer repeatedly. The CPU can process each completed half-buffer while the DMA fills the other half - a double-buffering pattern that is standard practice in audio and sensor data acquisition.

Cache Coherency on Cortex-M7

On Cortex-M7 devices (STM32H7 and STM32F7 series) the CPU has data cache. DMA accesses main memory directly, bypassing the cache. If the CPU has cached a region of memory that DMA is writing to, the CPU reads stale data from cache rather than the fresh DMA-written values. The solution is to either place DMA buffers in non-cacheable memory regions (using the MPU or linker script DTCM allocation) or to explicitly invalidate the cache region before reading DMA-written data. Failing to handle this is one of the most common bugs when porting DMA code from Cortex-M4 to Cortex-M7 devices.

Half-Transfer Interrupt for Double Buffering

Circular DMA with a single buffer creates a race: if the main loop processes bytes too slowly, the DMA controller will overwrite the start of the buffer before the main loop has finished reading it. The solution is double buffering with the half-transfer interrupt. Set up DMA in circular mode with a buffer twice the size you actually need. DMA fires two interrupts: half-transfer complete (HT) when it reaches the midpoint, and transfer complete (TC) when it wraps around. While DMA is writing the second half, your ISR or main loop processes the first half. When DMA writes the first half again, you process the second half. The two halves never conflict.

// I use a double buffer with half-transfer and transfer-complete interrupts
#define BUF_SIZE 512  // 256 samples per half

uint16_t adc_buf[BUF_SIZE];
volatile uint8_t half_ready = 0;  // 0 = first half ready, 1 = second half ready

void DMA1_Channel1_IRQHandler(void) {
    if (DMA1->ISR & DMA_ISR_HTIF1) {
        DMA1->IFCR = DMA_IFCR_CHTIF1;
        half_ready = 0;  // first half is ready to process
    }
    if (DMA1->ISR & DMA_ISR_TCIF1) {
        DMA1->IFCR = DMA_IFCR_CTCIF1;
        half_ready = 1;  // second half is ready
    }
}

// In main loop:
if (half_ready == 0) {
    process_samples(adc_buf, BUF_SIZE / 2);          // first half
} else {
    process_samples(adc_buf + BUF_SIZE / 2, BUF_SIZE / 2);  // second half
}

Common Pitfalls

Forgetting to enable the DMA request on the peripheral side - for USART, USART_CR3_DMAR must be set; DMA alone does nothing without the peripheral triggering it
Using the wrong DMA channel - channel-to-peripheral mappings are fixed in hardware and device-specific; the reference manual is the only reliable source
Not enabling the DMA clock - RCC->AHBENR |= RCC_AHBENR_DMA1EN is easy to forget
Cache coherency on Cortex-M7 - placing DMA buffers in cacheable SRAM without invalidating cache before reads; use __attribute__((section('.dma_buf'))) and configure MPU accordingly
Transfer size mismatch - if the peripheral sends 16-bit values and you configure DMA for 8-bit transfers, every value is silently truncated; DMA_CCR_PSIZE and DMA_CCR_MSIZE must match the peripheral register width

DMA moves the data. The CPU moves the product.
- Embedded systems engineering principle

References

Bare Metal AVR: Building a Nine-Mode State Machine Without Any Framework

How I built a nine-mode state machine on an ATmega644P from scratch using bare metal C, writing directly to hardware registers with no framework, no HAL and no shortcuts. Still ongoing.

UART From Scratch: Serial Communication Without a Library

How to set up UART on an AVR microcontroller using bare metal C, configure baud rate registers, transmit and receive bytes and debug embedded systems over a serial monitor.

What an RTOS Actually Does: Tasks, Scheduling and Why It Matters

A practical introduction to real-time operating systems: what a task scheduler does, why timing guarantees matter in embedded systems and how FreeRTOS implements preemptive multitasking on a microcontroller.

PreviousMy Development Setup in 2026: Everything I Use to Build, Learn and Ship NextInterrupt-Driven Design: Writing Non-Blocking Firmware for Microcontrollers

All writing

Isaac Adjei

BEng Electronic Engineering & Computer Science at Aston University · embedded systems, full-stack software, AI/ML and open source

Reactions

No login needed. Tap an emoji to let me know what landed.

Comments

Have a thought, correction or question? Sign in with GitHub - I read every comment and reply where I can.

All writing

Research

DMA Explained: Moving Data Without the CPU

How Direct Memory Access works on microcontrollers, why it matters for high-throughput embedded systems and how to configure a DMA transfer on an STM32 without relying on HAL.

1 July 20266 min read

DMA

STM32

Embedded

Performance

How DMA Works

DMA on STM32: Channels and Requests

// I configure DMA1 Channel 5 for USART1 receive on STM32F103
void usart1_dma_rx_init(uint8_t *buf, uint16_t len) {
    RCC->AHBENR |= RCC_AHBENR_DMA1EN;

    DMA1_Channel5->CCR = 0;  // I disable before configuring
    DMA1_Channel5->CPAR  = (uint32_t)&USART1->DR;   // peripheral address
    DMA1_Channel5->CMAR  = (uint32_t)buf;            // memory address
    DMA1_Channel5->CNDTR = len;                      // transfer count
    DMA1_Channel5->CCR   =
        DMA_CCR_MINC   |   // increment memory address after each byte
        DMA_CCR_TCIE   |   // interrupt when transfer complete
        DMA_CCR_EN;        // enable channel

    USART1->CR3 |= USART_CR3_DMAR;  // I enable USART DMA receive request
}

Circular Mode and Continuous Sampling

Cache Coherency on Cortex-M7

Half-Transfer Interrupt for Double Buffering

// I use a double buffer with half-transfer and transfer-complete interrupts
#define BUF_SIZE 512  // 256 samples per half

uint16_t adc_buf[BUF_SIZE];
volatile uint8_t half_ready = 0;  // 0 = first half ready, 1 = second half ready

void DMA1_Channel1_IRQHandler(void) {
    if (DMA1->ISR & DMA_ISR_HTIF1) {
        DMA1->IFCR = DMA_IFCR_CHTIF1;
        half_ready = 0;  // first half is ready to process
    }
    if (DMA1->ISR & DMA_ISR_TCIF1) {
        DMA1->IFCR = DMA_IFCR_CTCIF1;
        half_ready = 1;  // second half is ready
    }
}

// In main loop:
if (half_ready == 0) {
    process_samples(adc_buf, BUF_SIZE / 2);          // first half
} else {
    process_samples(adc_buf + BUF_SIZE / 2, BUF_SIZE / 2);  // second half
}

Common Pitfalls

Forgetting to enable the DMA request on the peripheral side - for USART, USART_CR3_DMAR must be set; DMA alone does nothing without the peripheral triggering it
Using the wrong DMA channel - channel-to-peripheral mappings are fixed in hardware and device-specific; the reference manual is the only reliable source
Not enabling the DMA clock - RCC->AHBENR |= RCC_AHBENR_DMA1EN is easy to forget
Cache coherency on Cortex-M7 - placing DMA buffers in cacheable SRAM without invalidating cache before reads; use __attribute__((section('.dma_buf'))) and configure MPU accordingly
Transfer size mismatch - if the peripheral sends 16-bit values and you configure DMA for 8-bit transfers, every value is silently truncated; DMA_CCR_PSIZE and DMA_CCR_MSIZE must match the peripheral register width

DMA moves the data. The CPU moves the product.
- Embedded systems engineering principle

References

Bare Metal AVR: Building a Nine-Mode State Machine Without Any Framework

How I built a nine-mode state machine on an ATmega644P from scratch using bare metal C, writing directly to hardware registers with no framework, no HAL and no shortcuts. Still ongoing.

UART From Scratch: Serial Communication Without a Library

How to set up UART on an AVR microcontroller using bare metal C, configure baud rate registers, transmit and receive bytes and debug embedded systems over a serial monitor.

What an RTOS Actually Does: Tasks, Scheduling and Why It Matters

PreviousMy Development Setup in 2026: Everything I Use to Build, Learn and Ship NextInterrupt-Driven Design: Writing Non-Blocking Firmware for Microcontrollers

All writing

Isaac Adjei

BEng Electronic Engineering & Computer Science at Aston University · embedded systems, full-stack software, AI/ML and open source

Reactions

No login needed. Tap an emoji to let me know what landed.

Comments

Have a thought, correction or question? Sign in with GitHub - I read every comment and reply where I can.

DMA Explained: Moving Data Without the CPU

How DMA Works

DMA on STM32: Channels and Requests

Circular Mode and Continuous Sampling

Cache Coherency on Cortex-M7

Half-Transfer Interrupt for Double Buffering

Common Pitfalls

References

You might also like

Reactions

Comments

DMA Explained: Moving Data Without the CPU

How DMA Works

DMA on STM32: Channels and Requests

Circular Mode and Continuous Sampling

Cache Coherency on Cortex-M7

Half-Transfer Interrupt for Double Buffering

Common Pitfalls

References

You might also like

Reactions

Comments