Direct Memory Access on the PIC32MZ post

What is Direct Memory Access (DMA)?

Direct Memory Access is a way for the CPU to offload the work of data transfers either to or from a peripheral to an external module that can take care of the transfer in the background and let the CPU know when it's done.
DMA is, in my opinion, one of the most powerful things found on microcontrollers and a big differentiator between them. But how does it work and why would I want to use it?

In my LCD example I was sending an image of Tux to the LCD. What if I now want to expand that to read frames from an SD card and blit them to the LCD? As per my example, I'd do this:

while (!F_EOF)
    // Read frame from disk
    f_read(&file, frame, FRAME_SIZE, &bytes_read); 

    // Set LCD window position and size
    LCD_set_address(0,0,FRAME_WIDTH - 1,FRAME_HEIGHT - 1);
    PMADDR = 1;         

    // Send the pixel data to the LCD
    for (cnt = 0; cnt < FRAME_SIZE; cnt++)
        PMDIN = Tux[cnt];

OK, that works fine but it also means the CPU is occupied 100% of the time in that for loop, just for sending pixels to the LCD. The Tux image was 210 x 248 pixels big, which is fairly big. What kind of frame rate could we expect from such an approach?
210 x 248 x 2 = 104,160 bytes per frame. If I want to do 30 frames per second, that translates to 3,124,800 bytes per second that I need to read from the SD card and send to the LCD. That might be possible, but just barely.

Let's expand this example to an actual LCD sized frame, 320 x 240 at 30 fps. This is 320 x 240 x 2 x 30 = 4,608,000 bytes per second to read and send to the LCD. Currently, even running the SD card at 50MHz I only get about 3.3MB/s reading, so this would be impossible. How could we speed this up? Well, the slow part is the SD code, the writing to the LCD is actually quite fast. So what I want is some way to spend less CPU time sending data to the LCD and devote more CPU time to reading from the SD card. In effect, I want some way to send the data to the LCD that doesn't involve me waiting around in a for loop. Well, that's what the DMA module can do for us. It can read and write from peripherals or ports in the background without using any CPU time, which means we are free to do other tasks, like reading from an SD card, while it is busy.

Let's take a look at the official block diagram of the DMA controller (click on it for an enlarged view):

PIC32MZ - I2C - PIC32MZ - DMA module

From the diagram you can see that the CPU and the DMA module are separate. The CPU can give the DMA module an instruction, like "Send the data in the frame array to the PMP module" and then DMA module will start doing that immediately. This instruction to the DMA module only takes a few lines of code to set up, and therefore is much, much faster than having to run through an entire for loop. Again, it also happens entirely in the background, without the CPU's involvement, which leaves us free to do whatever we want while it's busy.

To summarise: One of the biggest advantages of DMA is it frees up the CPU to do other work while large data transfers are happening.

Using DMA on the PIC32MZ

The PIC32MZ has eight of these DMA channels, and each of these can transfer up to 64kB at a time. It runs directly off of the System Clock (SYSCLK). There are also advanced features like chaining channels together, pattern matching and CRC generation. Today we're going to look at how to set up DMA transfers and use pattern matching too.

For starters, let's see what information we need to give the DMA controller:

  • The address of the source of the data
  • The size of the source of the data, in bytes
  • The cell size (how much data to transfer each time), in bytes
  • The address of the destination of the data
  • The size of the destination of the data, in bytes
  • The source of the "clock signal" or interrupt to tell it to move the data (covered later)

In theory, it's very simple but this is the PIC32MZ. It takes your cute "theory" and laughs at it before ripping out your heart and laughing at you. There are many things the documentation either doesn't mention or describes very poorly. The most important of them is this:

Any buffers you use **MUST be declared coherent or nothing will work**

Coherent? The memory on the PIC32 is a bit slow and peripherals use various tricks, like caching or making their own copies of data, to get better speed. The problem with this is that two devices accessing the same area of memory can end up reading different values from the same memory location due to this. In DMA, this would lead to disaster. The coherent memory space is one in which no caching or tricks are allowed and everything accesses the memory directly. This means it's slower but more reliable.

If you look in any Harmony example that use DMA, they declare their buffers like this:

unsigned short APP_MAKE_BUFFER_DMA_READY buffer[1024];

It turns out that APP_MAKE_BUFFER_DMA_READY is a friendly way of saying:

unsigned short __attribute__ ((coherent, aligned(16)))

Which tells the compiler to assign the array in coherent memory. So, where before you had to declare your buffer like this:

unsigned short read_buffer[1024]

You now need to declare it like this:

unsigned short __attribute__ ((coherent, aligned(16))) read_buffer[1024]

It looks confusing but it's not a huge change. Please remember that the 16 is the number of bits, so for an unsigned char you'd need to change that to 8.

If you prefer using heap memory and malloc() and free() the coherent memory version of those are __pic32_alloc_coherent() and __pic32_free_coherent().
Remember though, if you use heap memory that you need to specify a heap size under XC32 compiler options or it will not work.

OK, enough theory for now, let's take a look at some code to send a 16-bit buffer to the PMP:

volatile DMA_DONE_FLAG = 0; void LCD_blit(unsigned short *buffer, int num_bytes) { DCH0CONbits.CHEN = 0; // Turn off this channel DCH0SSA = virt_to_phys(buffer); // Move the data from the [buffer] array DCH0DSA = virt_to_phys((const void*)&PMDIN);// Move the data to the PMDIN register DCH0SSIZ = DMA_TRANSFER_SIZE; // Move num_bytes bytes of data in total DCH0CSIZ = 2; // Move 2 bytes at a time DCH0DSIZ = 2; // Destination size is 2 bytes DCH0ECON=0; // Clear the DMA configuration settings DCH0ECONbits.CHSIRQ = _PMP_VECTOR; // Move data on PMP interrupt DCH0ECONbits.CHAIRQ = _PMP_ERROR_VECTOR; // Abort on PMP error DCH0ECONbits.SIRQEN = 1; // Enable Start IRQ DCH0ECONbits.AIRQEN = 1; // Enable Abort IRQ DCH0CONbits.CHPRI = 3; // The priority of this channel is 3 (highest) DCH0CONbits.CHEN = 1; // Turn this channel on now IPC33bits.DMA0IP = 3; // Set DMA 0 interrupt priority to 3 IPC33bits.DMA0IS = 1; // Set DMA 0 interrupt sub-priority to 1 IFS4bits.PMPIF = 0; // Clear the PMP interrupt flag IFS4bits.DMA0IF = 0; // Clear the DMA channel 0 interrupt flag IEC4bits.DMA0IE = 1; // Enable the DMA 0 interrupt DCH0INTbits.CHBCIE = 1; // Enable the Channel Block Transer Complete (CHBC) Interrupt DCH0ECONbits.CFORCE = 1; // Force the start of the transfer now DMACONSET=0x8000; // Turn the DMA module on } // Interrupt handler void __attribute__((vector(_DMA0_VECTOR), interrupt(IPL3SRS), nomips16)) DMA0_handler() { IFS4bits.DMA0IF=0; // Clear the DMA channel 0 interrupt flag IEC4bits.DMA0IE=0; // Disable the DMA 0 interrupt DMA_DONE_FLAG = 1; // DMA transfer is done }

Important: Before continuing, I want to mention again that this can transfer a maximum of 65,536 bytes. This means is cannot transfer an entire 320x240x2 bytes frame of data at one time. That can be accomplished by DMA chaining or interrupt handling, neither of which I am going into today.

There are a few new things here. First of all, what is virt_to_phys? Then what's this IRQ-related stuff? Well, virt_to_phys is the name I copied from the datasheet. Let's take a look at what it does:

extern __inline__ unsigned int __attribute__((always_inline)) virt_to_phys(const void* p) 
 return (int)p<0?((int)p&0x1fffffffL):(unsigned int)((unsigned char*)p+0x40000000L); 

Easy, right? Seriously though, what it's doing is converting the virtual memory address of something to a physical memory address because the DMA module works with physical addresses.

Virtual vs Physical memory. To put it very simply, the PIC32 takes the physical memory and maps it into segments (like KSEG0, KSEG1, etc) some of which are cachable and some of which are not. It uses something called Fixed Mapping Translation (FMT) to translate these addresses to the actual physical memory location when they are used. The DMA module requires the actual physical address of the memory used, so we need to translate the pic32's virtual memory address into a physical address, which is what virt_to_phys() does.

The next thing you'll not is we have to supply a "source" interrupt for the DMA transfers. If you remember from the LCD example, in PMP_init() I had this line:

    PMMODEbits.IRQM = 1;    // IRQ at the end of the Read/Write cycle

This means that after any PMP transfers is completed an interrupt will be generated. We do not need to write the Interrupt Service Routine (ISR) for this, it's all handled internally and the DMA module will intercept the interrupt and clear the interrupt flag for us each time.
There is also the option to abort the DMA transfer if the PMP error interrupt is generated, that's what _PMP_ERROR_VECTOR is doing.
Next, we can see that each DMA channel has a priority, just like interrupts did. This priority is also important in DMA chaining.

A word on interrupts. First, why have I changed to using the Shadow Register Set instead of the software interrupts? Simply put, it's faster because it means the PIC32 doesn't have to save the contents of all the many registers to memory before it calls the interrupt service routine (ISR). Before using this feature, it needs to be enabled, usually somewhere after set_performance_mode() in your main() function like this:

PRISS = 0x76543210; // Assign shadow register sets to interrupt priorities 1 through 7

When the DMA transfer is done, it can generate an interrupt to let us know it's done. Knowing what we know about ISRs and how they take valuable time, you may be tempted to do this:

while (DCH0INTbits.CHSDIF == 0);    // Wait for DMA transfer to finish

However, this would be a big mistake. In the DMA datasheet in a code example they say:
" continuously polling the DMA controller in a tight loop would affect the performance of the DMA transfer "
You could check the flag, wait a few microseconds and check again but I prefer to use the interrupt approach as, in theory, it could result in better turn-around times. There are 8 different kinds of interrupts that can be generated which makes the DMA module very flexible.

Can't we do something to shorten that horrendous ISR declaration? Turns out yes, we can. Somewhere in your code, you can define:
#pragma interrupt DMA0_handler IPL3SRS vector _DMA0_VECTOR
and then later in your code, for the actual ISR function, you can just say:
void DMA0_handler(void)
Which is quite a nice change from that mess up above. All a matter of personal preference, really.

The last thing I want to take a look at today is a really cool ability of the DMA module called "Pattern Matching". This is a way to abort a DMA transfer upon receiving a certain byte / word. This pattern can be either 8 or 16 bits.
This is very useful in reading from SD cards, because before we read a block we have to output 0xFF until the SD card returns the 0xFE token to tell us it's ready to give us the data. You can set up a pattern match like this:

DCH0ECONbits.PATEN = 1;     // Enable abort on pattern match
DCH0CONbits.CHPATLEN = 0;   // 8-bit pattern
DCH0DAT = 'รพ';              // Character 0xFE

Right, that's long enough for one day. Next time I'll write about how you can use two DMA channels to read from an SD card.

Categories: pic32

Tags: code, DMA