SQI on the PIC32MZ

Refresher on SPI

A while back, I covered using the Serial Peripheral Interface (SPI) and how it worked. If you recall, we had four signals:

  • Chip Select (aka Slave Select) - To choose which slave we are talking to
  • Clock - Generated by the master and provided to all the slaves
  • Master Out Slave In (MOSI) - For the master to send data to the slave, and
  • Master In Slave Out (MISO) - For the slave to send data back to the master

PIC32MZ - Typical SPI connection

The communication was full-duplex, meaning data could be sent and received at the same time. However, the transactions had to be quite carefully controlled. If, for example, we wanted to send the command 0x9F to the slave and get a byte of data back, we'd have to do this:

PIC32MZ - SPI in a jarring GIF

<TL;DR>
As can hopefully be seen in the thrilling animation above, the master starts off sending a the byte 0x9F to the slave. At the same time, the slave is also sending data to the master. Until the slave has received the whole 8 bits, it does not know what the master is sending to it. So it has to wait until it receives the 8 bits, formulate a response and then send it out to the master the next time. Assuming it generates the response instantly, it still has no way to send the data back to the master because only the master can generate clock signals. This is why the master is seen to again be sending 0xFF (a dummy value) to the slave, purely to provide it with the clock signals it needs to send the data back to the master. 0xFF is typically used to avoid confusion with actual commands.
</TL;DR>

If we look at the code, we'd have to do this:

// Send the 0x9F byte
SPI1BUF = 0x9F;
// Wait until the reply has been received
while (SPI1STATbits.SPIRBE);
// Very important: read the reply from the buffer to clear the buffer
reply = SPI1BUF;
// Now we need to read the reply
// Send a dummy byte 0xFF
SPI1BUF = 0xFF;
// Wait until the reply has been received
while (SPI1STATbits.SPIRBE);
// Read the actual reply we want
reply = SPI1BUF;

So as you can see, it's fairly involved and there's a lot of hand-holding required. The speed is also not fantastic. We have a maximum speed of 50MHz, though some peripherals only work up to 20Mhz or even less. We can also send only one bit at a time, so even at 50MHz, we're only getting a maximum transfer rate of 6.25MB/s. This is where Serial Quad Interface (SQI) comes in.

So what is SQI?

SQI, as the name implies, can transfer up to 4 bits at once. While SPI uses 4 lines, SQI uses 6. They are:

  • Clock - The same as SPI, generated by the master
  • Chip select - Again, the same as SPI
  • D0 - The first data bit
  • D1 - The second data bit
  • D2 - The third data bit
  • D3 - The fourth data bit

<TL;DR>
"Up to"? Yes, SQI can be configured to send either 1 bit, 2 bits or 4 bits at a time. Many external devices that support both SPI and SQI start up in SPI mode and need to be sent a special command in order to switch to SQI mode. The PIC32MZ SQI peripheral fully supports this, thankfully, so if you really wanted you could use the SQI peripheral as a regular SPI peripheral.
</TL;DR>

From this, we can see the major difference between SPI and SQI. With SPI, we had one line for sending data to the slave and one line for receiving data and they were both in use at the same time. That is to say that communication was full duplex. With SQI, communication can only occur in one direction at a time, it is half duplex.

Let's take a look at a typical SQI connection:

PIC32MZ - Typical SQI connection

OK, not that hard so far. The hard part comes next.

The PIC32MZ's SQI peripheral

One of the most important things to remember with SQI on the PIC32MZ is that the communication can only happen in one direction at a time. This means we need some way of telling the PIC32MZ whether we want to send or receive data. This also means that, unlike the SPI peripheral, we cannot just write data to the SQI buffer and hope it'll get sent because it won't.

It turns out there are six registers we need to concern ourselves with:

  • SQI1THR - SQI Threshhold Control Register
  • SQI1INTTHR - SQI Interrupt Threshold Register
  • SQI1CMDTHR - SQI Command Threshhold Register
  • SQI1CON - SQI Control Register
  • SQI1TXDATA - SQI Transfer Data Buffer Register
  • SQI1RXDATA - SQI Receive Data Buffer Register

OK, that's a lot of registers, but why? Well, with SQI Microchip decided you know what, why don't we make this as fancy as we can? And to their credit, it is very fancy, but it can be very confusing too. Whereas with SPI we had to handle the transactions one by one, that is no longer the case with the SQI peripheral. Everything now works with buffers, meaning I can load up a whole list of transactions and it'll do them one by one. I repeat, writing to any of these registers will add whatever you've written to a buffer (or a queue).

This means that for every transaction with SQI, I need to tell it how many bytes I want to send or receive, how many lanes (1, 2 or 4) I will use and whether or not it's a send or receive transaction and then write the data to SQI1TXDATA (or read it from SQI1RXDATA). It can be confusing, so I'll go into it more in a moment.

If you've been following along with the PIC32MZ datasheet, you may have experienced cases where the datasheet contradicts itself multiple times. The SQI is one such case and why it's taken so long for me to get this code working at an acceptable level! As such, some of these registers remain a bit of a mystery to me but I know they have to be set in order for stuff to work :) Let's take a look at them one by one, very briefly.

SQI1THR

I think this controls how many transactions we can queue up in the SQI Control buffer. I just set it to 0x100 when I initialise the SQI peripheral and never touch it again.

SQI1INTTHR

I think this is used to set how many bytes transferred or received will trigger a transfer or receive interrupt. Again, I just set it to 0x100 when I initialise the SQI peripheral and never touch it again.

SQI1CMDTHR

I think this is used to set how many bytes need to be in the send or receive buffer before the SQI periperhal will start doing anything. I again set this before each transaction and I set it to be the same as the number of bytes I'm about to transmit/receive. The lower 6 bits are receive bytes and bits 8 to 13 are for transmit bytes.

SQI1CON

Finally, one I understand. This is used to tell the SQI peripheral how many bytes we are going to write to SQI1TXDATA (or read from SQI1RXDATA), how many lanes (so 1-lane (like SPI), 2-lane or 4-lane) and what kind of transaction it is, be it send or receive. There are other things it can do but I'm limiting it to this much today. It bears looking at how it's defined in the datasheet:

PIC32MZ - SQI1CON register

Note: There are two Chip Select pins (SQICS0 and SQICS1) so I presume that's where device 0 and 1 come from. I don't use either, I use a different pin as chip select and have only been able to get it to work using device 1. No, I don't know why.

Note 2: TXRXCOUNT is an ambitious 16 bits wide but the actual buffer itself is much smaller at 32 bytes, so this number should never exceed 32.

SQ1ITXDATA

Writing to this register will cause whatever data we write to be pushed into the transmit buffer. Note, however, that it defaults to writing 32-bits of data to the SQI transmit buffer!. This means that this:

unsigned char tmp = 128;
SQI1TXDATA = tmp

Will actually write 4 bytes of data to the transmit buffer (that is, 0x00 0x00 0x00 0x80)! If you want to write only 8 bits, you need to perform this trick:

unsigned char *TXDATA = (unsigned char *)&SQI1TXDATA;   // Address to write to for 8-bit data
*TXDATA = 128;

This is unlike SPI and is something to be wary of!

SQI1RXDATA

Reading from this register will pop whatever is on the top of the receive buffer off. Like with SQI1TXDATA, to access an 8-bit value, you have to do the following:

unsigned char *RXDATA = (unsigned char *)&SQI1RXDATA;   // Address to read from for 8-bit data
unsigned char result;
result = *RXDATA;

OK, so how do we use this in code?

First up, we have to initialise the SQI peripheral. Bear in mind it is connected to Reference Clock 2, so that is some we first have to set up. The initialisation is done like this:

void SQI_init()
{
    CFGCONbits.TROEN = 0; // Disable trace outputs because SQI share them

    // Set up Reference Clock 2
    if (!REFO2CONbits.ACTIVE)
    {
        REFO2CONbits.RODIV = 1;
        REFO2CONbits.ROSEL = 1;
        REFO2CONbits.ON = 1;
        while (REFO2CONbits.DIVSWEN);
        REFO2CONbits.OE = 1;
    }

    // Turn *off* clock division according to the errata
    SQI1CLKCONbits.CLKDIV = 0;
    SQI1CLKCONbits.EN = 1;
    // Wait until the SQI clock reports it is stable
    while (!SQI1CLKCONbits.STABLE);

   // Tell the SQI peripheral to reset
    SQI1CFGbits.RESET = 1;    
    SQI1CFGbits.CPOL = 0;
    SQI1CFGbits.CPHA = 0;
    // Set the mode to 1, which is PIO mode (where we control it directly)
    SQI1CFGbits.MODE = 1;
    // Enable burst mode, again as datasheet says
    SQI1CFGbits.BURSTEN = 1;
    // Enable the SQI peripheral
    SQI1CFGbits.SQIEN = 1;
    // Enable data lines SQID0, SQID1, SQID2 and SQID3
    SQI1CFGbits.DATAEN = 0b10;

    // Set up buffers to trigger as soon as 1 byte is present
    SQI1THR = 0x100;
    SQI1INTTHR = 0x100;
}

OK, now it's set up, how on earth do we use the thing?

My first example today is very basic. I am communicating with an 8MB PSRAM. The PSRAM starts up in SPI mode. So initially, what I want to do is send it the command 0x9F, which will cause it to send me it's EID information. Again, this is what I want to do:

  • Send the 8-bit command 0x9F
  • Send 3 dummy bytes for address (as the PSRAM datasheet says to)
  • Read 8 bytes of data as a response

In code, I'd do this:

void SRAM_get_EID(unsigned int *buf)
{    
    // Pull CS low, select the PSRAM
    SRAM_select(0);

    // Set up for 4 bytes initially
    SQI1THR = 0x100;
    SQI1INTTHR = 0x100;

    // Sending 4 bytes, 0x9F "Get EID" command and 3 empty address bytes
    SQI1CMDTHR = 0x100;
    // Deassert chip select when done, using device 1, using single lane mode, transmit command, 4 bytes
    SQI1CON = 0x00510004;
    // Remember, this next line actually sends 0x00 0x00 0x00 0x9F!
    SQI1TXDATA = 0x9F;

    // Wait until the transmit buffer is empty i.e. the data has been fully transmitted
    while (SQI1STAT1bits.TXBUFFREE < 32);

    // Trigger on each byte received
    SQI1CMDTHR = 0x01; // Receive 8 bytes
    // Deassert chip select when done, using device 1, using single lane mode, receive command, 8 bytes
    SQI1CON = 0x00520008;

    // Wait until the lower 8 bits of SQI1STAT1 (which are received bytes count) = 8
    while ((SQI1STAT1 & 0xFF) != 0x08);

    // Read the first 4 bytes and store them in the array pointer
    *buf = SQI1RXDATA;
    // Move array pointer to next element
    buf++;
    // Read the final 4 bytes and store them in the array pointer
    *buf = SQI1RXDATA;

    // Set CS high again, deselect the PSRAM
    SRAM_select(1);
}

This same PSRAM then has a command to set it into quad lane mode (command 0x35, but this time with no dummy bytes). I'd do this as follows:

void SRAM_go_SQI()
{
    // We want to write an 8-bit value to the SQI transmit buffer, so set that up
    unsigned char *TXDATA;
    TXDATA = (unsigned char *)&SQI1TXDATA;

    // Pull CS low, select the PSRAM
    SRAM_select(0);

    SQI1THR = 0x100;
    SQI1INTTHR = 0x100;
    // Trigger on 4 bytes, though this works just fine
    SQI1CMDTHR = 0x100;
    // Deassert chip select when done, using device 1, using single lane mode, transmit command, 1 byte
    SQI1CON = 0x00510001;

    // Write the 8-bit value 0x35 to the transmit buffer
    *TXDATA = 0x35;

    // Wait until the transmit buffer is empty i.e. the data has been fully transmitted
    while (SQI1STAT1bits.TXBUFFREE < 32);

    // Set CS high again, deselect the PSRAM
    SRAM_select(1);
}

Now that we're in quad lane mode, let's try writing some data to that PSRAM. First, we need to tell the PSRAM we are going to be writing to it. The quad-lane write command is 0x38 and it needs to be followed by a 3-byte address. We'll do it like this:

void SRAM_start_write_quad(int address)
{
    // We're about to have some endian fun
    unsigned int data;
    unsigned int endian[3];

    // Pull CS low, select the PSRAM
    SRAM_select(0);

    // Trigger on 4 bytes
    SQI1CMDTHR = 0x00000400;
    // Deassert chip select when done, using device 1, using quad lane mode now, transmit command, 4 bytes
    SQI1CON = 0x00590004;

    // Now, if our address is 0x123456, we are going to have to switch that around to 0x563412 because the PIC32 is a little-endian device. Yay.  
    endian[0] = address >> 16;
    endian[1] = (address & 0x00FF00) >> 8;
    endian[2] = (address & 0xFF);

    address = (endian[2] << 16) | (endian[1] << 8) |(endian[0]);
    data = (address << 8) | 0x38;

    // The actual order of bytes *sent* will be 0x38, address[16:23], address[8:15], address[0:7].
    // I've done it in a bit of a round-about way to hopefully make it clearer.

    // Send the 4 bytes of data to the transfer queue
    SQI1TXDATA = data;

    // Wait until the transmit buffer is empty i.e. the data has been fully transmitted
    while (SQI1STAT1bits.TXBUFFREE < 32);

    // NOTE: HERE I DO NOT SET THE CHIP SELECT LINE HIGH. That would indicate to the PSRAM chip that the transaction was over!
}

OK, so I've told it to get ready for data, now let's write that data in super fast quad lane mode, one byte at a time:

    for (cnt = 0; cnt < num_bytes; cnt++)
    {
        // Set SQI TX command threshold to 1 byte in a slightly different way why not
        SQI1CMDTHRbits.TXCMDTHR = 1; 
        // Deassert chip select when done, using device 1, using quad lane mode now, transmit command, 1 byte
        SQI1CON = 0x00590001;

        // Send an 8-bit value to the buffer
        *TXDATA = buffer[cnt];

        // Wait until the transmit buffer is empty i.e. the data has been fully transmitted
        while (SQI1STAT1bits.TXBUFFREE < 32);
    }

    // Now I can set Chip Select high again and thus deselect the PSRAM because I'm done writing
    SRAM_select(1);

Would it be faster if I didn't send one byte at a time and then nanny over the transmit buffer? Surely it would, yes. However, as this is an intro to getting SQI to work let's keep it as safe as we can for now :)

Phew, and now we've written data to the PSRAM. In the example code below I've included code for reading and writing in both single and quad-lane mode.

Please bear in mind that on my development board the PSRAM's Chip Select is connected to Port RJ1

Here's the code. Good luck!

Tags: code, SQI

My PIC32MZ Dev Board

Disclaimer

The files / images I'm sharing today are for my own personal development board, based on the PIC32MZ2048EFH144-I/PL. Do not use this board in any project that requires super precision or in life saving equipment type of projects. I cannot and will not be responsible if you make this and somehow manage to burn your house / neighbourhood / country down. I'm uploading them in the hopes that someone starting out can learn how to make their own PIC32MZ development board, hopefully better than my own.

Background on why I started making my own dev boards

When I first started with the PIC32, the dev board options weren't awesome. I was looking using the PIC32MX in a DIP package, so it could be bread-boarded. I found myself wanting a more permanent solution and decided to look into what development boards there were available. Microchip and other companies' dev boards are fine and all, some of them aren't even too expensive. However, they're often either designed with a very specific purpose in mind or designed to attach to other dev boards of theirs and the cost very quickly gets out of control. I just wanted something I could plug Dupont cables into and I couldn't find what I was looking for, so I decided to make my own.

I pretty soon got into making my own PCBs at home using a laser printer, iron-on paper, an iron and some etchant. For years I made my own boards and they were fine. When it came to the PIC32MZ I was able to make my own PCB for the 144-pin version but I quickly began to realise the limitiations of making single-sided PCBs. You can make double-sided PCBs at home but you have to be very careful to line up both sides correctly, drill the vias, solder the via pins in, etc and it turns into a lot of work very quickly. I also got tired of breathing in fiber glass when drilling all the holes for the headers. It may not seem like a lot but 200 holes done repeatedly gets a bit much. The etchant had already eaten holes in a good few pairs of pants too and I finally decided enough was enough and started looking online.

I first used Seeedstudio's excellent Fusion PCB service and found the quality to be great. I also appreciate that the different PCB colours don't cost any extra money. An alternative to them is JLCPCB. Their service is slightly cheaper and has faster and more reliable turn-around times but you have to pay extra for any PCB colour except green. A huge advantage with JLCPCB is that you can order components at their sister site LCSC and use combined shipping to save on those painful DHL shipping costs, which for me come to about $16.

Down-sides of making your own dev board

First of all, all the Harmony examples are set up to use their own dev boards, so whenever I want to use an example I have to modify code for LEDs and buttons. That's not too much bother really. The big problem, however, is that

BOARDS DESIGNED BY NOOBS LIKE ME BREAK TONS OF DESIGN RULES

While Seeedstudio and JLCPCB's prices are both good, you can get really cheap prices if you fit the board into 100mm x 100mm, double-layer. So my dev boards represent an effort to cram as much stuff as I can into that size limit while still having a working board. As such, there are too many vias, fast signals tracks are too long and routed through vias as they shouldn't be and the power and ground planes are probably more of a mess than they need be, despite multiple efforts to clean them up. Now, that out the way, my boards work fine. The USB is as fast as it should be, the ESP32 works, the SD card can be read at a very decent speed, everything works. If you can get over the worry of having an engineer looking at your board in disgust, then you too can make your own PIC32MZ dev board.

So why? Well, this is my hobby, I enjoy it. I use my dev boards to get modules, motors, LCDs and all sorts of things to work before designing specific boards for separate projects. It's a kitchen sink. A very clogged up kitchen sink. And today I'm going to share all the files for it with the Internet. If anyone even reads this, I'm sure they'll leave some delightful comments but eh, I'm uploading them all the same.

Overview of this dev board

First, this is what it looks like when assembled by a noob (me). Top:

PIC32MZ - Scorpio Dev Board - Top view

Yes, the erroneous "BUTTON 2" text has been removed in the uploaded Gerber files.

Bottom:

PIC32MZ - Scorpio Dev Board - Bottom view

Yes, the scorpion motif was cheesy as heck and has been removed (also, it was downloaded from a royalty free clipart site I can no longer find the link for).

This dev board, being based on a kitchen sink design philosophy, has a lot going on with it. Most of the extras can be left out entirely without affecting the PIC32 at all. I will mark these extras with a *. The list:

  • SD card attached in SPI mode to SPI channel 2 (*)
  • CS4344 audio DAC attached in I2S mode to SPI3 (*)
  • 8MB VTI7064 PSRAM attached via SQI (*)
  • 128MB W25N01GV flash memory attached to SPI5 (*)
  • Parallel Master Port (PMP) driver 16-bit TFT LCD connector for SSD1934 displays with capacitive touch (*)
  • HD44780 compatible text LCD port connected to the PMP (*)
  • FT232RL connected to UART4 to allow communications with PC (*)
  • USB host connector (*)
  • Stereo PWM audio output connector with single stage RC filter designed to work at 44.1kHz (*)
  • ESP32-WROOM-32 module connected to UART2 and SPI1, with connections to allow PIC32 and ESP32 to wake each other (*)
  • Power via micro USB port in either debug mode (with FT232RL) or device mode (two separate ports)

So basically, a lot of stuff, some of which is a hassle to solder by hand and none of which is necessary except for the USB port which provides power to the PIC32 chip. If you don't even want that, you could also power it directly via the ICSP connector using a PICKit or other programmer but bear in mind that needs to be 3.3V.

The ESP32 has been added very recently and in rather a slap-dash fashion. It is supplied by it's own 3.3V regulator and can be entirely disabled by removing the jumper near it labelled "ESP32".

I have tried to use 1206 sized SMD components to make it easier to hand solder but there are one or two places where I ran out of space (/willpower) and so used 0603.

Bill of Materials (BOM) and where to buy the components

I have put together a list of components for use when soldering and a Bill of Materials with links showing where to buy the components.

Here are the Gerber files for this project.

Here are the Eagle files for this project.

Almost, but not all, of my example code on this site was made with these ports in mind.

Tags: PCB, herebedragons, horror

Updated SPI SD DMA code and DMA Pattern Matching

Updated SPI SD DMA code

The code is now more stable and cleaner, so I'm uploading it again here. Here are a list of changes:

  • Fixed bugs relating to SPI buffers overflowing, causing the program to stop working at different BRG settings.
  • No longer need to set SCK as an input.
  • Now works properly at multiple values of SPIBRG, so you can run at whatever SPI, CPU or System frequency you like.
  • Configuration settings have been moved to mmcpic32.h and diskio.h and diskio.c have been removed. You need to #include "mmcpic32.h" in your main program now.
  • Configuration made possible by changing a few lines of code in mmcpic32.h. Thanks again to Bryn Thomas and Ivo Colleoni for their help with this.
  • Added a callback function that will be called, if set, multiple times during an SPI DMA read.

New configuration settings

Upon opening mmcpic32.h, you will see this:

// ***************************************************
// ** CHANGE THE BELOW SETTINGS TO MATCH YOUR BOARD **
// ***************************************************
// SD card port and pin settings
#define CS_PORT H                   // Port on which CS is to be found, A - H
#define CS_PIN 12                   // Pin number of CS, 0 -15
#define SDO_PORT B                  // Port on which SDO/MOSI is to be found, A - H
#define SDO_PIN 5                   // Pin number of SDO/MOSI, 0 - 15

// SPI channel and DMA channel configuration
#define SPI_CHANNEL 2               // Channel number to use for SD card
#define DMA_RX_CHANNEL 0            // DMA channel number to use for Receiving data, 0 - 7
#define DMA_TX_CHANNEL 1            // DMA channel number to use for Transmitting data, 0 - 7
#define DMA_RX_CHANNEL_PRIORITY 3   // Priority of DMA receiving channel, 0 - 3
#define DMA_TX_CHANNEL_PRIORITY 2   // Priority of DMA transmitting channel, 0 - 3
#define DMA_RX_INT_PRIORITY 4       // Priority of DMA receive complete interrupt, 0 - 7
#define DMA_RX_INT_SUBPRIORITY 1    // Sub-priority of DMA receive complete interrupt, 0 - 3

I've tried to make it easy to see. The settings currently there are for my board, with the SD card's Chip Select on port H12 and the MOSI / SDO pin on RB5. Please change these to be correct for your board or nothing will work. Right under that is the only other setting you may have to change, the SPI channel number. Change this to whatever your SPI channel SD card is connected to. The rest of the settings can be left as is or changed as desired. The allowed ranges are shown in the comments for each line.

Callback function

The callback function will be called multiple times during a call to f_read(). It can be used for things like checking keys, starting other transfers, updating LCDs, whatever you want really. Do note that if you take too long in the callback function the DMA transfer's performance will either suffer or, in extreme cases, stop working (hasn't happened yet but who knows). So it is recommended you do something fairly short during the callback function.

The example callback function is also declared in mmcpic32.h, like this:

void (*DMA_CALLBACK)(int stage, int args);

stage refers to the stage of the DMA read it is in, which can be DMA_STAGE_WAIT_TOKEN (waiting for the 0xFE token) or DMA_STAGE_WAIT_READ (reading 512-byte sector).
args can be one of two things. In DMA_STAGE_WAIT_TOKEN it is how many bytes were read before 0xFE was found. In DMA_STAGE_WAIT_READ it is how many bytes were read (always 512 in this program).

You can change this callback function to whatever you like, I've just given an example of how it could be used.

To set your own callback function, create a function in main(), for example:

void my_callback(int stage, int args)
{
}

Then, call the set_callback() function like this:

set_callback(my_callback);

Done!

A word on DMA Pattern Matching mode in this program

As I've discussed previously, a multi-block read from an SD card looks like this:

  • Send the command for multiple block read (CMD18)
  • Send the starting sector number
  • Send 0xFF until the 0xFE token is returned
  • Send 0xFF and read the reply 512 times to read a sector
  • If you wish to read more sectors, go back to line 3 and repeat until done
  • Send the command to stop transmission (CMD12)

In this program, the DMA read starts on line 3, waiting for 0xFE. At this stage, we have been sending 0xFF until 0xFE was returned. The DMA Transfer channel is now aborted when a Pattern Match for 0xFE is found on channel 0, resulting in less data being left in the SPI buffer. I have added code to handle bytes left in the SPI buffer and I strongly suggest you do not remove this code, even if it seems that no bytes are remaining. At lower settings for SPIBRG there can sometimes be one or two bytes left over each time and that can quickly lead to an SPI buffer overflow if not handled correctly.

Please note: The standard for SPI mode on SD cards specifies up to 25MHz for transfers. I am using 50MHz and it works fine. However, if you want to use this code in something that requires reliability, please set your SPIBRG to 1 to halve the speed to 25MHz!

As always, here's the code. If there are any issues with it, please do let me know.

Tags: code, DMA, SD, SPI

DMA SD card reads on the PIC32MZ

Why read from SD card in DMA mode at all?

Warning: This post is going to be long because it's a complex topic and I've included lots of code in it.

Turns out the "next time" from last post was today, the same day. An entire day spent on writing about DMA and airing my ignorance online. Yay!

For the last several weeks / months / eternities I've been working on getting the DMA module to work with the SPI peripheral so that I can read from the SD card using DMA. My initial motivation for doing this was that when I had a few (i.e. too many) ISRs in my main code the SPI module would sometimes seem to get confused at all these interruptions and just stop working, crashing my program. However, DMA has also resulted in a nice large speed boost to SD reading, which is very useful. I've been working on this for ages in my spare time and I still don't understand all of it but today I'm going to go over my code and my findings. It works in my MP3 player and in large block transfers but I can't get over the feeling of mistrust I have for it so YMMV.

Reading from an SD card using DMA

Before we even get to using DMA, let's remind ourselves how the SD card works in SPI mode. For block reads, there is single block read mode and multi block read mode. Single block read mode needs to send the command to the SD card each time it wants to read a block. Multi block read sends a command once and then reads however many blocks it wants. Naturally, multi block read results in much faster transfer times than a single block read does. Don't forget that before the following flow chart, the SD card needs to be told to set up the multi block transfer first. This code can be found in mmcpic32_dma.c in the disk_read() function. Once we've sent this command (and sector number and dummy CRC etc), the actual reading of the data works like this (click to enlarge):

PIC32MZ - SPI SD multi block read flowchart

(Shout-out to the website https://www.draw.io for providing a way to make flowcharts easily online, though with my drawing skills maybe they don't want people to know I used their site :))

So as you can (hopefully) see there are three phases to each 512-byte sector read:

  • Send out 0xFF via SPI until the SD card replies with 0xFE
  • Send out 0xFF and receive a data byte for each of the 512 bytes in the sector
  • Send out 2 x0FF and receive the CRC to finish the sector read

After that, the process repeats until you have read as many sectors as are required. So how can we adapt this to make use of DMA? Well, let's think about what we need to do:

  • Send 0xFF to the SD card over SPI
  • Receive the data the SD sends us over SPI

In both of these cases, the PIC32MZ is the master and provides the clock signal to the SD card. This means the SD card cannot do anything unless we send it some data first. As you can see, there are two types of transactions here, the sending of the data and the receiving of the data. Assuming I'm using SPI Channel 2, without using DMA we would do this to send 0xFF to receive a byte of information from the SD card:

    SPI2BUF = 0xFF;
    while (SPI2STATbits.SPIRBE);
    data = SPI2BUF;

The overall flow of the program

OK, so one sending transaction and one receiving transaction means we will need to use two DMA channels. As I discussed last time, DMA channels need to be triggered by an Interrupt Request (IRQ), so what shall we choose? Again, let's think about the flow of this program:

  • Send 0xFF to SD card
  • Receive data in response

The SPI peripheral has both a transfer done (TX) and receive done (RX) IRQ that it generates, so this is perfect. I'm going to choose my two DMA channels as follows:

  • DMA Channel 0 is in charge of receiving data from the SPI buffer
  • DMA Channel 1 is in charge of sending data to the SPI buffer

This means that DMA Channel 0's start IRQ (SIRQ) will be SPI Channel 2's Receive Done IRQ (_SPI2_RX_VECTOR) and the source of DMA Channel 1's SIRQ will be SPI Channel 2's Transfer Done IRQ (_SPI2_TX_VECTOR). The _SPI2_RX_VECTOR is triggered whenever the SPI2 channel has finished receiving a byte of data, and the _SPI2_TX_VECTOR triggers whenever the SPI2 channel has finished sending a byte of data.
This additionally means that the source address for DMA Channel 0 is the SPI Buffer SPI2BUF, because we are reading from that and the destination address of DMA Channel 1 is SPI2BUF because we are sending to it.
We are going to send 1 byte at a time (so cell size is 1), because we are using SPI in 8-bit mode. Perhaps we could get even more speed gains in 32-bit mode but we're fast enough for the moment.
We want to generate an interrupt when the transfer is done and I've chosen to use Interrupt Priority 4, Sub-priority 1.
Finally, I've chosen to abort DMA transfers whenever there's an error on SPI channel 2.

Before we move on, let's clarify what the heck we've been talking about and see how this is going to operate (click to enlarge):

PIC32MZ - SPI SD DMA Flow

As should hopefully be clear thanks to that fantastic image, DMA Channel 0 and DMA Channel 1 are not talking to each other at all. DMA Channel 1 sends as many bytes as we tell it to to SPI Buffer 2 (SPI2BUF) until it's done and DMA Channel 0 receives as many bytes as we tell it to until it's done. This can be tricky to understand, so it bears further thought. When I send 0xFF to the SD card, what does it send in response? Due to the nature of SPI, it is sending the response to the last byte I sent it. Referring to the flow-chart above, when I'm waiting for the 0xFE token, I'm actually doing this:

    while (token != 0xFE)
    {
        SPI2BUF = 0xFF;
        while (SPI2STATbits.SPIRBE);
        token = SPI2BUF;
    }

When that's done, the SD card has already internally queued up the first byte of data to send to me, it just has no way of sending it to me. The next time I sent it an 0xFF, it will send me that queued up reply at the same time as I send it the 0xFF. What this means for my DMA approach is that when I send it 0xFF, the reply it sends me will be an answer to the previous 0xFF instruction. And then, because I've sent it 0xFF again it will have another byte of data prepared for me and will be waiting to send it. It will only be able to send me that data when I send it another 0xFF.

Secondly, looking at my fantastic picture of DMA data flow right above this, it becomes clear that we are going to make use of the SPI 2 Buffer. This means Enhanced Buffer Mode must be enabled for this to work at all. OK, theory out of the way, for now.
The happy news is that none of the above information ever changes, so we can set that all up once at the beginning of the program and never have to set it up again. I have done this in a function called SPI_DMA_init(), here's the code for it:

void SPI_DMA_init(void)
{
    DCH0SSA = virt_to_phys((void*)&SPI2BUF); // Source address
    DCH0ECONbits.CHSIRQ = _SPI2_RX_VECTOR;   // Trigger cell transfer event on SPI2 Receive IRQ
    DCH0ECONbits.CHAIRQ = _SPI2_FAULT_VECTOR;// Abort on SPI 2 error

    DCH0ECONbits.SIRQEN = 1;                 // Enable cell transfer event on IRQ
    DCH0ECONbits.AIRQEN = 1;                 // Enable cell transfer event on IRQ
    DCH0CONCLR = 1 << 4;                     // CHAEN = 0, turn off the abort enable
    DCH0CONSET = 3 << 16;                    // CHPRI = 3, set channel priority to 3

    DCH0SSIZ = 1;                            // Destination size is 1 byte
    DCH0CSIZ = 1;                            // Transfer 1 byte at a time

    DCH1DSA = virt_to_phys((void*)&SPI2BUF); // Destination address
    DCH1ECONbits.CHSIRQ = _SPI2_TX_VECTOR;   // Trigger cell transfer event on SPI2 Transmit IRQ    
    DCH1ECONbits.CHAIRQ = _SPI2_FAULT_VECTOR;// Abort on SPI 2 error

    DCH1ECONbits.SIRQEN = 1;                 // Enable cell transfer event on IRQ
    DCH1ECONbits.AIRQEN = 1;                 // Enable cell transfer event on IRQ    
    DCH1CONCLR = 1 << 4;                     // CHAEN = 0, turn off the abort enable
    DCH1CONSET = 2 << 16;                    // CHPRI = 2, set channel priority to 2

    DCH1CSIZ = 1;                            // Cell size
    DCH1DSIZ = 1;                            // Destination size

    IPC33CLR = 0b11111 << 16;                // Clear DMA1IP and DMA1IS bits
    IPC33SET = 0b10001 << 16;                // Interrupt Priority 4, Interrupt Sub-priority 1

    DMACONSET = 0x8000;                      // Enable DMA module if it hasn't been
}

You may be wondering why the Destination Size for channels 0 and 1 are set to 1. They can be set to the size of the actual transfer but the DMA module will see which one is bigger and use that anyway, so we can save having to repeat those two lines of code by doing it this way. As you'll see later, there's plenty more code to come.

Enough chat, plz give me teh codes tx

For the sake of a simpler explanation, I have made my multi block read function into a mini state machine. As a result, the code is fairly long and not as optimised as it could be but still yields good results. It looks like this:

static int rcvr_datablock_multiple(BYTE *buff, INT btr)
{
    unsigned char READ_DONE = 0;
    unsigned char READ_ERROR = 0;
    int DMA_sectors_left;
    char DMA_stage;
    int DMA_read_size;
    int DMA_bytes_left;
    BYTE crc[2];

    // Divide by 512 to get the number of 512-byte sectors to read
    DMA_sectors_left = btr >> 9;

    // Initialise the state machine
    READ_DONE = 0;
    DMA_BUSY = 0;
    DMA_stage = 0;
    READ_ERROR = 0;

    if (btr >= 512)
        DMA_read_size = 512; // Reading 512 bytes (one sector) at a time
    else
        DMA_read_size = btr; // Reading less than 512 bytes at a time

    // How many bytes do we need to read each time?
    DMA_bytes_left = btr;

    // Disable the SDO pin
    SPICONbits.DISSDO = 1;

    // Set the SDO pin to 1 so it will always output 0b11111111 (0xFF)
    SDO_PIN = 1;

    // Start waiting for the 0xFE token that precedes each sector read
    SPI_DMA_wait_token(buff, MAX_TOKEN_WAIT_BYTES);

    while (!READ_DONE)
    {
        while (DMA_BUSY)
        {
            // Could place a callback routine in here to do something while DMA is busy but haven't gotten around to that yet
        };

        switch (DMA_stage)
        {
            case 0: // Finished waiting for 0xFE token
            {
                if (DCH1SPTR > MAX_TOKEN_WAIT_BYTES) 
                {
                    // 0xFE was not found in [MAX_TOKEN_WAIT_BYTES] bytes, give up
                    READ_DONE = 1;
                    READ_ERROR = 1;
                }
                else
                {
                    if (DMA_bytes_left > DMA_read_size)
                    {
                        SPI_DMA_read(buff, DMA_read_size);
                        DMA_bytes_left -= DMA_read_size;
                    }
                    else
                    {
                        SPI_DMA_read(buff, DMA_bytes_left);
                        DMA_bytes_left = 0;
                    }

                    DMA_stage = 1;
                }
                break;
            }
            case 1: // Finished reading data
            {
                buff += DMA_read_size; // Increment buffer position by number of bytes read

                // Read the CRC data now
                SPI_DMA_read(crc, 2);        

                DMA_stage = 2;
                break;
            }
            case 2: // Finished reading CRC
            {
                DMA_sectors_left--;

                if (DMA_sectors_left > 0)
                {
                    // Restart the process
                    DMA_stage = 0;
                    SPI_DMA_wait_token(buff, 8192);
                }
                else
                {
                    READ_DONE = 1;
                }

                break;
            }
        }            
    }

    // Reset everything
    DMACON = 0;
    SPICONbits.DISSDO = 0;

    if (READ_ERROR)
        return 0;
    else
        return 1;                       
}

OK, I realise that's pretty long so let's break it up into sections.

Initialising all the variables used and starting the wait for the 0xFE token

// Divide by 512 to get the number of 512-byte sectors to read
DMA_sectors_left = btr >> 9;

// Initialise the state machine
READ_DONE = 0;
DMA_BUSY = 0;
DMA_stage = 0;
READ_ERROR = 0;

if (btr >= 512)
    DMA_read_size = 512; // Reading 512 bytes (one sector) at a time
else
    DMA_read_size = btr; // Reading less than 512 bytes at a time

// How many bytes do we need to read each time?
DMA_bytes_left = btr;

// Disable the SDO pin
SPICONbits.DISSDO = 1;

// Set the SDO pin to 1 so it will always output 0b11111111 (0xFF)
SDO_PIN = 1;

// Start waiting for the 0xFE token that precedes each sector read
SPI_DMA_wait_token(buff, MAX_TOKEN_WAIT_BYTES);

The one trick I used here, which I got from some microchip forum ages ago, is that as we need to output a constant 0xFF value, and this equates to 0b11111111 in binary, we can just disable the SDO pin of the SPI peripheral (by setting DISSDO to 1) and set the value of the port pin to 1 and it'll just output 1's. Pretty neat trick.

When I call SPI_DMA_wait_token() I decided to give it a maximum number of bytes to wait for, in my program it's 8192. The number of bytes it takes to receive 0xFE is not set and seems to differ between SD cards. Let's take a look at the code in that function:

// SPI_DMA_wait_token sends out 0xFF and waits for the 0xFE token to come in. It will send a maximum of [num_bytes] bytes
void SPI_DMA_wait_token(unsigned char *buffer, unsigned int num_bytes)
{
    DCH0CONCLR = 1 << 7;            // Disable DMA Channel 0
    DCH1CONCLR = 1 << 7;            // Disable DMA Channel 1

    DCH0DSA = virt_to_phys(buffer); // Destination address
    DCH0CONCLR = 1 << 4;            // CHAEN = 0
    DCH0CONSET = 3 << 16;           // CHPRI = 3
    DCH0INTCLR = 0xFF00FF;          // Clear all DMA Channel 0 interrupt enables and flags
    DCH0INTSET = 0x90000;           // Enable the CHBCIE interrupt for DMA channel 0
    DCH0DAT = 'þ';                  // Wait for 0xFE token
    DCH0ECONSET = 1 << 5;           // PATEN is enabled
    DCH0CONbits.CHPATLEN = 0;       // 8bit pattern

    DCH0DSIZ = num_bytes;           // Destination size is [num_bytes]

    DCH1SSIZ = num_bytes;           // Source size is [num_bytes]

    DCH1INTCLR = 0xFF00FF;          // Clear all DMA Channel 1 interrupt enables and flags
    DCH1SSA = virt_to_phys(buffer); // Source address

    IFS4CLR = 0b11 << 14;           // Clear SPI2RXIF and SPI2TXIF
    IFS4CLR = 1 << 6;               // Clear DMA1IF    
    IEC4SET = 1 << 6;               // Set DMA1IE

    DCH0CONSET = 1 << 7;            // Enable DMA Channel 0
    DCH1CONSET = 1 << 7;            // Enable DMA Channel 1
    DCH1ECONSET = 1 << 7;           // Set CFORCE on

    DMA_BUSY = 1;                   // DMA_BUSY flag set to 1 indicating active transfer
}

As always, before we configure anything, turn if off. In this case, clearing the CHEN bit of DCH0CON and DCH1CON does this fine. We do not want to disable the entire DMA module while we do this because we have no idea what the other 6 channels are doing.
The code is as discussed in my previous post, even the pattern matching which is looking for the 8-bit character 'þ' (0xFE). As we have previously set up most of the registers, we don't need to keep setting them up again. I routinely clear all the interrupt enables and flags in both DCH0INT and DCH1INT to avoid any potential problems they may cause. I am using the Channel Block Transfer Complete (CHBC) interrupt to tell me when the transfer is finished. The DMA_BUSY flag is my own internal flag that I wait for, to avoid hammering the DMA module's status bits and thus slowing down the transfer.

Once the 0xFE token is found, or MAX_TOKEN_WAIT_BYTES is exceeded, we will get to the next stage of the state machine.

Starting a 512-byte sector read

case 0: // Finished waiting for 0xFE token
{
    if (DCH1SPTR > MAX_TOKEN_WAIT_BYTES) 
    {
        // 0xFE was not found in [MAX_TOKEN_WAIT_BYTES] bytes, give up
        READ_DONE = 1;
        READ_ERROR = 1;
    }
    else
    {
        if (DMA_bytes_left > DMA_read_size)
        {
            SPI_DMA_read(buff, DMA_read_size);
            DMA_bytes_left -= DMA_read_size;
        }
        else
        {
            SPI_DMA_read(buff, DMA_bytes_left);
            DMA_bytes_left = 0;
        }

        DMA_stage = 1;
    }
    break;
}

The first thing I do here is check the DMA Channel 1 Source Pointer to see how many bytes it actually sent before receiving 0xFE. If this exceeds MAX_TOKEN_WAIT_BYTES, I abort the transfer. If not, I check to see how many bytes I need to read and then call SPI_DMA_read(). Let's take a look at the code behind it:

// SPI_DMA_read sends out 0xFF and reads in the returned data into [buffer], for a total of [num_bytes] byte transfers
void SPI_DMA_read(unsigned char *buffer, unsigned int num_bytes)
{
    DCH0CONCLR = 1 << 7;            // Disable DMA Channel 0
    DCH1CONCLR = 1 << 7;            // Disable DMA Channel 1

    DCH0DSA = virt_to_phys(buffer); // Destination address
    DCH0INTCLR = 0xFF00FF;          // All flag and ints off
    DCH0INTSET = 0x80000;           // CHBCIE = 1
    DCH0DAT = 0xFFFF;
    DCH0ECONCLR = 1 << 5;           // PATEN is disabled

    DCH0DSIZ = num_bytes;           // Source size

    DCH1SSIZ = num_bytes;           // Source size    

    DCH1INTCLR = 0xFF00FF;          // Clear all DMA Channel interrupt enables and flags
    DCH1SSA = virt_to_phys(buffer); // Source address

    IFS4CLR = 0b11 << 14;           // Clear SPI2RXIF and SPI2TXIF
    IFS4CLR = 1 << 6;               // Clear DMA1IF   
    IEC4SET = 1 << 6;               // Set DMA1IE

    DCH0CONSET = 1 << 7;            // Enable DMA Channel 0
    DCH1CONSET = 1 << 7;            // Enable DMA Channel 1
    DCH1ECONSET = 1 << 7;           // Set CFORCE on

    DMA_BUSY = 1;
}

This code is almost exactly the same as SPI_DMA_wait_token(). The only difference is that it disables pattern matching. They could easily be combined into one function, I've chosen to separate them for clarity as this is already a long and complicated subject.
Once the transfer is done, DMA_BUSY is again cleared and we move on to the next stage of the state machine, reading the 2-byte CRC.

Reading for the CRC

case 1: // Finished reading data
{
    buff += DMA_read_size; // Increment buffer position by number of bytes read

    // Read the CRC data now
    SPI_DMA_read(crc, 2);        

    DMA_stage = 2;
    break;
}

Nothing much to say here. The CRC is 2 bytes of data and it must be read before either finishing the transfer or waiting for the 0xFE token again.
Note: This could easily be combined into the last SPI_DMA_read() before reading CRC, but I've chosen not to do this for clarity.

Restarting the state machine if needed

case 2: // Finished reading CRC
{
    DMA_sectors_left--;

    if (DMA_sectors_left > 0)
    {
        // Restart the process
        DMA_stage = 0;
        SPI_DMA_wait_token(buff, MAX_TOKEN_WAIT_BYTES);
    }
    else
    {
        READ_DONE = 1;
    }

    break;
}

OK, we got the CRC. Do we have any more sectors left to read? If so, restart the wait for the 0xFE token again. In multi block reads, the first wait for the 0xFE token can requires hundreds or even thousands of 0xFF bytes to be sent while the SD card gets ready but subsequent waits for 0xFE usually require only a few to be sent. This is part of the reason multi block reads are much faster than single block ones.

Cleaning up

// Re-enable SDO
SPICONbits.DISSDO = 0;

if (READ_ERROR)
    return 0;
else
    return 1;                       

Don't forget to re-enable SDO or the SPI port is not going to work and you're going to spend hours debugging your code :)

As I've mentioned multiple times before, this code isn't perfect and it's still under development. It seems to be working so far but I wouldn't trust it in anything you truly care about. Again, the SD card SPI specification allows for a maximum of 25MHz, and my code is running at 50MHz so if you experience issues, that's the first place I'd look (set SPIBRG to 1 to get 25MHz).

Here's the code

Tags: code, DMA, SD, SPI