Final Result
The ManWorm TV
Over the past few months, Bayley and I have been working on the MANWORM TV. The MANWORM TV has a STM32F446RE microcontroller connected to an 4-bit resistor DAC and a buffer, for generating composite video signals. We have developed simple games (pong, racing, wolfenstein clone), 3D graphics (vector and raster), and even a program to play back video from an SD card (it once played back all of Star Wars Episode IV).
Here's what it looks like:
The NTSC video format is a series of horizontal scans, each approximately 63 microseconds long. Each line begins with a sync pattern and is following by the video data. A higher voltage indicates a brighter white. To generate the waveform on the microcontroller, an interrupt is configured to go off every 63 microseconds. The interrupt loops through all the pixels in the line and changes the output voltage. The timing between pixels is achieved by inserting a few NOP instructions - there are only a few hundred nanoseconds between pixels! There are two major drawbacks to this approach. To start with, drawing the full screen takes almost all of the CPU time, giving you no time to generate image data. In practice, I get around this by truncating lines early, giving you around 20 microseconds of free CPU time per line, and by having a more efficient way of generating the all-black lines that are offscreen and near the bottom. Notice that the Star Wars demo uses only a small fraction of the screen to be able to copy data from the SD card in time. You basically are required to use double-buffering, which uses more than half the RAM on the microcontroller.
The second drawback is that the amount of time it takes to enter the interrupt and load the ISR into icache can be somewhat variable, causing some weirdness. You can see this most clearly in the Star Wars video.
Gameboy Emulator
One day, I was bored and decided to try writing a gameboy emulator. One weekend and ~20 hours of programming later, I was playing Pokemon.
Part 1
The Gameboy CPU is custom, but similar to both the Z80 and the Intel 8080. It has an 8-bit accumulator register, a 16-bit stack pointer register, a 16-bit program counter, and 6 other 8-bit registers, which can sometimes be used in pairs as a 16-bit register. It has 8 kB of internal RAM, 8 kB of VRAM, as well as additional RAM and ROM in cartridge.
The first step to writing the Gameboy Emulator was to write the memory and cpu subsystems. There are several types of Gameboy Cartridges, which contain the game data, stored in (possibly multiple) ROM banks, as well as additional RAM. The memory system keeps track of which memory banks are currently mapped, and can do reads/writes of gameboy memory. The memory layout is roughly as follows:
- 16 kB ROM bank #0 (always mapped to this bank, contains Interrupt handlers)
- 16 kB switchable ROM bank
- 8 kB VRAM (stores tiles)
- 8 kB switchable RAM bank (cartridge RAM)
- 8 kB internal RAM
- Mirror of 8 kB of internal RAM
- 160 bytes of Sprite Attribute Memory (where each sprite should be)
- Various registers
- Fast Top RAM (used for stack)
- Interrupt Enable byte
In my first pass, I did not implement any of the registers and only implemented internal memory.
Next, I started writing the CPU emulator. The Gameboy has around 512 opcodes. 256 of them have a single byte indicating what the opcode is, followed by a few bytes of arguments. The remaining 256 opcodes start with byte 0xCB then have a second byte indicating the function. The CPU emulator I started with uses the following approach
- increment the DIV register (increases at 16 kHz)
- Read the opcode at PC
- Interpret the opcode at PC (does the function of the opcode, increments PC, increments the cycle count, which is different for different instructions)
Finally, I set up the video system emulator. It is updated after every single emulated instruction and is told how many emulated cycles have elapsed. The video system writes to the display line-by-line and is timed off of the CPU clock. I implemented a few of special registers which tell the CPU what line is currently being drawn.
After all this, I was able to partially emulate the Gameboy BIOS, which would normally display the nintendo logo. Without video, it was challenging to verify it was actually working, but I was printing each write to the video SCROLLY register, which showed that the CPU was decrementing this register to zero, waiting around a second, then attempting to jump out of the BIOS.
Part 2
The next step was to implement more CPU instructions, like the shifts, bit sets/clears/checks, fix a few bugs in setting various flag bits, and add in interrupts. When going into an interrupt, the address to return to is pushed onto the stack, and the PC jumps straight to the interrupt handler. There is a master enable/disable of interrupts accomplished with the EI and DI instructions, as well as an interrupt mask byte. The first interrupt I added was the V-Blank interrupt, which runs 60 times per second, when the video RAM is not being used by the display hardware and can be accessed. Here's what the main CPU step function looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// set the globalState so the next call to step() will run an ISR | |
void interrupt(u16 addr) { | |
globalState.ime = 0; // disable interrupts | |
writeU16(globalState.pc, globalState.sp - (u16)2); // push pc to stack | |
globalState.sp -= 2; // push pc to stack | |
globalState.cycleCount += 12; // timing | |
globalState.pc = addr; // jump to ISR | |
} | |
// step the CPU 1 instruction | |
// returns number of clock cycles | |
u32 cpuStep() { | |
uint64_t oldCycleCount = globalState.cycleCount; | |
// update div register | |
globalMemState.ioRegs[IO_DIV] = (u8)((globalState.cycleCount - globalState.divOffset) >> 8); | |
// execute, if we aren't halted. | |
if(!globalState.halt) { | |
// fetch opcode | |
u8 opcode = readByte(globalState.pc); | |
// execute opcode | |
opcodes[opcode](opcode); | |
} | |
// interrupts | |
if(globalState.ime && globalMemState.upperRam[0x7f] && globalMemState.ioRegs[IO_IF]) { | |
// mask interrupts with the interrupt enable register at 0xffff. | |
u8 interrupts = globalMemState.upperRam[0x7f] & globalMemState.ioRegs[IO_IF]; | |
if(interrupts & 0x01) { | |
globalMemState.ioRegs[IO_IF] &= ~1; | |
interrupt(VBLANK_INTERRUPT); | |
globalState.halt = false; | |
} else if(interrupts & 0x02) { | |
globalMemState.ioRegs[IO_IF] &= ~2; | |
interrupt(LCDC_INTERRUPT); | |
globalState.halt = false; | |
} else if(interrupts & 0x04) { | |
globalMemState.ioRegs[IO_IF] &= ~4; | |
interrupt(TIMER_INTERRUPT); | |
globalState.halt = false; | |
} else if(interrupts & 0x08) { | |
globalMemState.ioRegs[IO_IF] &= ~8; | |
interrupt(SERIAL_INTERRUPT); | |
globalState.halt = false; | |
} else if(interrupts & 0x10) { | |
globalMemState.ioRegs[IO_IF] &= ~0x10; | |
interrupt(HIGH_TO_LOW_P10_P13); | |
globalState.halt = false; | |
} | |
} | |
// even if we have IME off, and we're halted, we're supposed to check IF and IE register | |
// this won't fire an interrupt, but will get us out of halt | |
// this behavior isn't well documented, but was required to pass cpu_instr.gb | |
// (though none of the games seem to need it...) | |
if(globalState.halt) { | |
globalState.cycleCount += 400; // just to keep ticking the timer... | |
u8 interrupts = globalMemState.upperRam[0x7f] & globalMemState.ioRegs[IO_IF]; | |
if(interrupts & 0x01) { | |
globalState.halt = false; | |
} else if(interrupts & 0x02) { | |
globalState.halt = false; | |
} else if(interrupts & 0x04) { | |
globalState.halt = false; | |
} else if(interrupts & 0x08) { | |
globalState.halt = false; | |
} else if(interrupts & 0x10) { | |
globalState.halt = false; | |
} | |
} | |
// cycle count | |
uint64_t cyclesThisIteration = globalState.cycleCount - oldCycleCount; | |
if(cyclesThisIteration == 0) { | |
printf("0 cycles!\n"); | |
} | |
// update timer | |
u8 tac = globalMemState.ioRegs[IO_TAC]; | |
bool ten = ((tac >> 2) & 1) != 0; // timer enable? | |
if(ten) { | |
u8 tclk = (tac & (u8)3); // timer speed | |
globalState.timSubcount += cyclesThisIteration; | |
if(globalState.timSubcount >= timReset[tclk]) { // timer tick | |
globalState.timSubcount = 0; | |
u8 timv = globalMemState.ioRegs[IO_TIMA]; // check for overflow | |
if(timv == 255) { | |
globalMemState.ioRegs[IO_IF] |= 4; // set interrupt | |
globalMemState.ioRegs[IO_TIMA] = globalMemState.ioRegs[IO_TMA]; // reset | |
} else { | |
globalMemState.ioRegs[IO_TIMA] = timv + (u8)1; // increment. | |
} | |
} | |
} | |
return (u32)cyclesThisIteration; | |
} | |
// example opcode: | |
void CALL_NZ(u8 opcode) { // 0xC4 | |
globalState.pc++; // increment PC to point to immediate | |
u16 addr = readU16(globalState.pc); // load 16-bit immediate (address to jump to) | |
globalState.pc += 2; // increment PC past immediate | |
if(!getZeroFlag()) { // if the zero flag is not set, then take the jump | |
writeU16(globalState.pc, globalState.sp - (u16)2); // push the current PC onto the stack | |
globalState.sp -= 2; | |
globalState.pc = addr; // jump | |
} | |
globalState.cycleCount += 12; // update cycles | |
} |
I also added a frame buffer and display window with SDL. It displays the framebuffer as an SDL texture. SDL was also used for reading the keyboard, which also triggers an interrupt. The Gameboy uses a scan matrix to determine which keys are pressed.
At this point, I also implemented the background render. The VRAM is filled with tile data, as well as a tile map, which tells you which tile should go in which spot. There are also SCROLLX and SCROLLY registers, which allow you to scroll the background around (it wraps!).
After a few issues with bit-shifting operations, I was able to get the following:
Notice that the (R) is a bit corrupt. This is because the Gameboy BIOS ROM I copied from the internet is slightly corrupted...
I was also able to run the "blargg CPU instruction test ROM", which showed that there were still many bugs in the CPU emulation and would crash before running all tests:
Part 3
In round 3 of Gameboy Programming, I fixed more CPU bugs, implemented some ROM bank mapping, implemented the DMA function, added the sprite renderer. This let me play Tetris, though the colors were still wrong:
I could also run the entire CPU instruction test without crashing the emulator, though several instructions still failed.
Part 4
After fixing a few more CPU bugs, implementing the HALT instruction (suspend until next interrupt), and adding the programmable timer, I was able to boot Dr. Mario. There were still a few bugs related to tile maps and sprite transparency that needed fixing:
Part 5
Next up was a rewrite of the ROM/RAM banking system to be more flexible. I finally revisited the graphics, fixing the tiles, adding the window renderer, and implementing palettes/transparency. Here's what the memory banking code looked like:
There were still a few small bugs, but it was good enough to start pokemon:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// handler for MBC3 switch (doesn't handle anything yet...) | |
void mbc3Handler(u16 addr, u8 value) { | |
if(addr >= 0x2000 && addr < 0x3fff) { | |
// ROM bank switch | |
if(value >= globalMemState.nRomBanks) { | |
printf("\trequested rom bank %d when there are only %d banks!\n", value, globalMemState.nRomBanks); | |
assert(false); | |
} | |
if(value == 0) value = 1; | |
globalMemState.mappedRom = fileData + 0x4000 * value; | |
} else if(addr >= 0 && addr < 0x1fff) { | |
// RAM enable/disable | |
if(value == 0) { | |
globalMemState.disabledMappedRam = globalMemState.mappedRam; | |
globalMemState.mappedRam = nullptr; | |
} else if(value == 0xa) { | |
globalMemState.mappedRam = globalMemState.disabledMappedRam; | |
} else { | |
//assert(false); | |
} | |
} else if(addr >= 0x4000 && addr < 0x5fff) { | |
// RAM bank switch | |
if(value < globalMemState.nRamBanks) { | |
globalMemState.mappedRam = globalMemState.mappedRamAllocation + 0x2000 * value; | |
} else { | |
//assert(false); | |
} | |
} else if(addr == 0x6000) { | |
// ?? RTC latch nonsense | |
} else { | |
assert(false); | |
} | |
} |
Part 6
Finally, Pokemon was working correctly. Here's the drawLine function, which draws a single line onto the screen:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// from a pointer to a tile, read a pixel | |
u8 readSpriteTileAddr(u8* tileAddr, u8 x, u8 y, u8 palette) { | |
//tileAddr = 0x8180; | |
assert(x <= 8); | |
assert(y <= 8); | |
x = (7 - x); | |
u8* loAddr = tileAddr + (y*(u16)2); | |
u8* hiAddr = loAddr + (u16)1; | |
u8 lo = *(loAddr); | |
u8 hi = *(hiAddr); | |
u8 loV = (lo >> x) & (u8)1; | |
u8 hiV = (hi >> x) & (u8)1; | |
u8 colorIdx = loV + 2 * hiV; | |
if(colorIdx == 0) { | |
return TRANSPARENT_SPRITE; | |
} | |
u8 colorID = (palette >> (2 * colorIdx)) & 3; | |
return colors[colorID]; | |
} | |
// find the address of the given tile | |
u8* computeTileAddrPtr(u8 tileIdx, bool tileData) { | |
if(tileData) { | |
return globalMemState.vram + 16 * tileIdx; | |
} else { | |
if(tileIdx <= 127) { | |
return globalMemState.vram + 0x1000 + 16 * tileIdx; | |
} else { | |
return globalMemState.vram + 16 * (tileIdx); | |
} | |
} | |
} | |
// main function to render a line of the display. | |
// this implementation is missing a number of things, including (but not limited to) | |
// -- proper position of the WINDOW | |
// -- 16x8 sprites | |
// -- sprite sorting | |
// -- 10 sprite limit | |
void renderLine() { | |
u8 lcdc = globalMemState.ioRegs[IO_LCDC]; // lcd control register | |
bool lcdOn = (lcdc >> 7) & (u8)1; // lcd display on? | |
bool windowTileMap = (lcdc >> 6) & (u8)1; // select tilemap source for window | |
bool windowEnable = (lcdc >> 5) & (u8)1; // draw window? | |
bool tileData = (lcdc >> 4) & (u8)1; // select tile data source | |
bool bgTileMap = (lcdc >> 3) & (u8)1; // select tilemap source for background | |
bool objSize = (lcdc >> 2) & (u8)1; // pick sprite size (nyi) | |
bool objEnable = (lcdc >> 1) & (u8)1; // enable sprite renderer | |
bool bgWinEnable = (lcdc >> 0) & (u8)1; // enable background and window renderer | |
u16 windowMapAddr = (u16)(windowTileMap ? 0x9c00 : 0x9800); | |
u16 bgTileMapAddr = (u16)(bgTileMap ? 0x9c00 : 0x9800); | |
// background renderer | |
if(lcdOn && bgWinEnable) { | |
// render background onto framebuffer | |
u8 pal = globalMemState.ioRegs[IO_BGP]; // color palette | |
u16 tileMapRowAddr = (u16)(bgTileMapAddr + 32*((((u16)globalVideoState.line + | |
globalMemState.ioRegs[IO_SCROLLY]) & (u16)255) >> 3)); // address of the row of the tilemap | |
u8 tileMapColIdx = globalMemState.ioRegs[IO_SCROLLX] >> 3; // column index of the tilemap | |
u8 yPixOffset = ((u8)globalVideoState.line + globalMemState.ioRegs[IO_SCROLLY]) & (u8)7; // y-pixel of tile | |
u8 xPixOffset = globalMemState.ioRegs[IO_SCROLLX] & (u8)7; // x-pixel of tile | |
u8 tileIdx = readByte(tileMapRowAddr + tileMapColIdx); // tile index | |
// loop over pixels in the line | |
for(u8 px = 0; px < 160; px++) { | |
globalVideoState.frameBuffer[xy2px(px,globalVideoState.line)] = | |
readTilePtr(computeTileAddrPtr(tileIdx, tileData), xPixOffset, yPixOffset, pal); | |
xPixOffset++; // increment tile pixel | |
if(xPixOffset == 8) { // if we have overflowed the tile | |
xPixOffset = 0; // go to the beginning | |
tileMapColIdx = (tileMapColIdx + 1) & 31; // of the next tile (allow wraparound) | |
tileIdx = readByte(tileMapRowAddr + tileMapColIdx); // and look up the tile index in the tile map | |
} | |
} | |
} | |
// window renderer | |
if(windowEnable) { | |
u8 pal = globalMemState.ioRegs[IO_BGP]; // palette | |
u8 wx = globalMemState.ioRegs[IO_WINX]; // location of the window (nyi) | |
u8 wy = globalMemState.ioRegs[IO_WINY]; // location of the window (nyi) | |
if(wx > 166 || wy > 143) { | |
// if the window is out of this range, it is disabled too. | |
} else { | |
u16 tileMapRowAddr = windowMapAddr + 32*((((u16)globalVideoState.line)) >> 3); // address of the row of the tilemap | |
u8 tileMapColIdx = 0; // column index of the tilemap | |
u8 yPixOffset = ((u8)globalVideoState.line) & (u8)7; // y-pixel of tile | |
u8 xPixOffset = 0; // x-pixel of tile | |
u8 tileIdx = readByte(tileMapRowAddr + tileMapColIdx); // tile index | |
// loop over pixels in the line | |
for(u8 px = 0; px < 160; px++) { | |
globalVideoState.frameBuffer[xy2px(px,globalVideoState.line)] = | |
readTilePtr(computeTileAddrPtr(tileIdx, tileData), xPixOffset, yPixOffset, pal); | |
xPixOffset++; // increment tile pixel | |
if(xPixOffset == 8) { // if we have overflowed the tile | |
xPixOffset = 0; // go to the beginning | |
tileMapColIdx = (tileMapColIdx + 1) & 31; // of the next tile (allow wraparound, but it shouldn't happen?) | |
tileIdx = readByte(tileMapRowAddr + tileMapColIdx); // and look up the tile index in the tile map | |
} | |
} | |
} | |
} | |
// sprite renderer | |
if(objEnable) { | |
for(u16 spriteID = 0; spriteID < 40; spriteID++) { | |
u16 oamPtr = 0xfe00 + 4 * spriteID; // sprite information table | |
u8 spriteY = readByte(oamPtr); // y-coordinate of sprite | |
u8 spriteX = readByte(oamPtr + 1); // x-coordinate of sprite | |
u8 patternIdx = readByte(oamPtr + 2); // sprite pattern | |
u8 flags = readByte(oamPtr + 3); // flag bits | |
bool pri = (flags >> 7) & (u8)1; // priority (transparency stuff) | |
bool yFlip = (flags >> 6) & (u8)1; // flip around y? | |
bool xFlip = (flags >> 5) & (u8)1; // flip around x? | |
bool palID = (flags >> 4) & (u8)1; // palette ID (OBP0/OBP2) | |
u8 pal = palID ? globalMemState.ioRegs[IO_OBP1] : globalMemState.ioRegs[IO_OBP0]; | |
if(spriteX | spriteY) { | |
// the sprite coordinates have an offset | |
u8 spriteStartY = spriteY - 16; | |
u8 spriteLastY = spriteStartY + 8; // todo 16 row sprites | |
// reject based on y if the sprite won't be visible in the current line | |
if(globalVideoState.line < spriteStartY || globalVideoState.line >= spriteLastY) { | |
continue; | |
} | |
// get y px relative to the sprite pattern | |
u8 tileY = globalVideoState.line - spriteStartY; | |
if(yFlip) { | |
tileY = 7 - tileY; | |
} | |
assert(tileY < 8); | |
// loop over the 8 pixels that the sprite is on: | |
for(u8 tileX = 0; tileX < 8; tileX++) { | |
u8 xPos = spriteX - 8 + tileX; // position on the screen | |
if(xPos >= 160) continue; // reject if we go off the end, don't wrap around | |
u32 fbIdx = xy2px(xPos, globalVideoState.line); | |
// current color at the screen | |
u8 old = globalVideoState.frameBuffer[fbIdx]; | |
// get the pixel from the sprite pattern data | |
u8 tileLookupX = tileX; | |
if(xFlip) { | |
tileLookupX = 7 - tileX; | |
} | |
u8 tileValue = readSpriteTileAddr(globalMemState.vram + patternIdx * 16, tileLookupX, tileY, pal); | |
// don't draw transparent | |
if(tileValue == TRANSPARENT_SPRITE) continue; // (transparent sprites) | |
if(!pri) { | |
globalVideoState.frameBuffer[fbIdx] = tileValue; | |
} else { | |
if(old == BRIGHTEST_COLOR) { | |
globalVideoState.frameBuffer[fbIdx] = tileValue; | |
} | |
} | |
} | |
} | |
} | |
} | |
} |
This code is available on github from here:
https://github.com/dicarlo236/gameboy
It doesn't run every game, but seems to be pretty good with the games that it does. There are two known bugs with this version: The button reading emulation is slightly wrong, causing some games like Pokemon Green to have trouble and some of the video registers are wrong, causing the move CONFUSION in pokemon to get stuck forever.
Part 7 - On to the STM32F446RE!
The next step was to port the emulator to the microcontroller. I wrote the code knowing that I'd have to do this port, so it was pretty straightforward. From the beginning, it was clear that this would be a struggle - I needed to remove the bitmapped font in order to have enough memory to store everything. There were problems with both RAM and flash size - there simply wasn't enough flash storage to store pokemon red, and I didn't have enough RAM to do double buffering. Despite this, I got the gameboy code booting in around 20 minutes, though it was extremely slow. I implemented a number of tricks to speed up the game:
- Only output frames at ~10fps
- Improve the DMA and tile reading functions to be much faster (use memcpy instead of for loop which uses gameboy memory subsystem)
- Skip cycles when the CPU is in halt mode, but this does cause timing issues in some games
As you can see, the quality of the image is poor, and we are limited to running games which have a small ROM, small RAM, use the halt instruction, and are not CPU intensive. Tetris ran much slower.
Part 8 - Pokemon?
I spent a lot more time improving the speed of the emulator, mostly related to graphics and more intelligent cycle skipping. I was able to shrink some video buffers down in size to give me enough RAM for pokemon, but I did not have enough memory to store the game in flash. To get around this, I broke up the 1 MB ROM into a bunch of pages, then stored as many as possible of the most commonly used pages in flash. I then used the remaining 32 kB of RAM on the nucleo and an SD card to implement a caching system that would load in pages from the SD card as needed, then store them in the RAM cache until another page needed the same spot. I got the best performance when making the pages the same size as the SD card read block size. The best cache design used a hash table with a limited-length linear probing scheme, with LRU replacement (it did limited length linear probing, then replaced the entry least recently used in the probe). Unfortunately, there are simply some sequences in Pokemon which require tons of bank switches, meaning I need to read from the SD card incredibly often. Some basic timing calculations showed that it was unlikely pokemon would ever run at full speed using this technique, but it did work:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#define CACHE_ENTRY_SIZE_LG2 9 | |
#define N_CACHE_ENTRIES_LG2 6 | |
u8 romBankCache[(1 << CACHE_ENTRY_SIZE_LG2) * (1 << N_CACHE_ENTRIES_LG2)]; | |
u32 blockIDs[(1<<N_CACHE_ENTRIES_LG2)] = {0}; | |
u32 cacheBlock = 0xffffffff; | |
u8 readByteRomBank(u16 addr, u8 bank) { | |
u16 bankedRomAddress = addr & 0x3fff; // address relative to start of memory bank | |
u32 fileAddress = 0x4000 * bank + bankedRomAddress; // address relative to start of cartridge | |
u32 block = fileAddress >> CACHE_ENTRY_SIZE_LG2; // block index (relative to start of cartridge) | |
u32 tableSlot = hash(block) & ((1 << N_CACHE_ENTRIES_LG2) - 1); | |
assert(tableSlot < (1 << N_CACHE_ENTRIES_LG2)); | |
if (blockIDs[tableSlot] == block) { | |
// cache hit! | |
return romBankCache[tableSlot * (1 << CACHE_ENTRY_SIZE_LG2) + (fileAddress & ((1 << CACHE_ENTRY_SIZE_LG2) - 1))]; | |
} else { | |
// cache miss. | |
fseek(fp, (block << CACHE_ENTRY_SIZE_LG2), SEEK_SET); | |
fread(romBankCache + (tableSlot * (1 << CACHE_ENTRY_SIZE_LG2)), 1, (1 << CACHE_ENTRY_SIZE_LG2), fp); | |
blockIDs[tableSlot] = block; | |
return romBankCache[tableSlot * (1 << CACHE_ENTRY_SIZE_LG2) + (fileAddress & ((1 << CACHE_ENTRY_SIZE_LG2) - 1))]; | |
} | |
} |
Part 9 - A Better Microcontroller
The solution to my speed problem is to switch to a better microcontroller. Bayley recently purchased some STM32H7 dev boards, which have a roughly 2x faster clock, and have enough flash to store all of Pokemon Red. However, this meant porting all of the Gameboy and NTSC video code from the MBED online compiler to the AC6 Workbench, learning how to do interrupts on the H7, and making another DAC out of some resistors and a random op-amp. I didn't know it at the time, but I was mistakenly programming the H7 in some sort of debug mode (even though I compiled with -O2...) which gave it around a factor of 3 decrease in performance. Even then, the performance improvement was huge. Pokemon was now much closer to real time (running at 60 fps!), and simple games like Mario Land and Dr. Mario were running at full speed! I also implemented the sound subsystem of the gameboy, and used a "1-bit DAC" (aka a digital output pin) to play back the music.
The sound was very bad, so I switched to the built-in DAC, which improved things a lot. There are a number of hacks to get the sound working (the arbitrary waveform is always a triangle, the noise channel is greatly simplified...) but the trick to getting a nice sound is to run the sound interrupt inside of the video interrupt. This only gives us 15 kHz sampling, but it's not the end of the world.
Here's a video showing the progression of the sound system, from absolutely terrible to halfway decent:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/////////////// | |
// channel 1 | |
/////////////// | |
// update the sndState with frequency from the NR13/NR14 register | |
void ch1_update_freq() { | |
u8 nr13 = globalMemState.ioRegs[IO_NR13]; | |
u8 nr14 = globalMemState.ioRegs[IO_NR14]; | |
// compute the "gb freq" | |
u16 gbFreq = (((u16)nr14 & 0x7) << 8) + (u16)nr13; | |
sndState.ch1.gbFreq = gbFreq; | |
// compute frequency in Hz (todo don't use floats) | |
float gbf = gbFreq; | |
float freqHz = 131072.f / (2048.f - gbf); | |
//printf("freq %f\r\n", freqHz); | |
// compute period in timer ticks | |
float period = 2.f*15625.f / freqHz; | |
sndState.ch1.per = period; | |
} | |
void ch1_length_load() { | |
ch1_update_freq(); | |
// todo: currently ignoring duty cycle | |
// first check the mode: | |
u8 nr14 = globalMemState.ioRegs[IO_NR14]; | |
u8 lenEnable = (nr14 & 0x40) != 0; | |
sndState.ch1.lenCntEnable = lenEnable; | |
if(lenEnable) { | |
//printf("length load\r\n"); | |
u8 nr11 = globalMemState.ioRegs[IO_NR11]; | |
sndState.ch1.len = nr11 & 0x3f; | |
sndState.ch1.enable = 1; | |
} else { | |
//sndState.ch1.enable = 1; // not 100% sure about this... | |
} | |
} | |
void ch1_trigger_load() { | |
u8 nr14 = globalMemState.ioRegs[IO_NR14]; | |
u8 nr12 = globalMemState.ioRegs[IO_NR12]; | |
ch1_update_freq(); | |
u8 lenEnable = (nr14 & 0x40) != 0; | |
sndState.ch1.lenCntEnable = lenEnable; | |
//printf("trigger: 0x%x\r\n", nr14); | |
if(nr14 & 0x80) { | |
// trigger! | |
u8 startingVol = nr12 >> 4; | |
sndState.ch1.enable = 1; | |
sndState.ch1.vol = startingVol; | |
u8 nr11 = globalMemState.ioRegs[IO_NR11]; | |
sndState.ch1.len = (64 - (nr11 & 0x3f)); | |
} | |
} | |
void ch1_len_update() { | |
if(sndState.ch1.lenCntEnable && sndState.ch1.len) { | |
//printf("len: %d\r\n", sndState.ch1.len); | |
sndState.ch1.len--; | |
} | |
if(sndState.ch1.len == 0) { | |
sndState.ch1.lenCntEnable = 0; | |
sndState.ch1.enable = 0; | |
} | |
} | |
u8 ch1_env_subcount = 0; | |
void ch1_env_update() { | |
u8 nr12 = globalMemState.ioRegs[IO_NR12]; | |
ch1_env_subcount++; | |
u8 dir = (nr12 & 8) != 0; | |
u8 sbc = nr12 & 7; | |
if(ch1_env_subcount > sbc) { | |
ch1_env_subcount = 0; | |
if(dir && sndState.ch1.vol < 15) { | |
sndState.ch1.vol++; | |
} else if(sndState.ch1.vol > 0){ | |
sndState.ch1.vol--; | |
} | |
} | |
} | |
// sound interrupt: | |
void snd() { | |
u8 sndOut = 0; | |
sndCount++; | |
if(!sndState.masterEnable) return; | |
if(sndState.ch1.enable && sndState.ch1.vol) { | |
c1c++; | |
u32 progress = sndCount % sndState.ch1.per; | |
if(progress > sndState.ch1.per / 2) { | |
sndOut += sndState.ch1.vol; | |
} | |
if(c1c > sndState.ch1.per) c1c = 0; | |
} | |
if(sndState.ch2.enable && sndState.ch2.vol) { | |
c2c++; | |
u32 progress = sndCount % sndState.ch2.per; | |
if(progress > sndState.ch2.per / 2) { | |
sndOut += sndState.ch2.vol; | |
} | |
if(c2c > sndState.ch2.per) c2c = 0; | |
} | |
if(sndState.ch3.enable && sndState.ch3.vol) { | |
c3c++; | |
u32 progress = sndCount % sndState.ch3.per; | |
if(progress > sndState.ch3.per / 2) { | |
sndOut += sndState.ch3.vol; | |
} | |
if(c3c > sndState.ch3.per) c3c = 0; | |
} | |
if(sndState.ch4.enable && sndState.ch4.vol) { | |
sndOut += sndState.ch4.vol * (rand() & 1); | |
} | |
DAC1->DHR12R1 = (sndOut << 6); | |
} |
Part 10 - A better video routine
The H7 has a very fancy DMA system which Bayley realized might help with the video code. The idea is that you set up a timer to run at 8 MHz (this gives us 512 pixels per line) to clock the DMA output. The DMA can then be configured to output an entire horizontal line independently from the CPU, then trigger an interrupt at the end. This interrupt would then reload the DMA for the next line. Because the line-end interrupt happens at 15 kHz, we can also use it to compute the sound DAC output voltage and get reasonable quality sound. Getting the DMA up and running took most of a day, but the results were very good:
In this video, I am still running in the reduced performance debug mode, but you can see the "lag" counter which displays how many frames behind (or ahead, if it's negative) of real time we are.
Here is the DMA NTSC code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// DMA interrupt handler: | |
// - this is called when the transfer is completed. | |
// it reloads the DMA for the next line (wrapping around at the end) | |
// and also calls the sound interrupt. | |
// This should run at ~15.625 kHz for the video/sound timing! | |
extern "C" void DMA1_Stream0_IRQHandler() { | |
DMA1->LIFCR |= (1 << 5); // clear the transfer interrupt flag | |
// there is some sort of DMA error generated on each transfer | |
// it needs to be cleared before we can re-enable | |
DMA1->LIFCR = 0xffffffff; // clear other flags which are confusing and bad | |
//toggle_led(); // toggle led | |
frameCountNTSC++; // increment number of lines drawn counter | |
DMA1_Stream0->CR &= ~1; // disable interrupt | |
// start the next transfer | |
DMA1_Stream0->NDTR = H_BUFF_SIZE; // reset size to 1 line | |
DMA1_Stream0->M0AR = (uint32_t)(ntscDmaBuffer + H_BUFF_SIZE * line); // set line | |
DMA1_Stream0->CR |= 1; // enable | |
line++; // increment line | |
if(line >= V_BUFF_SIZE) line = 0; // wrap line | |
snd(); // run sound | |
} | |
// sync pulse definition | |
u8 getSyncVal(u32 pos) { | |
if(pos < 22) return 0; | |
return DAC_SYNC_LEVEL; | |
} | |
// convert screen coordinates to DMA buffer index | |
u32 xyToDmaPx(u32 x, u32 y) { | |
return X_OFF + x + (Y_OFF + y) * H_BUFF_SIZE; | |
} | |
// initialize the frame buffer with the values needed for sync pattern | |
void initFrameBuffer(u8* buff) { | |
printf("[DMA NTSC] Reset Frame Buffer\r\n"); | |
memset(buff, 0, H_BUFF_SIZE * V_BUFF_SIZE); | |
for(int l = 0; l < V_PORCH_LEN; l++) { | |
for(int rpx = 0; rpx < H_PORCH_LEN; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = getSyncVal(rpx) << 4; | |
} | |
for(int rpx = H_PORCH_LEN; rpx < H_BUFF_SIZE; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = getSyncVal(rpx) << 4; // todo check if this goes low? | |
} | |
} | |
// seg 1: the image | |
for(int l = V_PORCH_LEN; l < V_BUFF_SIZE - 1; l++) { | |
if(l < 243) { // visible lines | |
for(int rpx = 0; rpx < H_PORCH_LEN; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = getSyncVal(rpx) << 4; | |
} | |
for(int rpx = H_PORCH_LEN; rpx < H_BUFF_SIZE; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = (getSyncVal(rpx) << 4) + 0*((((rpx + l)/2)%4) ? (6 << 4) : 0); | |
} | |
} else { | |
for(int rpx = 0; rpx < H_PORCH_LEN; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = getSyncVal(rpx) << 4; | |
} | |
for(int rpx = H_PORCH_LEN; rpx < H_BUFF_SIZE; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = (getSyncVal(rpx) << 4) + 0*((((rpx + l)/2)%4) ? (6 << 4) : 0); | |
} | |
} | |
} | |
int l = V_BUFF_SIZE - 1; | |
for(int rpx = 4; rpx < 16; rpx++) { | |
buff[l * H_BUFF_SIZE + rpx] = (DAC_SYNC_LEVEL << 4); | |
} | |
// draw gameboy rectangle to put something on the screen if the emulator crashes | |
int gbCenterX = 160/2; | |
int gbCenterY = 144/2; | |
for(int i = 0; i < 160; i++) { | |
for(int j = 0; j < 144; j++) { | |
buff[xyToDmaPx(i, j)] = (6 << 4); | |
} | |
} | |
} | |
void initDMA_NTSC() { | |
printf("[DMA NTSC]\r\n"); | |
ntscDmaBuffer = (u8*)badalloc_check(H_BUFF_SIZE * V_BUFF_SIZE, "ntsc-dma-buffer"); | |
initFrameBuffer(ntscDmaBuffer); | |
// disable timer 3 | |
TIM3->CR1 &= ~TIM_CR1_CEN; | |
// enabled clocks for timer and DMA | |
RCC->APB1LENR |= RCC_APB1LENR_TIM3EN; | |
RCC->AHB1ENR |= RCC_AHB1ENR_DMA1EN; | |
// set timer prescaler to 0 | |
TIM3->PSC = 0; | |
// set timer (pixel clock) | |
TIM3->ARR = 23; // ~8 MHz | |
// load timer settings | |
TIM3->EGR |= TIM_EGR_UG; | |
// connect timer to dma interrupt | |
TIM3->DIER |= TIM_DIER_UDE; | |
// load timer settings | |
TIM3->EGR |= TIM_EGR_UG; | |
// enable the timer | |
TIM3->CR1 |= TIM_CR1_CEN; | |
// a simple 8-step procedure to enable DMA | |
// step 1: disable the dma stream | |
DMA1_Stream0->CR &= ~1; | |
// step 2: set PAR (destination in the mem->peripheral mode) | |
DMA1_Stream0->PAR = (uint32_t)(&(GPIOD->ODR)); | |
// step 3: set MA0r (source in the mem->peripheral mode) | |
DMA1_Stream0->M0AR = (uint32_t)ntscDmaBuffer; | |
// step 4: set the number of "data items" to be transferred. For us, this is bytes | |
DMA1_Stream0->NDTR = H_BUFF_SIZE; | |
// step 5: use DMAMUX1 to connect timer 3 to the dma | |
DMAMUX1_Channel0->CCR |= 27; | |
// step 6: flow control | |
// do nothing | |
// step 7: set the priority (set to very high because why not) | |
DMA1_Stream0->CR |= (1 << 17); | |
DMA1_Stream0->CR |= (1 << 16); | |
// step 8: config fifo | |
// do nothing | |
// step 8.5: enable the interrupt for the dma (not sure this is really supposed to go here, but it works) | |
NVIC_EnableIRQ(DMA1_Stream0_IRQn); | |
// step 9: set up transfer mode, increment, and friends | |
DMA1_Stream0->CR |= (1 << 4); // transfer complete interrupt enable | |
DMA1_Stream0->CR |= (1 << 6); // mem -> peripheral transfer | |
DMA1_Stream0->CR |= (1 << 10); // minc (increment the memory address, 1 byte with defaults) | |
// step 9.5: enable the DMA | |
DMA1_Stream0->CR |= 1; | |
// step 9.75: enable the DMAMUX request generator | |
DMAMUX1_RequestGenerator0->RGCR |= (1 << 16); | |
// step 9.875: inform user that the DMA configuration has not killed the cpu | |
printf("[DMA NTSC] The CPU is still running!\r\n"); | |
} |
Summary
In total, the project is 6,023 lines long, of which 1,515 are blank/comments and 4538 are actual code. The largest files are
- gb_cpu: 2,445 lines
- gb_mem: 678 lines
- gb_sound: 369 lines
- gb_video: 309 lines