Rendering logic

On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.


 * LSPC-A0, PRO-B0 (early)
 * LSPC2-A2, NEO-B1 (most common)
 * NEO-GRC, NEO-OFC (CD systems)
 * NEO-GRZ (CDZ, MV-1C ?)

See graphics pipeline for an overview of the interconnections between chips and cartridges.

Graphics fetch
Tile pixel lines are rendered in halves.

Fix tiles
32mclk = 8 pixels, 2 pixels per read.


 * S1 ROM address is ...1**** (PCK2 pulse)
 * 2H1 is 0 for 2 pixels (columns 0 & 1)
 * Then 1 for 2 pixels (columns 2 & 3)
 * S1 ROM address is ...0**** (PCK2 pulse)
 * 2H1 is 0 for 2 pixels (columns 4 & 5)
 * Then 1 for 2 pixels (columns 6 & 7)

Sprite tiles
16mclk = 16 pixels, 8 pixels per read.


 * C ROM address is ...1***** (PCK1 pulse) (columns 0~7)
 * C ROM address is ...0***** (PCK1 pulse) (columns 8~15)

CA4 is used to select upper/lower tiles (see Sprite graphics format).

=Video generation= See Display timing for the sync signal's timing.

NEO-B1 is used for double-buffering scanlines. While a buffer is output to the screen, the other one is filled up. They're swapped each new scanline. Each of the two line buffers are actually 2 buffers of even/odd pixels. They will be named (1 & 2), and (3 & 4).

CSK signals
CSK1~4 signals are used to clock each buffer (on rising edge ?).

During rendering, the pulses usually go by pairs (1+2 and 3+4) to render pixels 2 by 2. Horizontal scaling cause CSK pulses to be skipped.

During output, the pulses are synchronized and alternative (1+2 then 3+4 then 1+2...) to output even/odd pixels in sequence.


 * If the corresponding LD* signal is high, then the buffer pointer is incremented (rendering is always done left to right ?).
 * If the corresponding LD* signal is low, then the buffer pointer is loaded from the P bus (X position of sprite, or 0 to start line output).

Inactive during H-blank.

LD signals
The LD1~2 signals sets the X position in B1. It is /2, since it is set in both even and odd buffers.

Example P bus data for 20 sprites right next to each other (X+=16px):

0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...

The lower byte might just be garbage ignored by B1.

WSE signals
WSE1~4 signals are used to indicate if the pixel color index on GAD/GBD needs to be written to the buffer.

During rendering, the pulses are synchronized to CSK signals.


 * If the pixel is transparent, there is a CSK pulse but no WSE pulse.
 * If the pixel is opaque, there are both pulses at the same time.
 * If the pixel is skipped for horizontal reduction, there are no pulses at all.

During output, the pulses are also synchronized to CSK signals and always present. This is used to clear to the backdrop color for the next rendering cycle.

The buffer's /WE signals seems to be (CSK & WSE).

SS signals
The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:


 * SS1 low & SS2 high: Buffers 1&2 are written to. Buffers 3&4 are output to the TV.
 * SS1 high & SS2 low: Buffers 1&2 are output to the TV. Buffers 3&4 are written to.

Parsing
SCB3 (Y) is read (from fast VRAM $8200 to $8380). Each time there is a scanline match, a write state to the active sprite list is inserted (apparently 2 states after the corresponding SCB3 read).

Once SCB3 address $8380 is reached, only $0000 writes to the sprite list are done (in order to fill the rest of the sprite list with zeros).

No matter how many sprites are matched in the scanline, there will always be 384 SCB3 read states and 96 sprite list write states.

This explains why sprite #0 cannot be used: this is the value used to pad the sprite list. If the list isn't filled (like most of the time), the sprite #0 i rendered over and over again until the end is reached.

Rendering
The state order is :
 * 1: Read sprite list ($8600+/$8680+) to get sprite #
 * 2: Read SCB2 zoom values ($8000+)
 * 3: Read SCB3 Y pos/size/sticky ($8200+), Y pos ignored in this state ?
 * 4: Read SCB4 X pos ($8400+)
 * 5: Read/write from CPU if needed

CPU access to fast VRAM
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.

Why 12 and not 8 ?

CPU access during state #5 occurs asynchronously with the 68000 bus -> storage in LSPC.

My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.

If write latch REG_VRAMRW is written by the 68000, the next state #5 becomes a write access and REG_VRAMADDR is incremented by REG_VRAMMOD value.

Even if a theoretical limit of 16mclk or 8 68kclk is possible, some additionnal cycles are needed for the address and data to propagate through the chip.

Or maybe SNK has given the worst case scenario between slow Low VRAM and fast High VRAM ?

Timing diagram when the sprite list is being filled:

Parse       ################################                                ################################ Render                                      ##########################                                      ########################## 24M   |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_ Addr  | 600 |  20F  | 210 | 211 | 600 | 601 |  684  | 005 | 205 | 405 | 600 |  212  | 213 | 602 | 603 | 214 |  685  | 006 | 206 | 406 PCK1  ______||___________________________________________________________||_____________________________________________________ PCK1B |___||___| LOAD  ||_______________________||_______________________||_______________________||_______________________ 12M   __||___||___||___||___||___||___||___||___||___||___||___||___||___||___||___||_ 2Pixel      |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       | /WE   '|___|'|___||___|'|___|' Read      ? !    !     !                   !     !     !     !     ?       !     !                 !       !     !     !     !


 * R/W sequences: (2 write buffers ?)
 * 600 RRRWW... 600 RRWWR...
 * 600 WWRRW... 600 WRRWW... 600 RRWWR ... 600 RWWRR
 * Even lines: Write to list A, Read from list B (Start of display)
 * Odd lines: Write to list B, Read from list A
 * 384px * 4clk/px = 1536clk/line


 * Available CPU R/W slots depending on parsing progress, safest is ? cycles

CPU access to slow VRAM

 * Slow VRAM is 100ns
 * 4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)