Rendering logic: Difference between revisions

From NeoGeo Development Wiki
Jump to navigation Jump to search
mNo edit summary
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.
Depending on the chipset, video is generated by 3, 2 or one unique chip:


* [[LSPC-A0]], [[PRO-B0]] (early)
* 3: [[LSPC-A0]], [[PRO-B0]], [[PRO-C0]] (early)
* [[LSPC2-A2]], [[NEO-B1]] (most common)
* 2: [[LSPC2-A2]], [[NEO-B1]] (most common)
* [[NEO-GRC]], [[NEO-OFC]] (CD systems)
* 1: [[NEO-GRC]] (CD systems), [[NEO-GRZ]] (CDZ, MV-1C...)
* [[NEO-GRZ]] (CDZ, MV-1C ?)


See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges.
See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges. See [[Display timing]] for the sync signal's timing.


==Temporary notes==
There are two main parts in generating video:


*PCK2 rises with BNKB and CHBL
* An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in [[VRAM]].
*The first valid rendering cycle is 32mclk after CHBL low ?
* Line buffers, to which pixels can be written in any order from the graphics ROMs data.
*Fix and sprite pixels are rendered at the same speed because sprite pixels are written by pairs
*Tile pixel lines are rendered in halves:


*For the fix (32mclk = 8 pixels corresponds to 6MHz pixel clock):
=Line buffers=
**Full address is ...1**** (PCK2 pulse)
**2H1 is 0 for 2 pixels (columns 0 & 1), then 1 for 2 pixels (columns 2 & 3)
**Full address is ...0**** (PCK2 pulse)
**2H1 is 0 for 2 pixels (columns 4 & 5), then 1 for 2 pixels (columns 6 & 7)


*For sprites (32mclk = 16 pixels):
To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.
**Full address is ...1***** (PCK1 pulse)
**CA4 is 0 for 4 pixels (columns 0~3), then 1 for 4 pixels (columns 4~7)
**Full address is ...0***** (PCK1 pulse)
**CA4 is 0 for 4 pixels (columns 8~11), then 1 for 4 pixels (columns 12~15)


*As fix is rendered in realtime, the fix tile address is set before sprites (on a new line PCK2 pulses before PCK1)
To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the [[Alpha68k]].
*X position to B1, just before each PCK2 pulse (SP during 1mclk), for 20 sprites next to each other (X+16px each time):
** Start of line: 0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...


=Video generation=
The fix layer pixels are rendered in real time over the buffers output.


See [[Display timing]] for the sync signal's timing.
==Fix tiles==


[[NEO-B1]] is used for double-buffering scanlines. While a buffer is output to the screen, the other one is filled up. They're swapped each new scanline. Each of the two line buffers are actually 2 buffers of even/odd pixels. They will be named (1 & 2), and (3 & 4).
* Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
* One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
* The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
* S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.


*The TMS0 signal from LSPC tells B1 how the pair of buffers are used:
Address sequence for one tile line:
**0: Buffers 1&2 are output to the TV. Buffers 3&4 are written to.
{|class="wikitable"
**1: Buffers 1&2 are written to. Buffers 3&4 are output to the TV.
! A4 !! 2H1 !! pixel pair
|-
| 1 || 0 || A
|-
| 1 || 1 || B
|-
| 0 || 0 || C
|-
| 0 || 1 || D
|}


*CSK1~4 signals are used to step to the next pixel (rising edge ?), periodic for video output, VRAM-dependent when filling up. Inactive during H-blank.
2H1 bypasses the PCK* latchs.
*WSE1~4 signals are used to indicate if the pixel color from GAD/GBD needs to be written to the buffer (falling edge ?), matches CSK for video output (ignored ?), depends on DOTA/DOTB (opaque pixel signal) when filling up.
*SS1~2 signals are used to reset the pixel pointers on falling edge ? (probably wrong)


*The rising edge of PCK2 latches the X position of the sprite (and something else in a byte ?)
==Sprite tiles==
*1H1 (6MHz / 2 pixels per byte = 3MHz) is used to clock in the pixels of FIXD into the pixel buffers (or directly to the output ?) if they're not 0000.
16mclk = 16 pixels, 8 pixels per read.
 
* Sprite pixels are rendered two-by-two at 12MHz (2mclk).
* One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
* The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
* C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.
 
Address sequence for one tile line:
{|class="wikitable"
! CA4 !! 8-pixel line
|-
| 1 || A
|-
| 0 || B
|}
 
CA4 bypasses the PCK* latchs.
 
=Active lists=
 
The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.
 
They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.
 
=Slow VRAM access slots=
 
Slow VRAM has four 4mclk-long access slots running in sequence with no variations:
 
# Read sprite map even word
# Read sprite map odd word
# Read fix map
# Read/Write for CPU
 
=Fast VRAM access slots=
 
Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:
 
{|class="wikitable"
! Slot # !! Duration !! Description
|-
|1
|2mclk
|rowspan=5|Parsing
|-
|2
|1.5mclk
|-
|3
|1.5mclk
|-
|4
|1.5mclk
|-
|5
|1.5mclk
|-
|6
|2mclk
|Read active list
|-
|7
|1.5mclk
|Read SCB2
|-
|8
|1.5mclk
|Read SCB3
|-
|9
|1.5mclk
|Read SCB4
|-
|10
|1.5mclk
|Read/write for CPU
|}
 
Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:
 
[[file:timing_gpu1.png]]
 
The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.
 
Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.


=Sprite parsing=
=Sprite parsing=


<span style="color:#FF0000">This is a draft. The following information shouldn't be considered as exact.</span>
<span style="color:#FF0000">This is still a draft. The following information shouldn't be considered as correct.</span>
 
LSPC splits the workload needed to render sprites in two passes: parsing and rendering.
 
* Parsing for a raster line N is done during line N-2
* Rendering is done during line N-1
* Finally the line is ready for output just at the right time.
 
During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.
 
* If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
* If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.
 
No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.


*LSPC runs at 24MHz
This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.
*Fast VRAM is 35ns
 
*The reads always occur 1mclk (41.6ns) after address is set
In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.
[[file:timing_gpu1.png]]
 
==Case 1: Not a single sprite match==
 
Fast VRAM cycles:
<pre>
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr    200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
R/W    Read  Read  Read  Read  Read                                  Read  Read  Read  Read  Read
</pre>
 
381 read slots, 96 fill slots, 3 waiting slots:
<pre>
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
</pre>
 
==Case 2: Some sprites match==


*FIXT: P23~16 are 0, P15~0 are S ROM address (+ external 2H1)
Fast VRAM cycles:
*SPRT: P23~0 are C ROM address (+ external CA4)
<pre>
*LO: P23~16 are [[LO]] ROM data, P15~0 are LO address
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
*FP: P19~16 is the fix tile palette, rest is 0
Addr    200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
*SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?
R/W    Read  Read  Write Read  Read                                  Write  Read  Read  Write Read
</pre>


*LSPC always starts filling up active sprite list A ($8600) each new frame
If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:
*Read sequence (100p capacitor delay on AES too on PCKxB ?):
*Timing diagram when the sprite list for the actual line is already filled (no writes):
<pre>
<pre>
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
Addr  | 600 |  200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
PCK1  ______|'''|___________________________________________________________|'''|_____________________________________________________
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
PCK1B  '''''''|____|''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
LOAD  |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
Read      ?      !    !    !    !    !      !    !    !    !    ?      !    !    !    !    !      !    !    !    !
What      1      2      2    2    2    2      3      4    5    6    1      2      2    2    2    2      3      4    5    6...
</pre>
</pre>


*1: ?
==Case 3: Exactly 96 sprites match==
*2: Read SCB3 to see if sprite is in next scanline (just increments), starts frame at sprite 1 ?
 
*3: Read sprite list to get sprite #
Fast VRAM cycles:
*4: Read SCB2 zoom values
<pre>
*5: Read SCB3 Y/size/chain
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
*6: Read SCB4 X
Addr    200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W    Read   Write Read  Write Read                                  Write  Read  Read Write Read
</pre>


*Timing diagram when the sprite list is being filled:
381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:
<pre>
<pre>
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
Addr  | 600 |  20F  | 210 | 211 | 600 | 601 |  684  | 005 | 205 | 405 | 600 |  212  | 213 | 602 | 603 | 214 |  685  | 006 | 206 | 406
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
PCK1  ______|'''|___________________________________________________________|'''|_____________________________________________________
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
PCK1B  '''''''|____|''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
LOAD  |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
/WE    ''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''
Read      ?      !    !    !                  !    !    !    !    ?      !    !                !      !    !    !    !
</pre>
</pre>


*R/W sequences: (2 write buffers ?)
==Case 4: More than 96 sprites match==
*600 RRRWW... 600 RRWWR...
 
*600 WWRRW... 600 WRRWW... 600 RRWWR ... 600 RWWRR
Same as case 3, except after 96 "W"s, there are only useless "R"s.
 
==Rendering==
 
# Read active list ($8600+ or $8680+) to get sprite #
# Read SCB2 zoom values ($8000+)
# Read SCB3 Y position, height, and chain bit ($8200+)
# Read SCB4 X position ($8400+)
 
The tile # and its attributes are also read from slow VRAM.
 
==CPU access to VRAM==
 
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.
 
CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.
 
=Buffers control=
 
==CK signals==
CK1~4 signals are used to clock each of the 4 buffers.
 
* During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
* During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.
 
* If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
* If the corresponding LD* signal is low, the buffer pointer is loaded from the [[P bus]] (X position of sprite, or 0 to start line output).
 
Inactive during H-blank.
 
==LD signals==
The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.
 
Example P bus values for 5 full-width sprites right next to each other, starting at X=0:
 
<pre>0000,0808,1010,1818,2020</pre>
 
Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):
 
<pre>0100,0908,1110,1918,2120</pre>
 
As sprite lines '''always''' take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) '''except''' for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.
 
==WE signals==
WE1~4 signals are used to tell if the pixel should be written to a buffer.
 
During rendering, the pulses are synchronized to CK signals.
 
* If the pixel is opaque, there are both pulses at the same time (write pixel).
* If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
* If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).
 
During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.


*Even lines: Write to list A, Read from list B (Start of display)
==SS signals==
*Odd lines: Write to list B, Read from list A
The SS1/2 signals enable clearing of buffer pairs, active during output.
*In 16clk, 2 sprites SCB3 max. are checked to fill up sprite list , and 1 sprite's attributes are read for output
*384px * 4clk/px = 1536clk/line
*1536clk / 16clk = 96 sprites max/line
*This means that there's at least 2 sprite SCB3 checked each 16clk, 4 writes to sprite list can be done max per 16clk ?


*Available CPU R/W slots depending on parsing progress, safest is ? cycles
==Others==


==Slow (lower) VRAM==
* TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
* The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.


*Slow VRAM is 100ns (10MHz) and is read at ?
Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).
*4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)


[[Category:Video system]]
[[Category:Video system]]

Latest revision as of 21:44, 5 December 2018

Depending on the chipset, video is generated by 3, 2 or one unique chip:

See graphics pipeline for an overview of the interconnections between chips and cartridges. See Display timing for the sync signal's timing.

There are two main parts in generating video:

  • An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in VRAM.
  • Line buffers, to which pixels can be written in any order from the graphics ROMs data.

Line buffers

To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.

To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the Alpha68k.

The fix layer pixels are rendered in real time over the buffers output.

Fix tiles

  • Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
  • One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
  • The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
  • S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.

Address sequence for one tile line:

A4 2H1 pixel pair
1 0 A
1 1 B
0 0 C
0 1 D

2H1 bypasses the PCK* latchs.

Sprite tiles

16mclk = 16 pixels, 8 pixels per read.

  • Sprite pixels are rendered two-by-two at 12MHz (2mclk).
  • One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
  • The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
  • C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.

Address sequence for one tile line:

CA4 8-pixel line
1 A
0 B

CA4 bypasses the PCK* latchs.

Active lists

The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.

They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.

Slow VRAM access slots

Slow VRAM has four 4mclk-long access slots running in sequence with no variations:

  1. Read sprite map even word
  2. Read sprite map odd word
  3. Read fix map
  4. Read/Write for CPU

Fast VRAM access slots

Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:

Slot # Duration Description
1 2mclk Parsing
2 1.5mclk
3 1.5mclk
4 1.5mclk
5 1.5mclk
6 2mclk Read active list
7 1.5mclk Read SCB2
8 1.5mclk Read SCB3
9 1.5mclk Read SCB4
10 1.5mclk Read/write for CPU

Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:

The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.

Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.

Sprite parsing

This is still a draft. The following information shouldn't be considered as correct.

LSPC splits the workload needed to render sprites in two passes: parsing and rendering.

  • Parsing for a raster line N is done during line N-2
  • Rendering is done during line N-1
  • Finally the line is ready for output just at the right time.

During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.

  • If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
  • If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.

No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.

This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.

In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.

Case 1: Not a single sprite match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
R/W     Read   Read  Read  Read  Read                                   Read   Read  Read  Read  Read

381 read slots, 96 fill slots, 3 waiting slots:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 2: Some sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
R/W     Read   Read  Write Read  Read                                   Write  Read  Read  Write Read

If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:

RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 3: Exactly 96 sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W     Read   Write Read  Write Read                                   Write  Read  Read  Write Read

381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:

RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---

Case 4: More than 96 sprites match

Same as case 3, except after 96 "W"s, there are only useless "R"s.

Rendering

  1. Read active list ($8600+ or $8680+) to get sprite #
  2. Read SCB2 zoom values ($8000+)
  3. Read SCB3 Y position, height, and chain bit ($8200+)
  4. Read SCB4 X position ($8400+)

The tile # and its attributes are also read from slow VRAM.

CPU access to VRAM

SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.

CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.

Buffers control

CK signals

CK1~4 signals are used to clock each of the 4 buffers.

  • During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
  • During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.
  • If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
  • If the corresponding LD* signal is low, the buffer pointer is loaded from the P bus (X position of sprite, or 0 to start line output).

Inactive during H-blank.

LD signals

The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.

Example P bus values for 5 full-width sprites right next to each other, starting at X=0:

0000,0808,1010,1818,2020

Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):

0100,0908,1110,1918,2120

As sprite lines always take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) except for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.

WE signals

WE1~4 signals are used to tell if the pixel should be written to a buffer.

During rendering, the pulses are synchronized to CK signals.

  • If the pixel is opaque, there are both pulses at the same time (write pixel).
  • If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
  • If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).

During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.

SS signals

The SS1/2 signals enable clearing of buffer pairs, active during output.

Others

  • TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
  • The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.

Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).