Rendering logic: Difference between revisions

From NeoGeo Development Wiki
Jump to navigation Jump to search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.
Depending on the chipset, video is generated by 3, 2 or one unique chip:


* [[LSPC-A0]], [[PRO-B0]] (early)T
* 3: [[LSPC-A0]], [[PRO-B0]], [[PRO-C0]] (early)
* [[LSPC2-A2]], [[NEO-B1]] (most common)
* 2: [[LSPC2-A2]], [[NEO-B1]] (most common)
* [[NEO-GRC]], [[NEO-OFC]] (CD systems)
* 1: [[NEO-GRC]] (CD systems), [[NEO-GRZ]] (CDZ, MV-1C...)
* [[NEO-GRZ]] (CDZ, MV-1C ?)


See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges.
See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges. See [[Display timing]] for the sync signal's timing.


==Temporary notes==
There are two main parts in generating video:


*Fix, then sprites (PCK1 then PCK2)
* An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in [[VRAM]].
*Fix and sprite pixels are rendered at the same speed because sprite pixels are also written by pairs (reason for the odd/even buffers)
* Line buffers, to which pixels can be written in any order from the graphics ROMs data.
*Tile pixel lines are rendered in halves:


*For the fix (32mclk = 8 pixels corresponds to 6MHz pixel clock):
=Line buffers=
**Full address is ...1**** (PCK2 pulse)
**2H1 is 0 for 2 pixels (columns 0 & 1), then 1 for 2 pixels (columns 2 & 3)
**Full address is ...0**** (PCK2 pulse)
**2H1 is 0 for 2 pixels (columns 4 & 5), then 1 for 2 pixels (columns 6 & 7)


*For sprites (32mclk = 16 pixels):
To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.
**Full address is ...1***** (PCK1 pulse)
**CA4 is 0 for 4 pixels (columns 0~3), then 1 for 4 pixels (columns 4~7)
**Full address is ...0***** (PCK1 pulse)
**CA4 is 0 for 4 pixels (columns 8~11), then 1 for 4 pixels (columns 12~15)


*As fix is rendered in realtime, the fix tile address is set before sprites (on a new line PCK1 pulses before PCK2)
To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the [[Alpha68k]].
*X position to B1, just before each PCK2 pulse (SP during 1mclk), for 20 sprites next to each other (X+16px each time):
** Start of line: 0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...


=Video generation=
The fix layer pixels are rendered in real time over the buffers output.


See [[Display timing]] for the sync signal's timing.
==Fix tiles==


[[NEO-B1]] is used for double-buffering scanlines. While a buffer is output to the screen, the other one is filled up. They're swapped each new scanline. Each of the two line buffers are actually 2 buffers of even/odd pixels. They will be named (1 & 2), and (3 & 4).
* Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
* One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
* The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
* S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.


==CSK signals==
Address sequence for one tile line:
CSK1~4 signals are used to clock each buffer (on rising edge ?).
{|class="wikitable"
! A4 !! 2H1 !! pixel pair
|-
| 1 || 0 || A
|-
| 1 || 1 || B
|-
| 0 || 0 || C
|-
| 0 || 1 || D
|}


During rendering, the pulses usually go by pairs (1+2 and 3+4) to render pixels 2 by 2. Horizontal scaling cause CSK pulses to be skipped.
2H1 bypasses the PCK* latchs.


During output, the pulses are synchronized and alternative (1+2 then 3+4 then 1+2...) to output even/odd pixels in sequence.
==Sprite tiles==
16mclk = 16 pixels, 8 pixels per read.


* If the corresponding LD* signal is high, then the buffer pointer is incremented (rendering is always done left to right ?).
* Sprite pixels are rendered two-by-two at 12MHz (2mclk).
* If the corresponding LD* signal is low, then the buffer pointer is loaded from the [[P bus]] (X position of sprite, or 0 to start line output).
* One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
* The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
* C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.


Inactive during H-blank.
Address sequence for one tile line:
{|class="wikitable"
! CA4 !! 8-pixel line
|-
| 1 || A
|-
| 0 || B
|}
 
CA4 bypasses the PCK* latchs.


==WSE signals==
=Active lists=
WSE1~4 signals are used to indicate if the pixel color index on GAD/GBD needs to be written to the buffer.


During rendering, the pulses are synchronized to CSK signals.
The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.


* If the pixel is transparent, there is a CSK pulse but no WSE pulse.
They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.
* If the pixel is opaque, there are both pulses at the same time.
* If the pixel is skipped for horizontal reduction, there are no pulses at all.


During output, the pulses are also synchronized and always present. This is used to clear to the backdrop color for the next rendering cycle.
=Slow VRAM access slots=


The buffer's /WE signals seems to be (CSK & WSE).
Slow VRAM has four 4mclk-long access slots running in sequence with no variations:


==SS signals==
# Read sprite map even word
# Read sprite map odd word
# Read fix map
# Read/Write for CPU


The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:
=Fast VRAM access slots=


* SS1 low & SS2 high: Buffers 1&2 are written to. Buffers 3&4 are output to the TV.
Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:
* SS1 high & SS2 low: Buffers 1&2 are output to the TV. Buffers 3&4 are written to.


==Notes==
{|class="wikitable"
! Slot # !! Duration !! Description
|-
|1
|2mclk
|rowspan=5|Parsing
|-
|2
|1.5mclk
|-
|3
|1.5mclk
|-
|4
|1.5mclk
|-
|5
|1.5mclk
|-
|6
|2mclk
|Read active list
|-
|7
|1.5mclk
|Read SCB2
|-
|8
|1.5mclk
|Read SCB3
|-
|9
|1.5mclk
|Read SCB4
|-
|10
|1.5mclk
|Read/write for CPU
|}


Whatever happens, a single sprite line (16 pixels) always takes 16mclk.
Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:


It can also be observed that there's always 96 pulses on LD* during rendering, since 1536mclk per line / 16mclk per sprite = 96 sprites max per line.
[[file:timing_gpu1.png]]


* The rising edge of PCK1 and PCK2 stores fix or sprite pixels.
The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.
* 1H1 is probably used to split pixels of FIXD between left and right.


Fix data is read 8 pixels in advance (confirms what Charles wrote in mvstech.txt). This seems to be inherited from the [[Alpha68k]] as 3 successive 8bit registers forming a waiting "pipeline".
Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.


=Sprite parsing=
=Sprite parsing=


<span style="color:#FF0000">This is a draft. The following information shouldn't be considered as exact.</span>
<span style="color:#FF0000">This is still a draft. The following information shouldn't be considered as correct.</span>
 
LSPC splits the workload needed to render sprites in two passes: parsing and rendering.
 
* Parsing for a raster line N is done during line N-2
* Rendering is done during line N-1
* Finally the line is ready for output just at the right time.
 
During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.


To do: Edit waveforms, FP and SP windows of the P BUS start 0.5mclk earlier (1.5,5,1.5,1.5,5,1.5 = 16).
* If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
* If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.


*LSPC runs at 24MHz, but generates signals on rising and falling edges ("48MHz")
No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.
*Fast VRAM is 35ns (<1mclk), slow VRAM is 100ns (<2.5mclk, 3 ?)
*The fast VRAM reads always occur 1mclk (41.6ns) after address is set. Smallest access window is 1.5mclk.
[[file:timing_gpu1.png]]


*FIXT: P23~16 is 0, P15~0 is S ROM address (+ external 2H1)
This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.
*SPRT: P23~0 is C ROM address (+ external CA4)
*LO: P23~16 is [[LO]] ROM data, P15~0 is LO address
*FP: P19~16 is the fix tile palette, rest is 0
*SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?


*LSPC always starts filling up active sprite list A ($8600) each new frame
In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.


Read sequence:
==Case 1: Not a single sprite match==


Timing diagram when no sprites fall in the next scanline (no writes to sprite list):
Fast VRAM cycles:
<pre>
<pre>
Parse        ################################                                ################################
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Render                                      ##########################                                      ##########################
Addr     200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
R/W    Read   Read  Read  Read Read                                  Read   Read Read  Read  Read
Addr   | 600 |  200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
PCK1   ______|'''|___________________________________________________________|'''|_____________________________________________________
PCK1B '''''''|____|''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
LOAD   |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
Read       ?      !    !    !    !    !      !    !    !    !    ?      !    !    !    !    !      !    !    !    !
What      1      2      2    2    2    2      3      4    5    6    1      2      2    2    2    2      3      4    5    6...
</pre>
</pre>


*1: Probably CPU acces slot with last address latched ($600)
381 read slots, 96 fill slots, 3 waiting slots:
*2: Read sprite Y position from SCB3 ($200+) to see if it's in next scanline
<pre>
*3: Read sprite list ($600+) to get sprite #
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*4: Read SCB2 zoom values ($000+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*5: Read SCB3 Y/size/chain ($200+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*6: Read SCB4 X ($400+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
</pre>


10 states in 16 cycles (or 5 in 8 cycles: 4-3-3-3-3).
==Case 2: Some sprites match==


One scanline contains 1536mclk cycles or 96 sequences of 16mclk cycles.
Fast VRAM cycles:
<pre>
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr    200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
R/W    Read  Read  Write Read  Read                                  Write  Read  Read  Write Read
</pre>
 
If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:
<pre>
RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
</pre>


Half of the mclk cycles are reserved for Sprite parsing, the other half is for sprite rendering and CPU access.
==Case 3: Exactly 96 sprites match==


Each half has 96 x 5 = 480 states.
Fast VRAM cycles:
<pre>
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr    200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W    Read  Write Read  Write Read                                  Write  Read  Read  Write Read
</pre>


For the parsing :
381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:
-----------------
<pre>
SCB3 is read (from $200 to $380), each time there is a sprite match, a write state to the sprite list is inserted (apparently 2 states after the corresponding SCB3 read).
RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---
</pre>


Once SCB3 address $380 is reached, only $0000 write states to the sprite list are possible (in order to fill the rest of the sprite list with zeros).
==Case 4: More than 96 sprites match==


No matter how many sprites are matched in the scanline, we will always have 384 SCB3 read states and 96 sprite list write states.
Same as case 3, except after 96 "W"s, there are only useless "R"s.


That explains why sprite #0 cannot be used : this is the value used to terminate the sprite list. It would have been smarter for SNK to use the value 511 instead...
==Rendering==


According to Charles MacDonald's document, the GPU always renders 96 sprites : the "filler" sprite #0 can be rendered many times per scanline.
# Read active list ($8600+ or $8680+) to get sprite #
# Read SCB2 zoom values ($8000+)
# Read SCB3 Y position, height, and chain bit ($8200+)
# Read SCB4 X position ($8400+)


For the rendering :
The tile # and its attributes are also read from slow VRAM.
-------------------
The state order is :
*1: Read sprite list ($600+) to get sprite #
*2: Read SCB2 zoom values ($000+)
*3: Read SCB3 Y pos/size/sticky ($200+)
*4: Read SCB4 X pos ($400+)
*5: Read/write from CPU
One remark : it is more logical to have SCB3 read before SCB2 because we need the sticky bit to make the decision of keeping the previous vertical shrink value or not.


Is the sticky bit written to the sprite list along with the sprite number or did SNK waste an additionnal 8-bit temporary register in their design ?
==CPU access to VRAM==


CPU access to High VRAM :
-------------------------
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.


Why 12 and not 8 ?
CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.
 
=Buffers control=
 
==CK signals==
CK1~4 signals are used to clock each of the 4 buffers.
 
* During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
* During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.
 
* If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
* If the corresponding LD* signal is low, the buffer pointer is loaded from the [[P bus]] (X position of sprite, or 0 to start line output).
 
Inactive during H-blank.


68000 DTACK# logic is apparently only tied to GPU registers access. So CPU access during state #5 occurs asynchronously with the 68000 bus.
==LD signals==
The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.


My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.
Example P bus values for 5 full-width sprites right next to each other, starting at X=0:


If write latch REG_VRAMRW is written by the 68000, the next state #5 becomes a write access and REG_VRAMADDR is incremented by REG_VRAMMOD value.
<pre>0000,0808,1010,1818,2020</pre>


Even if a theoretical limit of 16mclk or 8 68kclk is possible, some additionnal cycles are needed for the address and data to propagate through the chip.
Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):


Or maybe SNK has given the worst case scenario between slow Low VRAM and fast High VRAM ?
<pre>0100,0908,1110,1918,2120</pre>


Timing diagram when the sprite list is being filled:
As sprite lines '''always''' take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) '''except''' for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.
<pre>
+5/8:


0  5  2  7  4  1  6  3
==WE signals==
|      |      |  |      |
WE1~4 signals are used to tell if the pixel should be written to a buffer.
  5  2  7  4  1  6  3  0
      |      |  |      |  |
0 1 2 3 4 5 6 7 8 9 A B C D E F
|      |    |    |    |   


LLLLLLLLHHHHHHHHLLLLHHHHHHHHLLLL
During rendering, the pulses are synchronized to CK signals.
HHHHHHLLLLLLLLHHHHLLLLLLLLHHHLLL


            0 1 2 3 4 5 6 7 8 9 A B C D E F
* If the pixel is opaque, there are both pulses at the same time (write pixel).
            |      |    |    |    |
* If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
* If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).


Parse        ################################                                ################################
During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.
Render                                      ##########################                                      ##########################
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr  | 600 |  20F  | 210 | 211 | 600 | 601 |  684  | 005 | 205 | 405 | 600 |  212  | 213 | 602 | 603 | 214 |  685  | 006 | 206 | 406
PCK1  ______|'''|___________________________________________________________|'''|_____________________________________________________
PCK1B  '''''''|___|'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
LOAD  |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
/WE    ''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''
Read      ?      !    !    !                  !    !    !    !    ?      !    !                !      !    !    !    !
</pre>


*R/W sequences: (2 write buffers ?)
==SS signals==
*600 RRRWW... 600 RRWWR...
The SS1/2 signals enable clearing of buffer pairs, active during output.
*600 WWRRW... 600 WRRWW... 600 RRWWR ... 600 RWWRR
*Even lines: Write to list A, Read from list B (Start of display)
*Odd lines: Write to list B, Read from list A
*In 16clk, 2 sprites SCB3 max. are checked to fill up sprite list , and 1 sprite's attributes are read for output
*384px * 4clk/px = 1536clk/line
*1536clk / 16clk = 96 sprites max/line


*Available CPU R/W slots depending on parsing progress, safest is ? cycles
==Others==


==Slow (lower) VRAM==
* TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
* The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.


*Slow VRAM is 100ns (10MHz) and is read at ?
Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).
*4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)


[[Category:Video system]]
[[Category:Video system]]

Latest revision as of 21:44, 5 December 2018

Depending on the chipset, video is generated by 3, 2 or one unique chip:

See graphics pipeline for an overview of the interconnections between chips and cartridges. See Display timing for the sync signal's timing.

There are two main parts in generating video:

  • An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in VRAM.
  • Line buffers, to which pixels can be written in any order from the graphics ROMs data.

Line buffers

To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.

To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the Alpha68k.

The fix layer pixels are rendered in real time over the buffers output.

Fix tiles

  • Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
  • One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
  • The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
  • S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.

Address sequence for one tile line:

A4 2H1 pixel pair
1 0 A
1 1 B
0 0 C
0 1 D

2H1 bypasses the PCK* latchs.

Sprite tiles

16mclk = 16 pixels, 8 pixels per read.

  • Sprite pixels are rendered two-by-two at 12MHz (2mclk).
  • One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
  • The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
  • C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.

Address sequence for one tile line:

CA4 8-pixel line
1 A
0 B

CA4 bypasses the PCK* latchs.

Active lists

The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.

They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.

Slow VRAM access slots

Slow VRAM has four 4mclk-long access slots running in sequence with no variations:

  1. Read sprite map even word
  2. Read sprite map odd word
  3. Read fix map
  4. Read/Write for CPU

Fast VRAM access slots

Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:

Slot # Duration Description
1 2mclk Parsing
2 1.5mclk
3 1.5mclk
4 1.5mclk
5 1.5mclk
6 2mclk Read active list
7 1.5mclk Read SCB2
8 1.5mclk Read SCB3
9 1.5mclk Read SCB4
10 1.5mclk Read/write for CPU

Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:

The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.

Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.

Sprite parsing

This is still a draft. The following information shouldn't be considered as correct.

LSPC splits the workload needed to render sprites in two passes: parsing and rendering.

  • Parsing for a raster line N is done during line N-2
  • Rendering is done during line N-1
  • Finally the line is ready for output just at the right time.

During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.

  • If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
  • If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.

No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.

This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.

In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.

Case 1: Not a single sprite match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
R/W     Read   Read  Read  Read  Read                                   Read   Read  Read  Read  Read

381 read slots, 96 fill slots, 3 waiting slots:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 2: Some sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
R/W     Read   Read  Write Read  Read                                   Write  Read  Read  Write Read

If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:

RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 3: Exactly 96 sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W     Read   Write Read  Write Read                                   Write  Read  Read  Write Read

381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:

RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---

Case 4: More than 96 sprites match

Same as case 3, except after 96 "W"s, there are only useless "R"s.

Rendering

  1. Read active list ($8600+ or $8680+) to get sprite #
  2. Read SCB2 zoom values ($8000+)
  3. Read SCB3 Y position, height, and chain bit ($8200+)
  4. Read SCB4 X position ($8400+)

The tile # and its attributes are also read from slow VRAM.

CPU access to VRAM

SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.

CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.

Buffers control

CK signals

CK1~4 signals are used to clock each of the 4 buffers.

  • During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
  • During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.
  • If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
  • If the corresponding LD* signal is low, the buffer pointer is loaded from the P bus (X position of sprite, or 0 to start line output).

Inactive during H-blank.

LD signals

The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.

Example P bus values for 5 full-width sprites right next to each other, starting at X=0:

0000,0808,1010,1818,2020

Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):

0100,0908,1110,1918,2120

As sprite lines always take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) except for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.

WE signals

WE1~4 signals are used to tell if the pixel should be written to a buffer.

During rendering, the pulses are synchronized to CK signals.

  • If the pixel is opaque, there are both pulses at the same time (write pixel).
  • If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
  • If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).

During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.

SS signals

The SS1/2 signals enable clearing of buffer pairs, active during output.

Others

  • TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
  • The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.

Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).