Rendering logic: Difference between revisions

From NeoGeo Development Wiki
Jump to navigation Jump to search
m (Furrtek moved page GPU to Rendering logic without leaving a redirect: More details, also "GPU" isn't really the right word)
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.
Depending on the chipset, video is generated by 3, 2 or one unique chip:


* [[LSPC-A0]], [[PRO-B0]] (early)
* 3: [[LSPC-A0]], [[PRO-B0]], [[PRO-C0]] (early)
* [[LSPC2-A2]], [[NEO-B1]] (most common)
* 2: [[LSPC2-A2]], [[NEO-B1]] (most common)
* [[NEO-GRC]], [[NEO-OFC]] (CD systems)
* 1: [[NEO-GRC]] (CD systems), [[NEO-GRZ]] (CDZ, MV-1C...)
* [[NEO-GRZ]] (CDZ, MV-1C ?)


See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges.
See [[graphics pipeline]] for an overview of the interconnections between chips and cartridges. See [[Display timing]] for the sync signal's timing.


==Graphics fetch==
There are two main parts in generating video:
Tile pixel lines are rendered in halves.


===Fix tiles===
* An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in [[VRAM]].
32mclk = 8 pixels, 2 pixels per read.
* Line buffers, to which pixels can be written in any order from the graphics ROMs data.


* S1 ROM address is ...1**** (PCK2 pulse)
=Line buffers=
** 2H1 is 0 for 2 pixels (columns 0 & 1)
** Then 1 for 2 pixels (columns 2 & 3)
* S1 ROM address is ...0**** (PCK2 pulse)
** 2H1 is 0 for 2 pixels (columns 4 & 5)
** Then 1 for 2 pixels (columns 6 & 7)


===Sprite tiles===
To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.
16mclk = 16 pixels, 8 pixels per read.


* C ROM address is ...1***** (PCK1 pulse) (columns 0~7)
To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the [[Alpha68k]].
* C ROM  address is ...0***** (PCK1 pulse) (columns 8~15)


CA4 is used to select upper/lower tiles (see [[Sprite graphics format]]).
The fix layer pixels are rendered in real time over the buffers output.


=Video generation=
==Fix tiles==
See [[Display timing]] for the sync signal's timing.


[[NEO-B1]] is used for double-buffering scanlines. While a buffer is output to the screen, the other one is filled up. They're swapped each new scanline. Each of the two line buffers are actually 2 buffers of even/odd pixels. They will be named (1 & 2), and (3 & 4).
* Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
* One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
* The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
* S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.


==CSK signals==
Address sequence for one tile line:
CSK1~4 signals are used to clock each buffer (on rising edge ?).
{|class="wikitable"
! A4 !! 2H1 !! pixel pair
|-
| 1 || 0 || A
|-
| 1 || 1 || B
|-
| 0 || 0 || C
|-
| 0 || 1 || D
|}


During rendering, the pulses usually go by pairs (1+2 and 3+4) to render pixels 2 by 2. Horizontal scaling cause CSK pulses to be skipped.
2H1 bypasses the PCK* latchs.


During output, the pulses are synchronized and alternative (1+2 then 3+4 then 1+2...) to output even/odd pixels in sequence.
==Sprite tiles==
16mclk = 16 pixels, 8 pixels per read.


* If the corresponding LD* signal is high, then the buffer pointer is incremented (rendering is always done left to right ?).
* Sprite pixels are rendered two-by-two at 12MHz (2mclk).
* If the corresponding LD* signal is low, then the buffer pointer is loaded from the [[P bus]] (X position of sprite, or 0 to start line output).
* One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
* The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
* C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.


Inactive during H-blank.
Address sequence for one tile line:
{|class="wikitable"
! CA4 !! 8-pixel line
|-
| 1 || A
|-
| 0 || B
|}
 
CA4 bypasses the PCK* latchs.


==LD signals==
=Active lists=
The LD1~2 signals sets the X position in B1. It is /2, since it is set in both even and odd buffers.


Example P bus data for 20 sprites right next to each other (X+=16px):
The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.


<pre>0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...</pre>
They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.


The lower byte might just be garbage ignored by B1.
=Slow VRAM access slots=


==WSE signals==
Slow VRAM has four 4mclk-long access slots running in sequence with no variations:
WSE1~4 signals are used to indicate if the pixel color index on GAD/GBD needs to be written to the buffer.


During rendering, the pulses are synchronized to CSK signals.
# Read sprite map even word
# Read sprite map odd word
# Read fix map
# Read/Write for CPU


* If the pixel is transparent, there is a CSK pulse but no WSE pulse.
=Fast VRAM access slots=
* If the pixel is opaque, there are both pulses at the same time.
* If the pixel is skipped for horizontal reduction, there are no pulses at all.


During output, the pulses are also synchronized to CSK signals and always present. This is used to clear to the backdrop color for the next rendering cycle.
Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:


The buffer's /WE signals seems to be (CSK & WSE).
{|class="wikitable"
! Slot # !! Duration !! Description
|-
|1
|2mclk
|rowspan=5|Parsing
|-
|2
|1.5mclk
|-
|3
|1.5mclk
|-
|4
|1.5mclk
|-
|5
|1.5mclk
|-
|6
|2mclk
|Read active list
|-
|7
|1.5mclk
|Read SCB2
|-
|8
|1.5mclk
|Read SCB3
|-
|9
|1.5mclk
|Read SCB4
|-
|10
|1.5mclk
|Read/write for CPU
|}


==SS signals==
Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:
The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:


* SS1 low & SS2 high: Buffers 1&2 are written to. Buffers 3&4 are output to the TV.
[[file:timing_gpu1.png]]
* SS1 high & SS2 low: Buffers 1&2 are output to the TV. Buffers 3&4 are written to.


==Notes==
The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.


* What's the use of TMS0 ? Not buffer flip, SS1~2 do that.
Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.
* Fix pixel rendering: 4mclk (realtime)
* Sprite pixel rendering: 1mclk  
* LSPC runs at 24MHz, but generates signals on rising and falling edges ("48MHz")


A single sprite line (16 pixels) '''always''' takes 16mclk, whatever the zoom value or the amount of transparent pixels.
=Sprite parsing=


It can also be observed that there's always 96 pulses on LD* during rendering, since 1536mclk per line / 16mclk per sprite = 96 sprites max per line.
<span style="color:#FF0000">This is still a draft. The following information shouldn't be considered as correct.</span>


* The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.
LSPC splits the workload needed to render sprites in two passes: parsing and rendering.
* 1H1 is probably used to split pixels of FIXD between left and right.


Fix data is read 8 pixels in advance (confirms what Charles wrote in mvstech.txt). This seems to be inherited from the [[Alpha68k]] as 3 successive 8bit registers forming a waiting "pipeline".
* Parsing for a raster line N is done during line N-2
* Rendering is done during line N-1
* Finally the line is ready for output just at the right time.


=Sprite parsing=
During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.


<span style="color:#FF0000">This is a draft. The following information shouldn't be considered as exact.</span>
* If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
* If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.


* Fast VRAM is 35ns (<1mclk), slow VRAM is 100ns (<2.5mclk)
No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.
* The fast VRAM reads always occur 1mclk (41.6ns) after the address is set. Smallest access window is 1.5mclk.
[[file:timing_gpu1.png]]


* FIXT: P23~16 is 0, P15~0 is S ROM address (+ external 2H1)
This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.
* SPRT: P23~0 is C ROM address (+ external CA4)
* LO: P23~16 is [[LO]] ROM data, P15~0 is LO address
* FP: P19~16 is the fix tile palette, rest is 0
* SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?


* LSPC always starts filling up the active sprite list A ($8600) each new frame.
In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.


Read sequence:
==Case 1: Not a single sprite match==


Timing diagram when no sprites fall in the next scanline (no writes to sprite list):
Fast VRAM cycles:
<pre>
<pre>
Parse        ################################                                ################################
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Render                                      ##########################                                      ##########################
Addr     200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
R/W    Read   Read  Read  Read Read                                  Read   Read Read  Read  Read
Addr   | 600 |  200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
PCK1   ______|'''|___________________________________________________________|'''|_____________________________________________________
PCK1B '''''''|____|''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
LOAD   |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
Read       ?      !    !    !    !    !      !    !    !    !    ?      !    !    !    !    !      !    !    !    !
What      1      2      2    2    2    2      3      4    5    6    1      2      2    2    2    2      3      4    5    6...
</pre>
</pre>


*1: Probably CPU acces slot with last address latched ($600)
381 read slots, 96 fill slots, 3 waiting slots:
*2: Read sprite Y position from SCB3 ($200+) to see if it's in next scanline
<pre>
*3: Read sprite list ($600+) to get sprite #
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*4: Read SCB2 zoom values ($000+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*5: Read SCB3 Y/size/chain ($200+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
*6: Read SCB4 X ($400+)
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
</pre>


10 states in 16 cycles (or 5 in 8 cycles: 4-3-3-3-3).
==Case 2: Some sprites match==


One scanline lasts 1536mclk cycles = 96 sequences of 16mclk cycles.
Fast VRAM cycles:
<pre>
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr    200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
R/W    Read  Read  Write Read  Read                                  Write  Read  Read  Write Read
</pre>


Half of the mclk cycles are reserved for sprite Y parsing, the other half is for sprite rendering and CPU access.
If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:
<pre>
RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---
</pre>


==Parsing==
==Case 3: Exactly 96 sprites match==
SCB3 (Y) is read (from fast VRAM $8200 to $8380). Each time there is a scanline match, a write state to the active sprite list is inserted (apparently 2 states after the corresponding SCB3 read).


Once SCB3 address $8380 is reached, only $0000 writes to the sprite list are done (in order to fill the rest of the sprite list with zeros).
Fast VRAM cycles:
<pre>
24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr    200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W    Read  Write Read  Write Read                                  Write  Read  Read  Write Read
</pre>


No matter how many sprites are matched in the scanline, there will always be 384 SCB3 read states and 96 sprite list write states.
381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:
<pre>
RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---
</pre>


This explains why sprite #0 cannot be used: this is the value used to pad the sprite list. If the list isn't filled (like most of the time), the sprite #0 i rendered over and over again until the end is reached.
==Case 4: More than 96 sprites match==
 
Same as case 3, except after 96 "W"s, there are only useless "R"s.


==Rendering==
==Rendering==
The state order is :
*1: Read sprite list ($8600+/$8680+) to get sprite #
*2: Read SCB2 zoom values ($8000+)
*3: Read SCB3 Y pos/size/sticky ($8200+), Y pos ignored in this state ?
*4: Read SCB4 X pos ($8400+)
*5: Read/write from CPU if needed


==CPU access to fast VRAM==
# Read active list ($8600+ or $8680+) to get sprite #
# Read SCB2 zoom values ($8000+)
# Read SCB3 Y position, height, and chain bit ($8200+)
# Read SCB4 X position ($8400+)
 
The tile # and its attributes are also read from slow VRAM.
 
==CPU access to VRAM==
 
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.


Why 12 and not 8 ?
CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.


CPU access during state #5 occurs asynchronously with the 68000 bus -> storage in LSPC.
=Buffers control=


My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.
==CK signals==
CK1~4 signals are used to clock each of the 4 buffers.


If write latch REG_VRAMRW is written by the 68000, the next state #5 becomes a write access and REG_VRAMADDR is incremented by REG_VRAMMOD value.
* During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
* During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.


Even if a theoretical limit of 16mclk or 8 68kclk is possible, some additionnal cycles are needed for the address and data to propagate through the chip.
* If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
* If the corresponding LD* signal is low, the buffer pointer is loaded from the [[P bus]] (X position of sprite, or 0 to start line output).


Or maybe SNK has given the worst case scenario between slow Low VRAM and fast High VRAM ?
Inactive during H-blank.


Timing diagram when the sprite list is being filled:
==LD signals==
The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.


<pre>
Example P bus values for 5 full-width sprites right next to each other, starting at X=0:
Parse        ################################                                ################################
 
Render                                      ##########################                                      ##########################
<pre>0000,0808,1010,1818,2020</pre>
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
 
Addr  | 600 |  20F  | 210 | 211 | 600 | 601 |  684  | 005 | 205 | 405 | 600 |  212  | 213 | 602 | 603 | 214 |  685  | 006 | 206 | 406
Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):
PCK1  ______|'''|___________________________________________________________|'''|_____________________________________________________
 
PCK1B  '''''''|___|'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
<pre>0100,0908,1110,1918,2120</pre>
LOAD  |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
 
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
As sprite lines '''always''' take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) '''except''' for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.
2Pixel      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
 
/WE   ''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''
==WE signals==
Read      ?      !    !    !                  !    !    !    !    ?      !    !                !      !    !    !    !
WE1~4 signals are used to tell if the pixel should be written to a buffer.
</pre>
 
During rendering, the pulses are synchronized to CK signals.
 
* If the pixel is opaque, there are both pulses at the same time (write pixel).
* If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
* If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).
 
During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.


*R/W sequences: (2 write buffers ?)
==SS signals==
*600 RRRWW... 600 RRWWR...
The SS1/2 signals enable clearing of buffer pairs, active during output.
*600 WWRRW... 600 WRRWW... 600 RRWWR ... 600 RWWRR
*Even lines: Write to list A, Read from list B (Start of display)
*Odd lines: Write to list B, Read from list A
*384px * 4clk/px = 1536clk/line


*Available CPU R/W slots depending on parsing progress, safest is ? cycles
==Others==


==CPU access to slow VRAM==
* TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
* The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.


* Slow VRAM is 100ns
Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).
* 4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)


[[Category:Video system]]
[[Category:Video system]]

Latest revision as of 21:44, 5 December 2018

Depending on the chipset, video is generated by 3, 2 or one unique chip:

See graphics pipeline for an overview of the interconnections between chips and cartridges. See Display timing for the sync signal's timing.

There are two main parts in generating video:

  • An address generator (LSPC), which queries the graphics ROMs in the cartridges according to the data set in VRAM.
  • Line buffers, to which pixels can be written in any order from the graphics ROMs data.

Line buffers

To render sprites, the NeoGeo uses a pair of line buffers which are each 320 pixels long (a whole scanline). When one is used for rendering, the other one is shifted out for video output. Each new scanline, the buffers are flipped. This can be seen as a kind of double-buffering, allowing pixels to be rendered in any order.

To increase bandwidth, pixels are rendered two by two in sub-pairs: there are actually 4, 160-pixels-long buffers interleaved in an odd/even fashion. This scheme was inherited from the Alpha68k.

The fix layer pixels are rendered in real time over the buffers output.

Fix tiles

  • Fix pixels are output in time with the pixel clock (6MHz, 4mclk).
  • One fix tile line is therefore output in: 8 pixels * 4mclk = 32mclk. No variation.
  • The S ROM outputs 8 bits at a time so: 8 bits / 4bpp = 2 pixels at a time.
  • S ROM reads needed for one fix tile line: 8 pixels / 2 pixels per read = 4 reads.

Address sequence for one tile line:

A4 2H1 pixel pair
1 0 A
1 1 B
0 0 C
0 1 D

2H1 bypasses the PCK* latchs.

Sprite tiles

16mclk = 16 pixels, 8 pixels per read.

  • Sprite pixels are rendered two-by-two at 12MHz (2mclk).
  • One sprite tile line is therefore rendered in: 16 pixels / 2 * 2mclk = 16mclk. No variation, even if shrinking is used.
  • The C ROM outputs 2 * 16 = 32 bits at a time so: 32 bits / 4bpp = 8 pixels at a time.
  • C ROM reads needed for one sprite tile line: 16 pixels / 8 pixels per read = 2 reads.

Address sequence for one tile line:

CA4 8-pixel line
1 A
0 B

CA4 bypasses the PCK* latchs.

Active lists

The NeoGeo uses a pair of active lists, where the sprites numbers which need to be rendered on the next scanline are written to. As with the line buffers, the active lists are swapped every new scanline so that one is being filled by parsing, the other one is used for rendering.

They are located in the fast VRAM at addresses $8600 and $8680. Each list is 96-entries long.

Slow VRAM access slots

Slow VRAM has four 4mclk-long access slots running in sequence with no variations:

  1. Read sprite map even word
  2. Read sprite map odd word
  3. Read fix map
  4. Read/Write for CPU

Fast VRAM access slots

Fast VRAM is more complex and faster. It has 10 access slots with varying widths running in sequence with no variations, which can be seen as 5 parsing slots and 5 rendering slots:

Slot # Duration Description
1 2mclk Parsing
2 1.5mclk
3 1.5mclk
4 1.5mclk
5 1.5mclk
6 2mclk Read active list
7 1.5mclk Read SCB2
8 1.5mclk Read SCB3
9 1.5mclk Read SCB4
10 1.5mclk Read/write for CPU

Yellow are parsing cycles, purple is the active list read, green is SCB* reads for rendering, red is for CPU access:

The parsing cycles aren't consistent, they depend on the matching of sprites. One cycle will read from SCB3 to test if its Y position matches with the current raster line. If there's a match, the next cycle will be a write to the active list. Otherwise it's another read cycle.

Fast VRAM must be fast enough (45ns) as the shortest slots are 1.5mclk (62.5ns). 1mclk (41.6ns) would be too fast and SRAM was already expensive.

Sprite parsing

This is still a draft. The following information shouldn't be considered as correct.

LSPC splits the workload needed to render sprites in two passes: parsing and rendering.

  • Parsing for a raster line N is done during line N-2
  • Rendering is done during line N-1
  • Finally the line is ready for output just at the right time.

During parsing, the Y positions of 381 sprites are read to see if they will be visible on line N. If that's the case, the sprite number is written to the active list currently being filled. This goes on until 381 sprites were parsed, OR the active list is full (96 sprite numbers were written), whichever comes first.

  • If sprite #382 is reached, the remaining time is used to fill the active list up to 96 entries with zeros.
  • If the active list is full, sprites are still parsed up to #382 but no writes are done to the active list, whatever the matching result.

No matter how many sprites are matched in the scanline, there will always be 381 SCB3 reads and 96 active list writes.

This explains why sprite #0 cannot be used: this is the value used to top-up the active list. If there are less than 96 sprite matches (like most of the time), the sprite #0 will be rendered over and over again until the end of the list is reached.

In the next paragraphs, each character represents a parsing slot: R is an SCB3 read, W is a sprite number write to the active list, F is a filling write to the active list, - is just idle waiting. There are always 1536mclk per line / 16mclk per cycle * 5 slots per cycle = 480 slots.

Case 1: Not a single sprite match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
R/W     Read   Read  Read  Read  Read                                   Read   Read  Read  Read  Read

381 read slots, 96 fill slots, 3 waiting slots:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 2: Some sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 201 | 600 | 202 | 203 |  681  | 00E | 20E | 40E | 600 |  601  | 204 | 205 | 602 | 206 |  682  | 00F | 20F | 40F
R/W     Read   Read  Write Read  Read                                   Write  Read  Read  Write Read

If 17 sprites match: 381 read slots, 17 write slots, 96-17=79 fill slots, 3 waiting slots:

RRRRRRWRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRWRRRRRRWRRRRRRWRRRRRWRWRWRRRRRRRRRWRRRRRRRRRRRRRRWRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF---

Case 3: Exactly 96 sprites match

Fast VRAM cycles:

24M    _|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr     200  | 600 | 201 | 601 | 202 |  681  | 00E | 20E | 40E | 600 |  602  | 203 | 204 | 603 | 604 |  682  | 00F | 20F | 40F
R/W     Read   Write Read  Write Read                                   Write  Read  Read  Write Read

381 read slots, 96 write slots, 0 fill slots, 3 waiting slots:

RRRRRRWRRRRWRRRRRRRWRRRRRWRWRWRRWRRRRRRRRRWRRRRRRWRRWRRRRWRRRRRWRWRWRRRRRWRRRRWRRRRRRRRWRRRRRRWRRRWR
WRWRRWRRRRRRRWRRRWRRRWRRRRWRRRRRRRRWRWRRRRWRWRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRRWRRRWRWRWRWRRWRW
RRRRWRRWRRWRRRWRRRWRRWRRRRWRWRWRRWRRRWRWRWRRWRRRRWRRRRRWRRWRRWRRRRRRWRRRWRRRRWRRRRRWRRWRRRWRRRWRRRRR
WRWRWRWRWRWRRRWRRRWRRRRRRRWRRRWRWRRRWRRRRRRRRRWRRRRRRRWRWRWRWRRRRRRRRRRWRRWRRRRRRWRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRWRRRWRRRRWRRRWRRRRRRRWRRRRRRRWRRRRWRWRRRWRRWRRRRWRRRRRWRRRRRWR---

Case 4: More than 96 sprites match

Same as case 3, except after 96 "W"s, there are only useless "R"s.

Rendering

  1. Read active list ($8600+ or $8680+) to get sprite #
  2. Read SCB2 zoom values ($8000+)
  3. Read SCB3 Y position, height, and chain bit ($8200+)
  4. Read SCB4 X position ($8400+)

The tile # and its attributes are also read from slow VRAM.

CPU access to VRAM

SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.

CPU access occurs asynchronously with the 68000 bus -> storage in LSPC. If no write is requested, then the slots are occupied by reads, effectively updating one of the two read buffers continuously with the value pointed by the last used VRAM address.

Buffers control

CK signals

CK1~4 signals are used to clock each of the 4 buffers.

  • During rendering, the pulses often go by pair (1+2 or 3+4) to render pixels 2 by 2 if the corresponding WE signal is asserted (opaque pixel). Horizontal shrinking causes pulses to be skipped, so that the buffer's address isn't incremented.
  • During output, the pulses are slower and always alternate (1/2/1/2... or 3/4/3/4...) to output even/odd pixels in sequence.
  • If the corresponding LD* signal is high, the buffer pointer is incremented (rendering left to right).
  • If the corresponding LD* signal is low, the buffer pointer is loaded from the P bus (X position of sprite, or 0 to start line output).

Inactive during H-blank.

LD signals

The LD1~2 signals are synchronous signals used to load the pointers for a buffer pair as two bytes.

Example P bus values for 5 full-width sprites right next to each other, starting at X=0:

0000,0808,1010,1818,2020

Example P bus values for 5 full-width sprites right next to each other, starting at X=1 (pixel pairs will be flipped by NEO-ZMC2):

0100,0908,1110,1918,2120

As sprite lines always take 16mclk to render, there's an LD* pulse every 16mclk to set the new starting address (X position) except for chained sprites. There's also always an unique pulse just before output to reset the pointers to 0.

WE signals

WE1~4 signals are used to tell if the pixel should be written to a buffer.

During rendering, the pulses are synchronized to CK signals.

  • If the pixel is opaque, there are both pulses at the same time (write pixel).
  • If the pixel is transparent, there is a CK pulse but no WE pulse (skip pixel, move to next one).
  • If the pixel is skipped for horizontal shrink, there are no pulses at all (do nothing).

During output, the pulses are also synchronized to CK signals and always present. This is used to clear the buffers to the backdrop color for the next rendering cycle.

SS signals

The SS1/2 signals enable clearing of buffer pairs, active during output.

Others

  • TMS0 is used to flip the buffers, related to the lowest bit of the raster counter.
  • The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.

Fix data is read 8 pixels in advance (32mclk, confirms what Charles wrote in mvstech.txt).