Rendering logic: Difference between revisions

From NeoGeo Development Wiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.
On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.


* [[LSPC-A0]], [[PRO-B0]] (early)T
* [[LSPC-A0]], [[PRO-B0]] (early)
* [[LSPC2-A2]], [[NEO-B1]] (most common)
* [[LSPC2-A2]], [[NEO-B1]] (most common)
* [[NEO-GRC]], [[NEO-OFC]] (CD systems)
* [[NEO-GRC]], [[NEO-OFC]] (CD systems)
Line 9: Line 9:


==Graphics fetch==
==Graphics fetch==
Tile pixel lines are rendered in halves:
Tile pixel lines are rendered in halves.


*For the fix (32mclk = 8 pixels corresponds to 6MHz pixel clock):
===Fix tiles===
**Full address is ...1**** (PCK2 pulse)
32mclk = 8 pixels, 2 pixels per read.
**2H1 is 0 for 2 pixels (columns 0 & 1), then 1 for 2 pixels (columns 2 & 3)
**Full address is ...0**** (PCK2 pulse)
**2H1 is 0 for 2 pixels (columns 4 & 5), then 1 for 2 pixels (columns 6 & 7)


*For sprites (32mclk = 16 pixels):
* S1 ROM address is ...1**** (PCK2 pulse)
**Full address is ...1***** (PCK1 pulse)
** 2H1 is 0 for 2 pixels (columns 0 & 1)
**CA4 is 0 for 4 pixels (columns 0~3), then 1 for 4 pixels (columns 4~7)
** Then 1 for 2 pixels (columns 2 & 3)
**Full address is ...0***** (PCK1 pulse)
* S1 ROM address is ...0**** (PCK2 pulse)
**CA4 is 0 for 4 pixels (columns 8~11), then 1 for 4 pixels (columns 12~15)
** 2H1 is 0 for 2 pixels (columns 4 & 5)
** Then 1 for 2 pixels (columns 6 & 7)


*X position to B1, just before each PCK2 pulse (SP during 1mclk), for 20 sprites next to each other (X+16px each time):
===Sprite tiles===
** Start of line: 0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...
16mclk = 16 pixels, 8 pixels per read.
 
* C ROM address is ...1***** (PCK1 pulse) (columns 0~7)
* C ROM  address is ...0***** (PCK1 pulse) (columns 8~15)
 
CA4 is used to select upper/lower tiles (see [[Sprite graphics format]]).


=Video generation=
=Video generation=
See [[Display timing]] for the sync signal's timing.
See [[Display timing]] for the sync signal's timing.


Line 43: Line 45:


Inactive during H-blank.
Inactive during H-blank.
==LD signals==
The LD1~2 signals sets the X position in B1. It is /2, since it is set in both even and odd buffers.
Example P bus data for 20 sprites right next to each other (X+=16px):
<pre>0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...</pre>
The lower byte might just be garbage ignored by B1.


==WSE signals==
==WSE signals==
Line 53: Line 64:
* If the pixel is skipped for horizontal reduction, there are no pulses at all.
* If the pixel is skipped for horizontal reduction, there are no pulses at all.


During output, the pulses are also synchronized and always present. This is used to clear to the backdrop color for the next rendering cycle.
During output, the pulses are also synchronized to CSK signals and always present. This is used to clear to the backdrop color for the next rendering cycle.


The buffer's /WE signals seems to be (CSK & WSE).
The buffer's /WE signals seems to be (CSK & WSE).


==SS signals==
==SS signals==
The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:
The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:


Line 66: Line 76:
==Notes==
==Notes==


Whatever happens, a single sprite line (16 pixels) always takes 16mclk.
* What's the use of TMS0 ? Not buffer flip, SS1~2 do that.
* Fix pixel rendering: 4mclk (realtime)
* Sprite pixel rendering: 1mclk
* LSPC runs at 24MHz, but generates signals on rising and falling edges ("48MHz")
 
A single sprite line (16 pixels) '''always''' takes 16mclk, whatever the zoom value or the amount of transparent pixels.


It can also be observed that there's always 96 pulses on LD* during rendering, since 1536mclk per line / 16mclk per sprite = 96 sprites max per line.
It can also be observed that there's always 96 pulses on LD* during rendering, since 1536mclk per line / 16mclk per sprite = 96 sprites max per line.


* The rising edge of PCK1 and PCK2 stores fix or sprite pixels.
* The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.
* 1H1 is probably used to split pixels of FIXD between left and right.
* 1H1 is probably used to split pixels of FIXD between left and right.


Line 81: Line 96:
To do: Edit waveforms, FP and SP windows of the P BUS start 0.5mclk earlier (1.5,5,1.5,1.5,5,1.5 = 16).
To do: Edit waveforms, FP and SP windows of the P BUS start 0.5mclk earlier (1.5,5,1.5,1.5,5,1.5 = 16).


*LSPC runs at 24MHz, but generates signals on rising and falling edges ("48MHz")
* Fast VRAM is 35ns (<1mclk), slow VRAM is 100ns (<2.5mclk)
*Fast VRAM is 35ns (<1mclk), slow VRAM is 100ns (<2.5mclk, 3 ?)
* The fast VRAM reads always occur 1mclk (41.6ns) after the address is set. Smallest access window is 1.5mclk.
*The fast VRAM reads always occur 1mclk (41.6ns) after address is set. Smallest access window is 1.5mclk.
[[file:timing_gpu1.png]]
[[file:timing_gpu1.png]]


*FIXT: P23~16 is 0, P15~0 is S ROM address (+ external 2H1)
* FIXT: P23~16 is 0, P15~0 is S ROM address (+ external 2H1)
*SPRT: P23~0 is C ROM address (+ external CA4)
* SPRT: P23~0 is C ROM address (+ external CA4)
*LO: P23~16 is [[LO]] ROM data, P15~0 is LO address
* LO: P23~16 is [[LO]] ROM data, P15~0 is LO address
*FP: P19~16 is the fix tile palette, rest is 0
* FP: P19~16 is the fix tile palette, rest is 0
*SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?
* SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?


*LSPC always starts filling up active sprite list A ($8600) each new frame  
* LSPC always starts filling up the active sprite list A ($8600) each new frame.


Read sequence:
Read sequence:
Line 120: Line 134:
10 states in 16 cycles (or 5 in 8 cycles: 4-3-3-3-3).
10 states in 16 cycles (or 5 in 8 cycles: 4-3-3-3-3).


One scanline contains 1536mclk cycles or 96 sequences of 16mclk cycles.
One scanline lasts 1536mclk cycles = 96 sequences of 16mclk cycles.
 
Half of the mclk cycles are reserved for Sprite parsing, the other half is for sprite rendering and CPU access.


Each half has 96 x 5 = 480 states.
Half of the mclk cycles are reserved for sprite Y parsing, the other half is for sprite rendering and CPU access.


For the parsing :
==Parsing==
-----------------
SCB3 (Y) is read (from fast VRAM $8200 to $8380). Each time there is a scanline match, a write state to the active sprite list is inserted (apparently 2 states after the corresponding SCB3 read).
SCB3 is read (from $200 to $380), each time there is a sprite match, a write state to the sprite list is inserted (apparently 2 states after the corresponding SCB3 read).


Once SCB3 address $380 is reached, only $0000 write states to the sprite list are possible (in order to fill the rest of the sprite list with zeros).
Once SCB3 address $8380 is reached, only $0000 writes to the sprite list are done (in order to fill the rest of the sprite list with zeros).


No matter how many sprites are matched in the scanline, we will always have 384 SCB3 read states and 96 sprite list write states.
No matter how many sprites are matched in the scanline, there will always be 384 SCB3 read states and 96 sprite list write states.


That explains why sprite #0 cannot be used : this is the value used to terminate the sprite list. It would have been smarter for SNK to use the value 511 instead...
This explains why sprite #0 cannot be used: this is the value used to pad the sprite list. If the list isn't filled (like most of the time), the sprite #0 i rendered over and over again until the end is reached.


According to Charles MacDonald's document, the GPU always renders 96 sprites : the "filler" sprite #0 can be rendered many times per scanline.
==Rendering==
 
For the rendering :
-------------------
The state order is :
The state order is :
*1: Read sprite list ($600+) to get sprite #
*1: Read sprite list ($8600+/$8680+) to get sprite #
*2: Read SCB2 zoom values ($000+)
*2: Read SCB2 zoom values ($8000+)
*3: Read SCB3 Y pos/size/sticky ($200+)
*3: Read SCB3 Y pos/size/sticky ($8200+), Y pos ignored in this state ?
*4: Read SCB4 X pos ($400+)
*4: Read SCB4 X pos ($8400+)
*5: Read/write from CPU
*5: Read/write from CPU if needed
One remark : it is more logical to have SCB3 read before SCB2 because we need the sticky bit to make the decision of keeping the previous vertical shrink value or not.
 
Is the sticky bit written to the sprite list along with the sprite number or did SNK waste an additionnal 8-bit temporary register in their design ?


CPU access to High VRAM :
==CPU access to fast VRAM==
-------------------------
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.
SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.


Why 12 and not 8 ?
Why 12 and not 8 ?


68000 DTACK# logic is apparently only tied to GPU registers access. So CPU access during state #5 occurs asynchronously with the 68000 bus.
CPU access during state #5 occurs asynchronously with the 68000 bus -> storage in LSPC.


My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.
My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.
Line 207: Line 211:
*Available CPU R/W slots depending on parsing progress, safest is ? cycles
*Available CPU R/W slots depending on parsing progress, safest is ? cycles


==Slow (lower) VRAM==
==CPU access to slow VRAM==


*Slow VRAM is 100ns (10MHz) and is read at ?
* Slow VRAM is 100ns
*4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)
* 4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)


[[Category:Video system]]
[[Category:Video system]]

Revision as of 21:28, 10 May 2017

On the NeoGeo hardware, the GPU (Graphics Processing Unit) a.k.a. VDP, may refer to a chip or a group of different chips used to generate the video signal.

See graphics pipeline for an overview of the interconnections between chips and cartridges.

Graphics fetch

Tile pixel lines are rendered in halves.

Fix tiles

32mclk = 8 pixels, 2 pixels per read.

  • S1 ROM address is ...1**** (PCK2 pulse)
    • 2H1 is 0 for 2 pixels (columns 0 & 1)
    • Then 1 for 2 pixels (columns 2 & 3)
  • S1 ROM address is ...0**** (PCK2 pulse)
    • 2H1 is 0 for 2 pixels (columns 4 & 5)
    • Then 1 for 2 pixels (columns 6 & 7)

Sprite tiles

16mclk = 16 pixels, 8 pixels per read.

  • C ROM address is ...1***** (PCK1 pulse) (columns 0~7)
  • C ROM address is ...0***** (PCK1 pulse) (columns 8~15)

CA4 is used to select upper/lower tiles (see Sprite graphics format).

Video generation

See Display timing for the sync signal's timing.

NEO-B1 is used for double-buffering scanlines. While a buffer is output to the screen, the other one is filled up. They're swapped each new scanline. Each of the two line buffers are actually 2 buffers of even/odd pixels. They will be named (1 & 2), and (3 & 4).

CSK signals

CSK1~4 signals are used to clock each buffer (on rising edge ?).

During rendering, the pulses usually go by pairs (1+2 and 3+4) to render pixels 2 by 2. Horizontal scaling cause CSK pulses to be skipped.

During output, the pulses are synchronized and alternative (1+2 then 3+4 then 1+2...) to output even/odd pixels in sequence.

  • If the corresponding LD* signal is high, then the buffer pointer is incremented (rendering is always done left to right ?).
  • If the corresponding LD* signal is low, then the buffer pointer is loaded from the P bus (X position of sprite, or 0 to start line output).

Inactive during H-blank.

LD signals

The LD1~2 signals sets the X position in B1. It is /2, since it is set in both even and odd buffers.

Example P bus data for 20 sprites right next to each other (X+=16px):

0000,0808,1010,1838,2000,2808,3010,3838,40C0,48E8,50F0,58F8,60C0,68E8,70F0,78F8,8000,8808,9010,9838,0,0,0...

The lower byte might just be garbage ignored by B1.

WSE signals

WSE1~4 signals are used to indicate if the pixel color index on GAD/GBD needs to be written to the buffer.

During rendering, the pulses are synchronized to CSK signals.

  • If the pixel is transparent, there is a CSK pulse but no WSE pulse.
  • If the pixel is opaque, there are both pulses at the same time.
  • If the pixel is skipped for horizontal reduction, there are no pulses at all.

During output, the pulses are also synchronized to CSK signals and always present. This is used to clear to the backdrop color for the next rendering cycle.

The buffer's /WE signals seems to be (CSK & WSE).

SS signals

The complementary pair of SS* signals from LSPC tell B1 how the buffers are used:

  • SS1 low & SS2 high: Buffers 1&2 are written to. Buffers 3&4 are output to the TV.
  • SS1 high & SS2 low: Buffers 1&2 are output to the TV. Buffers 3&4 are written to.

Notes

  • What's the use of TMS0 ? Not buffer flip, SS1~2 do that.
  • Fix pixel rendering: 4mclk (realtime)
  • Sprite pixel rendering: 1mclk
  • LSPC runs at 24MHz, but generates signals on rising and falling edges ("48MHz")

A single sprite line (16 pixels) always takes 16mclk, whatever the zoom value or the amount of transparent pixels.

It can also be observed that there's always 96 pulses on LD* during rendering, since 1536mclk per line / 16mclk per sprite = 96 sprites max per line.

  • The rising edge of PCK1 and PCK2 latches fix or sprite pixels from the cart ROMs.
  • 1H1 is probably used to split pixels of FIXD between left and right.

Fix data is read 8 pixels in advance (confirms what Charles wrote in mvstech.txt). This seems to be inherited from the Alpha68k as 3 successive 8bit registers forming a waiting "pipeline".

Sprite parsing

This is a draft. The following information shouldn't be considered as exact.

To do: Edit waveforms, FP and SP windows of the P BUS start 0.5mclk earlier (1.5,5,1.5,1.5,5,1.5 = 16).

  • Fast VRAM is 35ns (<1mclk), slow VRAM is 100ns (<2.5mclk)
  • The fast VRAM reads always occur 1mclk (41.6ns) after the address is set. Smallest access window is 1.5mclk.

  • FIXT: P23~16 is 0, P15~0 is S ROM address (+ external 2H1)
  • SPRT: P23~0 is C ROM address (+ external CA4)
  • LO: P23~16 is LO ROM data, P15~0 is LO address
  • FP: P19~16 is the fix tile palette, rest is 0
  • SP: P23~16 is the sprite tile palette, P15~8 is X position, P7~0 is ?
  • LSPC always starts filling up the active sprite list A ($8600) each new frame.

Read sequence:

Timing diagram when no sprites fall in the next scanline (no writes to sprite list):

Parse        ################################                                ################################
Render                                       ##########################                                      ##########################
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr   | 600 |  200  | 201 | 202 | 203 | 204 |  681  | 00E | 20E | 40E | 600 |  205  | 206 | 207 | 208 | 209 |  682  | 00F | 20F | 40F
PCK1   ______|'''|___________________________________________________________|'''|_____________________________________________________
PCK1B  '''''''|____|''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
LOAD   |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |
Read       ?       !     !     !     !     !       !     !     !     !     ?       !     !     !     !     !       !     !     !     !
What      1      2      2     2     2     2      3      4     5     6     1      2      2     2     2     2      3      4     5     6...
  • 1: Probably CPU acces slot with last address latched ($600)
  • 2: Read sprite Y position from SCB3 ($200+) to see if it's in next scanline
  • 3: Read sprite list ($600+) to get sprite #
  • 4: Read SCB2 zoom values ($000+)
  • 5: Read SCB3 Y/size/chain ($200+)
  • 6: Read SCB4 X ($400+)

10 states in 16 cycles (or 5 in 8 cycles: 4-3-3-3-3).

One scanline lasts 1536mclk cycles = 96 sequences of 16mclk cycles.

Half of the mclk cycles are reserved for sprite Y parsing, the other half is for sprite rendering and CPU access.

Parsing

SCB3 (Y) is read (from fast VRAM $8200 to $8380). Each time there is a scanline match, a write state to the active sprite list is inserted (apparently 2 states after the corresponding SCB3 read).

Once SCB3 address $8380 is reached, only $0000 writes to the sprite list are done (in order to fill the rest of the sprite list with zeros).

No matter how many sprites are matched in the scanline, there will always be 384 SCB3 read states and 96 sprite list write states.

This explains why sprite #0 cannot be used: this is the value used to pad the sprite list. If the list isn't filled (like most of the time), the sprite #0 i rendered over and over again until the end is reached.

Rendering

The state order is :

  • 1: Read sprite list ($8600+/$8680+) to get sprite #
  • 2: Read SCB2 zoom values ($8000+)
  • 3: Read SCB3 Y pos/size/sticky ($8200+), Y pos ignored in this state ?
  • 4: Read SCB4 X pos ($8400+)
  • 5: Read/write from CPU if needed

CPU access to fast VRAM

SNK says min. 12 68kclk between writes (so 24mclk). 1 write every 24mclk = 64 per scanline.

Why 12 and not 8 ?

CPU access during state #5 occurs asynchronously with the 68000 bus -> storage in LSPC.

My guess on the HW implementation is that during state #5, the GPU always reads the memory content pointed by REG_VRAMADDR and updates the read latch of REG_VRAMRW.

If write latch REG_VRAMRW is written by the 68000, the next state #5 becomes a write access and REG_VRAMADDR is incremented by REG_VRAMMOD value.

Even if a theoretical limit of 16mclk or 8 68kclk is possible, some additionnal cycles are needed for the address and data to propagate through the chip.

Or maybe SNK has given the worst case scenario between slow Low VRAM and fast High VRAM ?

Timing diagram when the sprite list is being filled:

+5/8:

0   5   2   7   4   1   6   3
|       |       |   |       |
  5   2   7   4   1   6   3  0
      |       |   |       |  |
0 1 2 3 4 5 6 7 8 9 A B C D E F
|       |     |     |     |    

LLLLLLLLHHHHHHHHLLLLHHHHHHHHLLLL
HHHHHHLLLLLLLLHHHHLLLLLLLLHHHLLL

             0 1 2 3 4 5 6 7 8 9 A B C D E F
             |       |     |     |     |

Parse        ################################                                ################################
Render                                       ##########################                                      ##########################
24M    |'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_|'|_
Addr   | 600 |  20F  | 210 | 211 | 600 | 601 |  684  | 005 | 205 | 405 | 600 |  212  | 213 | 602 | 603 | 214 |  685  | 006 | 206 | 406
PCK1   ______|'''|___________________________________________________________|'''|_____________________________________________________
PCK1B  '''''''|___|'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''|___|''''''''''''''''''''''''''''''''''''''''''''''''''''
LOAD   |'''''''|_______________________|'''''''|_______________________|'''''''|_______________________|'''''''|_______________________
12M    __|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|___|'''|_
2Pixel       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |
/WE    ''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''''''''''''''''|___|'|___|'''''''''''''''''''''''''''''''''
Read       ?       !     !     !                   !     !     !     !     ?       !     !                 !       !     !     !     !
  • R/W sequences: (2 write buffers ?)
  • 600 RRRWW... 600 RRWWR...
  • 600 WWRRW... 600 WRRWW... 600 RRWWR ... 600 RWWRR
  • Even lines: Write to list A, Read from list B (Start of display)
  • Odd lines: Write to list B, Read from list A
  • In 16clk, 2 sprites SCB3 max. are checked to fill up sprite list , and 1 sprite's attributes are read for output
  • 384px * 4clk/px = 1536clk/line
  • 1536clk / 16clk = 96 sprites max/line
  • Available CPU R/W slots depending on parsing progress, safest is ? cycles

CPU access to slow VRAM

  • Slow VRAM is 100ns
  • 4 slots per render cycle, 1 slot for CPU R/W (1 each 16 68k cycles)