Monday, January 28, 2008

SEAForth matrix multiplication example

Disclaimer: I don't work for IntellaSys and I consider myself a novice in NOSC programming in general and the SEAForth architecture in particular. But, by using documentation publicly at the IntellaSys website I have figured out how to write working programs for it.



I'm going to diverge from ColorForth for this particular blog entry to talk about SEAForth. Hopefully the ColorForth version of the SEAForth compiler will be released soon, but for now the public SEAForth programming environment, VentureForth, is closer IMO to MachineForth. (No color tokens. You have to keep up with code slots etc.) Still VentureForth does have a strong "fun factor" when it comes to programming. Personally I think that when programming, rather than thinking about ALGOL like "algorithms" you should think about the hardware and how to directly map your problem across it. Programming comes later.

SEAForth Architecture

The SEAForth architecture is the latest hardware design from Forth creator Chuck Moore. Like previous designs it is a NOSC or "no operand stack computer". The history and documentation of Mr. Moore's early NOSC work is documented at Jeff Fox's Ultratechnology website. This means that all operations take their operands from the stack. SEAForth goes a step further and has multiple cores (24 in the first implementation) on one chip.


Here is a diagram of a SEAForth chip.



U

















U





18R19L20R21L22R23


D

D

D

D

D

D



12R13L14R15L16R17


U

U

U

U

U

U



06R07L08R09L10R23


D

D

D

D

D

D

L00R01L02R03L04R05


U





















The numbers 0..23 represent the individual cores. The letters U, D, L and R represent the ports. Mr. Fox explained that these ports can point to other cores on chip or to off chip devices. For example communication between ports 18 and 19 always use the "R" port regardless of whether its 18 sending the data or receiving it. Also the U port for cores 18 and 23 goes to the A/D and D/A pins in some configurations. Knowing how port communication works is essential for understanding the "crawler" application walkthrough that IntellaSys has provided with their development kit.

Sample Application

In trying to come up with a test program to explore SEAForth, I settled on an obvious parallel processing application that's easy to understand, matrix multiplication. First lets look at the mathematical definition of matrix multiplication. The product of two matrices A and B results in a new matrix C where each element of C is the sum of the products of the rows in A and the columns in B. Let's take the case of a 2 row matrix by a 2 column matrix.




Matrix A:




12345679
910111213141516


Matrix B:









1725
1826
1927
2028
2129
2230
2331
2432

If we label the matrix A rows R1 and R2 and the matrix B columns C1 and C2 then the resulting matrix is:




R1*C1R1*C2
R2*C1R2*C2
Mapping of problem to SEAForth chip

So this is how I mapped the problem to the SEAForth chip.
  • Data is initially in node 12
  • Row 1 is passed to node 18
  • Row 2 is passed to node 13
  • Column 1 is passed to nodes 18 and 13 simultaneously
  • Multiplications R1*C1 and R2*C1 are done simultaneously
  • Column 2 is passed to nodes 18 and 13 simultaneously
  • Multiplications R1*C2 and R2*C2 are done simultaneously
  • Results R1*C1, R1*C2 from node 18 are returned to node 12
  • Results R2*C1, R2*C2 from node 13 are returned to node 12
Data streaming

As you may have noticed, in this program node 12 is responsible for streaming data to other nodes and then collecting the results. For this test program the data is stored in node 12's core memory. For larger applications this could be streamed in from external memory. Consider the following code snippet.


\ Stream R1 to node 13
'r--- b! . . \ point reg b to right node. (I/O node 13)
7 for @p+ !b . unext
[ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , ]

This is a microloop. The first line configures the b register to communicate with the "right" node to node 12. The next line is a microloop. The entire loop fits in one SEAForth word of memory. The data (values 1...8) are stored in the words immediately following the loop. The instruction @p+ copies a word from the program counter to the stack and the increments the program counter. The instruction !b takes the word on the top of the stack and sends it to the node(s) pointed to by the b register. You can write to more than one node at once as shown by this code.


\ Stream C1 to nodes 13 and 18 simultaneously
'rd-- b! . . \ point reg b to right and down nodes. (I/O nodes 13 and 18)
0 !b . . \ dummy write for synch purposes
15 for @p+ !b . unext
[ 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 ,
25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , ]

The a register is used to write to internal core memory. The last part of the code for node 12 reads 4 results (2 from node 13 and 2 from node 18) and stores them sequentially starting at hex address 32.

$32 a! . . \ point A register to address 32
'r--- b! . . \ point reg b to right node. (I/O node 13)
@b !a+ . . \ Get result R1*C1
@b !a+ . . \ Get result R1*C2
'-d-- b! . . \ point reg b to down node. (I/O node 18)
@b !a+ . . \ Get result R2*c1
@b !a+ . . \ Get result R2*c2


Multiplication on SEAForth

The SEAForth chip does not have a direct multiplication instruction. Instead it has a multiply step. All you really need to know is that if you want to multiply two n-bit numbers n1 and n2, put both on the stack, shift 1 number left n times, then execute the multiply step n times. For 8 bit multiplication you can use the following code.

: 8*8 ( n1 n2 -- n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;

This creates a subroutine 8*8 which is used to do 8 bit multiplies. That's all I need for this particular example. Here is the rest of the code for node 13. Node 18 uses the same code.


-d-- b! . . \ point reg b to down node (I/O node 12)
$20 dup a! push \ point reg a to buffer - leave adr on return stack

7 for @b !a+ . unext \ read row R1 from node 12 and store in buffer

@b drop . . \ dummy read for synchronization purposes

pop dup a! push \ reset a to start of buffer
dup xor \ initialize TOS to 0
7 for \ Loop to multiply elements from R1 and C1
@b @a+ 8*8 . \ Get next R1 element from buffer and next C1 element from node 12
+ next . .

dup dup xor . \ preserve result from vector multiply - initialize TOS to 0

pop a! . .
\ reset a to start of buffer
7 for \ Loop to multiply elements from R1 and C2
@b @a+ 8*8 . \ Get next R1 element from buffer and next C2 element from node 12
+ next . .

!b !b . . \ Send results back to node 12

Performance

First I need to point out that the goal isn't to show how fast I can do multiplication but rather how efficient the SEAForth architecture can be for exploiting parallelism. On the simulator I have it was able to complete a single 8 bit multiplication in 50 cycles. Considering that there are a total of 32 multiplications that 1,600 cycles for multiplication alone. Using 3 cores the operation is completed in 945 cycles which represents a 41% speed up. It should be intuitively obvious that even great speed ups can be achieved by mapping the problem over more cores. I have a couple of ideas on that to test but I'll save that for another day.

Labels: