I'm going to diverge from ColorForth for this particular blog entry to talk about SEAForth. Hopefully the ColorForth version of the SEAForth compiler will be released soon, but for now the public SEAForth programming environment, VentureForth, is closer IMO to MachineForth. (No color tokens. You have to keep up with code slots etc.) Still VentureForth does have a strong "fun factor" when it comes to programming. Personally I think that when programming, rather than thinking about ALGOL like "algorithms" you should think about the hardware and how to directly map your problem across it. Programming comes later.
SEAForth Architecture
The SEAForth architecture is the latest hardware design from Forth creator Chuck Moore. Like previous designs it is a NOSC or "no operand stack computer". The history and documentation of Mr. Moore's early NOSC work is documented at Jeff Fox's Ultratechnology website. This means that all operations take their operands from the stack. SEAForth goes a step further and has multiple cores (24 in the first implementation) on one chip.
Here is a diagram of a SEAForth chip.
U  U  
18  R  19  L  20  R  21  L  22  R  23  
D  D  D  D  D  D  
12  R  13  L  14  R  15  L  16  R  17  
U  U  U  U  U  U  
06  R  07  L  08  R  09  L  10  R  23  
D  D  D  D  D  D  
L  00  R  01  L  02  R  03  L  04  R  05  
U 
Sample Application
In trying to come up with a test program to explore SEAForth, I settled on an obvious parallel processing application that's easy to understand, matrix multiplication. First lets look at the mathematical definition of matrix multiplication. The product of two matrices A and B results in a new matrix C where each element of C is the sum of the products of the rows in A and the columns in B. Let's take the case of a 2 row matrix by a 2 column matrix.
Matrix A:
 Matrix B:

R1*C1  R1*C2 
R2*C1  R2*C2 
So this is how I mapped the problem to the SEAForth chip.
 Data is initially in node 12
 Row 1 is passed to node 18
 Row 2 is passed to node 13
 Column 1 is passed to nodes 18 and 13 simultaneously
 Multiplications R1*C1 and R2*C1 are done simultaneously
 Column 2 is passed to nodes 18 and 13 simultaneously
 Multiplications R1*C2 and R2*C2 are done simultaneously
 Results R1*C1, R1*C2 from node 18 are returned to node 12
 Results R2*C1, R2*C2 from node 13 are returned to node 12
As you may have noticed, in this program node 12 is responsible for streaming data to other nodes and then collecting the results. For this test program the data is stored in node 12's core memory. For larger applications this could be streamed in from external memory. Consider the following code snippet.
\ Stream R1 to node 13
'r b! . . \ point reg b to right node. (I/O node 13)
7 for @p+ !b . unext
[ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , ]
This is a microloop. The first line configures the b register to communicate with the "right" node to node 12. The next line is a microloop. The entire loop fits in one SEAForth word of memory. The data (values 1...8) are stored in the words immediately following the loop. The instruction @p+ copies a word from the program counter to the stack and the increments the program counter. The instruction !b takes the word on the top of the stack and sends it to the node(s) pointed to by the b register. You can write to more than one node at once as shown by this code.
\ Stream C1 to nodes 13 and 18 simultaneously
'rd b! . . \ point reg b to right and down nodes. (I/O nodes 13 and 18)
0 !b . . \ dummy write for synch purposes
15 for @p+ !b . unext
[ 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 ,
25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , ]
The a register is used to write to internal core memory. The last part of the code for node 12 reads 4 results (2 from node 13 and 2 from node 18) and stores them sequentially starting at hex address 32.
$32 a! . . \ point A register to address 32
'r b! . . \ point reg b to right node. (I/O node 13)
@b !a+ . . \ Get result R1*C1
@b !a+ . . \ Get result R1*C2
'd b! . . \ point reg b to down node. (I/O node 18)
@b !a+ . . \ Get result R2*c1
@b !a+ . . \ Get result R2*c2
Multiplication on SEAForth
The SEAForth chip does not have a direct multiplication instruction. Instead it has a multiply step. All you really need to know is that if you want to multiply two nbit numbers n1 and n2, put both on the stack, shift 1 number left n times, then execute the multiply step n times. For 8 bit multiplication you can use the following code.
: 8*8 ( n1 n2  n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;
This creates a subroutine 8*8 which is used to do 8 bit multiplies. That's all I need for this particular example. Here is the rest of the code for node 13. Node 18 uses the same code.
d b! . . \ point reg b to down node (I/O node 12)
$20 dup a! push \ point reg a to buffer  leave adr on return stack
7 for @b !a+ . unext \ read row R1 from node 12 and store in buffer
@b drop . . \ dummy read for synchronization purposes
pop dup a! push \ reset a to start of buffer
dup xor \ initialize TOS to 0
7 for \ Loop to multiply elements from R1 and C1
@b @a+ 8*8 . \ Get next R1 element from buffer and next C1 element from node 12
+ next . .
dup dup xor . \ preserve result from vector multiply  initialize TOS to 0
pop a! . .
\ reset a to start of buffer
7 for
\ Loop to multiply elements from R1 and C2
@b @a+ 8*8 .
\ Get next R1 element from buffer and next C2 element from node 12
+ next . .
!b !b . . \ Send results back to node 12
Performance
First I need to point out that the goal isn't to show how fast I can do multiplication but rather how efficient the SEAForth architecture can be for exploiting parallelism. On the simulator I have it was able to complete a single 8 bit multiplication in 50 cycles. Considering that there are a total of 32 multiplications that 1,600 cycles for multiplication alone. Using 3 cores the operation is completed in 945 cycles which represents a 41% speed up. It should be intuitively obvious that even great speed ups can be achieved by mapping the problem over more cores. I have a couple of ideas on that to test but I'll save that for another day.
4 comments:
Here is the complete code listing.
( $Id: matrix.f,v 1.0 2006112 $ )
\ Matrix multiplication example
\ Parrallel multiplication of a 2 row matrix by a 2 column matrix
\ 1) Data initially is in node 12
\ 2) Row1 is passed to node 18
\ 3) Row2 is passed to node 13
\ 4) Column1 is passed to nodes 18 and 13 simultaneously
\ 5) Multiplications R1*C1 and R2*C1 done simultaneously
\ 6) Column2 is passed to nodes 18 and 13 simultaneously
\ 7) Multiplications R1*C2 and R2*C2 done simultaneously
\ 8) Results R1*C1, R1*C2, R2*C1, R2*C2 are returned to node 12
include compatibility.f
cd ../t18
include t18x.f \ load compiler and simulator
cd ../bios
decimal
include rombios.mf \ 1/24/06 load ROM BIOS code
hex
init
\ ******* node 12 ***************************************
\ Pass row1 to node 13, row2 to node 18
\ and columns 1 and 2 to nodes 13 and 18 simultaneously
decimal
12 node!
$0 org
machine
\ Stream R1 to node 13
'r b! . . \ point reg b to right node. (I/O node 13)
7 for @p+ !b . unext
[ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , ]
\ Stream R2 to node 18
'd b! . . \ point reg b to down node. (I/O node 18)
7 for @p+ !b . unext
[ 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , ]
\ Stream C1 to nodes 13 and 18 simultaneously
'rd b! . . \ point reg b to right and down nodes. (I/O nodes 13 and 18)
0 !b . . \ dummy write for synch purposes
15 for @p+ !b . unext
[ 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 ,
25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , ]
$32 a! . . \ point A register to address 32
'r b! . . \ point reg b to right node. (I/O node 13)
@b !a+ . . \ Get result R1*C1
@b !a+ . . \ Get result R1*C2
'd b! . . \ point reg b to down nodw. (I/O node 18)
@b !a+ . . \ Get result R2*c1
@b !a+ . . \ Get result R2*c2
[
\ ******* node 13 *********************************************
\ Receive row1 then multiply it by columns 1 and 2
\ and return results.
decimal
13 node !
0 org
machine
: 8*8 ( n1 n2  n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;
'r b! . . \ point reg b to right node
$20 dup a! push \ point reg a to buffer  push adr
7 for
@b !a+ . unext
@b drop . . \ dummy read for synchronization purposes
pop dup a! push \ reset a to start of buffer
dup xor \ initilize TOS to 0
7 for
@b @a+ 8*8 .
+ next . .
dup dup xor . \ preserve result  0 TOS
pop a! . .
7 for
@b @a+ 8*8 .
+ next . .
!b !b . . \ Send results back to r node
[
\ ******* node 18 *********************************************
\ Receive row2 then multiply it by columns 1 and 2
\ and return results.
decimal
18 node !
0 org
machine
: 8*8 ( n1 n2  n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;
'd b! . . \ point reg b to down node
$20 dup a! push \ point reg a to buffer  leave adr on return stack
7 for @b !a+ . unext
@b drop . . \ dummy read for synchronization purposes
pop dup a! push \ reset a to start of buffer
dup xor \ initilize TOS to 0
7 for
@b @a+ 8*8 .
+ next . .
dup dup xor . \ preserve result from vector mult  initialize TOS to 0
pop a! . .
7 for
@b @a+ 8*8 .
+ next . .
!b !b . . \ Send results back to right node
[
\ ******* node 23 *********************************************
\ Test multiplication
decimal
23 node !
0 org
machine
: 8*8 ( n1 n2  n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;
2 5 8*8 .
'r b! . .
!b
[
decimal
0 to bootadr
12 node! initnode
7 to bootadr
13 node! initnode 18 node! initnode
23 node! initnode
host
Great example John. Thanks for posting it. I'm looking forward to reading more about the SEAForth chip and the programming ideas behind it.
Hello John,
Thanks for the SeaForth tutorial. You are the only one outside of Intellasys who seems to understand the simulator and you blog comes up tops on a google search too. Can you please post a little bit more about SeaForth for commoners like me?
Please excuse my ignorance if the below is a stupid question. Can the simulator be used for a "Hello World" app? Can it be used to communicate through the hosts' tcp connection?
I am trying to understand the SeaForth simulator and given its' coolness, I am trying to understand how I could use it for text processing, etc?
Thanks
Jack
Good for people to know.
Post a Comment