I'm going to diverge from ColorForth for this particular blog entry to talk about SEAForth. Hopefully the ColorForth version of the SEAForth compiler will be released soon, but for now the public SEAForth programming environment, VentureForth, is closer IMO to MachineForth. (No color tokens. You have to keep up with code slots etc.) Still VentureForth does have a strong "fun factor" when it comes to programming. Personally I think that when programming, rather than thinking about ALGOL like "algorithms" you should think about the hardware and how to directly map your problem across it. Programming comes later.
SEAForth Architecture
The SEAForth architecture is the latest hardware design from Forth creator Chuck Moore. Like previous designs it is a NOSC or "no operand stack computer". The history and documentation of Mr. Moore's early NOSC work is documented at Jeff Fox's Ultratechnology website. This means that all operations take their operands from the stack. SEAForth goes a step further and has multiple cores (24 in the first implementation) on one chip.
Here is a diagram of a SEAForth chip.
U  U  
18  R  19  L  20  R  21  L  22  R  23  
D  D  D  D  D  D  
12  R  13  L  14  R  15  L  16  R  17  
U  U  U  U  U  U  
06  R  07  L  08  R  09  L  10  R  23  
D  D  D  D  D  D  
L  00  R  01  L  02  R  03  L  04  R  05  
U 
Sample Application
In trying to come up with a test program to explore SEAForth, I settled on an obvious parallel processing application that's easy to understand, matrix multiplication. First lets look at the mathematical definition of matrix multiplication. The product of two matrices A and B results in a new matrix C where each element of C is the sum of the products of the rows in A and the columns in B. Let's take the case of a 2 row matrix by a 2 column matrix.
Matrix A:
 Matrix B:

R1*C1  R1*C2 
R2*C1  R2*C2 
So this is how I mapped the problem to the SEAForth chip.
 Data is initially in node 12
 Row 1 is passed to node 18
 Row 2 is passed to node 13
 Column 1 is passed to nodes 18 and 13 simultaneously
 Multiplications R1*C1 and R2*C1 are done simultaneously
 Column 2 is passed to nodes 18 and 13 simultaneously
 Multiplications R1*C2 and R2*C2 are done simultaneously
 Results R1*C1, R1*C2 from node 18 are returned to node 12
 Results R2*C1, R2*C2 from node 13 are returned to node 12
As you may have noticed, in this program node 12 is responsible for streaming data to other nodes and then collecting the results. For this test program the data is stored in node 12's core memory. For larger applications this could be streamed in from external memory. Consider the following code snippet.
\ Stream R1 to node 13
'r b! . . \ point reg b to right node. (I/O node 13)
7 for @p+ !b . unext
[ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , ]
This is a microloop. The first line configures the b register to communicate with the "right" node to node 12. The next line is a microloop. The entire loop fits in one SEAForth word of memory. The data (values 1...8) are stored in the words immediately following the loop. The instruction @p+ copies a word from the program counter to the stack and the increments the program counter. The instruction !b takes the word on the top of the stack and sends it to the node(s) pointed to by the b register. You can write to more than one node at once as shown by this code.
\ Stream C1 to nodes 13 and 18 simultaneously
'rd b! . . \ point reg b to right and down nodes. (I/O nodes 13 and 18)
0 !b . . \ dummy write for synch purposes
15 for @p+ !b . unext
[ 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 ,
25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , ]
The a register is used to write to internal core memory. The last part of the code for node 12 reads 4 results (2 from node 13 and 2 from node 18) and stores them sequentially starting at hex address 32.
$32 a! . . \ point A register to address 32
'r b! . . \ point reg b to right node. (I/O node 13)
@b !a+ . . \ Get result R1*C1
@b !a+ . . \ Get result R1*C2
'd b! . . \ point reg b to down node. (I/O node 18)
@b !a+ . . \ Get result R2*c1
@b !a+ . . \ Get result R2*c2
Multiplication on SEAForth
The SEAForth chip does not have a direct multiplication instruction. Instead it has a multiply step. All you really need to know is that if you want to multiply two nbit numbers n1 and n2, put both on the stack, shift 1 number left n times, then execute the multiply step n times. For 8 bit multiplication you can use the following code.
: 8*8 ( n1 n2  n1*n2 ) \ 16 bit output
push 2* 2* . \ left shift n1 8 times
2* 2* 2* .
2* 2* 2* .
pop +* +* +*
+* +* +* +*
+* push drop .
pop ;
This creates a subroutine 8*8 which is used to do 8 bit multiplies. That's all I need for this particular example. Here is the rest of the code for node 13. Node 18 uses the same code.
d b! . . \ point reg b to down node (I/O node 12)
$20 dup a! push \ point reg a to buffer  leave adr on return stack
7 for @b !a+ . unext \ read row R1 from node 12 and store in buffer
@b drop . . \ dummy read for synchronization purposes
pop dup a! push \ reset a to start of buffer
dup xor \ initialize TOS to 0
7 for \ Loop to multiply elements from R1 and C1
@b @a+ 8*8 . \ Get next R1 element from buffer and next C1 element from node 12
+ next . .
dup dup xor . \ preserve result from vector multiply  initialize TOS to 0
pop a! . .
\ reset a to start of buffer
7 for
\ Loop to multiply elements from R1 and C2
@b @a+ 8*8 .
\ Get next R1 element from buffer and next C2 element from node 12
+ next . .
!b !b . . \ Send results back to node 12
Performance
First I need to point out that the goal isn't to show how fast I can do multiplication but rather how efficient the SEAForth architecture can be for exploiting parallelism. On the simulator I have it was able to complete a single 8 bit multiplication in 50 cycles. Considering that there are a total of 32 multiplications that 1,600 cycles for multiplication alone. Using 3 cores the operation is completed in 945 cycles which represents a 41% speed up. It should be intuitively obvious that even great speed ups can be achieved by mapping the problem over more cores. I have a couple of ideas on that to test but I'll save that for another day.