|
|
|
|
| Author |
Message |
Chris Thomasson Guest
|
Posted: Sat Sep 09, 2006 4:49 am Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Ooops!
"Chris Thomasson" <cristom@comcast.net> wrote in message
news:rLadnbtuobfZiZ_YnZ2dnUVZ_sKdnZ2d@comcast.com...
| Quote: | "Scott Michel" <scooter.phd@gmail.com> wrote in message
news:1157761489.824538.48430@i3g2000cwc.googlegroups.com...
Chris Thomasson wrote:
|
[...]
| Quote: | LL/SC does not really fit the bill... You have to implement logic that
uses LL/SC in a loop. You can predict exactly how many times a thread will
retry.
^^^^^^^^^^ |
You CAN'T predict exactly how many times a thread will retry.
| Quote: | Its similar to the live-lock-like situations that are inherent in
obstruction-free algorithms...
Is this the kind 'pathological conditions' you were getting at?
|
[...]
Sorry for any confusion. |
|
| Back to top |
|
 |
|
|
Tom Horsley Guest
|
Posted: Sun Sep 10, 2006 3:40 am Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
On Fri, 08 Sep 2006 21:03:24 +0000, Sander Vesik wrote:
| Quote: | Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
|
Hey, the compiler writers had all their brain cells used up
trying to generate code for the x86 architecture.
You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :-). |
|
| Back to top |
|
 |
Tom Horsley Guest
|
Posted: Sun Sep 10, 2006 3:43 am Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
On Fri, 08 Sep 2006 17:17:54 -0700, Scott Michel wrote:
| Quote: | But the original question why both Cell and Opteron...?
|
Opteron so they can get the performance they need?
Cell because IBM makes 'em and they can unload a bunch
of them on the gummint while they are at it?
(Just a theory :-). |
|
| Back to top |
|
 |
The little lost angel Guest
|
Posted: Sun Sep 10, 2006 7:30 am Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
On Sat, 09 Sep 2006 22:40:26 GMT, Tom Horsley
<tomhorsley@adelphia.net> wrote:
| Quote: | You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :-).
|
That's bull, it's web design and JAVA ;)
--
A Lost Angel, fallen from heaven
Lost in dreams, Lost in aspirations,
Lost to the world, Lost to myself |
|
| Back to top |
|
 |
Scott Michel Guest
|
Posted: Mon Sep 11, 2006 9:59 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Sander Vesik wrote:
| Quote: | In comp.arch Scott Michel <scooter.phd@gmail.com> wrote:
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
|
gcc doesn't really help you if you don't know what you're doing. Loop
unrolling comes to mind: can't tell you how many times I've had to
forcibly do loop unrolling where one would have expected gcc to do it
with "-O3 -funroll-loops".
There is some hope on the horizon, like LLVM from UIUC, which you'll
see underneath hood in OS X "Leopard". I'm not sure if I'd expect to
see Cell SPU support in Java, although IBM will likely make that
happen. Sure, compilers can take hints, but it seems to me that an
interpretive system, like LLVM or Python, to take the "bird's eye" view
and dispatch tasks to SPUs. Simple loop-level parallelism, while
common, is likely the wrong level of granularity. |
|
| Back to top |
|
 |
Fix your Windows Problems - FAST.
FREE Safe Scan Registry Check. Locate & Fix Errors in Minutes!
|
|
Robert Redelmeier Guest
|
Posted: Tue Sep 12, 2006 2:25 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
In comp.sys.ibm.pc.hardware.chips Scott Michel <scooter.phd@gmail.com> wrote in part:
| Quote: | gcc doesn't really help you if you don't know what you're doing.
|
Agreed `gcc` can be cantankerous.
| Quote: | Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".
|
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.
-- Robert |
|
| Back to top |
|
 |
Scott Michel Guest
|
Posted: Wed Sep 13, 2006 9:37 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Robert Redelmeier wrote:
| Quote: | In comp.sys.ibm.pc.hardware.chips Scott Michel <scooter.phd@gmail.com> wrote in part:
gcc doesn't really help you if you don't know what you're doing.
Agreed `gcc` can be cantankerous.
Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.
|
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.) |
|
| Back to top |
|
 |
Robert Redelmeier Guest
|
Posted: Wed Sep 13, 2006 10:59 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
In comp.sys.ibm.pc.hardware.chips Scott Michel <scooter.phd@gmail.com> wrote in part:
| Quote: | Robert Redelmeier wrote:
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
|
If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.
40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.
-- Robert
> |
|
| Back to top |
|
 |
Scott Michel Guest
|
Posted: Thu Sep 14, 2006 10:11 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Robert Redelmeier wrote:
| Quote: | In comp.sys.ibm.pc.hardware.chips Scott Michel <scooter.phd@gmail.com> wrote in part:
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.
40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.
|
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.
gcc is not your friend. |
|
| Back to top |
|
 |
Phil Armstrong Guest
|
Posted: Thu Sep 14, 2006 11:50 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Scott Michel <scooter.phd@gmail.com> wrote:
| Quote: | In comp.sys.ibm.pc.hardware.chips Scott Michel <scooter.phd@gmail.com> wrote in part:
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
[snip]
gcc is not your friend.
|
Was the loop not being unrolled at all by gcc? Did -funroll-all-loops
help?
Phil
--
http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt |
|
| Back to top |
|
 |
|
|
Bernd Paysan Guest
|
Posted: Fri Sep 15, 2006 1:24 pm Post subject: Re: IBM to build Opteron-Cell hybrid supercomputer of 1 Peta |
|
|
Scott Michel wrote:
| Quote: | The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.
gcc is not your friend.
|
More than 10 years ago, when I still was a student, one of the PhDs of the
numeric faculty made a matrix multiply competition for the HP-RISC CPUs we
had on our workstations. He estimated that 30MFLOPs would be possible, even
though a naive C loop could get less than 1MFLOP, and the HP Fortran
compiler with a build-in "extremely fast" matrix multiplication got no more
than 10MFLOPs.
After doing some experiments, I got indeed 30MFLOPs out of the thing, by
doing several levels of blocking. The inner loop kept a small submatrix
accumulator (as much as did fit, I think I got 5x5 into the registers), so
that several rows and columns could be multiplied together in one go
(saving a lot of loads and stores). The next blocking level was the (quite
large) cache of the PA-RISC machine, i.e. subareas of both matrixes where
multiplied together.
I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries. GCC isn't your friend, because loop unrolling here is really the
wrong approach. The inner loop I used just did all the multiplications for
the 5x5 submatrix, and no further unrolling was necessary.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
|
|
| |