Doom_Machine
New Member
conitnued:
since most collision detection branches are basically random and can't be predicted even with the best
branch predictor. So not having a branch predictor doesn't hurt, what does hurt however is the very small
amount of local memory available to each SPE. In order to access main memory, the SPE places a DMA
request on the bus (or the PPE can initiate the DMA request) and waits for it to be fulfilled. From those
that have had experience with the PS3 development kits, this access takes far too long to be used in many
real world scenarios. It is the small amount of local memory that each SPE has access to that limits the
SPEs from being able to work on more than a handful of tasks. While physics acceleration is an important
one, there are many more tasks that can't be accelerated by the SPEs because of the memory limitation.
The other point that has been made is that even if you can offload some of the physics calculations to the
SPE array, the Cell's PPE ends up being a pretty big bottleneck thanks to its overall lackluster
performance. It's akin to having an extremely fast GPU but without a fast CPU to pair it up with.
What About Multithreading? We of course asked the obvious question: would game developers rather have 3
slow general purpose cores, or one of those cores paired with an array of specialized SPEs? The response
was unanimous, everyone we have spoken to would rather take the general purpose core approach.
Citing everything from ease of programming to the limitations of the SPEs we mentioned previously, the
Xbox 360 appears to be the more developer-friendly of the two platforms according to the cross-platform
developers we've spoken to. Despite being more developer-friendly, the Xenon CPU is still not what
developers wanted.
The most ironic bit of it all is that according to developers, if either manufacturer had decided to use
an Athlon 64 or a Pentium D in their next-gen console, they would be significantly ahead of the
competition in terms of CPU performance.
While the developers we've spoken to agree that heavily multithreaded game engines are the future, that
future won't really take form for another 3 - 5 years. Even Microsoft admitted to us that all developers
are focusing on having, at most, one or two threads of execution for the game engine itself - not the four
or six threads that the Xbox 360 was designed for.
Even when games become more aggressive with their multithreading, targeting 2 - 4 threads, most of the
work will still be done in a single thread. It won't be until the next step in multithreaded
architectures where that single thread gets broken down even further, and by that time we'll be talking
about Xbox 720 and PlayStation 4. In the end, the more multithreaded nature of these new console CPUs
doesn't help paint much of a brighter performance picture - multithreaded or not, game developers are not
pleased with the performance of these CPUs.
What about all those Flops? The one statement that we heard over and over again was that Microsoft was
sold on the peak theoretical performance of the Xenon CPU. Ever since the announcement of the Xbox 360
and PS3 hardware, people have been set on comparing Microsoft's figure of 1 trillion floating point
operations per second to Sony's figure of 2 trillion floating point operations per second (TFLOPs). Any
AnandTech reader should know for a fact that these numbers are meaningless, but just in case you need some
reasoning for why, let's look at the facts.
First and foremost, a floating point operation can be anything; it can be adding two floating point
numbers together, or it can be performing a dot product on two floating point numbers, it can even be just
calculating the complement of a fp number. Anything that is executed on a FPU is fair game to be called a
floating point operation.
Secondly, both floating point power numbers refer to the whole system, CPU and GPU. Obviously a GPU's
floating point processing power doesn't mean anything if you're trying to run general purpose code on it
and vice versa. As we've seen from the graphics market, characterizing GPU performance in terms of generic
floating point operations per second is far from the full performance story.
Third, when a manufacturer is talking about peak floating point performance there are a few things that
they aren't taking into account. Being able to process billions of operations per second depends on
actually being able to have that many floating point operations to work on. That means that you have to
have enough bandwidth to keep the FPUs fed, no mispredicted branches, no cache misses and the right
structure of code to make sure that all of the FPUs can be fed at all times so they can execute at their
peak rates. We already know that's not the case as game developers have already told us that the Xenon
CPU isn't even in the same realm of performance as the Pentium 4 or Athlon 64. Not to mention that the
requirements for hitting peak theoretical performance are always ridiculous; caches are only so big and
thus there will come a time where a request to main memory is needed, and you can expect that request to
be fulfilled in a few hundred clock cycles, where no floating point operations will be happening at all.
So while there may be some extreme cases where the Xenon CPU can hit its peak performance, it sure isn't
happening in any real world code.
The Cell processor is no different; given that its PPE is identical to one of the PowerPC cores in Xenon,
it must derive its floating point performance superiority from its array of SPEs. So what's the issue
with 218 GFLOPs number (2 TFLOPs for the whole system)? Well, from what we've heard, game developers are
finding that they can't use the SPEs for a lot of tasks. So in the end, it doesn't matter what peak
theoretical performance of Cell's SPE array is, if those SPEs aren't being used all the time.
Another way to look at this comparison of flops is to look at integer add latencies on the Pentium 4 vs.
the Athlon 64. The Pentium 4 has two double pumped ALUs, each capable of performing two add operations
per clock, that's a total of 4 add operations per clock; so we could say that a 3.8GHz Pentium 4 can
perform 15.2 billion operations per second. The Athlon 64 has three ALUs each capable of executing an add
every clock; so a 2.8GHz Athlon 64 can perform 8.4 billion operations per second. By this silly console
marketing logic, the Pentium 4 would be almost twice as fast as the Athlon 64, and a multi-core Pentium 4
would be faster than a multi-core Athlon 64. Any AnandTech reader should know that's hardly the case. No
code is composed entirely of add instructions, and even if it were, eventually the Pentium 4 and Athlon 64
will have to go out to main memory for data, and when they do, the Athlon 64 has a much lower latency
access to memory than the P4. In the end, despite what these horribly concocted numbers may lead you to
believe, they say absolutely nothing about performance. The exact same situation exists with the CPUs of
the next-generation consoles; don't fall for it.
Why did Sony/MS do it? For Sony, it doesn't take much to see that the Cell processor is eerily similar to
the Emotion Engine in the PlayStation 2, at least conceptually. Sony clearly has an idea of what direction
they would like to go in, and it doesn't happen to be one that's aligned with much of the rest of the
industry.
since most collision detection branches are basically random and can't be predicted even with the best
branch predictor. So not having a branch predictor doesn't hurt, what does hurt however is the very small
amount of local memory available to each SPE. In order to access main memory, the SPE places a DMA
request on the bus (or the PPE can initiate the DMA request) and waits for it to be fulfilled. From those
that have had experience with the PS3 development kits, this access takes far too long to be used in many
real world scenarios. It is the small amount of local memory that each SPE has access to that limits the
SPEs from being able to work on more than a handful of tasks. While physics acceleration is an important
one, there are many more tasks that can't be accelerated by the SPEs because of the memory limitation.
The other point that has been made is that even if you can offload some of the physics calculations to the
SPE array, the Cell's PPE ends up being a pretty big bottleneck thanks to its overall lackluster
performance. It's akin to having an extremely fast GPU but without a fast CPU to pair it up with.
What About Multithreading? We of course asked the obvious question: would game developers rather have 3
slow general purpose cores, or one of those cores paired with an array of specialized SPEs? The response
was unanimous, everyone we have spoken to would rather take the general purpose core approach.
Citing everything from ease of programming to the limitations of the SPEs we mentioned previously, the
Xbox 360 appears to be the more developer-friendly of the two platforms according to the cross-platform
developers we've spoken to. Despite being more developer-friendly, the Xenon CPU is still not what
developers wanted.
The most ironic bit of it all is that according to developers, if either manufacturer had decided to use
an Athlon 64 or a Pentium D in their next-gen console, they would be significantly ahead of the
competition in terms of CPU performance.
While the developers we've spoken to agree that heavily multithreaded game engines are the future, that
future won't really take form for another 3 - 5 years. Even Microsoft admitted to us that all developers
are focusing on having, at most, one or two threads of execution for the game engine itself - not the four
or six threads that the Xbox 360 was designed for.
Even when games become more aggressive with their multithreading, targeting 2 - 4 threads, most of the
work will still be done in a single thread. It won't be until the next step in multithreaded
architectures where that single thread gets broken down even further, and by that time we'll be talking
about Xbox 720 and PlayStation 4. In the end, the more multithreaded nature of these new console CPUs
doesn't help paint much of a brighter performance picture - multithreaded or not, game developers are not
pleased with the performance of these CPUs.
What about all those Flops? The one statement that we heard over and over again was that Microsoft was
sold on the peak theoretical performance of the Xenon CPU. Ever since the announcement of the Xbox 360
and PS3 hardware, people have been set on comparing Microsoft's figure of 1 trillion floating point
operations per second to Sony's figure of 2 trillion floating point operations per second (TFLOPs). Any
AnandTech reader should know for a fact that these numbers are meaningless, but just in case you need some
reasoning for why, let's look at the facts.
First and foremost, a floating point operation can be anything; it can be adding two floating point
numbers together, or it can be performing a dot product on two floating point numbers, it can even be just
calculating the complement of a fp number. Anything that is executed on a FPU is fair game to be called a
floating point operation.
Secondly, both floating point power numbers refer to the whole system, CPU and GPU. Obviously a GPU's
floating point processing power doesn't mean anything if you're trying to run general purpose code on it
and vice versa. As we've seen from the graphics market, characterizing GPU performance in terms of generic
floating point operations per second is far from the full performance story.
Third, when a manufacturer is talking about peak floating point performance there are a few things that
they aren't taking into account. Being able to process billions of operations per second depends on
actually being able to have that many floating point operations to work on. That means that you have to
have enough bandwidth to keep the FPUs fed, no mispredicted branches, no cache misses and the right
structure of code to make sure that all of the FPUs can be fed at all times so they can execute at their
peak rates. We already know that's not the case as game developers have already told us that the Xenon
CPU isn't even in the same realm of performance as the Pentium 4 or Athlon 64. Not to mention that the
requirements for hitting peak theoretical performance are always ridiculous; caches are only so big and
thus there will come a time where a request to main memory is needed, and you can expect that request to
be fulfilled in a few hundred clock cycles, where no floating point operations will be happening at all.
So while there may be some extreme cases where the Xenon CPU can hit its peak performance, it sure isn't
happening in any real world code.
The Cell processor is no different; given that its PPE is identical to one of the PowerPC cores in Xenon,
it must derive its floating point performance superiority from its array of SPEs. So what's the issue
with 218 GFLOPs number (2 TFLOPs for the whole system)? Well, from what we've heard, game developers are
finding that they can't use the SPEs for a lot of tasks. So in the end, it doesn't matter what peak
theoretical performance of Cell's SPE array is, if those SPEs aren't being used all the time.
Another way to look at this comparison of flops is to look at integer add latencies on the Pentium 4 vs.
the Athlon 64. The Pentium 4 has two double pumped ALUs, each capable of performing two add operations
per clock, that's a total of 4 add operations per clock; so we could say that a 3.8GHz Pentium 4 can
perform 15.2 billion operations per second. The Athlon 64 has three ALUs each capable of executing an add
every clock; so a 2.8GHz Athlon 64 can perform 8.4 billion operations per second. By this silly console
marketing logic, the Pentium 4 would be almost twice as fast as the Athlon 64, and a multi-core Pentium 4
would be faster than a multi-core Athlon 64. Any AnandTech reader should know that's hardly the case. No
code is composed entirely of add instructions, and even if it were, eventually the Pentium 4 and Athlon 64
will have to go out to main memory for data, and when they do, the Athlon 64 has a much lower latency
access to memory than the P4. In the end, despite what these horribly concocted numbers may lead you to
believe, they say absolutely nothing about performance. The exact same situation exists with the CPUs of
the next-generation consoles; don't fall for it.
Why did Sony/MS do it? For Sony, it doesn't take much to see that the Cell processor is eerily similar to
the Emotion Engine in the PlayStation 2, at least conceptually. Sony clearly has an idea of what direction
they would like to go in, and it doesn't happen to be one that's aligned with much of the rest of the
industry.