RISC V is an opinionated architecture and that is always going to get some people fired up. Any technology that aims for simplicity has to make hard choices and trade offs. It isn’t hard to complain about missing instructions when there are less than 100 of them. Meanwhile nobody will complain about ARM64 missing instructions because it had about 1000 of them.
Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.
RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.
Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.
So this is one tiny corner of the ISA, not something that makes ALL instruction sequences longer - essentially RISCV has no condition codes (they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other).
It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case
RISCV also specifies a 128-bit variant that is of course FASTER than these examples
> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.
When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.
In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.
> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)
As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.
I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.
> Sure, it is "clean" but just to make it clean, there was no reason to be naive.
I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.
The idea is to use the compressed instruction extension. Then two adjacent instructions can be handled like a single “fat” instruction with a special case implementation.
That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.
This guy clearly did not look at the stated rationale for the design decisions of RISC-V.
I think talking about ISAs as better or worse than one another is often a bad idea for the same reason that arguing about whether C or Python is better is a bad idea. Different ISAs are used for different purposes. We can point to some specific things as almost always being bad in the modern world like branch delay slots or the way the C preprocessor works but even then for widely employed languages or ISAs there was a point to it when it was created.
RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.
I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.
> My conclusion is that Risc V is a terrible architecture.
Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".
My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.
And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.
Godbolt:
typedef __int128_t int128_t;
int128_t add(int128_t left, int128_t right)
{
return left + right;
}
GCC 10, -O2, RISC-V: add(__int128, __int128):
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
ARM64: add(__int128, __int128):
adds x0, x0, x2
adc x1, x1, x3
ret
This issue hurts the wider types that are compiler built-ins.Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.
Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.
(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)
Hmmm... I think this argument is solid. Albeit biased from GMP's perspective, but bignums are used all the time in RSA / ECC, and probably other common tasks, so maybe its important enough to analyze at this level.
2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?
So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.
The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.
----------
Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.
I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.
So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).
AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.
----------
ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.
..........
Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?
EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).
2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).
A bit of a computer history question: I have never looked at the ISA of the Alpha (referenced in post), but RISC V has always struck me as being nearly identical to (early) MIPS, just without the HI and LO registers for multiply results and the addition of variable length instruction support, even if the core ISA doesn't use them.
MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)
Few years ago, I designed my own ISA. In that time I investigated design decisions in lots of ISAs and compared them. There was nothing in the RISC-V instruction set that stood out to me, like for example, the SuperH instruction set, which is remarkably well designed.
Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything...
What if the multi-precision code is written in C?
You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31
So 64-bit add in C is:
f_low = a_low + b_low
c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
f_high = a_high + b_high + c_high
So for RISC-V in gcc 8.2.0 with -O2 -S -c add a1,a3,a2
or a5,a3,a2
not a7,a1
and a5,a5,a7
and a3,a3,a2
or a5,a5,a3
srli a5,a5,31
add a4,a4,a6
add a4,a4,a5
But for ARM I get (with gcc 9.3.1): add ip, r2, r1
orr r3, r2, r1
and r1, r1, r2
bic r3, r3, ip
orr r3, r3, r1
lsr r3, r3, #31
add r2, r2, lr
add r2, r2, r3
It's shorter because ARM has bic. Neither one figures out to use carry related instructions.Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);
So with this:
RISC-V:
add a3,a4,a3
sltu a4,a3,a4
add a5,a5,a2
add a5,a5,a4
ARM: adds r2, r3, r2
movcs r1, #1
movcc r1, #0
add r3, r3, ip
add r3, r3, r1
RISC-V is faster..EDIT: CLANG has one better: __builtin_addc().
f_low = __builtin_addcl(a_low, b_low, 0, &c);
f_high = __builtin_addcl(a_high, b_high, c, &junk);
x86: addl 8(%rdi), %eax
adcl 4(%rdi), %ecx
ARM: adds w8, w8, w10
add w9, w11, w9
cinc w9, w9, hs
RISC-V: add a1, a4, a5
add a6, a2, a3
sltu a2, a2, a3
add a6, a6, a2
Why do these half baked slam pieces always make it to the top of HN?
The unwritten rule of HN:
You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.
One thing that bothers me: RISC-V seems to use up a lot of the available instruction set space with "HINT" instructions that nobody has (yet) found a use for. Is it anticipated that all of the available HINTs will actually be used, or is the hope that the compressed version of the instruction set will avoid the wasted space?
So, how meaningful is the "projected score of 11+ SPECInt2006/GHz" as claimed here: https://www.sifive.com/press/sifive-raises-risc-v-performanc... ?
Technical lead for SoC architecture at Nokia dismissed Risc V: https://www.quora.com/Is-RISC-V-the-future/answer/Heikki-Kul...
If this really is an issue, I imagine risc-v could easily get an extension for adding/subtracting/etc simd vectors together in a way that would expand to the capabilities of underlying processor without requiring hardcoding.
Oh wow, everybody else is debating the specific intricacies of the design decisions, and I'm here wondering why you would complain about not enough instructions in an architecture with "RISC" in the name.
Rather than glib hand-waving in front of the chalkboard...might there be a decent piece or few of RISC V hardware, which could actually be compared to non-RISC V hardware with similar budgets (for design work, transistor count, etc.) - to see how things work out when running substantial pieces of decently-compiled code?
Doesn't RISC-V have an add-with-carry instruction as part of the vector extension? I see it listed here: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
TL;DR
My code snippet results in bloated code for RISC-V RV64I.
I'm not sure how bloated it is. All of those instructions will compress [1].[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...
It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.
1. add t0, a4, a6 add t1, a5, a7
2. sltu t6, t0, a4 sltu t2, t1, a5
3. add t4, t1, t6 sltu t3, t4, t1
4. add t6, t2, t3
I'm not getting terrible from this.Experimenting with RISC-V is one of those things I keep postponing.
For those are more versed, is this really a general problem?
I was under the impression that the real bottleneck is memory, and things like this would be fixed in real applications through out of order execution, and that it payed off having simpler instructions because compilers had more freedom to rearrange things.
I noticed high and low in there so those code snippets look like 32 bit code, at least to me.
Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?
If we're really comparing 32 and 64 and complaining that 32 bit uses more instructions than 64, perhaps we should dig out the 4 bit processors and really sharpen the pitchforks. Alternatively, we could simply not. Comparing apples to oranges doesn't really help.
From the article:
Let's look at some examples of how Risc V underperforms.
First, addition of a double-word integer with carry-out:
add t0, a4, a6 // add low words
sltu t6, t0, a4 // compute carry-out from low add
add t1, a5, a7 // add hi words
sltu t2, t1, a5 // compute carry-out from high add
add t4, t1, t6 // add carry to low result
sltu t3, t4, t1 // compute carry out from the carry add
add t6, t2, t3 // combine carries
Same for 64-bit arm:
adds x12, x6, x10
adcs x13, x7, x11
Same for 64-bit x86:
add %r8, %rax
adc %r9, %rdx
I call this "benchmark by visual inspection". It is completely useless. Yet, many top devs that I know seem to think that they can emulate a complex chip in their head better than... the chip itself.
Honestly in my eyes, author loses all credibility after saying this:
"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"
Utter horse manure.
A bit off topic, but when did a DWORD implicitly become 64bits?
Inside AI you have all these standard data sets (ImageNet, MNIST etc.) that act as a benchmark for how well an algorithm performs within a given area (character recognintion, image recognintion etc.).
Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.
Carry flag and overflow checking? We don't need those things because C doesn't support it! That sadly seems to be the kind of thought process behind RISC-V, and a lot of other "modern" computing:
Everything should be written in C, or some scripting language implemented in C. Writing safe code is easy, just wrap everything in layers of macros that the compiler will magically optimize away, and if it doesn't, computers are fast enough anyway, right? The mark of a real programmer is that every one of their source files includes megabytes of headers defining things like __GNU__EXTENSION_FOO_BAR_F__UNDERSCORE_.
You say your processor has a single instruction to do some extremely common operation, and want to use it? You shouldn't even be reading a processor manual unless you are working on one of the two approved compilers, preferably GCC! If you are very lucky, those compiler people that are so much smarter than you could hope to be, have already implemented some clever transformation that recognizes the specific kind of expression produced by a set of deeply nested macros, and turns them into that single instruction. In the process, it will helpfully remove null pointer checks because you are relying on undefined behaviour somewhere else.
You say you'll do it in assembly? For Kernighan's sake, think about portability!!! I mean, portable to any other system that more or less looks the same as UNIX, with a generous sprinkling of #ifdefs and a configure script that takes minutes to run.
Implement a better language? Sure, as long as the compiler is written in C, preferably outputs C source code (that is then run through GCC), and the output binary must of course link against the system's C library. You can't do it any other way, and every proper UNIX - BSD or Mac OS X - will make it literally impossible by preventing syscalls from any other piece of code.
IMO this is like a cultural virus that seems to have infected everything IT-related, and I don't exactly understand why. Sure, having all these layers of cruft down below lets us build the next web app faster, but isn't it normal to want to fix things? Do some people actually get a sense of satisfaction out of saying "It is a solved problem, don't reinvent the wheel"? Or do they want to think that their knowledge of UNIX and C intricacies is somehow the most important, fundamental thing in computer science?
> Let's look at some examples (7 instructions vs 2 vs 2)
Isn't this the classic RISC vs CISC problem?
Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.
If RISC-V was born out of a need for an open source embedded ISA, would the ISA not need to remain very RISC-like to accommodate implementations with fewer available transistors... Or is this an outdated assumption?
Over 200 comments and not a single benchmark comparison, if only there was some way to settle this argument...sigh.
The code given is arbitrary precision addition. How often do you need that in general computing? Hardly often enough to make a measurable difference.
Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.
The author seems to be assuming that the designers have never thought about this corner case.
Who changed the title?
Moderators where are you?
TL;DR RISC-V doesn't have add with carry.
I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.
The sad thing is that many people will just read the headline or the original email and walk away misinformed and indeed disinformed
The original title was "Risc V greatly underperforms", which seems like a far more defensible and less inflammatory claim than "Risc V is a terrible architecture", which was picked from the actual message but still isn't the title.
"Gee no carry flag how will we cope?"
All of the discussions about instruction sets and "mine is better than yours" or "anyone else could do better in a small amount of time" are useless considering those arguments, if true, haven't actually resulted in any free ISA being available broadly, embraced broadly and hardware implementing that ISA being available.
It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).
I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.
But ultimately, the gist of their argument is this:
>Any task will require more Risc V instructions that any contemporary instruction set.
Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.