is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?
https://www.techpowerup.com/340779/amd-claims-arm-isa-doesnt-offer-efficiency-advantage-over-x86
@regehr Every serious study (both from independent researchers and from vendors themselves) that I've ever seen (and I'm up to 5 or so at this point), broadly, supports this, with some caveats.
It's not "no difference", but for server/application cores, what differences there are typically somewhere in the single-digit %. You can always find pathological examples, but typically it's not that much.
There is a real cost to x86s many warts but it's mostly in design/validation cost and toolchains.
@regehr Some more details:
- the D/V and toolchain costs are amortized. Broadly speaking, the bigger your ecosystem/market share, the bigger your ability to absorb that cost.
- This holds for what ARM would call "application" cores; oversimplifying a bit, it's essentially a constant overhead on the design that adds some extra area and pipe stages. It's more onerous for smaller cores, but you need to be really small.
@regehr Eventually, there's nowhere left to hide. For applications where you'd use say an ARM Cortex-M0 or a bare-bones minimal RV32I CPU, I'm not aware of anything x86 past or present that would really make sense.
Intel did "Quark" a while back which I believe was either a 486 or P5 derivative, so still something like a 5-stage pipelined integer core. If you want to go even lower than that, I don't think anyone has (or wants to do) anything.
@steve @regehr Anyway, take that with whatever amount of salt you want, but Intel and AMD both are strongly incentivized to seriously look at this.
They for sure would prefer to sell you x86s because they have decades of experience with that, but they're looking at what it costs them to do it both in capex and in how much it hurts the resulting designs.
And for the latter, the consistent answer has been "a bit, but not much".
@regehr @steve Anecdotally, there's at least 3 (Intel, AMD, Centaur) companies that do this on the regular, and one of them (Centaur) is quite small as such things go.
I wouldn't want to do it either, but the other thing you gotta keep in mind is that the CPU core, while important, is only part of a SoC and ISA has very little impact on the "everything else".
@regehr @steve For example, it's a goddamn NIGHTMARE doing a high-performance memory subsystem for absolutely anything.
This whole "shared memory" fiction we're committed to maintaining is a significant drag on all HW, but HW impls of it are just in another league perf-wise than "just" building message-passing and trying to work around it in SW (lots have tried, but there's little code for it and it's a PITA), so we're kind of stuck with it.
@regehr @steve To wit: virtual memory is a lie, by design. Uniform memory is a lie. Shared instruction/data memory is a lie. Coherent caches are a lie, caches would rather be _anything_ else. Buses are a lie. Memory-mapped IO is IO lying about being memory. Oh and the data bits and wires are small and shitty enough now that they started lying too and everything is slowly creeping towards ECCing all the things
@regehr @steve Also, re: ISA efficiency, I like re-posting this, by now, rather old image that shows you what the score really is.
This was on the Xeon Phis but the general trend holds to this day. (Source: https://people.eecs.berkeley.edu/~ysshao/assets/papers/shao2013-islped.pdf p. 3) NB this is an in-order core with 512b vector units.
@regehr @steve This is one of the bigger reasons for why ISA doesn't matter more.
Broadly, your uArch is only as good as its data movement, because that shit is what's really expensive, not the logic gates.
It's things like:
- how good is your entire memory subsystem
- how good is your bypass network
- how good are your register files
etc.
It's not like you can't make mistakes in the ISA that will really kill your design, you can. That's what happened to VAX.
@regehr @steve The VAX ISA turns out to be, inadvertently, _extremely_ hostile to an implementation that tries to decouple frontend and backend, which ultimately broke its neck.
x86 has many flaws, but nothing that makes it so that there is a massive discontinuity where there's basically nothing you can do about a particular problem until you have like 10x transistor/power/whatever budget, which is the kind of thing that kills archs.
@argv_minus_one @rygorous @regehr @steve 68k went way way too CISC right at the point RISC got all trendy. Like... RISC was wrong in the long term, but it was 20% right for a decade or so. And then it was wrong. Sadly, that was long enough to kill 68k as a mainstream part (though it lived on for a looooong time in the embedded space)
@TomF
@argv_minus_one @rygorous @regehr @steve
wasn't 68k basically a VAX?
@wolf480pl it was absolutely not, no.
One of VAX's more notable problems was that absolutely every operand could be a memory reference or even indirect memory reference (meaning a memory location containing a pointer to a memory location that was accessed by the instruction). Some VAX instructions had 6 operands, each of which could be a double-indirect memory reference, and IIRC also unaligned and spanning a page boundary, so the worst case number of page faults per instruction was bonkers.
@wolf480pl Everything could also be an immediate operand.
There were two ways to encode immediates, "literal" was for short integers and was more compact, anything out of range used the actual immediate encoding.
On the VAX, you had 16 GPRs R0-R15, and R15 was just your PC. (32-bit ARM later copied that mistake, and it is a mistake.)
The immediate encoding boiled down to (r15)+, i.e., fetch data (of whatever the right size is) at PC and auto-increment. That's also how it was encoded.
@wolf480pl So, in the VAX encoding, if you have say an add instruction where the first operand is an immediate, you get the encoding for the first operand, then the immediate bytes, then the encoding for the second operand, and so forth.
Crucially, you don't really know where the byte describing the second operand starts until you've finished the first operand; and this goes for all (up to 6) operands.
Nobody does this anymore, because turns out, it's a _terrible_ idea.
@rygorous
sounds like it'd save you a lot of gates in the uninteresting scenario of a cacheless byte-addressable memory and a core that takes 3+ cycles to process each operand
@wolf480pl VAX was multi-cycle everything, basically something like at least 1 cycle for the base operation (even if no operands), at least 1 cycle extra for every operand, more if they involved memory access.
They did try to pipeline it past that (with the NVAX) but the ISA proved to be remarkably resistant to doing something much better, at least with the transistor budget they had at the time (late 80s/early 90s).
@wolf480pl Which is all just a random historical footnote at this point, but it is important context because all the original first-wave early-to-mid-80s RISC papers were subtweeting VAXen, specifically and especially.
VAX is '77. 8086 is from '78, and descended from the 8080 and ultimately 8008 ('72). IBM z mainframes are still off the original System/360 architecture from 1965. Both decidedly not RISC. Both made the jumps to first pipelined, then superscalar, then OoO just fine. VAX, nope.
@wolf480pl The original RISC papers put all "CISCs" in the same boat, but historically, that is demonstrably false.
VAX made some very specific decisions that felt clean and elegant in the short term and screwed them over big-time in the long term.
Same for the first gen RISCs - load delay slots never made it into series production for MIPS, branch delay slots did but were regretted not long after, etc.
I don't think there's a big lesson here other than predicting the future is hard.
@wolf480pl Except, of course, for ISA designers, where there's plenty of immediately actionable information from how the VAX shook out, but that's less along RISC/CISC ideological lines and more like:
- make instructions fixed-size or, when not practical, at least make it easy to tell insn size from the first word
- don't bake in decisions that really lock you into one particular implementation, you might not want it in the future
- don't make the PC a GPR (SP is somewhat special too)
etc.
@rygorous
those things are fairly obvious if you know pipelining is a thing. I'm guessing in the 70s they didn't know that yet?
@wolf480pl They did. Pipelining is 50s tech (IBM Stretch had a 3-stage pipeline, designed starting 1956). Superscalar, OoO was developed in the 60s (e.g. IBM ACS, Tomasulo algorithm) and shipped that decade (IBM System/360 Model 91 FPU, 1967).
But this was all mostly the purview of supercomputers.
The whole idea of an Instruction Set Architecture with multiple impls goes back to the S/360. That was barely 10 years old when they designed the VAX.
@rygorous @wolf480pl That's why we used "emulators" of a virtual insn set (modern example being Java) - so that moving to a new ISA was trivial and barely a blip. Moved from 68k to 88k to PPC to x86 with all the code binaries unchanged - just port the emulator.
IBM did similar with their AS400. The apps were mostly RPG, which was unchanged as they kept changing the CPU.