is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?

techpowerup.com/340779/amd-cla

@regehr Every serious study (both from independent researchers and from vendors themselves) that I've ever seen (and I'm up to 5 or so at this point), broadly, supports this, with some caveats.

It's not "no difference", but for server/application cores, what differences there are typically somewhere in the single-digit %. You can always find pathological examples, but typically it's not that much.

There is a real cost to x86s many warts but it's mostly in design/validation cost and toolchains.

@regehr Some more details:
- the D/V and toolchain costs are amortized. Broadly speaking, the bigger your ecosystem/market share, the bigger your ability to absorb that cost.
- This holds for what ARM would call "application" cores; oversimplifying a bit, it's essentially a constant overhead on the design that adds some extra area and pipe stages. It's more onerous for smaller cores, but you need to be really small.

@regehr Eventually, there's nowhere left to hide. For applications where you'd use say an ARM Cortex-M0 or a bare-bones minimal RV32I CPU, I'm not aware of anything x86 past or present that would really make sense.

Intel did "Quark" a while back which I believe was either a 486 or P5 derivative, so still something like a 5-stage pipelined integer core. If you want to go even lower than that, I don't think anyone has (or wants to do) anything.

@steve @regehr Anyway, take that with whatever amount of salt you want, but Intel and AMD both are strongly incentivized to seriously look at this.

They for sure would prefer to sell you x86s because they have decades of experience with that, but they're looking at what it costs them to do it both in capex and in how much it hurts the resulting designs.

And for the latter, the consistent answer has been "a bit, but not much".

@rygorous @steve I've seen part of a convincing / complete formal spec for x86 and I would run away from any effort to validate an implementation of this

@regehr @steve Anecdotally, there's at least 3 (Intel, AMD, Centaur) companies that do this on the regular, and one of them (Centaur) is quite small as such things go.

I wouldn't want to do it either, but the other thing you gotta keep in mind is that the CPU core, while important, is only part of a SoC and ISA has very little impact on the "everything else".

@regehr @steve For example, it's a goddamn NIGHTMARE doing a high-performance memory subsystem for absolutely anything.

This whole "shared memory" fiction we're committed to maintaining is a significant drag on all HW, but HW impls of it are just in another league perf-wise than "just" building message-passing and trying to work around it in SW (lots have tried, but there's little code for it and it's a PITA), so we're kind of stuck with it.

@regehr @steve Basically almost everything that _all_ major ISAs pretend is true about memory at the ISA level is an expensive lie, but one that ~ALL the SW depends on. :)

@regehr @steve To wit: virtual memory is a lie, by design. Uniform memory is a lie. Shared instruction/data memory is a lie. Coherent caches are a lie, caches would rather be _anything_ else. Buses are a lie. Memory-mapped IO is IO lying about being memory. Oh and the data bits and wires are small and shitty enough now that they started lying too and everything is slowly creeping towards ECCing all the things

@regehr @steve Also, re: ISA efficiency, I like re-posting this, by now, rather old image that shows you what the score really is.

This was on the Xeon Phis but the general trend holds to this day. (Source: people.eecs.berkeley.edu/~yssh p. 3) NB this is an in-order core with 512b vector units.

@regehr @steve This is one of the bigger reasons for why ISA doesn't matter more.

Broadly, your uArch is only as good as its data movement, because that shit is what's really expensive, not the logic gates.

It's things like:
- how good is your entire memory subsystem
- how good is your bypass network
- how good are your register files
etc.

It's not like you can't make mistakes in the ISA that will really kill your design, you can. That's what happened to VAX.

@regehr @steve The VAX ISA turns out to be, inadvertently, _extremely_ hostile to an implementation that tries to decouple frontend and backend, which ultimately broke its neck.

x86 has many flaws, but nothing that makes it so that there is a massive discontinuity where there's basically nothing you can do about a particular problem until you have like 10x transistor/power/whatever budget, which is the kind of thing that kills archs.

@argv_minus_one @rygorous @regehr @steve 68k went way way too CISC right at the point RISC got all trendy. Like... RISC was wrong in the long term, but it was 20% right for a decade or so. And then it was wrong. Sadly, that was long enough to kill 68k as a mainstream part (though it lived on for a looooong time in the embedded space)

@wolf480pl it was absolutely not, no.

One of VAX's more notable problems was that absolutely every operand could be a memory reference or even indirect memory reference (meaning a memory location containing a pointer to a memory location that was accessed by the instruction). Some VAX instructions had 6 operands, each of which could be a double-indirect memory reference, and IIRC also unaligned and spanning a page boundary, so the worst case number of page faults per instruction was bonkers.

@wolf480pl Everything could also be an immediate operand.

There were two ways to encode immediates, "literal" was for short integers and was more compact, anything out of range used the actual immediate encoding.

On the VAX, you had 16 GPRs R0-R15, and R15 was just your PC. (32-bit ARM later copied that mistake, and it is a mistake.)

The immediate encoding boiled down to (r15)+, i.e., fetch data (of whatever the right size is) at PC and auto-increment. That's also how it was encoded.

@wolf480pl So, in the VAX encoding, if you have say an add instruction where the first operand is an immediate, you get the encoding for the first operand, then the immediate bytes, then the encoding for the second operand, and so forth.

Crucially, you don't really know where the byte describing the second operand starts until you've finished the first operand; and this goes for all (up to 6) operands.

Nobody does this anymore, because turns out, it's a _terrible_ idea.

Follow

@rygorous @wolf480pl My all time fave ISA as an assembly programmer was Dec20. 36bit insns and memory architecture. String insns could pack 4 to 12 bit values into 36 bit words. 4 bit hex, 5 bit baudot, 7 bit ASCII, 9 bit extended ASCII were all used in my code.

It's a pipe dream now, but I keep hoping a balanced ternary CPU will drop before I die.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.