Does memory fencing blocks threads in multi-core CPUs?

Does memory fencing blocks threads in multi-core CPUs?

I was reading the Intel instruction set guide 64-ia-32 guide
to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

@Stephen C - why don’t you make this comment an answer?
– theMayer
Aug 12 at 13:28

1 Answer
1

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.

Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).

Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.

mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).

mfence

lock

lock add

xchg [mem], reg

mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains (but not locked operations). See my answer on
Are loads and stores the only instructions that gets reordered? for details. But locked operations and xchg aren't like that; they're full memory barriers but they stil allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.

mfence

lock

xchg

imul eax, edx

With hyperthreading, I think this stalling happens per logical thread, not the whole core.

But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.

mfence

lock or dword [rsp], 0

sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

sfence

It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)

sfence

See also Does a memory barrier ensure that the cache coherence has been completed?

lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.

lfence

movntdqa

The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)

lfence

lfence; rdtsc

rdtsc

cpuid; rdtsc

Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.

lfence

lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?

lfence

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).

Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.

_mm_sfence

So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

_mm_sfence

std::atomic_thread_fence(std::memory_order_release)

RE "lfence as a memory barrier is nearly useless": lfence is now the mainstream way of dealing with most Spectre-like vulnerabilities in software. Anyway, the question seems to me too broad because a detailed discussion of each fence is a lot to write. But this answer should resolve the main misunderstanding of the OP I think.
– Hadi Brais
Aug 12 at 22:08

@HadiBrais: Exactly. That use case has nothing to do with ordering between two data accesses to block LoadLoad or LoadStore reordering. It's for the Intel-guaranteed side-effect of blocking OoO exec.
– Peter Cordes
Aug 12 at 22:11

@HadiBrais: That sounds like a description of why the store buffer exists in the first place, to decouple in-order commit from the execution pipeline, and from loads. I haven't heard of intentionally delaying commit. Would that help for a store/reload that's split across a cache-line boundary? L1d load/use latency is about the same as store-forward latency, and SF latency doesn't include address-generation latency. Maybe if a store-forwarding was already detected and lined up? If it's possible for that to happen in the same cycle that the data could have otherwise committed?
– Peter Cordes
Aug 12 at 22:24

@HadiBrais: I think the obvious reason is to prevent future stalls from the store buffer being full, defeating the decoupling of OoO exec from store commit. It's only safe to delay commit if you can see the future and see there won't be any cache-miss stores that prevent you from doing later commits at 1 per clock. (Remember x86's strong memory model requires in-order commit). Any possible downside from commit-as-fast-as-possible is pretty small, so it doesn't seem worth it to build extra logic to consider delaying it.
– Peter Cordes
Aug 13 at 0:14

This AMD/lfence thing comes up enough that maybe it deserves a canonical question (and hopefully one day a canonical answer).
– BeeOnRope
Aug 14 at 15:29

lfence

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

4PiTAVm1JvfT4Ky2mx58i0LTMSmA 8089q,XJONbaFGdsi517XV18fw,sAczeik28j11A2H8trUzAz,j,AyfMldrlVE,2,hlnCk JY

搜尋此網誌

Sfyjdyy