linux/Documentation/memory-barriers.txt

108b42b4SDavid Howells			 ============================
108b42b4SDavid Howells			 LINUX KERNEL MEMORY BARRIERS
108b42b4SDavid Howells			 ============================
108b42b4SDavid Howells
108b42b4SDavid HowellsBy: David Howells <dhowells@redhat.com>
714b6904SPaul E. McKenney    Paul E. McKenney <paulmck@linux.ibm.com>
e7720af5SPeter Zijlstra    Will Deacon <will.deacon@arm.com>
e7720af5SPeter Zijlstra    Peter Zijlstra <peterz@infradead.org>
108b42b4SDavid Howells
e7720af5SPeter Zijlstra==========
e7720af5SPeter ZijlstraDISCLAIMER
e7720af5SPeter Zijlstra==========
e7720af5SPeter Zijlstra
e7720af5SPeter ZijlstraThis document is not a specification; it is intentionally (for the sake of
e7720af5SPeter Zijlstrabrevity) and unintentionally (due to being human) incomplete. This document is
e7720af5SPeter Zijlstrameant as a guide to using the various memory barriers provided by Linux, but
621df431SAndrea Parriin case of any doubt (and there are many) please ask.  Some doubts may be
621df431SAndrea Parriresolved by referring to the formal memory consistency model and related
621df431SAndrea Parridocumentation at tools/memory-model/.  Nevertheless, even this memory
621df431SAndrea Parrimodel should be viewed as the collective opinion of its maintainers rather
621df431SAndrea Parrithan as an infallible oracle.
e7720af5SPeter Zijlstra
e7720af5SPeter ZijlstraTo repeat, this document is not a specification of what Linux expects from
e7720af5SPeter Zijlstrahardware.
e7720af5SPeter Zijlstra
8d4840e8SDavid HowellsThe purpose of this document is twofold:
8d4840e8SDavid Howells
8d4840e8SDavid Howells (1) to specify the minimum functionality that one can rely on for any
8d4840e8SDavid Howells     particular barrier, and
8d4840e8SDavid Howells
8d4840e8SDavid Howells (2) to provide a guide as to how to use the barriers that are available.
8d4840e8SDavid Howells
8d4840e8SDavid HowellsNote that an architecture can provide more than the minimum requirement
35bdc72aSStan Drozdfor any particular barrier, but if the architecture provides less than
8d4840e8SDavid Howellsthat, that architecture is incorrect.
8d4840e8SDavid Howells
8d4840e8SDavid HowellsNote also that it is possible that a barrier may be a no-op for an
8d4840e8SDavid Howellsarchitecture because the way that arch works renders an explicit barrier
8d4840e8SDavid Howellsunnecessary in that case.
8d4840e8SDavid Howells
8d4840e8SDavid Howells
e7720af5SPeter Zijlstra========
e7720af5SPeter ZijlstraCONTENTS
e7720af5SPeter Zijlstra========
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Abstract memory access model.
108b42b4SDavid Howells
108b42b4SDavid Howells     - Device operations.
108b42b4SDavid Howells     - Guarantees.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) What are memory barriers?
108b42b4SDavid Howells
108b42b4SDavid Howells     - Varieties of memory barrier.
108b42b4SDavid Howells     - What may not be assumed about memory barriers?
203185f6SAkira Yokosawa     - Address-dependency barriers (historical).
108b42b4SDavid Howells     - Control dependencies.
108b42b4SDavid Howells     - SMP barrier pairing.
108b42b4SDavid Howells     - Examples of memory barrier sequences.
670bd95eSDavid Howells     - Read memory barriers vs load speculation.
f1ab25a3SPaul E. McKenney     - Multicopy atomicity.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Explicit kernel barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells     - Compiler barrier.
81fc6323SJarek Poplawski     - CPU memory barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Implicit kernel memory barriers.
108b42b4SDavid Howells
166bda71SSeongJae Park     - Lock acquisition functions.
108b42b4SDavid Howells     - Interrupt disabling functions.
50fa610aSDavid Howells     - Sleep and wake-up functions.
108b42b4SDavid Howells     - Miscellaneous functions.
108b42b4SDavid Howells
166bda71SSeongJae Park (*) Inter-CPU acquiring barrier effects.
108b42b4SDavid Howells
166bda71SSeongJae Park     - Acquires vs memory accesses.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Where are memory barriers needed?
108b42b4SDavid Howells
108b42b4SDavid Howells     - Interprocessor interaction.
108b42b4SDavid Howells     - Atomic operations.
108b42b4SDavid Howells     - Accessing devices.
108b42b4SDavid Howells     - Interrupts.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Kernel I/O barrier effects.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Assumed minimum execution ordering model.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) The effects of the cpu cache.
108b42b4SDavid Howells
108b42b4SDavid Howells     - Cache coherency vs DMA.
108b42b4SDavid Howells     - Cache coherency vs MMIO.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) The things CPUs get up to.
108b42b4SDavid Howells
108b42b4SDavid Howells     - And then there's the Alpha.
01e1cd6dSSeongJae Park     - Virtual Machine Guests.
108b42b4SDavid Howells
90fddabfSDavid Howells (*) Example uses.
90fddabfSDavid Howells
90fddabfSDavid Howells     - Circular buffers.
90fddabfSDavid Howells
108b42b4SDavid Howells (*) References.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells============================
108b42b4SDavid HowellsABSTRACT MEMORY ACCESS MODEL
108b42b4SDavid Howells============================
108b42b4SDavid Howells
108b42b4SDavid HowellsConsider the following abstract model of the system:
108b42b4SDavid Howells
108b42b4SDavid Howells		            :                :
108b42b4SDavid Howells		            :                :
108b42b4SDavid Howells		            :                :
108b42b4SDavid Howells		+-------+   :   +--------+   :   +-------+
108b42b4SDavid Howells		|       |   :   |        |   :   |       |
108b42b4SDavid Howells		|       |   :   |        |   :   |       |
108b42b4SDavid Howells		| CPU 1 |<----->| Memory |<----->| CPU 2 |
108b42b4SDavid Howells		|       |   :   |        |   :   |       |
108b42b4SDavid Howells		|       |   :   |        |   :   |       |
108b42b4SDavid Howells		+-------+   :   +--------+   :   +-------+
108b42b4SDavid Howells		    ^       :       ^        :       ^
108b42b4SDavid Howells		    |       :       |        :       |
108b42b4SDavid Howells		    |       :       |        :       |
108b42b4SDavid Howells		    |       :       v        :       |
108b42b4SDavid Howells		    |       :   +--------+   :       |
108b42b4SDavid Howells		    |       :   |        |   :       |
108b42b4SDavid Howells		    |       :   |        |   :       |
108b42b4SDavid Howells		    +---------->| Device |<----------+
108b42b4SDavid Howells		            :   |        |   :
108b42b4SDavid Howells		            :   |        |   :
108b42b4SDavid Howells		            :   +--------+   :
108b42b4SDavid Howells		            :                :
108b42b4SDavid Howells
108b42b4SDavid HowellsEach CPU executes a program that generates memory access operations.  In the
108b42b4SDavid Howellsabstract CPU, memory operation ordering is very relaxed, and a CPU may actually
108b42b4SDavid Howellsperform the memory operations in any order it likes, provided program causality
108b42b4SDavid Howellsappears to be maintained.  Similarly, the compiler may also arrange the
108b42b4SDavid Howellsinstructions it emits in any order it likes, provided it doesn't affect the
108b42b4SDavid Howellsapparent operation of the program.
108b42b4SDavid Howells
108b42b4SDavid HowellsSo in the above diagram, the effects of the memory operations performed by a
108b42b4SDavid HowellsCPU are perceived by the rest of the system as the operations cross the
108b42b4SDavid Howellsinterface between the CPU and rest of the system (the dotted lines).
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsFor example, consider the following sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		CPU 2
108b42b4SDavid Howells	===============	===============
108b42b4SDavid Howells	{ A == 1; B == 2 }
615cc2c9SAlexey Dobriyan	A = 3;		x = B;
615cc2c9SAlexey Dobriyan	B = 4;		y = A;
108b42b4SDavid Howells
108b42b4SDavid HowellsThe set of accesses as seen by the memory system in the middle can be arranged
108b42b4SDavid Howellsin 24 different combinations:
108b42b4SDavid Howells
8ab8b3e1SPranith Kumar	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
8ab8b3e1SPranith Kumar	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
8ab8b3e1SPranith Kumar	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
8ab8b3e1SPranith Kumar	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
8ab8b3e1SPranith Kumar	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
8ab8b3e1SPranith Kumar	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
8ab8b3e1SPranith Kumar	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
108b42b4SDavid Howells	STORE B=4, ...
108b42b4SDavid Howells	...
108b42b4SDavid Howells
108b42b4SDavid Howellsand can thus result in four different combinations of values:
108b42b4SDavid Howells
8ab8b3e1SPranith Kumar	x == 2, y == 1
8ab8b3e1SPranith Kumar	x == 2, y == 3
8ab8b3e1SPranith Kumar	x == 4, y == 1
8ab8b3e1SPranith Kumar	x == 4, y == 3
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsFurthermore, the stores committed by a CPU to the memory system may not be
108b42b4SDavid Howellsperceived by the loads made by another CPU in the same order as the stores were
108b42b4SDavid Howellscommitted.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsAs a further example, consider this sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		CPU 2
108b42b4SDavid Howells	===============	===============
3dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
108b42b4SDavid Howells	B = 4;		Q = P;
8149b5cbSSeongJae Park	P = &B;		D = *Q;
108b42b4SDavid Howells
f556082dSAkira YokosawaThere is an obvious address dependency here, as the value loaded into D depends
f556082dSAkira Yokosawaon the address retrieved from P by CPU 2.  At the end of the sequence, any of
f556082dSAkira Yokosawathe following results are possible:
108b42b4SDavid Howells
108b42b4SDavid Howells	(Q == &A) and (D == 1)
108b42b4SDavid Howells	(Q == &B) and (D == 2)
108b42b4SDavid Howells	(Q == &B) and (D == 4)
108b42b4SDavid Howells
108b42b4SDavid HowellsNote that CPU 2 will never try and load C into D because the CPU will load P
108b42b4SDavid Howellsinto Q before issuing the load of *Q.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsDEVICE OPERATIONS
108b42b4SDavid Howells-----------------
108b42b4SDavid Howells
108b42b4SDavid HowellsSome devices present their control interfaces as collections of memory
108b42b4SDavid Howellslocations, but the order in which the control registers are accessed is very
108b42b4SDavid Howellsimportant.  For instance, imagine an ethernet card with a set of internal
108b42b4SDavid Howellsregisters that are accessed through an address port register (A) and a data
108b42b4SDavid Howellsport register (D).  To read internal register 5, the following code might then
108b42b4SDavid Howellsbe used:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = 5;
108b42b4SDavid Howells	x = *D;
108b42b4SDavid Howells
108b42b4SDavid Howellsbut this might show up as either of the following two sequences:
108b42b4SDavid Howells
108b42b4SDavid Howells	STORE *A = 5, x = LOAD *D
108b42b4SDavid Howells	x = LOAD *D, STORE *A = 5
108b42b4SDavid Howells
108b42b4SDavid Howellsthe second of which will almost certainly result in a malfunction, since it set
108b42b4SDavid Howellsthe address _after_ attempting to read the register.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsGUARANTEES
108b42b4SDavid Howells----------
108b42b4SDavid Howells
108b42b4SDavid HowellsThere are some minimal guarantees that may be expected of a CPU:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) On any given CPU, dependent memory accesses will be issued in order, with
108b42b4SDavid Howells     respect to itself.  This means that for:
108b42b4SDavid Howells
40555946SPaul E. McKenney	Q = READ_ONCE(P); D = READ_ONCE(*Q);
108b42b4SDavid Howells
108b42b4SDavid Howells     the CPU will issue the following memory operations:
108b42b4SDavid Howells
108b42b4SDavid Howells	Q = LOAD P, D = LOAD *Q
108b42b4SDavid Howells
40555946SPaul E. McKenney     and always in that order.  However, on DEC Alpha, READ_ONCE() also
40555946SPaul E. McKenney     emits a memory-barrier instruction, so that a DEC Alpha CPU will
40555946SPaul E. McKenney     instead issue the following memory operations:
40555946SPaul E. McKenney
40555946SPaul E. McKenney	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
40555946SPaul E. McKenney
40555946SPaul E. McKenney     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
40555946SPaul E. McKenney     mischief.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Overlapping loads and stores within a particular CPU will appear to be
108b42b4SDavid Howells     ordered within that CPU.  This means that for:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
108b42b4SDavid Howells
108b42b4SDavid Howells     the CPU will only issue the following sequence of memory operations:
108b42b4SDavid Howells
108b42b4SDavid Howells	a = LOAD *X, STORE *X = b
108b42b4SDavid Howells
108b42b4SDavid Howells     And for:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
108b42b4SDavid Howells
108b42b4SDavid Howells     the CPU will only issue:
108b42b4SDavid Howells
108b42b4SDavid Howells	STORE *X = c, d = LOAD *X
108b42b4SDavid Howells
fa00e7e1SMatt LaPlante     (Loads and stores overlap if they are targeted at overlapping pieces of
108b42b4SDavid Howells     memory).
108b42b4SDavid Howells
108b42b4SDavid HowellsAnd there are a number of things that _must_ or _must_not_ be assumed:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney (*) It _must_not_ be assumed that the compiler will do what you want
9af194ceSPaul E. McKenney     with memory references that are not protected by READ_ONCE() and
9af194ceSPaul E. McKenney     WRITE_ONCE().  Without them, the compiler is within its rights to
9af194ceSPaul E. McKenney     do all sorts of "creative" transformations, which are covered in
895f5542SPaul E. McKenney     the COMPILER BARRIER section.
2ecf8101SPaul E. McKenney
108b42b4SDavid Howells (*) It _must_not_ be assumed that independent loads and stores will be issued
108b42b4SDavid Howells     in the order given.  This means that for:
108b42b4SDavid Howells
108b42b4SDavid Howells	X = *A; Y = *B; *D = Z;
108b42b4SDavid Howells
108b42b4SDavid Howells     we may get any of the following sequences:
108b42b4SDavid Howells
108b42b4SDavid Howells	X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
108b42b4SDavid Howells	X = LOAD *A,  STORE *D = Z, Y = LOAD *B
108b42b4SDavid Howells	Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
108b42b4SDavid Howells	Y = LOAD *B,  STORE *D = Z, X = LOAD *A
108b42b4SDavid Howells	STORE *D = Z, X = LOAD *A,  Y = LOAD *B
108b42b4SDavid Howells	STORE *D = Z, Y = LOAD *B,  X = LOAD *A
108b42b4SDavid Howells
108b42b4SDavid Howells (*) It _must_ be assumed that overlapping memory accesses may be merged or
108b42b4SDavid Howells     discarded.  This means that for:
108b42b4SDavid Howells
108b42b4SDavid Howells	X = *A; Y = *(A + 4);
108b42b4SDavid Howells
108b42b4SDavid Howells     we may get any one of the following sequences:
108b42b4SDavid Howells
108b42b4SDavid Howells	X = LOAD *A; Y = LOAD *(A + 4);
108b42b4SDavid Howells	Y = LOAD *(A + 4); X = LOAD *A;
108b42b4SDavid Howells	{X, Y} = LOAD {*A, *(A + 4) };
108b42b4SDavid Howells
108b42b4SDavid Howells     And for:
108b42b4SDavid Howells
f191eec5SPaul E. McKenney	*A = X; *(A + 4) = Y;
108b42b4SDavid Howells
f191eec5SPaul E. McKenney     we may get any of:
108b42b4SDavid Howells
f191eec5SPaul E. McKenney	STORE *A = X; STORE *(A + 4) = Y;
f191eec5SPaul E. McKenney	STORE *(A + 4) = Y; STORE *A = X;
f191eec5SPaul E. McKenney	STORE {*A, *(A + 4) } = {X, Y};
108b42b4SDavid Howells
432fbf3cSPaul E. McKenneyAnd there are anti-guarantees:
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney (*) These guarantees do not apply to bitfields, because compilers often
432fbf3cSPaul E. McKenney     generate code to modify these using non-atomic read-modify-write
432fbf3cSPaul E. McKenney     sequences.  Do not attempt to use bitfields to synchronize parallel
432fbf3cSPaul E. McKenney     algorithms.
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney (*) Even in cases where bitfields are protected by locks, all fields
432fbf3cSPaul E. McKenney     in a given bitfield must be protected by one lock.  If two fields
432fbf3cSPaul E. McKenney     in a given bitfield are protected by different locks, the compiler's
432fbf3cSPaul E. McKenney     non-atomic read-modify-write sequences can cause an update to one
432fbf3cSPaul E. McKenney     field to corrupt the value of an adjacent field.
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney (*) These guarantees apply only to properly aligned and sized scalar
432fbf3cSPaul E. McKenney     variables.  "Properly sized" currently means variables that are
432fbf3cSPaul E. McKenney     the same size as "char", "short", "int" and "long".  "Properly
432fbf3cSPaul E. McKenney     aligned" means the natural alignment, thus no constraints for
432fbf3cSPaul E. McKenney     "char", two-byte alignment for "short", four-byte alignment for
432fbf3cSPaul E. McKenney     "int", and either four-byte or eight-byte alignment for "long",
432fbf3cSPaul E. McKenney     on 32-bit and 64-bit systems, respectively.  Note that these
432fbf3cSPaul E. McKenney     guarantees were introduced into the C11 standard, so beware when
432fbf3cSPaul E. McKenney     using older pre-C11 compilers (for example, gcc 4.6).  The portion
432fbf3cSPaul E. McKenney     of the standard containing this guarantee is Section 3.14, which
432fbf3cSPaul E. McKenney     defines "memory location" as follows:
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney     	memory location
432fbf3cSPaul E. McKenney		either an object of scalar type, or a maximal sequence
432fbf3cSPaul E. McKenney		of adjacent bit-fields all having nonzero width
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney		NOTE 1: Two threads of execution can update and access
432fbf3cSPaul E. McKenney		separate memory locations without interfering with
432fbf3cSPaul E. McKenney		each other.
432fbf3cSPaul E. McKenney
432fbf3cSPaul E. McKenney		NOTE 2: A bit-field and an adjacent non-bit-field member
432fbf3cSPaul E. McKenney		are in separate memory locations. The same applies
432fbf3cSPaul E. McKenney		to two bit-fields, if one is declared inside a nested
432fbf3cSPaul E. McKenney		structure declaration and the other is not, or if the two
432fbf3cSPaul E. McKenney		are separated by a zero-length bit-field declaration,
432fbf3cSPaul E. McKenney		or if they are separated by a non-bit-field member
432fbf3cSPaul E. McKenney		declaration. It is not safe to concurrently update two
432fbf3cSPaul E. McKenney		bit-fields in the same structure if all members declared
432fbf3cSPaul E. McKenney		between them are also bit-fields, no matter what the
432fbf3cSPaul E. McKenney		sizes of those intervening bit-fields happen to be.
432fbf3cSPaul E. McKenney
108b42b4SDavid Howells
108b42b4SDavid Howells=========================
108b42b4SDavid HowellsWHAT ARE MEMORY BARRIERS?
108b42b4SDavid Howells=========================
108b42b4SDavid Howells
108b42b4SDavid HowellsAs can be seen above, independent memory operations are effectively performed
108b42b4SDavid Howellsin random order, but this can be a problem for CPU-CPU interaction and for I/O.
108b42b4SDavid HowellsWhat is required is some way of intervening to instruct the compiler and the
108b42b4SDavid HowellsCPU to restrict the order.
108b42b4SDavid Howells
108b42b4SDavid HowellsMemory barriers are such interventions.  They impose a perceived partial
2b94895bSDavid Howellsordering over the memory operations on either side of the barrier.
2b94895bSDavid Howells
2b94895bSDavid HowellsSuch enforcement is important because the CPUs and other devices in a system
81fc6323SJarek Poplawskican use a variety of tricks to improve performance, including reordering,
2b94895bSDavid Howellsdeferral and combination of memory operations; speculative loads; speculative
2b94895bSDavid Howellsbranch prediction and various types of caching.  Memory barriers are used to
2b94895bSDavid Howellsoverride or suppress these tricks, allowing the code to sanely control the
2b94895bSDavid Howellsinteraction of multiple CPUs and/or devices.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsVARIETIES OF MEMORY BARRIER
108b42b4SDavid Howells---------------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsMemory barriers come in four basic varieties:
108b42b4SDavid Howells
108b42b4SDavid Howells (1) Write (or store) memory barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells     A write memory barrier gives a guarantee that all the STORE operations
108b42b4SDavid Howells     specified before the barrier will appear to happen before all the STORE
108b42b4SDavid Howells     operations specified after the barrier with respect to the other
108b42b4SDavid Howells     components of the system.
108b42b4SDavid Howells
108b42b4SDavid Howells     A write barrier is a partial ordering on stores only; it is not required
108b42b4SDavid Howells     to have any effect on loads.
108b42b4SDavid Howells
6bc39274SDavid Howells     A CPU can be viewed as committing a sequence of store operations to the
5692fcc6SGuilherme G. Piccoli     memory system as time progresses.  All stores _before_ a write barrier
5692fcc6SGuilherme G. Piccoli     will occur _before_ all the stores after the write barrier.
108b42b4SDavid Howells
203185f6SAkira Yokosawa     [!] Note that write barriers should normally be paired with read or
203185f6SAkira Yokosawa     address-dependency barriers; see the "SMP barrier pairing" subsection.
108b42b4SDavid Howells
108b42b4SDavid Howells
203185f6SAkira Yokosawa (2) Address-dependency barriers (historical).
ad944630SPaul E. McKenney     [!] This section is marked as HISTORICAL: it covers the long-obsolete
ad944630SPaul E. McKenney     smp_read_barrier_depends() macro, the semantics of which are now
ad944630SPaul E. McKenney     implicit in all marked accesses.  For more up-to-date information,
ad944630SPaul E. McKenney     including how compiler transformations can sometimes break address
ad944630SPaul E. McKenney     dependencies, see Documentation/RCU/rcu_dereference.rst.
108b42b4SDavid Howells
f556082dSAkira Yokosawa     An address-dependency barrier is a weaker form of read barrier.  In the
f556082dSAkira Yokosawa     case where two loads are performed such that the second depends on the
f556082dSAkira Yokosawa     result of the first (eg: the first load retrieves the address to which
f556082dSAkira Yokosawa     the second load will be directed), an address-dependency barrier would
f556082dSAkira Yokosawa     be required to make sure that the target of the second load is updated
f556082dSAkira Yokosawa     after the address obtained by the first load is accessed.
108b42b4SDavid Howells
f556082dSAkira Yokosawa     An address-dependency barrier is a partial ordering on interdependent
f556082dSAkira Yokosawa     loads only; it is not required to have any effect on stores, independent
f556082dSAkira Yokosawa     loads or overlapping loads.
108b42b4SDavid Howells
108b42b4SDavid Howells     As mentioned in (1), the other CPUs in the system can be viewed as
108b42b4SDavid Howells     committing sequences of stores to the memory system that the CPU being
f556082dSAkira Yokosawa     considered can then perceive.  An address-dependency barrier issued by
f556082dSAkira Yokosawa     the CPU under consideration guarantees that for any load preceding it,
f556082dSAkira Yokosawa     if that load touches one of a sequence of stores from another CPU, then
f556082dSAkira Yokosawa     by the time the barrier completes, the effects of all the stores prior to
f556082dSAkira Yokosawa     that touched by the load will be perceptible to any loads issued after
f556082dSAkira Yokosawa     the address-dependency barrier.
108b42b4SDavid Howells
108b42b4SDavid Howells     See the "Examples of memory barrier sequences" subsection for diagrams
108b42b4SDavid Howells     showing the ordering constraints.
108b42b4SDavid Howells
203185f6SAkira Yokosawa     [!] Note that the first load really has to have an _address_ dependency and
108b42b4SDavid Howells     not a control dependency.  If the address for the second load is dependent
108b42b4SDavid Howells     on the first load, but the dependency is through a conditional rather than
108b42b4SDavid Howells     actually loading the address itself, then it's a _control_ dependency and
108b42b4SDavid Howells     a full read barrier or better is required.  See the "Control dependencies"
108b42b4SDavid Howells     subsection for more information.
108b42b4SDavid Howells
203185f6SAkira Yokosawa     [!] Note that address-dependency barriers should normally be paired with
108b42b4SDavid Howells     write barriers; see the "SMP barrier pairing" subsection.
108b42b4SDavid Howells
203185f6SAkira Yokosawa     [!] Kernel release v5.9 removed kernel APIs for explicit address-
203185f6SAkira Yokosawa     dependency barriers.  Nowadays, APIs for marking loads from shared
203185f6SAkira Yokosawa     variables such as READ_ONCE() and rcu_dereference() provide implicit
203185f6SAkira Yokosawa     address-dependency barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells (3) Read (or load) memory barriers.
108b42b4SDavid Howells
f556082dSAkira Yokosawa     A read barrier is an address-dependency barrier plus a guarantee that all
f556082dSAkira Yokosawa     the LOAD operations specified before the barrier will appear to happen
f556082dSAkira Yokosawa     before all the LOAD operations specified after the barrier with respect to
f556082dSAkira Yokosawa     the other components of the system.
108b42b4SDavid Howells
108b42b4SDavid Howells     A read barrier is a partial ordering on loads only; it is not required to
108b42b4SDavid Howells     have any effect on stores.
108b42b4SDavid Howells
f556082dSAkira Yokosawa     Read memory barriers imply address-dependency barriers, and so can
f556082dSAkira Yokosawa     substitute for them.
108b42b4SDavid Howells
108b42b4SDavid Howells     [!] Note that read barriers should normally be paired with write barriers;
108b42b4SDavid Howells     see the "SMP barrier pairing" subsection.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells (4) General memory barriers.
108b42b4SDavid Howells
670bd95eSDavid Howells     A general memory barrier gives a guarantee that all the LOAD and STORE
670bd95eSDavid Howells     operations specified before the barrier will appear to happen before all
670bd95eSDavid Howells     the LOAD and STORE operations specified after the barrier with respect to
670bd95eSDavid Howells     the other components of the system.
670bd95eSDavid Howells
670bd95eSDavid Howells     A general memory barrier is a partial ordering over both loads and stores.
108b42b4SDavid Howells
108b42b4SDavid Howells     General memory barriers imply both read and write memory barriers, and so
108b42b4SDavid Howells     can substitute for either.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsAnd a couple of implicit varieties:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (5) ACQUIRE operations.
108b42b4SDavid Howells
108b42b4SDavid Howells     This acts as a one-way permeable barrier.  It guarantees that all memory
2e4f5382SPeter Zijlstra     operations after the ACQUIRE operation will appear to happen after the
2e4f5382SPeter Zijlstra     ACQUIRE operation with respect to the other components of the system.
787df638SDavidlohr Bueso     ACQUIRE operations include LOCK operations and both smp_load_acquire()
2f359c7eSAndrea Parri     and smp_cond_load_acquire() operations.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Memory operations that occur before an ACQUIRE operation may appear to
2e4f5382SPeter Zijlstra     happen after it completes.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     An ACQUIRE operation should almost always be paired with a RELEASE
2e4f5382SPeter Zijlstra     operation.
108b42b4SDavid Howells
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (6) RELEASE operations.
108b42b4SDavid Howells
108b42b4SDavid Howells     This also acts as a one-way permeable barrier.  It guarantees that all
2e4f5382SPeter Zijlstra     memory operations before the RELEASE operation will appear to happen
2e4f5382SPeter Zijlstra     before the RELEASE operation with respect to the other components of the
2e4f5382SPeter Zijlstra     system. RELEASE operations include UNLOCK operations and
2e4f5382SPeter Zijlstra     smp_store_release() operations.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Memory operations that occur after a RELEASE operation may appear to
108b42b4SDavid Howells     happen before it completes.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     The use of ACQUIRE and RELEASE operations generally precludes the need
a897b13dSSeongJae Park     for other sorts of memory barrier.  In addition, a RELEASE+ACQUIRE pair is
a897b13dSSeongJae Park     -not- guaranteed to act as a full memory barrier.  However, after an
a897b13dSSeongJae Park     ACQUIRE on a given variable, all memory accesses preceding any prior
2e4f5382SPeter Zijlstra     RELEASE on that same variable are guaranteed to be visible.  In other
2e4f5382SPeter Zijlstra     words, within a given variable's critical section, all accesses of all
2e4f5382SPeter Zijlstra     previous critical sections for that variable are guaranteed to have
2e4f5382SPeter Zijlstra     completed.
17eb88e0SPaul E. McKenney
2e4f5382SPeter Zijlstra     This means that ACQUIRE acts as a minimal "acquire" operation and
2e4f5382SPeter Zijlstra     RELEASE acts as a minimal "release" operation.
108b42b4SDavid Howells
706eeb3eSPeter ZijlstraA subset of the atomic operations described in atomic_t.txt have ACQUIRE and
706eeb3eSPeter ZijlstraRELEASE variants in addition to fully-ordered and relaxed (no barrier
706eeb3eSPeter Zijlstrasemantics) definitions.  For compound atomics performing both a load and a
706eeb3eSPeter Zijlstrastore, ACQUIRE semantics apply only to the load and RELEASE semantics apply
706eeb3eSPeter Zijlstraonly to the store portion of the operation.
108b42b4SDavid Howells
108b42b4SDavid HowellsMemory barriers are only required where there's a possibility of interaction
108b42b4SDavid Howellsbetween two CPUs or between a CPU and a device.  If it can be guaranteed that
108b42b4SDavid Howellsthere won't be any such interaction in any particular piece of code, then
108b42b4SDavid Howellsmemory barriers are unnecessary in that piece of code.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsNote that these are the _minimum_ guarantees.  Different architectures may give
108b42b4SDavid Howellsmore substantial guarantees, but they may _not_ be relied upon outside of arch
108b42b4SDavid Howellsspecific code.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsWHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
108b42b4SDavid Howells----------------------------------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsThere are certain things that the Linux kernel memory barriers do not guarantee:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) There is no guarantee that any of the memory accesses specified before a
108b42b4SDavid Howells     memory barrier will be _complete_ by the completion of a memory barrier
108b42b4SDavid Howells     instruction; the barrier can be considered to draw a line in that CPU's
108b42b4SDavid Howells     access queue that accesses of the appropriate type may not cross.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) There is no guarantee that issuing a memory barrier on one CPU will have
108b42b4SDavid Howells     any direct effect on another CPU or any other hardware in the system.  The
108b42b4SDavid Howells     indirect effect will be the order in which the second CPU sees the effects
108b42b4SDavid Howells     of the first CPU's accesses occur, but see the next point:
108b42b4SDavid Howells
6bc39274SDavid Howells (*) There is no guarantee that a CPU will see the correct order of effects
108b42b4SDavid Howells     from a second CPU's accesses, even _if_ the second CPU uses a memory
108b42b4SDavid Howells     barrier, unless the first CPU _also_ uses a matching memory barrier (see
108b42b4SDavid Howells     the subsection on "SMP Barrier Pairing").
108b42b4SDavid Howells
108b42b4SDavid Howells (*) There is no guarantee that some intervening piece of off-the-CPU
108b42b4SDavid Howells     hardware[*] will not reorder the memory accesses.  CPU cache coherency
108b42b4SDavid Howells     mechanisms should propagate the indirect effects of a memory barrier
108b42b4SDavid Howells     between CPUs, but might not do so in order.
108b42b4SDavid Howells
108b42b4SDavid Howells	[*] For information on bus mastering DMA and coherency please read:
108b42b4SDavid Howells
bff9e34cSMauro Carvalho Chehab	    Documentation/driver-api/pci/pci.rst
537f3a7cSSeongJae Park	    Documentation/core-api/dma-api-howto.rst
537f3a7cSSeongJae Park	    Documentation/core-api/dma-api.rst
108b42b4SDavid Howells
108b42b4SDavid Howells
203185f6SAkira YokosawaADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
203185f6SAkira Yokosawa----------------------------------------
ad944630SPaul E. McKenney[!] This section is marked as HISTORICAL: it covers the long-obsolete
ad944630SPaul E. McKenneysmp_read_barrier_depends() macro, the semantics of which are now implicit
ad944630SPaul E. McKenneyin all marked accesses.  For more up-to-date information, including
ad944630SPaul E. McKenneyhow compiler transformations can sometimes break address dependencies,
ad944630SPaul E. McKenneysee Documentation/RCU/rcu_dereference.rst.
f28f0868SPaul E. McKenney
8ca924aeSWill DeaconAs of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
8ca924aeSWill DeaconDEC Alpha, which means that about the only people who need to pay attention
8ca924aeSWill Deaconto this section are those working on DEC Alpha architecture-specific code
8ca924aeSWill Deaconand those working on READ_ONCE() itself.  For those who need it, and for
8ca924aeSWill Deaconthose who are interested in the history, here is the story of
203185f6SAkira Yokosawaaddress-dependency barriers.
108b42b4SDavid Howells
203185f6SAkira Yokosawa[!] While address dependencies are observed in both load-to-load and
203185f6SAkira Yokosawaload-to-store relations, address-dependency barriers are not necessary
203185f6SAkira Yokosawafor load-to-store situations.
203185f6SAkira Yokosawa
203185f6SAkira YokosawaThe requirement of address-dependency barriers is a little subtle, and
108b42b4SDavid Howellsit's not always obvious that they're needed.  To illustrate, consider the
108b42b4SDavid Howellsfollowing sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		      CPU 2
108b42b4SDavid Howells	===============	      ===============
3dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
108b42b4SDavid Howells	B = 4;
108b42b4SDavid Howells	<write barrier>
8149b5cbSSeongJae Park	WRITE_ONCE(P, &B);
203185f6SAkira Yokosawa			      Q = READ_ONCE_OLD(P);
108b42b4SDavid Howells			      D = *Q;
108b42b4SDavid Howells
203185f6SAkira Yokosawa[!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
203185f6SAkira Yokosawadoesn't imply an address-dependency barrier.
203185f6SAkira Yokosawa
f556082dSAkira YokosawaThere's a clear address dependency here, and it would seem that by the end of
f556082dSAkira Yokosawathe sequence, Q must be either &A or &B, and that:
108b42b4SDavid Howells
108b42b4SDavid Howells	(Q == &A) implies (D == 1)
108b42b4SDavid Howells	(Q == &B) implies (D == 4)
108b42b4SDavid Howells
108b42b4SDavid HowellsBut!  CPU 2's perception of P may be updated _before_ its perception of B, thus
108b42b4SDavid Howellsleading to the following situation:
108b42b4SDavid Howells
108b42b4SDavid Howells	(Q == &B) and (D == 2) ????
108b42b4SDavid Howells
806654a9SWill DeaconWhile this may seem like a failure of coherency or causality maintenance, it
108b42b4SDavid Howellsisn't, and this behaviour can be observed on certain real CPUs (such as the DEC
108b42b4SDavid HowellsAlpha).
108b42b4SDavid Howells
f556082dSAkira YokosawaTo deal with this, READ_ONCE() provides an implicit address-dependency barrier
f556082dSAkira Yokosawasince kernel release v4.15:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		      CPU 2
108b42b4SDavid Howells	===============	      ===============
3dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
108b42b4SDavid Howells	B = 4;
108b42b4SDavid Howells	<write barrier>
9af194ceSPaul E. McKenney	WRITE_ONCE(P, &B);
9af194ceSPaul E. McKenney			      Q = READ_ONCE(P);
203185f6SAkira Yokosawa			      <implicit address-dependency barrier>
108b42b4SDavid Howells			      D = *Q;
108b42b4SDavid Howells
108b42b4SDavid HowellsThis enforces the occurrence of one of the two implications, and prevents the
108b42b4SDavid Howellsthird possibility from arising.
108b42b4SDavid Howells
92a84dd2SPaul E. McKenney
108b42b4SDavid Howells[!] Note that this extremely counterintuitive situation arises most easily on
108b42b4SDavid Howellsmachines with split caches, so that, for example, one cache bank processes
108b42b4SDavid Howellseven-numbered cache lines and the other bank processes odd-numbered cache
108b42b4SDavid Howellslines.  The pointer P might be stored in an odd-numbered cache line, and the
108b42b4SDavid Howellsvariable B might be stored in an even-numbered cache line.  Then, if the
108b42b4SDavid Howellseven-numbered bank of the reading CPU's cache is extremely busy while the
108b42b4SDavid Howellsodd-numbered bank is idle, one can see the new value of the pointer P (&B),
6bc39274SDavid Howellsbut the old value of the variable B (2).
108b42b4SDavid Howells
108b42b4SDavid Howells
203185f6SAkira YokosawaAn address-dependency barrier is not required to order dependent writes
f556082dSAkira Yokosawabecause the CPUs that the Linux kernel supports don't do writes until they
f556082dSAkira Yokosawaare certain (1) that the write will actually happen, (2) of the location of
f556082dSAkira Yokosawathe write, and (3) of the value to be written.
66ce3a4dSPaul E. McKenneyBut please carefully read the "CONTROL DEPENDENCIES" section and the
f556082dSAkira YokosawaDocumentation/RCU/rcu_dereference.rst file:  The compiler can and does break
f556082dSAkira Yokosawadependencies in a great many highly creative ways.
66ce3a4dSPaul E. McKenney
66ce3a4dSPaul E. McKenney	CPU 1		      CPU 2
66ce3a4dSPaul E. McKenney	===============	      ===============
66ce3a4dSPaul E. McKenney	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
66ce3a4dSPaul E. McKenney	B = 4;
66ce3a4dSPaul E. McKenney	<write barrier>
66ce3a4dSPaul E. McKenney	WRITE_ONCE(P, &B);
203185f6SAkira Yokosawa			      Q = READ_ONCE_OLD(P);
66ce3a4dSPaul E. McKenney			      WRITE_ONCE(*Q, 5);
66ce3a4dSPaul E. McKenney
203185f6SAkira YokosawaTherefore, no address-dependency barrier is required to order the read into
66ce3a4dSPaul E. McKenneyQ with the store into *Q.  In other words, this outcome is prohibited,
203185f6SAkira Yokosawaeven without an implicit address-dependency barrier of modern READ_ONCE():
66ce3a4dSPaul E. McKenney
66ce3a4dSPaul E. McKenney	(Q == &B) && (B == 4)
66ce3a4dSPaul E. McKenney
66ce3a4dSPaul E. McKenneyPlease note that this pattern should be rare.  After all, the whole point
66ce3a4dSPaul E. McKenneyof dependency ordering is to -prevent- writes to the data structure, along
66ce3a4dSPaul E. McKenneywith the expensive cache misses associated with those writes.  This pattern
66ce3a4dSPaul E. McKenneycan be used to record rare error conditions and the like, and the CPUs'
66ce3a4dSPaul E. McKenneynaturally occurring ordering prevents such records from being lost.
66ce3a4dSPaul E. McKenney
66ce3a4dSPaul E. McKenney
203185f6SAkira YokosawaNote well that the ordering provided by an address dependency is local to
f1ab25a3SPaul E. McKenneythe CPU containing it.  See the section on "Multicopy atomicity" for
f1ab25a3SPaul E. McKenneymore information.
f1ab25a3SPaul E. McKenney
f1ab25a3SPaul E. McKenney
203185f6SAkira YokosawaThe address-dependency barrier is very important to the RCU system,
2ecf8101SPaul E. McKenneyfor example.  See rcu_assign_pointer() and rcu_dereference() in
2ecf8101SPaul E. McKenneyinclude/linux/rcupdate.h.  This permits the current target of an RCU'd
2ecf8101SPaul E. McKenneypointer to be replaced with a new modified target, without the replacement
2ecf8101SPaul E. McKenneytarget appearing to be incompletely initialised.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsCONTROL DEPENDENCIES
108b42b4SDavid Howells--------------------
108b42b4SDavid Howells
c8241f85SPaul E. McKenneyControl dependencies can be a bit tricky because current compilers do
c8241f85SPaul E. McKenneynot understand them.  The purpose of this section is to help you prevent
c8241f85SPaul E. McKenneythe compiler's ignorance from breaking your code.
c8241f85SPaul E. McKenney
ff382810SPaul E. McKenneyA load-load control dependency requires a full read memory barrier, not
f556082dSAkira Yokosawasimply an (implicit) address-dependency barrier to make it work correctly.
f556082dSAkira YokosawaConsider the following bit of code:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	q = READ_ONCE(a);
203185f6SAkira Yokosawa	<implicit address-dependency barrier>
18c03c61SPeter Zijlstra	if (q) {
203185f6SAkira Yokosawa		/* BUG: No address dependency!!! */
9af194ceSPaul E. McKenney		p = READ_ONCE(b);
45c8a36aSPaul E. McKenney	}
108b42b4SDavid Howells
203185f6SAkira YokosawaThis will not have the desired effect because there is no actual address
2ecf8101SPaul E. McKenneydependency, but rather a control dependency that the CPU may short-circuit
2ecf8101SPaul E. McKenneyby attempting to predict the outcome in advance, so that other CPUs see
f556082dSAkira Yokosawathe load from b as having happened before the load from a.  In such a case
f556082dSAkira Yokosawawhat's actually required is:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	q = READ_ONCE(a);
18c03c61SPeter Zijlstra	if (q) {
108b42b4SDavid Howells		<read barrier>
9af194ceSPaul E. McKenney		p = READ_ONCE(b);
45c8a36aSPaul E. McKenney	}
18c03c61SPeter Zijlstra
18c03c61SPeter ZijlstraHowever, stores are not speculated.  This means that ordering -is- provided
ff382810SPaul E. McKenneyfor load-store control dependencies, as in the following example:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
2456d2a6SPaul E. McKenney	if (q) {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
c8241f85SPaul E. McKenneyControl dependencies pair normally with other types of barriers.
c8241f85SPaul E. McKenneyThat said, please note that neither READ_ONCE() nor WRITE_ONCE()
c8241f85SPaul E. McKenneyare optional! Without the READ_ONCE(), the compiler might combine the
c8241f85SPaul E. McKenneyload from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
c8241f85SPaul E. McKenneythe compiler might combine the store to 'b' with other stores to 'b'.
c8241f85SPaul E. McKenneyEither can result in highly counterintuitive effects on ordering.
18c03c61SPeter Zijlstra
18c03c61SPeter ZijlstraWorse yet, if the compiler is able to prove (say) that the value of
18c03c61SPeter Zijlstravariable 'a' is always non-zero, it would be well within its rights
18c03c61SPeter Zijlstrato optimize the original example by eliminating the "if" statement
18c03c61SPeter Zijlstraas follows:
18c03c61SPeter Zijlstra
18c03c61SPeter Zijlstra	q = a;
c8241f85SPaul E. McKenney	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
18c03c61SPeter Zijlstra
105ff3cbSLinus TorvaldsSo don't leave out the READ_ONCE().
2456d2a6SPaul E. McKenney
2456d2a6SPaul E. McKenneyIt is tempting to try to enforce ordering on identical stores on both
2456d2a6SPaul E. McKenneybranches of the "if" statement as follows:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
18c03c61SPeter Zijlstra	if (q) {
9b2b3bf5SPaul E. McKenney		barrier();
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
18c03c61SPeter Zijlstra		do_something();
18c03c61SPeter Zijlstra	} else {
9b2b3bf5SPaul E. McKenney		barrier();
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
18c03c61SPeter Zijlstra		do_something_else();
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
2456d2a6SPaul E. McKenneyUnfortunately, current compilers will transform this as follows at high
2456d2a6SPaul E. McKenneyoptimization levels:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
2456d2a6SPaul E. McKenney	barrier();
c8241f85SPaul E. McKenney	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
18c03c61SPeter Zijlstra	if (q) {
c8241f85SPaul E. McKenney		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
18c03c61SPeter Zijlstra		do_something();
18c03c61SPeter Zijlstra	} else {
c8241f85SPaul E. McKenney		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
18c03c61SPeter Zijlstra		do_something_else();
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
2456d2a6SPaul E. McKenneyNow there is no conditional between the load from 'a' and the store to
2456d2a6SPaul E. McKenney'b', which means that the CPU is within its rights to reorder them:
2456d2a6SPaul E. McKenneyThe conditional is absolutely required, and must be present in the
2456d2a6SPaul E. McKenneyassembly code even after all compiler optimizations have been applied.
2456d2a6SPaul E. McKenneyTherefore, if you need ordering in this example, you need explicit
2456d2a6SPaul E. McKenneymemory barriers, for example, smp_store_release():
18c03c61SPeter Zijlstra
9af194ceSPaul E. McKenney	q = READ_ONCE(a);
2456d2a6SPaul E. McKenney	if (q) {
c8241f85SPaul E. McKenney		smp_store_release(&b, 1);
18c03c61SPeter Zijlstra		do_something();
18c03c61SPeter Zijlstra	} else {
c8241f85SPaul E. McKenney		smp_store_release(&b, 1);
18c03c61SPeter Zijlstra		do_something_else();
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
2456d2a6SPaul E. McKenneyIn contrast, without explicit memory barriers, two-legged-if control
2456d2a6SPaul E. McKenneyordering is guaranteed only when the stores differ, for example:
2456d2a6SPaul E. McKenney
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
2456d2a6SPaul E. McKenney	if (q) {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
2456d2a6SPaul E. McKenney		do_something();
2456d2a6SPaul E. McKenney	} else {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
2456d2a6SPaul E. McKenney		do_something_else();
2456d2a6SPaul E. McKenney	}
2456d2a6SPaul E. McKenney
105ff3cbSLinus TorvaldsThe initial READ_ONCE() is still required to prevent the compiler from
105ff3cbSLinus Torvaldsproving the value of 'a'.
18c03c61SPeter Zijlstra
18c03c61SPeter ZijlstraIn addition, you need to be careful what you do with the local variable 'q',
18c03c61SPeter Zijlstraotherwise the compiler might be able to guess the value and again remove
18c03c61SPeter Zijlstrathe needed conditional.  For example:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
18c03c61SPeter Zijlstra	if (q % MAX) {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
18c03c61SPeter Zijlstra		do_something();
18c03c61SPeter Zijlstra	} else {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
18c03c61SPeter Zijlstra		do_something_else();
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
18c03c61SPeter ZijlstraIf MAX is defined to be 1, then the compiler knows that (q % MAX) is
18c03c61SPeter Zijlstraequal to zero, in which case the compiler is within its rights to
18c03c61SPeter Zijlstratransform the above code into the following:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
b26cfc48Spierre Kuo	WRITE_ONCE(b, 2);
18c03c61SPeter Zijlstra	do_something_else();
18c03c61SPeter Zijlstra
2456d2a6SPaul E. McKenneyGiven this transformation, the CPU is not required to respect the ordering
2456d2a6SPaul E. McKenneybetween the load from variable 'a' and the store to variable 'b'.  It is
2456d2a6SPaul E. McKenneytempting to add a barrier(), but this does not help.  The conditional
2456d2a6SPaul E. McKenneyis gone, and the barrier won't bring it back.  Therefore, if you are
2456d2a6SPaul E. McKenneyrelying on this ordering, you should make sure that MAX is greater than
2456d2a6SPaul E. McKenneyone, perhaps as follows:
18c03c61SPeter Zijlstra
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
18c03c61SPeter Zijlstra	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
18c03c61SPeter Zijlstra	if (q % MAX) {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
18c03c61SPeter Zijlstra		do_something();
18c03c61SPeter Zijlstra	} else {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
18c03c61SPeter Zijlstra		do_something_else();
18c03c61SPeter Zijlstra	}
18c03c61SPeter Zijlstra
2456d2a6SPaul E. McKenneyPlease note once again that the stores to 'b' differ.  If they were
2456d2a6SPaul E. McKenneyidentical, as noted earlier, the compiler could pull this store outside
2456d2a6SPaul E. McKenneyof the 'if' statement.
2456d2a6SPaul E. McKenney
8b19d1deSPaul E. McKenneyYou must also be careful not to rely too much on boolean short-circuit
8b19d1deSPaul E. McKenneyevaluation.  Consider this example:
8b19d1deSPaul E. McKenney
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
57aecae9SPaul E. McKenney	if (q || 1 > 0)
9af194ceSPaul E. McKenney		WRITE_ONCE(b, 1);
8b19d1deSPaul E. McKenney
5af4692aSPaul E. McKenneyBecause the first condition cannot fault and the second condition is
5af4692aSPaul E. McKenneyalways true, the compiler can transform this example as following,
5af4692aSPaul E. McKenneydefeating control dependency:
8b19d1deSPaul E. McKenney
105ff3cbSLinus Torvalds	q = READ_ONCE(a);
9af194ceSPaul E. McKenney	WRITE_ONCE(b, 1);
8b19d1deSPaul E. McKenney
8b19d1deSPaul E. McKenneyThis example underscores the need to ensure that the compiler cannot
9af194ceSPaul E. McKenneyout-guess your code.  More generally, although READ_ONCE() does force
8b19d1deSPaul E. McKenneythe compiler to actually emit code for a given load, it does not force
8b19d1deSPaul E. McKenneythe compiler to use the results.
8b19d1deSPaul E. McKenney
ebff09a6SPaul E. McKenneyIn addition, control dependencies apply only to the then-clause and
ebff09a6SPaul E. McKenneyelse-clause of the if-statement in question.  In particular, it does
ebff09a6SPaul E. McKenneynot necessarily apply to code following the if-statement:
ebff09a6SPaul E. McKenney
ebff09a6SPaul E. McKenney	q = READ_ONCE(a);
ebff09a6SPaul E. McKenney	if (q) {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
ebff09a6SPaul E. McKenney	} else {
c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
ebff09a6SPaul E. McKenney	}
c8241f85SPaul E. McKenney	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
ebff09a6SPaul E. McKenney
ebff09a6SPaul E. McKenneyIt is tempting to argue that there in fact is ordering because the
ebff09a6SPaul E. McKenneycompiler cannot reorder volatile accesses and also cannot reorder
c8241f85SPaul E. McKenneythe writes to 'b' with the condition.  Unfortunately for this line
c8241f85SPaul E. McKenneyof reasoning, the compiler might compile the two writes to 'b' as
ebff09a6SPaul E. McKenneyconditional-move instructions, as in this fanciful pseudo-assembly
ebff09a6SPaul E. McKenneylanguage:
ebff09a6SPaul E. McKenney
ebff09a6SPaul E. McKenney	ld r1,a
ebff09a6SPaul E. McKenney	cmp r1,$0
c8241f85SPaul E. McKenney	cmov,ne r4,$1
c8241f85SPaul E. McKenney	cmov,eq r4,$2
ebff09a6SPaul E. McKenney	st r4,b
ebff09a6SPaul E. McKenney	st $1,c
ebff09a6SPaul E. McKenney
ebff09a6SPaul E. McKenneyA weakly ordered CPU would have no dependency of any sort between the load
c8241f85SPaul E. McKenneyfrom 'a' and the store to 'c'.  The control dependencies would extend
ebff09a6SPaul E. McKenneyonly to the pair of cmov instructions and the store depending on them.
ebff09a6SPaul E. McKenneyIn short, control dependencies apply only to the stores in the then-clause
ebff09a6SPaul E. McKenneyand else-clause of the if-statement in question (including functions
ebff09a6SPaul E. McKenneyinvoked by those two clauses), not to code following that if-statement.
ebff09a6SPaul E. McKenney
18c03c61SPeter Zijlstra
f1ab25a3SPaul E. McKenneyNote well that the ordering provided by a control dependency is local
f1ab25a3SPaul E. McKenneyto the CPU containing it.  See the section on "Multicopy atomicity"
f1ab25a3SPaul E. McKenneyfor more information.
18c03c61SPeter Zijlstra
18c03c61SPeter Zijlstra
18c03c61SPeter ZijlstraIn summary:
18c03c61SPeter Zijlstra
18c03c61SPeter Zijlstra  (*) Control dependencies can order prior loads against later stores.
18c03c61SPeter Zijlstra      However, they do -not- guarantee any other sort of ordering:
18c03c61SPeter Zijlstra      Not prior loads against later loads, nor prior stores against
18c03c61SPeter Zijlstra      later anything.  If you need these other forms of ordering,
d87510c5SDavidlohr Bueso      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
18c03c61SPeter Zijlstra      later loads, smp_mb().
18c03c61SPeter Zijlstra
7817b799SPaul E. McKenney  (*) If both legs of the "if" statement begin with identical stores to
7817b799SPaul E. McKenney      the same variable, then those stores must be ordered, either by
7817b799SPaul E. McKenney      preceding both of them with smp_mb() or by using smp_store_release()
7817b799SPaul E. McKenney      to carry out the stores.  Please note that it is -not- sufficient
a5052657SPaul E. McKenney      to use barrier() at beginning of each leg of the "if" statement
a5052657SPaul E. McKenney      because, as shown by the example above, optimizing compilers can
a5052657SPaul E. McKenney      destroy the control dependency while respecting the letter of the
a5052657SPaul E. McKenney      barrier() law.
9b2b3bf5SPaul E. McKenney
18c03c61SPeter Zijlstra  (*) Control dependencies require at least one run-time conditional
586dd56aSPaul E. McKenney      between the prior load and the subsequent store, and this
9af194ceSPaul E. McKenney      conditional must involve the prior load.  If the compiler is able
9af194ceSPaul E. McKenney      to optimize the conditional away, it will have also optimized
105ff3cbSLinus Torvalds      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
105ff3cbSLinus Torvalds      can help to preserve the needed conditional.
18c03c61SPeter Zijlstra
18c03c61SPeter Zijlstra  (*) Control dependencies require that the compiler avoid reordering the
105ff3cbSLinus Torvalds      dependency into nonexistence.  Careful use of READ_ONCE() or
105ff3cbSLinus Torvalds      atomic{,64}_read() can help to preserve your control dependency.
895f5542SPaul E. McKenney      Please see the COMPILER BARRIER section for more information.
18c03c61SPeter Zijlstra
ebff09a6SPaul E. McKenney  (*) Control dependencies apply only to the then-clause and else-clause
ebff09a6SPaul E. McKenney      of the if-statement containing the control dependency, including
ebff09a6SPaul E. McKenney      any functions that these two clauses call.  Control dependencies
ebff09a6SPaul E. McKenney      do -not- apply to code following the if-statement containing the
ebff09a6SPaul E. McKenney      control dependency.
ebff09a6SPaul E. McKenney
ff382810SPaul E. McKenney  (*) Control dependencies pair normally with other types of barriers.
ff382810SPaul E. McKenney
f1ab25a3SPaul E. McKenney  (*) Control dependencies do -not- provide multicopy atomicity.  If you
f1ab25a3SPaul E. McKenney      need all the CPUs to see a given store at the same time, use smp_mb().
108b42b4SDavid Howells
c8241f85SPaul E. McKenney  (*) Compilers do not understand control dependencies.  It is therefore
c8241f85SPaul E. McKenney      your job to ensure that they do not break your code.
c8241f85SPaul E. McKenney
108b42b4SDavid Howells
108b42b4SDavid HowellsSMP BARRIER PAIRING
108b42b4SDavid Howells-------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsWhen dealing with CPU-CPU interactions, certain types of memory barrier should
108b42b4SDavid Howellsalways be paired.  A lack of appropriate pairing is almost certainly an error.
108b42b4SDavid Howells
ff382810SPaul E. McKenneyGeneral barriers pair with each other, though they also pair with most
f1ab25a3SPaul E. McKenneyother types of barriers, albeit without multicopy atomicity.  An acquire
f1ab25a3SPaul E. McKenneybarrier pairs with a release barrier, but both may also pair with other
f1ab25a3SPaul E. McKenneybarriers, including of course general barriers.  A write barrier pairs
203185f6SAkira Yokosawawith an address-dependency barrier, a control dependency, an acquire barrier,
f1ab25a3SPaul E. McKenneya release barrier, a read barrier, or a general barrier.  Similarly a
203185f6SAkira Yokosawaread barrier, control dependency, or an address-dependency barrier pairs
f1ab25a3SPaul E. McKenneywith a write barrier, an acquire barrier, a release barrier, or a
f1ab25a3SPaul E. McKenneygeneral barrier:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		      CPU 2
108b42b4SDavid Howells	===============	      ===============
9af194ceSPaul E. McKenney	WRITE_ONCE(a, 1);
108b42b4SDavid Howells	<write barrier>
9af194ceSPaul E. McKenney	WRITE_ONCE(b, 2);     x = READ_ONCE(b);
108b42b4SDavid Howells			      <read barrier>
9af194ceSPaul E. McKenney			      y = READ_ONCE(a);
108b42b4SDavid Howells
108b42b4SDavid HowellsOr:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1		      CPU 2
108b42b4SDavid Howells	===============	      ===============================
108b42b4SDavid Howells	a = 1;
108b42b4SDavid Howells	<write barrier>
9af194ceSPaul E. McKenney	WRITE_ONCE(b, &a);    x = READ_ONCE(b);
203185f6SAkira Yokosawa			      <implicit address-dependency barrier>
108b42b4SDavid Howells			      y = *x;
108b42b4SDavid Howells
ff382810SPaul E. McKenneyOr even:
ff382810SPaul E. McKenney
ff382810SPaul E. McKenney	CPU 1		      CPU 2
ff382810SPaul E. McKenney	===============	      ===============================
9af194ceSPaul E. McKenney	r1 = READ_ONCE(y);
ff382810SPaul E. McKenney	<general barrier>
d92f842bSScott Tsai	WRITE_ONCE(x, 1);     if (r2 = READ_ONCE(x)) {
ff382810SPaul E. McKenney			         <implicit control dependency>
9af194ceSPaul E. McKenney			         WRITE_ONCE(y, 1);
ff382810SPaul E. McKenney			      }
ff382810SPaul E. McKenney
ff382810SPaul E. McKenney	assert(r1 == 0 || r2 == 0);
ff382810SPaul E. McKenney
108b42b4SDavid HowellsBasically, the read barrier always has to be there, even though it can be of
108b42b4SDavid Howellsthe "weaker" type.
108b42b4SDavid Howells
670bd95eSDavid Howells[!] Note that the stores before the write barrier would normally be expected to
f556082dSAkira Yokosawamatch the loads after the read barrier or the address-dependency barrier, and
f556082dSAkira Yokosawavice versa:
670bd95eSDavid Howells
670bd95eSDavid Howells	CPU 1                               CPU 2
2ecf8101SPaul E. McKenney	===================                 ===================
9af194ceSPaul E. McKenney	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
9af194ceSPaul E. McKenney	WRITE_ONCE(b, 2);    }    \ /    {  w = READ_ONCE(d);
670bd95eSDavid Howells	<write barrier>            \        <read barrier>
9af194ceSPaul E. McKenney	WRITE_ONCE(c, 3);    }    / \    {  x = READ_ONCE(a);
9af194ceSPaul E. McKenney	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
670bd95eSDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsEXAMPLES OF MEMORY BARRIER SEQUENCES
108b42b4SDavid Howells------------------------------------
108b42b4SDavid Howells
81fc6323SJarek PoplawskiFirstly, write barriers act as partial orderings on store operations.
108b42b4SDavid HowellsConsider the following sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1
108b42b4SDavid Howells	=======================
108b42b4SDavid Howells	STORE A = 1
108b42b4SDavid Howells	STORE B = 2
108b42b4SDavid Howells	STORE C = 3
108b42b4SDavid Howells	<write barrier>
108b42b4SDavid Howells	STORE D = 4
108b42b4SDavid Howells	STORE E = 5
108b42b4SDavid Howells
108b42b4SDavid HowellsThis sequence of events is committed to the memory coherence system in an order
108b42b4SDavid Howellsthat the rest of the system might perceive as the unordered set of { STORE A,
80f7228bSAdrian BunkSTORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
108b42b4SDavid Howells}:
108b42b4SDavid Howells
108b42b4SDavid Howells	+-------+       :      :
108b42b4SDavid Howells	|       |       +------+
108b42b4SDavid Howells	|       |------>| C=3  |     }     /\
81fc6323SJarek Poplawski	|       |  :    +------+     }-----  \  -----> Events perceptible to
81fc6323SJarek Poplawski	|       |  :    | A=1  |     }        \/       the rest of the system
108b42b4SDavid Howells	|       |  :    +------+     }
108b42b4SDavid Howells	| CPU 1 |  :    | B=2  |     }
108b42b4SDavid Howells	|       |       +------+     }
108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
108b42b4SDavid Howells	|       |       +------+     }        requires all stores prior to the
108b42b4SDavid Howells	|       |  :    | E=5  |     }        barrier to be committed before
81fc6323SJarek Poplawski	|       |  :    +------+     }        further stores may take place
108b42b4SDavid Howells	|       |------>| D=4  |     }
108b42b4SDavid Howells	|       |       +------+
108b42b4SDavid Howells	+-------+       :      :
108b42b4SDavid Howells	                   |
670bd95eSDavid Howells	                   | Sequence in which stores are committed to the
670bd95eSDavid Howells	                   | memory system by CPU 1
108b42b4SDavid Howells	                   V
108b42b4SDavid Howells
108b42b4SDavid Howells
f556082dSAkira YokosawaSecondly, address-dependency barriers act as partial orderings on address-
f556082dSAkira Yokosawadependent loads.  Consider the following sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1			CPU 2
108b42b4SDavid Howells	=======================	=======================
c14038c3SDavid Howells		{ B = 7; X = 9; Y = 8; C = &Y }
108b42b4SDavid Howells	STORE A = 1
108b42b4SDavid Howells	STORE B = 2
108b42b4SDavid Howells	<write barrier>
108b42b4SDavid Howells	STORE C = &B		LOAD X
108b42b4SDavid Howells	STORE D = 4		LOAD C (gets &B)
108b42b4SDavid Howells				LOAD *C (reads B)
108b42b4SDavid Howells
108b42b4SDavid HowellsWithout intervention, CPU 2 may perceive the events on CPU 1 in some
108b42b4SDavid Howellseffectively random order, despite the write barrier issued by CPU 1:
108b42b4SDavid Howells
108b42b4SDavid Howells	+-------+       :      :                :       :
108b42b4SDavid Howells	|       |       +------+                +-------+  | Sequence of update
108b42b4SDavid Howells	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
108b42b4SDavid Howells	|       |  :    +------+     \          +-------+  | CPU 2
108b42b4SDavid Howells	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
108b42b4SDavid Howells	|       |       +------+       |        +-------+
108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww   |        :       :
108b42b4SDavid Howells	|       |       +------+       |        :       :
108b42b4SDavid Howells	|       |  :    | C=&B |---    |        :       :       +-------+
108b42b4SDavid Howells	|       |  :    +------+   \   |        +-------+       |       |
108b42b4SDavid Howells	|       |------>| D=4  |    ----------->| C->&B |------>|       |
108b42b4SDavid Howells	|       |       +------+       |        +-------+       |       |
108b42b4SDavid Howells	+-------+       :      :       |        :       :       |       |
108b42b4SDavid Howells	                               |        :       :       |       |
108b42b4SDavid Howells	                               |        :       :       | CPU 2 |
108b42b4SDavid Howells	                               |        +-------+       |       |
108b42b4SDavid Howells	    Apparently incorrect --->  |        | B->7  |------>|       |
108b42b4SDavid Howells	    perception of B (!)        |        +-------+       |       |
108b42b4SDavid Howells	                               |        :       :       |       |
108b42b4SDavid Howells	                               |        +-------+       |       |
108b42b4SDavid Howells	    The load of X holds --->    \       | X->9  |------>|       |
108b42b4SDavid Howells	    up the maintenance           \      +-------+       |       |
108b42b4SDavid Howells	    of coherence of B             ----->| B->2  |       +-------+
108b42b4SDavid Howells	                                        +-------+
108b42b4SDavid Howells	                                        :       :
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsIn the above example, CPU 2 perceives that B is 7, despite the load of *C
670e9f34SPaolo Ornati(which would be B) coming after the LOAD of C.
108b42b4SDavid Howells
f556082dSAkira YokosawaIf, however, an address-dependency barrier were to be placed between the load
f556082dSAkira Yokosawaof C and the load of *C (ie: B) on CPU 2:
c14038c3SDavid Howells
c14038c3SDavid Howells	CPU 1			CPU 2
c14038c3SDavid Howells	=======================	=======================
c14038c3SDavid Howells		{ B = 7; X = 9; Y = 8; C = &Y }
c14038c3SDavid Howells	STORE A = 1
c14038c3SDavid Howells	STORE B = 2
c14038c3SDavid Howells	<write barrier>
c14038c3SDavid Howells	STORE C = &B		LOAD X
c14038c3SDavid Howells	STORE D = 4		LOAD C (gets &B)
203185f6SAkira Yokosawa				<address-dependency barrier>
c14038c3SDavid Howells				LOAD *C (reads B)
c14038c3SDavid Howells
c14038c3SDavid Howellsthen the following will occur:
108b42b4SDavid Howells
108b42b4SDavid Howells	+-------+       :      :                :       :
108b42b4SDavid Howells	|       |       +------+                +-------+
108b42b4SDavid Howells	|       |------>| B=2  |-----       --->| Y->8  |
108b42b4SDavid Howells	|       |  :    +------+     \          +-------+
108b42b4SDavid Howells	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
108b42b4SDavid Howells	|       |       +------+       |        +-------+
108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww   |        :       :
108b42b4SDavid Howells	|       |       +------+       |        :       :
108b42b4SDavid Howells	|       |  :    | C=&B |---    |        :       :       +-------+
108b42b4SDavid Howells	|       |  :    +------+   \   |        +-------+       |       |
108b42b4SDavid Howells	|       |------>| D=4  |    ----------->| C->&B |------>|       |
108b42b4SDavid Howells	|       |       +------+       |        +-------+       |       |
108b42b4SDavid Howells	+-------+       :      :       |        :       :       |       |
108b42b4SDavid Howells	                               |        :       :       |       |
108b42b4SDavid Howells	                               |        :       :       | CPU 2 |
108b42b4SDavid Howells	                               |        +-------+       |       |
670bd95eSDavid Howells	                               |        | X->9  |------>|       |
670bd95eSDavid Howells	                               |        +-------+       |       |
203185f6SAkira Yokosawa	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
670bd95eSDavid Howells	  prior to the store of C        \      +-------+       |       |
670bd95eSDavid Howells	  are perceptible to              ----->| B->2  |------>|       |
670bd95eSDavid Howells	  subsequent loads                      +-------+       |       |
108b42b4SDavid Howells	                                        :       :       +-------+
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsAnd thirdly, a read barrier acts as a partial order on loads.  Consider the
108b42b4SDavid Howellsfollowing sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1			CPU 2
108b42b4SDavid Howells	=======================	=======================
670bd95eSDavid Howells		{ A = 0, B = 9 }
108b42b4SDavid Howells	STORE A=1
108b42b4SDavid Howells	<write barrier>
670bd95eSDavid Howells	STORE B=2
108b42b4SDavid Howells				LOAD B
670bd95eSDavid Howells				LOAD A
108b42b4SDavid Howells
108b42b4SDavid HowellsWithout intervention, CPU 2 may then choose to perceive the events on CPU 1 in
108b42b4SDavid Howellssome effectively random order, despite the write barrier issued by CPU 1:
108b42b4SDavid Howells
670bd95eSDavid Howells	+-------+       :      :                :       :
670bd95eSDavid Howells	|       |       +------+                +-------+
670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
670bd95eSDavid Howells	|       |       +------+      \         +-------+
670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
670bd95eSDavid Howells	|       |       +------+        |       +-------+
670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
670bd95eSDavid Howells	                                |       | A->0  |------>|       |
670bd95eSDavid Howells	                                |       +-------+       |       |
670bd95eSDavid Howells	                                |       :       :       +-------+
670bd95eSDavid Howells	                                 \      :       :
670bd95eSDavid Howells	                                  \     +-------+
670bd95eSDavid Howells	                                   ---->| A->1  |
670bd95eSDavid Howells	                                        +-------+
670bd95eSDavid Howells	                                        :       :
108b42b4SDavid Howells
108b42b4SDavid Howells
6bc39274SDavid HowellsIf, however, a read barrier were to be placed between the load of B and the
670bd95eSDavid Howellsload of A on CPU 2:
108b42b4SDavid Howells
670bd95eSDavid Howells	CPU 1			CPU 2
670bd95eSDavid Howells	=======================	=======================
670bd95eSDavid Howells		{ A = 0, B = 9 }
670bd95eSDavid Howells	STORE A=1
670bd95eSDavid Howells	<write barrier>
670bd95eSDavid Howells	STORE B=2
670bd95eSDavid Howells				LOAD B
670bd95eSDavid Howells				<read barrier>
670bd95eSDavid Howells				LOAD A
670bd95eSDavid Howells
670bd95eSDavid Howellsthen the partial ordering imposed by CPU 1 will be perceived correctly by CPU
670bd95eSDavid Howells2:
670bd95eSDavid Howells
670bd95eSDavid Howells	+-------+       :      :                :       :
670bd95eSDavid Howells	|       |       +------+                +-------+
670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
670bd95eSDavid Howells	|       |       +------+      \         +-------+
670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
670bd95eSDavid Howells	|       |       +------+        |       +-------+
670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
670bd95eSDavid Howells	                                |       :       :       |       |
670bd95eSDavid Howells	                                |       :       :       |       |
108b42b4SDavid Howells	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
108b42b4SDavid Howells	  barrier causes all effects      \     +-------+       |       |
670bd95eSDavid Howells	  prior to the storage of B        ---->| A->1  |------>|       |
670bd95eSDavid Howells	  to be perceptible to CPU 2            +-------+       |       |
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells
670bd95eSDavid Howells
670bd95eSDavid HowellsTo illustrate this more completely, consider what could happen if the code
670bd95eSDavid Howellscontained a load of A either side of the read barrier:
670bd95eSDavid Howells
670bd95eSDavid Howells	CPU 1			CPU 2
670bd95eSDavid Howells	=======================	=======================
670bd95eSDavid Howells		{ A = 0, B = 9 }
670bd95eSDavid Howells	STORE A=1
670bd95eSDavid Howells	<write barrier>
670bd95eSDavid Howells	STORE B=2
670bd95eSDavid Howells				LOAD B
670bd95eSDavid Howells				LOAD A [first load of A]
670bd95eSDavid Howells				<read barrier>
670bd95eSDavid Howells				LOAD A [second load of A]
670bd95eSDavid Howells
670bd95eSDavid HowellsEven though the two loads of A both occur after the load of B, they may both
670bd95eSDavid Howellscome up with different values:
670bd95eSDavid Howells
670bd95eSDavid Howells	+-------+       :      :                :       :
670bd95eSDavid Howells	|       |       +------+                +-------+
670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
670bd95eSDavid Howells	|       |       +------+      \         +-------+
670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
670bd95eSDavid Howells	|       |       +------+        |       +-------+
670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
670bd95eSDavid Howells	                                |       :       :       |       |
670bd95eSDavid Howells	                                |       :       :       |       |
670bd95eSDavid Howells	                                |       +-------+       |       |
670bd95eSDavid Howells	                                |       | A->0  |------>| 1st   |
670bd95eSDavid Howells	                                |       +-------+       |       |
670bd95eSDavid Howells	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
670bd95eSDavid Howells	  barrier causes all effects      \     +-------+       |       |
670bd95eSDavid Howells	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
670bd95eSDavid Howells	  to be perceptible to CPU 2            +-------+       |       |
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells
670bd95eSDavid Howells
670bd95eSDavid HowellsBut it may be that the update to A from CPU 1 becomes perceptible to CPU 2
670bd95eSDavid Howellsbefore the read barrier completes anyway:
670bd95eSDavid Howells
670bd95eSDavid Howells	+-------+       :      :                :       :
670bd95eSDavid Howells	|       |       +------+                +-------+
670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
670bd95eSDavid Howells	|       |       +------+      \         +-------+
670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
670bd95eSDavid Howells	|       |       +------+        |       +-------+
670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
670bd95eSDavid Howells	                                |       :       :       |       |
670bd95eSDavid Howells	                                 \      :       :       |       |
670bd95eSDavid Howells	                                  \     +-------+       |       |
670bd95eSDavid Howells	                                   ---->| A->1  |------>| 1st   |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrrr   |       |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	                                        | A->1  |------>| 2nd   |
108b42b4SDavid Howells	                                        +-------+       |       |
108b42b4SDavid Howells	                                        :       :       +-------+
108b42b4SDavid Howells
108b42b4SDavid Howells
670bd95eSDavid HowellsThe guarantee is that the second load will always come up with A == 1 if the
670bd95eSDavid Howellsload of B came up with B == 2.  No such guarantee exists for the first load of
670bd95eSDavid HowellsA; that may come up with either A == 0 or A == 1.
670bd95eSDavid Howells
670bd95eSDavid Howells
670bd95eSDavid HowellsREAD MEMORY BARRIERS VS LOAD SPECULATION
670bd95eSDavid Howells----------------------------------------
670bd95eSDavid Howells
670bd95eSDavid HowellsMany CPUs speculate with loads: that is they see that they will need to load an
670bd95eSDavid Howellsitem from memory, and they find a time where they're not using the bus for any
670bd95eSDavid Howellsother loads, and so do the load in advance - even though they haven't actually
670bd95eSDavid Howellsgot to that point in the instruction execution flow yet.  This permits the
670bd95eSDavid Howellsactual load instruction to potentially complete immediately because the CPU
670bd95eSDavid Howellsalready has the value to hand.
670bd95eSDavid Howells
670bd95eSDavid HowellsIt may turn out that the CPU didn't actually need the value - perhaps because a
670bd95eSDavid Howellsbranch circumvented the load - in which case it can discard the value or just
670bd95eSDavid Howellscache it for later use.
670bd95eSDavid Howells
670bd95eSDavid HowellsConsider:
670bd95eSDavid Howells
670bd95eSDavid Howells	CPU 1			CPU 2
670bd95eSDavid Howells	=======================	=======================
670bd95eSDavid Howells				LOAD B
670bd95eSDavid Howells				DIVIDE		} Divide instructions generally
670bd95eSDavid Howells				DIVIDE		} take a long time to perform
670bd95eSDavid Howells				LOAD A
670bd95eSDavid Howells
670bd95eSDavid HowellsWhich might appear as this:
670bd95eSDavid Howells
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	Once the divisions are complete -->     :       :   ~-->|       |
670bd95eSDavid Howells	the CPU can then perform the            :       :       |       |
670bd95eSDavid Howells	LOAD with immediate effect              :       :       +-------+
670bd95eSDavid Howells
670bd95eSDavid Howells
203185f6SAkira YokosawaPlacing a read barrier or an address-dependency barrier just before the second
670bd95eSDavid Howellsload:
670bd95eSDavid Howells
670bd95eSDavid Howells	CPU 1			CPU 2
670bd95eSDavid Howells	=======================	=======================
670bd95eSDavid Howells				LOAD B
670bd95eSDavid Howells				DIVIDE
670bd95eSDavid Howells				DIVIDE
670bd95eSDavid Howells				<read barrier>
670bd95eSDavid Howells				LOAD A
670bd95eSDavid Howells
670bd95eSDavid Howellswill force any value speculatively obtained to be reconsidered to an extent
670bd95eSDavid Howellsdependent on the type of barrier used.  If there was no change made to the
670bd95eSDavid Howellsspeculated memory location, then the speculated value will just be used:
670bd95eSDavid Howells
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrr~   |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :   ~-->|       |
670bd95eSDavid Howells	                                        :       :       |       |
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells
670bd95eSDavid Howells
670bd95eSDavid Howellsbut if there was an update or an invalidation from another CPU pending, then
670bd95eSDavid Howellsthe speculation will be cancelled and the value reloaded:
670bd95eSDavid Howells
670bd95eSDavid Howells	                                        :       :       +-------+
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :DIVIDE |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	                                        :       :   ~   |       |
670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrrr   |       |
670bd95eSDavid Howells	                                        +-------+       |       |
670bd95eSDavid Howells	The speculation is discarded --->   --->| A->1  |------>|       |
670bd95eSDavid Howells	and an updated value is                 +-------+       |       |
670bd95eSDavid Howells	retrieved                               :       :       +-------+
670bd95eSDavid Howells
670bd95eSDavid Howells
f1ab25a3SPaul E. McKenneyMULTICOPY ATOMICITY
f1ab25a3SPaul E. McKenney--------------------
241e6663SPaul E. McKenney
f1ab25a3SPaul E. McKenneyMulticopy atomicity is a deeply intuitive notion about ordering that is
f1ab25a3SPaul E. McKenneynot always provided by real computer systems, namely that a given store
0902b1f4SAlan Sternbecomes visible at the same time to all CPUs, or, alternatively, that all
0902b1f4SAlan SternCPUs agree on the order in which all stores become visible.  However,
0902b1f4SAlan Sternsupport of full multicopy atomicity would rule out valuable hardware
0902b1f4SAlan Sternoptimizations, so a weaker form called ``other multicopy atomicity''
0902b1f4SAlan Sterninstead guarantees only that a given store becomes visible at the same
0902b1f4SAlan Sterntime to all -other- CPUs.  The remainder of this document discusses this
0902b1f4SAlan Sternweaker form, but for brevity will call it simply ``multicopy atomicity''.
f1ab25a3SPaul E. McKenney
f1ab25a3SPaul E. McKenneyThe following example demonstrates multicopy atomicity:
241e6663SPaul E. McKenney
241e6663SPaul E. McKenney	CPU 1			CPU 2			CPU 3
241e6663SPaul E. McKenney	=======================	=======================	=======================
241e6663SPaul E. McKenney		{ X = 0, Y = 0 }
f1ab25a3SPaul E. McKenney	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
f1ab25a3SPaul E. McKenney				<general barrier>	<read barrier>
f1ab25a3SPaul E. McKenney				STORE Y=r1		LOAD X
241e6663SPaul E. McKenney
0902b1f4SAlan SternSuppose that CPU 2's load from X returns 1, which it then stores to Y,
0902b1f4SAlan Sternand CPU 3's load from Y returns 1.  This indicates that CPU 1's store
0902b1f4SAlan Sternto X precedes CPU 2's load from X and that CPU 2's store to Y precedes
0902b1f4SAlan SternCPU 3's load from Y.  In addition, the memory barriers guarantee that
0902b1f4SAlan SternCPU 2 executes its load before its store, and CPU 3 loads from Y before
0902b1f4SAlan Sternit loads from X.  The question is then "Can CPU 3's load from X return 0?"
241e6663SPaul E. McKenney
0902b1f4SAlan SternBecause CPU 3's load from X in some sense comes after CPU 2's load, it
241e6663SPaul E. McKenneyis natural to expect that CPU 3's load from X must therefore return 1.
0902b1f4SAlan SternThis expectation follows from multicopy atomicity: if a load executing
0902b1f4SAlan Sternon CPU B follows a load from the same variable executing on CPU A (and
0902b1f4SAlan SternCPU A did not originally store the value which it read), then on
0902b1f4SAlan Sternmulticopy-atomic systems, CPU B's load must return either the same value
0902b1f4SAlan Sternthat CPU A's load did or some later value.  However, the Linux kernel
0902b1f4SAlan Sterndoes not require systems to be multicopy atomic.
241e6663SPaul E. McKenney
0902b1f4SAlan SternThe use of a general memory barrier in the example above compensates
0902b1f4SAlan Sternfor any lack of multicopy atomicity.  In the example, if CPU 2's load
0902b1f4SAlan Sternfrom X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
0902b1f4SAlan Sternfrom X must indeed also return 1.
241e6663SPaul E. McKenney
f1ab25a3SPaul E. McKenneyHowever, dependencies, read barriers, and write barriers are not always
f1ab25a3SPaul E. McKenneyable to compensate for non-multicopy atomicity.  For example, suppose
f1ab25a3SPaul E. McKenneythat CPU 2's general barrier is removed from the above example, leaving
f1ab25a3SPaul E. McKenneyonly the data dependency shown below:
241e6663SPaul E. McKenney
241e6663SPaul E. McKenney	CPU 1			CPU 2			CPU 3
241e6663SPaul E. McKenney	=======================	=======================	=======================
241e6663SPaul E. McKenney		{ X = 0, Y = 0 }
f1ab25a3SPaul E. McKenney	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
f1ab25a3SPaul E. McKenney				<data dependency>	<read barrier>
f1ab25a3SPaul E. McKenney				STORE Y=r1		LOAD X (reads 0)
241e6663SPaul E. McKenney
f1ab25a3SPaul E. McKenneyThis substitution allows non-multicopy atomicity to run rampant: in
f1ab25a3SPaul E. McKenneythis example, it is perfectly legal for CPU 2's load from X to return 1,
f1ab25a3SPaul E. McKenneyCPU 3's load from Y to return 1, and its load from X to return 0.
241e6663SPaul E. McKenney
f1ab25a3SPaul E. McKenneyThe key point is that although CPU 2's data dependency orders its load
0902b1f4SAlan Sternand store, it does not guarantee to order CPU 1's store.  Thus, if this
0902b1f4SAlan Sternexample runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
0902b1f4SAlan Sternstore buffer or a level of cache, CPU 2 might have early access to CPU 1's
0902b1f4SAlan Sternwrites.  General barriers are therefore required to ensure that all CPUs
0902b1f4SAlan Sternagree on the combined order of multiple accesses.
241e6663SPaul E. McKenney
f1ab25a3SPaul E. McKenneyGeneral barriers can compensate not only for non-multicopy atomicity,
f1ab25a3SPaul E. McKenneybut can also generate additional ordering that can ensure that -all-
f1ab25a3SPaul E. McKenneyCPUs will perceive the same order of -all- operations.  In contrast, a
f1ab25a3SPaul E. McKenneychain of release-acquire pairs do not provide this additional ordering,
f1ab25a3SPaul E. McKenneywhich means that only those CPUs on the chain are guaranteed to agree
f1ab25a3SPaul E. McKenneyon the combined order of the accesses.  For example, switching to C code
f1ab25a3SPaul E. McKenneyin deference to the ghost of Herman Hollerith:
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	int u, v, x, y, z;
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	void cpu0(void)
c535cc92SPaul E. McKenney	{
c535cc92SPaul E. McKenney		r0 = smp_load_acquire(&x);
c535cc92SPaul E. McKenney		WRITE_ONCE(u, 1);
c535cc92SPaul E. McKenney		smp_store_release(&y, 1);
c535cc92SPaul E. McKenney	}
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	void cpu1(void)
c535cc92SPaul E. McKenney	{
c535cc92SPaul E. McKenney		r1 = smp_load_acquire(&y);
c535cc92SPaul E. McKenney		r4 = READ_ONCE(v);
c535cc92SPaul E. McKenney		r5 = READ_ONCE(u);
c535cc92SPaul E. McKenney		smp_store_release(&z, 1);
c535cc92SPaul E. McKenney	}
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	void cpu2(void)
c535cc92SPaul E. McKenney	{
c535cc92SPaul E. McKenney		r2 = smp_load_acquire(&z);
c535cc92SPaul E. McKenney		smp_store_release(&x, 1);
c535cc92SPaul E. McKenney	}
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	void cpu3(void)
c535cc92SPaul E. McKenney	{
c535cc92SPaul E. McKenney		WRITE_ONCE(v, 1);
c535cc92SPaul E. McKenney		smp_mb();
c535cc92SPaul E. McKenney		r3 = READ_ONCE(u);
c535cc92SPaul E. McKenney	}
c535cc92SPaul E. McKenney
f1ab25a3SPaul E. McKenneyBecause cpu0(), cpu1(), and cpu2() participate in a chain of
f1ab25a3SPaul E. McKenneysmp_store_release()/smp_load_acquire() pairs, the following outcome
f1ab25a3SPaul E. McKenneyis prohibited:
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	r0 == 1 && r1 == 1 && r2 == 1
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenneyFurthermore, because of the release-acquire relationship between cpu0()
c535cc92SPaul E. McKenneyand cpu1(), cpu1() must see cpu0()'s writes, so that the following
c535cc92SPaul E. McKenneyoutcome is prohibited:
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	r1 == 1 && r5 == 0
c535cc92SPaul E. McKenney
f1ab25a3SPaul E. McKenneyHowever, the ordering provided by a release-acquire chain is local
f1ab25a3SPaul E. McKenneyto the CPUs participating in that chain and does not apply to cpu3(),
f1ab25a3SPaul E. McKenneyat least aside from stores.  Therefore, the following outcome is possible:
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
c535cc92SPaul E. McKenney
37ef0341SPaul E. McKenneyAs an aside, the following outcome is also possible:
37ef0341SPaul E. McKenney
37ef0341SPaul E. McKenney	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
37ef0341SPaul E. McKenney
c535cc92SPaul E. McKenneyAlthough cpu0(), cpu1(), and cpu2() will see their respective reads and
c535cc92SPaul E. McKenneywrites in order, CPUs not involved in the release-acquire chain might
c535cc92SPaul E. McKenneywell disagree on the order.  This disagreement stems from the fact that
c535cc92SPaul E. McKenneythe weak memory-barrier instructions used to implement smp_load_acquire()
c535cc92SPaul E. McKenneyand smp_store_release() are not required to order prior stores against
c535cc92SPaul E. McKenneysubsequent loads in all cases.  This means that cpu3() can see cpu0()'s
c535cc92SPaul E. McKenneystore to u as happening -after- cpu1()'s load from v, even though
c535cc92SPaul E. McKenneyboth cpu0() and cpu1() agree that these two operations occurred in the
c535cc92SPaul E. McKenneyintended order.
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenneyHowever, please keep in mind that smp_load_acquire() is not magic.
c535cc92SPaul E. McKenneyIn particular, it simply reads from its argument with ordering.  It does
c535cc92SPaul E. McKenney-not- ensure that any particular value will be read.  Therefore, the
c535cc92SPaul E. McKenneyfollowing outcome is possible:
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenney	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
c535cc92SPaul E. McKenney
c535cc92SPaul E. McKenneyNote that this outcome can happen even on a mythical sequentially
c535cc92SPaul E. McKenneyconsistent system where nothing is ever reordered.
c535cc92SPaul E. McKenney
f1ab25a3SPaul E. McKenneyTo reiterate, if your code requires full ordering of all operations,
f1ab25a3SPaul E. McKenneyuse general barriers throughout.
241e6663SPaul E. McKenney
241e6663SPaul E. McKenney
108b42b4SDavid Howells========================
108b42b4SDavid HowellsEXPLICIT KERNEL BARRIERS
108b42b4SDavid Howells========================
108b42b4SDavid Howells
108b42b4SDavid HowellsThe Linux kernel has a variety of different barriers that act at different
108b42b4SDavid Howellslevels:
108b42b4SDavid Howells
108b42b4SDavid Howells  (*) Compiler barrier.
108b42b4SDavid Howells
108b42b4SDavid Howells  (*) CPU memory barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsCOMPILER BARRIER
108b42b4SDavid Howells----------------
108b42b4SDavid Howells
108b42b4SDavid HowellsThe Linux kernel has an explicit compiler barrier function that prevents the
108b42b4SDavid Howellscompiler from moving the memory accesses either side of it to the other side:
108b42b4SDavid Howells
108b42b4SDavid Howells	barrier();
108b42b4SDavid Howells
9af194ceSPaul E. McKenneyThis is a general barrier -- there are no read-read or write-write
9af194ceSPaul E. McKenneyvariants of barrier().  However, READ_ONCE() and WRITE_ONCE() can be
9af194ceSPaul E. McKenneythought of as weak forms of barrier() that affect only the specific
9af194ceSPaul E. McKenneyaccesses flagged by the READ_ONCE() or WRITE_ONCE().
108b42b4SDavid Howells
692118daSPaul E. McKenneyThe barrier() function has the following effects:
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) Prevents the compiler from reordering accesses following the
692118daSPaul E. McKenney     barrier() to precede any accesses preceding the barrier().
692118daSPaul E. McKenney     One example use for this property is to ease communication between
692118daSPaul E. McKenney     interrupt-handler code and the code that was interrupted.
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) Within a loop, forces the compiler to load the variables used
692118daSPaul E. McKenney     in that loop's conditional on each pass through that loop.
692118daSPaul E. McKenney
9af194ceSPaul E. McKenneyThe READ_ONCE() and WRITE_ONCE() functions can prevent any number of
9af194ceSPaul E. McKenneyoptimizations that, while perfectly safe in single-threaded code, can
9af194ceSPaul E. McKenneybe fatal in concurrent code.  Here are some examples of these sorts
9af194ceSPaul E. McKenneyof optimizations:
692118daSPaul E. McKenney
449f7413SPaul E. McKenney (*) The compiler is within its rights to reorder loads and stores
449f7413SPaul E. McKenney     to the same variable, and in some cases, the CPU is within its
449f7413SPaul E. McKenney     rights to reorder loads to the same variable.  This means that
449f7413SPaul E. McKenney     the following code:
449f7413SPaul E. McKenney
449f7413SPaul E. McKenney	a[0] = x;
449f7413SPaul E. McKenney	a[1] = x;
449f7413SPaul E. McKenney
449f7413SPaul E. McKenney     Might result in an older value of x stored in a[1] than in a[0].
449f7413SPaul E. McKenney     Prevent both the compiler and the CPU from doing this as follows:
449f7413SPaul E. McKenney
9af194ceSPaul E. McKenney	a[0] = READ_ONCE(x);
9af194ceSPaul E. McKenney	a[1] = READ_ONCE(x);
449f7413SPaul E. McKenney
9af194ceSPaul E. McKenney     In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for
9af194ceSPaul E. McKenney     accesses from multiple CPUs to a single variable.
449f7413SPaul E. McKenney
692118daSPaul E. McKenney (*) The compiler is within its rights to merge successive loads from
692118daSPaul E. McKenney     the same variable.  Such merging can cause the compiler to "optimize"
692118daSPaul E. McKenney     the following code:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	while (tmp = a)
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     into the following code, which, although in some sense legitimate
692118daSPaul E. McKenney     for single-threaded code, is almost certainly not what the developer
692118daSPaul E. McKenney     intended:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	if (tmp = a)
692118daSPaul E. McKenney		for (;;)
692118daSPaul E. McKenney			do_something_with(tmp);
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     Use READ_ONCE() to prevent the compiler from doing this to you:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) The compiler is within its rights to reload a variable, for example,
692118daSPaul E. McKenney     in cases where high register pressure prevents the compiler from
692118daSPaul E. McKenney     keeping all data of interest in registers.  The compiler might
692118daSPaul E. McKenney     therefore optimize the variable 'tmp' out of our previous example:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	while (tmp = a)
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     This could result in the following code, which is perfectly safe in
692118daSPaul E. McKenney     single-threaded code, but can be fatal in concurrent code:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	while (a)
692118daSPaul E. McKenney		do_something_with(a);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     For example, the optimized version of this code could result in
692118daSPaul E. McKenney     passing a zero to do_something_with() in the case where the variable
692118daSPaul E. McKenney     a was modified by some other CPU between the "while" statement and
692118daSPaul E. McKenney     the call to do_something_with().
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     Again, use READ_ONCE() to prevent the compiler from doing this:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     Note that if the compiler runs short of registers, it might save
692118daSPaul E. McKenney     tmp onto the stack.  The overhead of this saving and later restoring
692118daSPaul E. McKenney     is why compilers reload variables.  Doing so is perfectly safe for
692118daSPaul E. McKenney     single-threaded code, so you need to tell the compiler about cases
692118daSPaul E. McKenney     where it is not safe.
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) The compiler is within its rights to omit a load entirely if it knows
692118daSPaul E. McKenney     what the value will be.  For example, if the compiler can prove that
692118daSPaul E. McKenney     the value of variable 'a' is always zero, it can optimize this code:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	while (tmp = a)
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     Into this:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	do { } while (0);
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     This transformation is a win for single-threaded code because it
9af194ceSPaul E. McKenney     gets rid of a load and a branch.  The problem is that the compiler
9af194ceSPaul E. McKenney     will carry out its proof assuming that the current CPU is the only
9af194ceSPaul E. McKenney     one updating variable 'a'.  If variable 'a' is shared, then the
9af194ceSPaul E. McKenney     compiler's proof will be erroneous.  Use READ_ONCE() to tell the
9af194ceSPaul E. McKenney     compiler that it doesn't know as much as it thinks it does:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     But please note that the compiler is also closely watching what you
9af194ceSPaul E. McKenney     do with the value after the READ_ONCE().  For example, suppose you
692118daSPaul E. McKenney     do the following and MAX is a preprocessor macro with the value 1:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	while ((tmp = READ_ONCE(a)) % MAX)
692118daSPaul E. McKenney		do_something_with(tmp);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     Then the compiler knows that the result of the "%" operator applied
692118daSPaul E. McKenney     to MAX will always be zero, again allowing the compiler to optimize
692118daSPaul E. McKenney     the code into near-nonexistence.  (It will still load from the
692118daSPaul E. McKenney     variable 'a'.)
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) Similarly, the compiler is within its rights to omit a store entirely
692118daSPaul E. McKenney     if it knows that the variable already has the value being stored.
692118daSPaul E. McKenney     Again, the compiler assumes that the current CPU is the only one
692118daSPaul E. McKenney     storing into the variable, which can cause the compiler to do the
692118daSPaul E. McKenney     wrong thing for shared variables.  For example, suppose you have
692118daSPaul E. McKenney     the following:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	a = 0;
65f95ff2SSeongJae Park	... Code that does not store to variable a ...
692118daSPaul E. McKenney	a = 0;
692118daSPaul E. McKenney
692118daSPaul E. McKenney     The compiler sees that the value of variable 'a' is already zero, so
692118daSPaul E. McKenney     it might well omit the second store.  This would come as a fatal
692118daSPaul E. McKenney     surprise if some other CPU might have stored to variable 'a' in the
692118daSPaul E. McKenney     meantime.
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     Use WRITE_ONCE() to prevent the compiler from making this sort of
692118daSPaul E. McKenney     wrong guess:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	WRITE_ONCE(a, 0);
65f95ff2SSeongJae Park	... Code that does not store to variable a ...
9af194ceSPaul E. McKenney	WRITE_ONCE(a, 0);
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) The compiler is within its rights to reorder memory accesses unless
692118daSPaul E. McKenney     you tell it not to.  For example, consider the following interaction
692118daSPaul E. McKenney     between process-level code and an interrupt handler:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	void process_level(void)
692118daSPaul E. McKenney	{
692118daSPaul E. McKenney		msg = get_message();
692118daSPaul E. McKenney		flag = true;
692118daSPaul E. McKenney	}
692118daSPaul E. McKenney
692118daSPaul E. McKenney	void interrupt_handler(void)
692118daSPaul E. McKenney	{
692118daSPaul E. McKenney		if (flag)
692118daSPaul E. McKenney			process_message(msg);
692118daSPaul E. McKenney	}
692118daSPaul E. McKenney
df5cbb27SMasanari Iida     There is nothing to prevent the compiler from transforming
692118daSPaul E. McKenney     process_level() to the following, in fact, this might well be a
692118daSPaul E. McKenney     win for single-threaded code:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	void process_level(void)
692118daSPaul E. McKenney	{
692118daSPaul E. McKenney		flag = true;
692118daSPaul E. McKenney		msg = get_message();
692118daSPaul E. McKenney	}
692118daSPaul E. McKenney
692118daSPaul E. McKenney     If the interrupt occurs between these two statement, then
9af194ceSPaul E. McKenney     interrupt_handler() might be passed a garbled msg.  Use WRITE_ONCE()
692118daSPaul E. McKenney     to prevent this as follows:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	void process_level(void)
692118daSPaul E. McKenney	{
9af194ceSPaul E. McKenney		WRITE_ONCE(msg, get_message());
9af194ceSPaul E. McKenney		WRITE_ONCE(flag, true);
692118daSPaul E. McKenney	}
692118daSPaul E. McKenney
692118daSPaul E. McKenney	void interrupt_handler(void)
692118daSPaul E. McKenney	{
9af194ceSPaul E. McKenney		if (READ_ONCE(flag))
9af194ceSPaul E. McKenney			process_message(READ_ONCE(msg));
692118daSPaul E. McKenney	}
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     Note that the READ_ONCE() and WRITE_ONCE() wrappers in
9af194ceSPaul E. McKenney     interrupt_handler() are needed if this interrupt handler can itself
9af194ceSPaul E. McKenney     be interrupted by something that also accesses 'flag' and 'msg',
9af194ceSPaul E. McKenney     for example, a nested interrupt or an NMI.  Otherwise, READ_ONCE()
9af194ceSPaul E. McKenney     and WRITE_ONCE() are not needed in interrupt_handler() other than
9af194ceSPaul E. McKenney     for documentation purposes.  (Note also that nested interrupts
9af194ceSPaul E. McKenney     do not typically occur in modern Linux kernels, in fact, if an
9af194ceSPaul E. McKenney     interrupt handler returns with interrupts enabled, you will get a
9af194ceSPaul E. McKenney     WARN_ONCE() splat.)
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     You should assume that the compiler can move READ_ONCE() and
9af194ceSPaul E. McKenney     WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
9af194ceSPaul E. McKenney     barrier(), or similar primitives.
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     This effect could also be achieved using barrier(), but READ_ONCE()
9af194ceSPaul E. McKenney     and WRITE_ONCE() are more selective:  With READ_ONCE() and
9af194ceSPaul E. McKenney     WRITE_ONCE(), the compiler need only forget the contents of the
9af194ceSPaul E. McKenney     indicated memory locations, while with barrier() the compiler must
8149b5cbSSeongJae Park     discard the value of all memory locations that it has currently
9af194ceSPaul E. McKenney     cached in any machine registers.  Of course, the compiler must also
9af194ceSPaul E. McKenney     respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
9af194ceSPaul E. McKenney     though the CPU of course need not do so.
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) The compiler is within its rights to invent stores to a variable,
692118daSPaul E. McKenney     as in the following example:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	if (a)
692118daSPaul E. McKenney		b = a;
692118daSPaul E. McKenney	else
692118daSPaul E. McKenney		b = 42;
692118daSPaul E. McKenney
692118daSPaul E. McKenney     The compiler might save a branch by optimizing this as follows:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	b = 42;
692118daSPaul E. McKenney	if (a)
692118daSPaul E. McKenney		b = a;
692118daSPaul E. McKenney
692118daSPaul E. McKenney     In single-threaded code, this is not only safe, but also saves
692118daSPaul E. McKenney     a branch.  Unfortunately, in concurrent code, this optimization
692118daSPaul E. McKenney     could cause some other CPU to see a spurious value of 42 -- even
692118daSPaul E. McKenney     if variable 'a' was never zero -- when loading variable 'b'.
9af194ceSPaul E. McKenney     Use WRITE_ONCE() to prevent this as follows:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	if (a)
9af194ceSPaul E. McKenney		WRITE_ONCE(b, a);
692118daSPaul E. McKenney	else
9af194ceSPaul E. McKenney		WRITE_ONCE(b, 42);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     The compiler can also invent loads.  These are usually less
692118daSPaul E. McKenney     damaging, but they can result in cache-line bouncing and thus in
9af194ceSPaul E. McKenney     poor performance and scalability.  Use READ_ONCE() to prevent
692118daSPaul E. McKenney     invented loads.
692118daSPaul E. McKenney
692118daSPaul E. McKenney (*) For aligned memory locations whose size allows them to be accessed
692118daSPaul E. McKenney     with a single memory-reference instruction, prevents "load tearing"
692118daSPaul E. McKenney     and "store tearing," in which a single large access is replaced by
692118daSPaul E. McKenney     multiple smaller accesses.  For example, given an architecture having
692118daSPaul E. McKenney     16-bit store instructions with 7-bit immediate fields, the compiler
692118daSPaul E. McKenney     might be tempted to use two 16-bit store-immediate instructions to
692118daSPaul E. McKenney     implement the following 32-bit store:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	p = 0x00010002;
692118daSPaul E. McKenney
692118daSPaul E. McKenney     Please note that GCC really does use this sort of optimization,
692118daSPaul E. McKenney     which is not surprising given that it would likely take more
692118daSPaul E. McKenney     than two instructions to build the constant and then store it.
692118daSPaul E. McKenney     This optimization can therefore be a win in single-threaded code.
692118daSPaul E. McKenney     In fact, a recent bug (since fixed) caused GCC to incorrectly use
692118daSPaul E. McKenney     this optimization in a volatile store.  In the absence of such bugs,
9af194ceSPaul E. McKenney     use of WRITE_ONCE() prevents store tearing in the following example:
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney	WRITE_ONCE(p, 0x00010002);
692118daSPaul E. McKenney
692118daSPaul E. McKenney     Use of packed structures can also result in load and store tearing,
692118daSPaul E. McKenney     as in this example:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	struct __attribute__((__packed__)) foo {
692118daSPaul E. McKenney		short a;
692118daSPaul E. McKenney		int b;
692118daSPaul E. McKenney		short c;
692118daSPaul E. McKenney	};
692118daSPaul E. McKenney	struct foo foo1, foo2;
692118daSPaul E. McKenney	...
692118daSPaul E. McKenney
692118daSPaul E. McKenney	foo2.a = foo1.a;
692118daSPaul E. McKenney	foo2.b = foo1.b;
692118daSPaul E. McKenney	foo2.c = foo1.c;
692118daSPaul E. McKenney
9af194ceSPaul E. McKenney     Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no
9af194ceSPaul E. McKenney     volatile markings, the compiler would be well within its rights to
9af194ceSPaul E. McKenney     implement these three assignment statements as a pair of 32-bit
9af194ceSPaul E. McKenney     loads followed by a pair of 32-bit stores.  This would result in
9af194ceSPaul E. McKenney     load tearing on 'foo1.b' and store tearing on 'foo2.b'.  READ_ONCE()
9af194ceSPaul E. McKenney     and WRITE_ONCE() again prevent tearing in this example:
692118daSPaul E. McKenney
692118daSPaul E. McKenney	foo2.a = foo1.a;
9af194ceSPaul E. McKenney	WRITE_ONCE(foo2.b, READ_ONCE(foo1.b));
692118daSPaul E. McKenney	foo2.c = foo1.c;
692118daSPaul E. McKenney
9af194ceSPaul E. McKenneyAll that aside, it is never necessary to use READ_ONCE() and
9af194ceSPaul E. McKenneyWRITE_ONCE() on a variable that has been marked volatile.  For example,
9af194ceSPaul E. McKenneybecause 'jiffies' is marked volatile, it is never necessary to
9af194ceSPaul E. McKenneysay READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
9af194ceSPaul E. McKenneyWRITE_ONCE() are implemented as volatile casts, which has no effect when
9af194ceSPaul E. McKenneyits argument is already marked volatile.
692118daSPaul E. McKenney
692118daSPaul E. McKenneyPlease note that these compiler barriers have no direct effect on the CPU,
692118daSPaul E. McKenneywhich may then reorder things however it wishes.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsCPU MEMORY BARRIERS
108b42b4SDavid Howells-------------------
108b42b4SDavid Howells
203185f6SAkira YokosawaThe Linux kernel has seven basic CPU memory barriers:
108b42b4SDavid Howells
108b42b4SDavid Howells	TYPE			MANDATORY	SMP CONDITIONAL
203185f6SAkira Yokosawa	=======================	===============	===============
108b42b4SDavid Howells	GENERAL			mb()		smp_mb()
108b42b4SDavid Howells	WRITE			wmb()		smp_wmb()
108b42b4SDavid Howells	READ			rmb()		smp_rmb()
203185f6SAkira Yokosawa	ADDRESS DEPENDENCY			READ_ONCE()
108b42b4SDavid Howells
108b42b4SDavid Howells
203185f6SAkira YokosawaAll memory barriers except the address-dependency barriers imply a compiler
203185f6SAkira Yokosawabarrier.  Address dependencies do not impose any additional compiler ordering.
73f10281SNick Piggin
203185f6SAkira YokosawaAside: In the case of address dependencies, the compiler would be expected
9af194ceSPaul E. McKenneyto issue the loads in the correct order (eg. `a[b]` would have to load
9af194ceSPaul E. McKenneythe value of b before loading a[b]), however there is no guarantee in
9af194ceSPaul E. McKenneythe C specification that the compiler may not speculate the value of b
8149b5cbSSeongJae Park(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
9af194ceSPaul E. McKenneytmp = a[b]; ).  There is also the problem of a compiler reloading b after
9af194ceSPaul E. McKenneyhaving loaded a[b], thus having a newer copy of b than a[b].  A consensus
9af194ceSPaul E. McKenneyhas not yet been reached about these problems, however the READ_ONCE()
9af194ceSPaul E. McKenneymacro is a good place to start looking.
108b42b4SDavid Howells
108b42b4SDavid HowellsSMP memory barriers are reduced to compiler barriers on uniprocessor compiled
81fc6323SJarek Poplawskisystems because it is assumed that a CPU will appear to be self-consistent,
108b42b4SDavid Howellsand will order overlapping accesses correctly with respect to itself.
6a65d263SMichael S. TsirkinHowever, see the subsection on "Virtual Machine Guests" below.
108b42b4SDavid Howells
108b42b4SDavid Howells[!] Note that SMP memory barriers _must_ be used to control the ordering of
108b42b4SDavid Howellsreferences to shared memory on SMP systems, though the use of locking instead
108b42b4SDavid Howellsis sufficient.
108b42b4SDavid Howells
108b42b4SDavid HowellsMandatory barriers should not be used to control SMP effects, since mandatory
6a65d263SMichael S. Tsirkinbarriers impose unnecessary overhead on both SMP and UP systems. They may,
6a65d263SMichael S. Tsirkinhowever, be used to control MMIO effects on accesses through relaxed memory I/O
6a65d263SMichael S. Tsirkinwindows.  These barriers are required even on non-SMP systems as they affect
6a65d263SMichael S. Tsirkinthe order in which memory operations appear to a device by prohibiting both the
6a65d263SMichael S. Tsirkincompiler and the CPU from reordering them.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsThere are some more advanced barrier functions:
108b42b4SDavid Howells
b92b8b35SPeter Zijlstra (*) smp_store_mb(var, value)
108b42b4SDavid Howells
75b2bd55SOleg Nesterov     This assigns the value to the variable and then inserts a full memory
2d142e59SDavidlohr Bueso     barrier after it.  It isn't guaranteed to insert anything more than a
2d142e59SDavidlohr Bueso     compiler barrier in a UP compilation.
108b42b4SDavid Howells
108b42b4SDavid Howells
1b15611eSPeter Zijlstra (*) smp_mb__before_atomic();
1b15611eSPeter Zijlstra (*) smp_mb__after_atomic();
108b42b4SDavid Howells
39323c64SManfred Spraul     These are for use with atomic RMW functions that do not imply memory
39323c64SManfred Spraul     barriers, but where the code needs a memory barrier. Examples for atomic
d8566f15SFox Chen     RMW functions that do not imply a memory barrier are e.g. add,
39323c64SManfred Spraul     subtract, (failed) conditional operations, _relaxed functions,
39323c64SManfred Spraul     but not atomic_read or atomic_set. A common example where a memory
39323c64SManfred Spraul     barrier may be required is when atomic ops are used for reference
39323c64SManfred Spraul     counting.
1b15611eSPeter Zijlstra
39323c64SManfred Spraul     These are also used for atomic RMW bitop functions that do not imply a
39323c64SManfred Spraul     memory barrier (such as set_bit and clear_bit).
108b42b4SDavid Howells
108b42b4SDavid Howells     As an example, consider a piece of code that marks an object as being dead
108b42b4SDavid Howells     and then decrements the object's reference count:
108b42b4SDavid Howells
108b42b4SDavid Howells	obj->dead = 1;
1b15611eSPeter Zijlstra	smp_mb__before_atomic();
108b42b4SDavid Howells	atomic_dec(&obj->ref_count);
108b42b4SDavid Howells
108b42b4SDavid Howells     This makes sure that the death mark on the object is perceived to be set
108b42b4SDavid Howells     *before* the reference counter is decremented.
108b42b4SDavid Howells
706eeb3eSPeter Zijlstra     See Documentation/atomic_{t,bitops}.txt for more information.
108b42b4SDavid Howells
108b42b4SDavid Howells
1077fa36SAlexander Duyck (*) dma_wmb();
1077fa36SAlexander Duyck (*) dma_rmb();
ed59dfd9SKefeng Wang (*) dma_mb();
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck     These are for use with consistent memory to guarantee the ordering
1077fa36SAlexander Duyck     of writes or reads of shared memory accessible to both the CPU and a
289e1c89SParav Pandit     DMA capable device. See Documentation/core-api/dma-api.rst file for more
289e1c89SParav Pandit     information about consistent memory.
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck     For example, consider a device driver that shares memory with a device
1077fa36SAlexander Duyck     and uses a descriptor status value to indicate if the descriptor belongs
1077fa36SAlexander Duyck     to the device or the CPU, and a doorbell to notify it when new
1077fa36SAlexander Duyck     descriptors are available:
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck	if (desc->status != DEVICE_OWN) {
1077fa36SAlexander Duyck		/* do not read data until we own descriptor */
1077fa36SAlexander Duyck		dma_rmb();
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck		/* read/modify data */
1077fa36SAlexander Duyck		read_data = desc->data;
1077fa36SAlexander Duyck		desc->data = write_data;
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck		/* flush modifications before status update */
1077fa36SAlexander Duyck		dma_wmb();
1077fa36SAlexander Duyck
1077fa36SAlexander Duyck		/* assign ownership */
1077fa36SAlexander Duyck		desc->status = DEVICE_OWN;
1077fa36SAlexander Duyck
289e1c89SParav Pandit		/* Make descriptor status visible to the device followed by
289e1c89SParav Pandit		 * notify device of new descriptor
289e1c89SParav Pandit		 */
1077fa36SAlexander Duyck		writel(DESC_NOTIFY, doorbell);
1077fa36SAlexander Duyck	}
1077fa36SAlexander Duyck
289e1c89SParav Pandit     The dma_rmb() allows us to guarantee that the device has released ownership
7a458007SSylvain Trias     before we read the data from the descriptor, and the dma_wmb() allows
1077fa36SAlexander Duyck     us to guarantee the data is written to the descriptor before the device
ed59dfd9SKefeng Wang     can see it now has ownership.  The dma_mb() implies both a dma_rmb() and
289e1c89SParav Pandit     a dma_wmb().
1077fa36SAlexander Duyck
289e1c89SParav Pandit     Note that the dma_*() barriers do not provide any ordering guarantees for
289e1c89SParav Pandit     accesses to MMIO regions.  See the later "KERNEL I/O BARRIER EFFECTS"
289e1c89SParav Pandit     subsection for more information about I/O accessors and MMIO ordering.
1077fa36SAlexander Duyck
3e79f082SAneesh Kumar K.V (*) pmem_wmb();
3e79f082SAneesh Kumar K.V
3e79f082SAneesh Kumar K.V     This is for use with persistent memory to ensure that stores for which
3e79f082SAneesh Kumar K.V     modifications are written to persistent storage reached a platform
3e79f082SAneesh Kumar K.V     durability domain.
3e79f082SAneesh Kumar K.V
3e79f082SAneesh Kumar K.V     For example, after a non-temporal write to pmem region, we use pmem_wmb()
3e79f082SAneesh Kumar K.V     to ensure that stores have reached a platform durability domain. This ensures
3e79f082SAneesh Kumar K.V     that stores have updated persistent storage before any data access or
3e79f082SAneesh Kumar K.V     data transfer caused by subsequent instructions is initiated. This is
3e79f082SAneesh Kumar K.V     in addition to the ordering done by wmb().
3e79f082SAneesh Kumar K.V
3e79f082SAneesh Kumar K.V     For load from persistent memory, existing read memory barriers are sufficient
3e79f082SAneesh Kumar K.V     to ensure read ordering.
dfeccea6SSeongJae Park
d5624bb2SXiongfeng Wang (*) io_stop_wc();
d5624bb2SXiongfeng Wang
d5624bb2SXiongfeng Wang     For memory accesses with write-combining attributes (e.g. those returned
1ab8f248SSeongJae Park     by ioremap_wc()), the CPU may wait for prior accesses to be merged with
d5624bb2SXiongfeng Wang     subsequent ones. io_stop_wc() can be used to prevent the merging of
d5624bb2SXiongfeng Wang     write-combining memory accesses before this macro with those after it when
d5624bb2SXiongfeng Wang     such wait has performance implications.
d5624bb2SXiongfeng Wang
108b42b4SDavid Howells===============================
108b42b4SDavid HowellsIMPLICIT KERNEL MEMORY BARRIERS
108b42b4SDavid Howells===============================
108b42b4SDavid Howells
108b42b4SDavid HowellsSome of the other functions in the linux kernel imply memory barriers, amongst
670bd95eSDavid Howellswhich are locking and scheduling functions.
108b42b4SDavid Howells
108b42b4SDavid HowellsThis specification is a _minimum_ guarantee; any particular architecture may
108b42b4SDavid Howellsprovide more substantial guarantees, but these may not be relied upon outside
108b42b4SDavid Howellsof arch specific code.
108b42b4SDavid Howells
108b42b4SDavid Howells
166bda71SSeongJae ParkLOCK ACQUISITION FUNCTIONS
166bda71SSeongJae Park--------------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsThe Linux kernel has a number of locking constructs:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) spin locks
108b42b4SDavid Howells (*) R/W spin locks
108b42b4SDavid Howells (*) mutexes
108b42b4SDavid Howells (*) semaphores
108b42b4SDavid Howells (*) R/W semaphores
108b42b4SDavid Howells
2e4f5382SPeter ZijlstraIn all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
108b42b4SDavid Howellsfor each construct.  These operations all imply certain barriers:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (1) ACQUIRE operation implication:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Memory operations issued after the ACQUIRE will be completed after the
2e4f5382SPeter Zijlstra     ACQUIRE operation has completed.
108b42b4SDavid Howells
8dd853d7SPaul E. McKenney     Memory operations issued before the ACQUIRE may be completed after
a9668cd6SPeter Zijlstra     the ACQUIRE operation has completed.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (2) RELEASE operation implication:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Memory operations issued before the RELEASE will be completed before the
2e4f5382SPeter Zijlstra     RELEASE operation has completed.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Memory operations issued after the RELEASE may be completed before the
2e4f5382SPeter Zijlstra     RELEASE operation has completed.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (3) ACQUIRE vs ACQUIRE implication:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     All ACQUIRE operations issued before another ACQUIRE operation will be
2e4f5382SPeter Zijlstra     completed before that ACQUIRE operation.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (4) ACQUIRE vs RELEASE implication:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     All ACQUIRE operations issued before a RELEASE operation will be
2e4f5382SPeter Zijlstra     completed before the RELEASE operation.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra (5) Failed conditional ACQUIRE implication:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra     Certain locking variants of the ACQUIRE operation may fail, either due to
2e4f5382SPeter Zijlstra     being unable to get the lock immediately, or due to receiving an unblocked
806654a9SWill Deacon     signal while asleep waiting for the lock to become available.  Failed
108b42b4SDavid Howells     locks do not imply any sort of barrier.
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
2e4f5382SPeter Zijlstraone-way barriers is that the effects of instructions outside of a critical
2e4f5382SPeter Zijlstrasection may seep into the inside of the critical section.
108b42b4SDavid Howells
2e4f5382SPeter ZijlstraAn ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier
2e4f5382SPeter Zijlstrabecause it is possible for an access preceding the ACQUIRE to happen after the
2e4f5382SPeter ZijlstraACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
2e4f5382SPeter Zijlstrathe two accesses can themselves then cross:
670bd95eSDavid Howells
670bd95eSDavid Howells	*A = a;
2e4f5382SPeter Zijlstra	ACQUIRE M
2e4f5382SPeter Zijlstra	RELEASE M
670bd95eSDavid Howells	*B = b;
670bd95eSDavid Howells
670bd95eSDavid Howellsmay occur as:
670bd95eSDavid Howells
2e4f5382SPeter Zijlstra	ACQUIRE M, STORE *B, STORE *A, RELEASE M
17eb88e0SPaul E. McKenney
8dd853d7SPaul E. McKenneyWhen the ACQUIRE and RELEASE are a lock acquisition and release,
8dd853d7SPaul E. McKenneyrespectively, this same reordering can occur if the lock's ACQUIRE and
8dd853d7SPaul E. McKenneyRELEASE are to the same lock variable, but only from the perspective of
8dd853d7SPaul E. McKenneyanother CPU not holding that lock.  In short, a ACQUIRE followed by an
8dd853d7SPaul E. McKenneyRELEASE may -not- be assumed to be a full memory barrier.
17eb88e0SPaul E. McKenney
12d560f4SPaul E. McKenneySimilarly, the reverse case of a RELEASE followed by an ACQUIRE does
12d560f4SPaul E. McKenneynot imply a full memory barrier.  Therefore, the CPU's execution of the
12d560f4SPaul E. McKenneycritical sections corresponding to the RELEASE and the ACQUIRE can cross,
12d560f4SPaul E. McKenneyso that:
17eb88e0SPaul E. McKenney
17eb88e0SPaul E. McKenney	*A = a;
2e4f5382SPeter Zijlstra	RELEASE M
2e4f5382SPeter Zijlstra	ACQUIRE N
17eb88e0SPaul E. McKenney	*B = b;
17eb88e0SPaul E. McKenney
17eb88e0SPaul E. McKenneycould occur as:
17eb88e0SPaul E. McKenney
2e4f5382SPeter Zijlstra	ACQUIRE N, STORE *B, STORE *A, RELEASE M
17eb88e0SPaul E. McKenney
8dd853d7SPaul E. McKenneyIt might appear that this reordering could introduce a deadlock.
8dd853d7SPaul E. McKenneyHowever, this cannot happen because if such a deadlock threatened,
8dd853d7SPaul E. McKenneythe RELEASE would simply complete, thereby avoiding the deadlock.
8dd853d7SPaul E. McKenney
8dd853d7SPaul E. McKenney	Why does this work?
8dd853d7SPaul E. McKenney
8dd853d7SPaul E. McKenney	One key point is that we are only talking about the CPU doing
8dd853d7SPaul E. McKenney	the reordering, not the compiler.  If the compiler (or, for
8dd853d7SPaul E. McKenney	that matter, the developer) switched the operations, deadlock
8dd853d7SPaul E. McKenney	-could- occur.
8dd853d7SPaul E. McKenney
8dd853d7SPaul E. McKenney	But suppose the CPU reordered the operations.  In this case,
8dd853d7SPaul E. McKenney	the unlock precedes the lock in the assembly code.  The CPU
8dd853d7SPaul E. McKenney	simply elected to try executing the later lock operation first.
8dd853d7SPaul E. McKenney	If there is a deadlock, this lock operation will simply spin (or
8dd853d7SPaul E. McKenney	try to sleep, but more on that later).	The CPU will eventually
8dd853d7SPaul E. McKenney	execute the unlock operation (which preceded the lock operation
8dd853d7SPaul E. McKenney	in the assembly code), which will unravel the potential deadlock,
8dd853d7SPaul E. McKenney	allowing the lock operation to succeed.
8dd853d7SPaul E. McKenney
8dd853d7SPaul E. McKenney	But what if the lock is a sleeplock?  In that case, the code will
8dd853d7SPaul E. McKenney	try to enter the scheduler, where it will eventually encounter
8dd853d7SPaul E. McKenney	a memory barrier, which will force the earlier unlock operation
8dd853d7SPaul E. McKenney	to complete, again unraveling the deadlock.  There might be
8dd853d7SPaul E. McKenney	a sleep-unlock race, but the locking primitive needs to resolve
8dd853d7SPaul E. McKenney	such races properly in any case.
8dd853d7SPaul E. McKenney
108b42b4SDavid HowellsLocks and semaphores may not provide any guarantee of ordering on UP compiled
108b42b4SDavid Howellssystems, and so cannot be counted on in such a situation to actually achieve
108b42b4SDavid Howellsanything at all - especially with respect to I/O accesses - unless combined
108b42b4SDavid Howellswith interrupt disabling operations.
108b42b4SDavid Howells
d7cab36dSSeongJae ParkSee also the section on "Inter-CPU acquiring barrier effects".
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsAs an example, consider the following:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = a;
108b42b4SDavid Howells	*B = b;
2e4f5382SPeter Zijlstra	ACQUIRE
108b42b4SDavid Howells	*C = c;
108b42b4SDavid Howells	*D = d;
2e4f5382SPeter Zijlstra	RELEASE
108b42b4SDavid Howells	*E = e;
108b42b4SDavid Howells	*F = f;
108b42b4SDavid Howells
108b42b4SDavid HowellsThe following sequence of events is acceptable:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
108b42b4SDavid Howells
108b42b4SDavid Howells	[+] Note that {*F,*A} indicates a combined access.
108b42b4SDavid Howells
108b42b4SDavid HowellsBut none of the following are:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
2e4f5382SPeter Zijlstra	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
2e4f5382SPeter Zijlstra	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
2e4f5382SPeter Zijlstra	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsINTERRUPT DISABLING FUNCTIONS
108b42b4SDavid Howells-----------------------------
108b42b4SDavid Howells
2e4f5382SPeter ZijlstraFunctions that disable interrupts (ACQUIRE equivalent) and enable interrupts
2e4f5382SPeter Zijlstra(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
108b42b4SDavid Howellsbarriers are required in such a situation, they must be provided from some
108b42b4SDavid Howellsother means.
108b42b4SDavid Howells
108b42b4SDavid Howells
50fa610aSDavid HowellsSLEEP AND WAKE-UP FUNCTIONS
50fa610aSDavid Howells---------------------------
50fa610aSDavid Howells
50fa610aSDavid HowellsSleeping and waking on an event flagged in global data can be viewed as an
50fa610aSDavid Howellsinteraction between two pieces of data: the task state of the task waiting for
50fa610aSDavid Howellsthe event and the global data used to indicate the event.  To make sure that
50fa610aSDavid Howellsthese appear to happen in the right order, the primitives to begin the process
50fa610aSDavid Howellsof going to sleep, and the primitives to initiate a wake up imply certain
50fa610aSDavid Howellsbarriers.
50fa610aSDavid Howells
50fa610aSDavid HowellsFirstly, the sleeper normally follows something like this sequence of events:
50fa610aSDavid Howells
50fa610aSDavid Howells	for (;;) {
50fa610aSDavid Howells		set_current_state(TASK_UNINTERRUPTIBLE);
50fa610aSDavid Howells		if (event_indicated)
50fa610aSDavid Howells			break;
50fa610aSDavid Howells		schedule();
50fa610aSDavid Howells	}
50fa610aSDavid Howells
50fa610aSDavid HowellsA general memory barrier is interpolated automatically by set_current_state()
50fa610aSDavid Howellsafter it has altered the task state:
50fa610aSDavid Howells
50fa610aSDavid Howells	CPU 1
50fa610aSDavid Howells	===============================
50fa610aSDavid Howells	set_current_state();
b92b8b35SPeter Zijlstra	  smp_store_mb();
50fa610aSDavid Howells	    STORE current->state
50fa610aSDavid Howells	    <general barrier>
50fa610aSDavid Howells	LOAD event_indicated
50fa610aSDavid Howells
50fa610aSDavid Howellsset_current_state() may be wrapped by:
50fa610aSDavid Howells
50fa610aSDavid Howells	prepare_to_wait();
50fa610aSDavid Howells	prepare_to_wait_exclusive();
50fa610aSDavid Howells
50fa610aSDavid Howellswhich therefore also imply a general memory barrier after setting the state.
50fa610aSDavid HowellsThe whole sequence above is available in various canned forms, all of which
*7f8fcc6fSAkira Yokosawainterpolate the memory barrier in the right place, for example:
50fa610aSDavid Howells
50fa610aSDavid Howells	wait_event();
*7f8fcc6fSAkira Yokosawa	wait_event_cmd();
*7f8fcc6fSAkira Yokosawa	wait_event_exclusive_cmd();
50fa610aSDavid Howells	wait_event_interruptible();
50fa610aSDavid Howells	wait_event_interruptible_exclusive();
50fa610aSDavid Howells	wait_event_interruptible_timeout();
50fa610aSDavid Howells	wait_event_killable();
50fa610aSDavid Howells	wait_event_timeout();
50fa610aSDavid Howells	wait_on_bit();
50fa610aSDavid Howells	wait_on_bit_lock();
50fa610aSDavid Howells
50fa610aSDavid Howells
50fa610aSDavid HowellsSecondly, code that performs a wake up normally follows something like this:
50fa610aSDavid Howells
50fa610aSDavid Howells	event_indicated = 1;
50fa610aSDavid Howells	wake_up(&event_wait_queue);
50fa610aSDavid Howells
50fa610aSDavid Howellsor:
50fa610aSDavid Howells
50fa610aSDavid Howells	event_indicated = 1;
50fa610aSDavid Howells	wake_up_process(event_daemon);
50fa610aSDavid Howells
7696f991SAndrea ParriA general memory barrier is executed by wake_up() if it wakes something up.
7696f991SAndrea ParriIf it doesn't wake anything up then a memory barrier may or may not be
7696f991SAndrea Parriexecuted; you must not rely on it.  The barrier occurs before the task state
7696f991SAndrea Parriis accessed, in particular, it sits between the STORE to indicate the event
7696f991SAndrea Parriand the STORE to set TASK_RUNNING:
50fa610aSDavid Howells
7696f991SAndrea Parri	CPU 1 (Sleeper)			CPU 2 (Waker)
50fa610aSDavid Howells	===============================	===============================
50fa610aSDavid Howells	set_current_state();		STORE event_indicated
b92b8b35SPeter Zijlstra	  smp_store_mb();		wake_up();
7696f991SAndrea Parri	    STORE current->state	  ...
7696f991SAndrea Parri	    <general barrier>		  <general barrier>
7696f991SAndrea Parri	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
7696f991SAndrea Parri					    STORE task->state
50fa610aSDavid Howells
7696f991SAndrea Parriwhere "task" is the thread being woken up and it equals CPU 1's "current".
7696f991SAndrea Parri
7696f991SAndrea ParriTo repeat, a general memory barrier is guaranteed to be executed by wake_up()
7696f991SAndrea Parriif something is actually awakened, but otherwise there is no such guarantee.
7696f991SAndrea ParriTo see this, consider the following sequence of events, where X and Y are both
7696f991SAndrea Parriinitially zero:
5726ce06SPaul E. McKenney
5726ce06SPaul E. McKenney	CPU 1				CPU 2
5726ce06SPaul E. McKenney	===============================	===============================
7696f991SAndrea Parri	X = 1;				Y = 1;
5726ce06SPaul E. McKenney	smp_mb();			wake_up();
7696f991SAndrea Parri	LOAD Y				LOAD X
5726ce06SPaul E. McKenney
7696f991SAndrea ParriIf a wakeup does occur, one (at least) of the two loads must see 1.  If, on
7696f991SAndrea Parrithe other hand, a wakeup does not occur, both loads might see 0.
7696f991SAndrea Parri
7696f991SAndrea Parriwake_up_process() always executes a general memory barrier.  The barrier again
7696f991SAndrea Parrioccurs before the task state is accessed.  In particular, if the wake_up() in
7696f991SAndrea Parrithe previous snippet were replaced by a call to wake_up_process() then one of
7696f991SAndrea Parrithe two loads would be guaranteed to see 1.
5726ce06SPaul E. McKenney
50fa610aSDavid HowellsThe available waker functions include:
50fa610aSDavid Howells
50fa610aSDavid Howells	complete();
50fa610aSDavid Howells	wake_up();
50fa610aSDavid Howells	wake_up_all();
50fa610aSDavid Howells	wake_up_bit();
50fa610aSDavid Howells	wake_up_interruptible();
50fa610aSDavid Howells	wake_up_interruptible_all();
50fa610aSDavid Howells	wake_up_interruptible_nr();
50fa610aSDavid Howells	wake_up_interruptible_poll();
50fa610aSDavid Howells	wake_up_interruptible_sync();
50fa610aSDavid Howells	wake_up_interruptible_sync_poll();
50fa610aSDavid Howells	wake_up_locked();
50fa610aSDavid Howells	wake_up_locked_poll();
50fa610aSDavid Howells	wake_up_nr();
50fa610aSDavid Howells	wake_up_poll();
50fa610aSDavid Howells	wake_up_process();
50fa610aSDavid Howells
7696f991SAndrea ParriIn terms of memory ordering, these functions all provide the same guarantees of
7696f991SAndrea Parria wake_up() (or stronger).
50fa610aSDavid Howells
50fa610aSDavid Howells[!] Note that the memory barriers implied by the sleeper and the waker do _not_
50fa610aSDavid Howellsorder multiple stores before the wake-up with respect to loads of those stored
50fa610aSDavid Howellsvalues after the sleeper has called set_current_state().  For instance, if the
50fa610aSDavid Howellssleeper does:
50fa610aSDavid Howells
50fa610aSDavid Howells	set_current_state(TASK_INTERRUPTIBLE);
50fa610aSDavid Howells	if (event_indicated)
50fa610aSDavid Howells		break;
50fa610aSDavid Howells	__set_current_state(TASK_RUNNING);
50fa610aSDavid Howells	do_something(my_data);
50fa610aSDavid Howells
50fa610aSDavid Howellsand the waker does:
50fa610aSDavid Howells
50fa610aSDavid Howells	my_data = value;
50fa610aSDavid Howells	event_indicated = 1;
50fa610aSDavid Howells	wake_up(&event_wait_queue);
50fa610aSDavid Howells
50fa610aSDavid Howellsthere's no guarantee that the change to event_indicated will be perceived by
50fa610aSDavid Howellsthe sleeper as coming after the change to my_data.  In such a circumstance, the
50fa610aSDavid Howellscode on both sides must interpolate its own memory barriers between the
50fa610aSDavid Howellsseparate data accesses.  Thus the above sleeper ought to do:
50fa610aSDavid Howells
50fa610aSDavid Howells	set_current_state(TASK_INTERRUPTIBLE);
50fa610aSDavid Howells	if (event_indicated) {
50fa610aSDavid Howells		smp_rmb();
50fa610aSDavid Howells		do_something(my_data);
50fa610aSDavid Howells	}
50fa610aSDavid Howells
50fa610aSDavid Howellsand the waker should do:
50fa610aSDavid Howells
50fa610aSDavid Howells	my_data = value;
50fa610aSDavid Howells	smp_wmb();
50fa610aSDavid Howells	event_indicated = 1;
50fa610aSDavid Howells	wake_up(&event_wait_queue);
50fa610aSDavid Howells
50fa610aSDavid Howells
108b42b4SDavid HowellsMISCELLANEOUS FUNCTIONS
108b42b4SDavid Howells-----------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsOther functions that imply barriers:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) schedule() and similar imply full memory barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra===================================
2e4f5382SPeter ZijlstraINTER-CPU ACQUIRING BARRIER EFFECTS
2e4f5382SPeter Zijlstra===================================
108b42b4SDavid Howells
108b42b4SDavid HowellsOn SMP systems locking primitives give a more substantial form of barrier: one
108b42b4SDavid Howellsthat does affect memory access ordering on other CPUs, within the context of
108b42b4SDavid Howellsconflict on any particular lock.
108b42b4SDavid Howells
108b42b4SDavid Howells
2e4f5382SPeter ZijlstraACQUIRES VS MEMORY ACCESSES
2e4f5382SPeter Zijlstra---------------------------
108b42b4SDavid Howells
79afecfaSAneesh KumarConsider the following: the system has a pair of spinlocks (M) and (Q), and
108b42b4SDavid Howellsthree CPUs; then should the following sequence of events occur:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1				CPU 2
108b42b4SDavid Howells	===============================	===============================
9af194ceSPaul E. McKenney	WRITE_ONCE(*A, a);		WRITE_ONCE(*E, e);
2e4f5382SPeter Zijlstra	ACQUIRE M			ACQUIRE Q
9af194ceSPaul E. McKenney	WRITE_ONCE(*B, b);		WRITE_ONCE(*F, f);
9af194ceSPaul E. McKenney	WRITE_ONCE(*C, c);		WRITE_ONCE(*G, g);
2e4f5382SPeter Zijlstra	RELEASE M			RELEASE Q
9af194ceSPaul E. McKenney	WRITE_ONCE(*D, d);		WRITE_ONCE(*H, h);
108b42b4SDavid Howells
81fc6323SJarek PoplawskiThen there is no guarantee as to what order CPU 3 will see the accesses to *A
108b42b4SDavid Howellsthrough *H occur in, other than the constraints imposed by the separate locks
108b42b4SDavid Howellson the separate CPUs.  It might, for example, see:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
108b42b4SDavid Howells
108b42b4SDavid HowellsBut it won't see any of:
108b42b4SDavid Howells
2e4f5382SPeter Zijlstra	*B, *C or *D preceding ACQUIRE M
2e4f5382SPeter Zijlstra	*A, *B or *C following RELEASE M
2e4f5382SPeter Zijlstra	*F, *G or *H preceding ACQUIRE Q
2e4f5382SPeter Zijlstra	*E, *F or *G following RELEASE Q
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells=================================
108b42b4SDavid HowellsWHERE ARE MEMORY BARRIERS NEEDED?
108b42b4SDavid Howells=================================
108b42b4SDavid Howells
108b42b4SDavid HowellsUnder normal operation, memory operation reordering is generally not going to
108b42b4SDavid Howellsbe a problem as a single-threaded linear piece of code will still appear to
50fa610aSDavid Howellswork correctly, even if it's in an SMP kernel.  There are, however, four
108b42b4SDavid Howellscircumstances in which reordering definitely _could_ be a problem:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Interprocessor interaction.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Atomic operations.
108b42b4SDavid Howells
81fc6323SJarek Poplawski (*) Accessing devices.
108b42b4SDavid Howells
108b42b4SDavid Howells (*) Interrupts.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsINTERPROCESSOR INTERACTION
108b42b4SDavid Howells--------------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsWhen there's a system with more than one processor, more than one CPU in the
108b42b4SDavid Howellssystem may be working on the same data set at the same time.  This can cause
108b42b4SDavid Howellssynchronisation problems, and the usual way of dealing with them is to use
108b42b4SDavid Howellslocks.  Locks, however, are quite expensive, and so it may be preferable to
108b42b4SDavid Howellsoperate without the use of a lock if at all possible.  In such a case
108b42b4SDavid Howellsoperations that affect both CPUs may have to be carefully ordered to prevent
108b42b4SDavid Howellsa malfunction.
108b42b4SDavid Howells
108b42b4SDavid HowellsConsider, for example, the R/W semaphore slow path.  Here a waiting process is
108b42b4SDavid Howellsqueued on the semaphore, by virtue of it having a piece of its stack linked to
108b42b4SDavid Howellsthe semaphore's list of waiting processes:
108b42b4SDavid Howells
108b42b4SDavid Howells	struct rw_semaphore {
108b42b4SDavid Howells		...
108b42b4SDavid Howells		spinlock_t lock;
108b42b4SDavid Howells		struct list_head waiters;
108b42b4SDavid Howells	};
108b42b4SDavid Howells
108b42b4SDavid Howells	struct rwsem_waiter {
108b42b4SDavid Howells		struct list_head list;
108b42b4SDavid Howells		struct task_struct *task;
108b42b4SDavid Howells	};
108b42b4SDavid Howells
108b42b4SDavid HowellsTo wake up a particular waiter, the up_read() or up_write() functions have to:
108b42b4SDavid Howells
108b42b4SDavid Howells (1) read the next pointer from this waiter's record to know as to where the
108b42b4SDavid Howells     next waiter record is;
108b42b4SDavid Howells
81fc6323SJarek Poplawski (2) read the pointer to the waiter's task structure;
108b42b4SDavid Howells
108b42b4SDavid Howells (3) clear the task pointer to tell the waiter it has been given the semaphore;
108b42b4SDavid Howells
108b42b4SDavid Howells (4) call wake_up_process() on the task; and
108b42b4SDavid Howells
108b42b4SDavid Howells (5) release the reference held on the waiter's task struct.
108b42b4SDavid Howells
108b42b4SDavid HowellsIn other words, it has to perform this sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	LOAD waiter->list.next;
108b42b4SDavid Howells	LOAD waiter->task;
108b42b4SDavid Howells	STORE waiter->task;
108b42b4SDavid Howells	CALL wakeup
108b42b4SDavid Howells	RELEASE task
108b42b4SDavid Howells
108b42b4SDavid Howellsand if any of these steps occur out of order, then the whole thing may
108b42b4SDavid Howellsmalfunction.
108b42b4SDavid Howells
108b42b4SDavid HowellsOnce it has queued itself and dropped the semaphore lock, the waiter does not
108b42b4SDavid Howellsget the lock again; it instead just waits for its task pointer to be cleared
108b42b4SDavid Howellsbefore proceeding.  Since the record is on the waiter's stack, this means that
108b42b4SDavid Howellsif the task pointer is cleared _before_ the next pointer in the list is read,
108b42b4SDavid Howellsanother CPU might start processing the waiter and might clobber the waiter's
108b42b4SDavid Howellsstack before the up*() function has a chance to read the next pointer.
108b42b4SDavid Howells
108b42b4SDavid HowellsConsider then what might happen to the above sequence of events:
108b42b4SDavid Howells
108b42b4SDavid Howells	CPU 1				CPU 2
108b42b4SDavid Howells	===============================	===============================
108b42b4SDavid Howells					down_xxx()
108b42b4SDavid Howells					Queue waiter
108b42b4SDavid Howells					Sleep
108b42b4SDavid Howells	up_yyy()
108b42b4SDavid Howells	LOAD waiter->task;
108b42b4SDavid Howells	STORE waiter->task;
108b42b4SDavid Howells					Woken up by other event
108b42b4SDavid Howells	<preempt>
108b42b4SDavid Howells					Resume processing
108b42b4SDavid Howells					down_xxx() returns
108b42b4SDavid Howells					call foo()
108b42b4SDavid Howells					foo() clobbers *waiter
108b42b4SDavid Howells	</preempt>
108b42b4SDavid Howells	LOAD waiter->list.next;
108b42b4SDavid Howells	--- OOPS ---
108b42b4SDavid Howells
108b42b4SDavid HowellsThis could be dealt with using the semaphore lock, but then the down_xxx()
108b42b4SDavid Howellsfunction has to needlessly get the spinlock again after being woken up.
108b42b4SDavid Howells
108b42b4SDavid HowellsThe way to deal with this is to insert a general SMP memory barrier:
108b42b4SDavid Howells
108b42b4SDavid Howells	LOAD waiter->list.next;
108b42b4SDavid Howells	LOAD waiter->task;
108b42b4SDavid Howells	smp_mb();
108b42b4SDavid Howells	STORE waiter->task;
108b42b4SDavid Howells	CALL wakeup
108b42b4SDavid Howells	RELEASE task
108b42b4SDavid Howells
108b42b4SDavid HowellsIn this case, the barrier makes a guarantee that all memory accesses before the
108b42b4SDavid Howellsbarrier will appear to happen before all the memory accesses after the barrier
108b42b4SDavid Howellswith respect to the other CPUs on the system.  It does _not_ guarantee that all
108b42b4SDavid Howellsthe memory accesses before the barrier will be complete by the time the barrier
108b42b4SDavid Howellsinstruction itself is complete.
108b42b4SDavid Howells
108b42b4SDavid HowellsOn a UP system - where this wouldn't be a problem - the smp_mb() is just a
108b42b4SDavid Howellscompiler barrier, thus making sure the compiler emits the instructions in the
6bc39274SDavid Howellsright order without actually intervening in the CPU.  Since there's only one
6bc39274SDavid HowellsCPU, that CPU's dependency ordering logic will take care of everything else.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsATOMIC OPERATIONS
108b42b4SDavid Howells-----------------
108b42b4SDavid Howells
806654a9SWill DeaconWhile they are technically interprocessor interaction considerations, atomic
dbc8700eSDavid Howellsoperations are noted specially as some of them imply full memory barriers and
dbc8700eSDavid Howellssome don't, but they're very heavily relied on as a group throughout the
dbc8700eSDavid Howellskernel.
dbc8700eSDavid Howells
706eeb3eSPeter ZijlstraSee Documentation/atomic_t.txt for more information.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsACCESSING DEVICES
108b42b4SDavid Howells-----------------
108b42b4SDavid Howells
108b42b4SDavid HowellsMany devices can be memory mapped, and so appear to the CPU as if they're just
108b42b4SDavid Howellsa set of memory locations.  To control such a device, the driver usually has to
108b42b4SDavid Howellsmake the right memory accesses in exactly the right order.
108b42b4SDavid Howells
108b42b4SDavid HowellsHowever, having a clever CPU or a clever compiler creates a potential problem
108b42b4SDavid Howellsin that the carefully sequenced accesses in the driver code won't reach the
108b42b4SDavid Howellsdevice in the requisite order if the CPU or the compiler thinks it is more
108b42b4SDavid Howellsefficient to reorder, combine or merge accesses - something that would cause
108b42b4SDavid Howellsthe device to malfunction.
108b42b4SDavid Howells
108b42b4SDavid HowellsInside of the Linux kernel, I/O should be done through the appropriate accessor
108b42b4SDavid Howellsroutines - such as inb() or writel() - which know how to make such accesses
806654a9SWill Deaconappropriately sequential.  While this, for the most part, renders the explicit
91553039SWill Deaconuse of memory barriers unnecessary, if the accessor functions are used to refer
91553039SWill Deaconto an I/O memory window with relaxed memory access properties, then _mandatory_
91553039SWill Deaconmemory barriers are required to enforce ordering.
108b42b4SDavid Howells
0fe397f0SHelmut GrohneSee Documentation/driver-api/device-io.rst for more information.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsINTERRUPTS
108b42b4SDavid Howells----------
108b42b4SDavid Howells
108b42b4SDavid HowellsA driver may be interrupted by its own interrupt service routine, and thus the
108b42b4SDavid Howellstwo parts of the driver may interfere with each other's attempts to control or
108b42b4SDavid Howellsaccess the device.
108b42b4SDavid Howells
108b42b4SDavid HowellsThis may be alleviated - at least in part - by disabling local interrupts (a
108b42b4SDavid Howellsform of locking), such that the critical operations are all contained within
806654a9SWill Deaconthe interrupt-disabled section in the driver.  While the driver's interrupt
108b42b4SDavid Howellsroutine is executing, the driver's core may not run on the same CPU, and its
108b42b4SDavid Howellsinterrupt is not permitted to happen again until the current interrupt has been
108b42b4SDavid Howellshandled, thus the interrupt handler does not need to lock against that.
108b42b4SDavid Howells
108b42b4SDavid HowellsHowever, consider a driver that was talking to an ethernet card that sports an
108b42b4SDavid Howellsaddress register and a data register.  If that driver's core talks to the card
108b42b4SDavid Howellsunder interrupt-disablement and then the driver's interrupt handler is invoked:
108b42b4SDavid Howells
108b42b4SDavid Howells	LOCAL IRQ DISABLE
108b42b4SDavid Howells	writew(ADDR, 3);
108b42b4SDavid Howells	writew(DATA, y);
108b42b4SDavid Howells	LOCAL IRQ ENABLE
108b42b4SDavid Howells	<interrupt>
108b42b4SDavid Howells	writew(ADDR, 4);
108b42b4SDavid Howells	q = readw(DATA);
108b42b4SDavid Howells	</interrupt>
108b42b4SDavid Howells
108b42b4SDavid HowellsThe store to the data register might happen after the second store to the
108b42b4SDavid Howellsaddress register if ordering rules are sufficiently relaxed:
108b42b4SDavid Howells
108b42b4SDavid Howells	STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsIf ordering rules are relaxed, it must be assumed that accesses done inside an
108b42b4SDavid Howellsinterrupt disabled section may leak outside of it and may interleave with
108b42b4SDavid Howellsaccesses performed in an interrupt - and vice versa - unless implicit or
108b42b4SDavid Howellsexplicit barriers are used.
108b42b4SDavid Howells
108b42b4SDavid HowellsNormally this won't be a problem because the I/O accesses done inside such
108b42b4SDavid Howellssections will include synchronous load operations on strictly ordered I/O
91553039SWill Deaconregisters that form implicit I/O barriers.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsA similar situation may occur between an interrupt routine and two routines
108b42b4SDavid Howellsrunning on separate CPUs that communicate with each other.  If such a case is
108b42b4SDavid Howellslikely, then interrupt-disabling locks should be used to guarantee ordering.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells==========================
108b42b4SDavid HowellsKERNEL I/O BARRIER EFFECTS
108b42b4SDavid Howells==========================
108b42b4SDavid Howells
4614bbdeSWill DeaconInterfacing with peripherals via I/O accesses is deeply architecture and device
4614bbdeSWill Deaconspecific. Therefore, drivers which are inherently non-portable may rely on
4614bbdeSWill Deaconspecific behaviours of their target systems in order to achieve synchronization
4614bbdeSWill Deaconin the most lightweight manner possible. For drivers intending to be portable
4614bbdeSWill Deaconbetween multiple architectures and bus implementations, the kernel offers a
4614bbdeSWill Deaconseries of accessor functions that provide various degrees of ordering
4614bbdeSWill Deaconguarantees:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) readX(), writeX():
108b42b4SDavid Howells
0cde62a4SWill Deacon	The readX() and writeX() MMIO accessors take a pointer to the
0cde62a4SWill Deacon	peripheral being accessed as an __iomem * parameter. For pointers
0cde62a4SWill Deacon	mapped with the default I/O attributes (e.g. those returned by
0cde62a4SWill Deacon	ioremap()), the ordering guarantees are as follows:
108b42b4SDavid Howells
4614bbdeSWill Deacon	1. All readX() and writeX() accesses to the same peripheral are ordered
9726840dSWill Deacon	   with respect to each other. This ensures that MMIO register accesses
9726840dSWill Deacon	   by the same CPU thread to a particular device will arrive in program
9726840dSWill Deacon	   order.
108b42b4SDavid Howells
9726840dSWill Deacon	2. A writeX() issued by a CPU thread holding a spinlock is ordered
9726840dSWill Deacon	   before a writeX() to the same peripheral from another CPU thread
9726840dSWill Deacon	   issued after a later acquisition of the same spinlock. This ensures
9726840dSWill Deacon	   that MMIO register writes to a particular device issued while holding
9726840dSWill Deacon	   a spinlock will arrive in an order consistent with acquisitions of
9726840dSWill Deacon	   the lock.
108b42b4SDavid Howells
9726840dSWill Deacon	3. A writeX() by a CPU thread to the peripheral will first wait for the
9726840dSWill Deacon	   completion of all prior writes to memory either issued by, or
9726840dSWill Deacon	   propagated to, the same thread. This ensures that writes by the CPU
9726840dSWill Deacon	   to an outbound DMA buffer allocated by dma_alloc_coherent() will be
9726840dSWill Deacon	   visible to a DMA engine when the CPU writes to its MMIO control
9726840dSWill Deacon	   register to trigger the transfer.
108b42b4SDavid Howells
9726840dSWill Deacon	4. A readX() by a CPU thread from the peripheral will complete before
9726840dSWill Deacon	   any subsequent reads from memory by the same thread can begin. This
9726840dSWill Deacon	   ensures that reads by the CPU from an incoming DMA buffer allocated
9726840dSWill Deacon	   by dma_alloc_coherent() will not see stale data after reading from
9726840dSWill Deacon	   the DMA engine's MMIO status register to establish that the DMA
9726840dSWill Deacon	   transfer has completed.
9726840dSWill Deacon
9726840dSWill Deacon	5. A readX() by a CPU thread from the peripheral will complete before
9726840dSWill Deacon	   any subsequent delay() loop can begin execution on the same thread.
9726840dSWill Deacon	   This ensures that two MMIO register writes by the CPU to a peripheral
9726840dSWill Deacon	   will arrive at least 1us apart if the first write is immediately read
9726840dSWill Deacon	   back with readX() and udelay(1) is called prior to the second
9726840dSWill Deacon	   writeX():
108b42b4SDavid Howells
0cde62a4SWill Deacon		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
0cde62a4SWill Deacon		readl(DEVICE_REGISTER_0);
0cde62a4SWill Deacon		udelay(1);
0cde62a4SWill Deacon		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
0cde62a4SWill Deacon
0cde62a4SWill Deacon	The ordering properties of __iomem pointers obtained with non-default
0cde62a4SWill Deacon	attributes (e.g. those returned by ioremap_wc()) are specific to the
0cde62a4SWill Deacon	underlying architecture and therefore the guarantees listed above cannot
0cde62a4SWill Deacon	generally be relied upon for accesses to these types of mappings.
108b42b4SDavid Howells
4614bbdeSWill Deacon (*) readX_relaxed(), writeX_relaxed():
108b42b4SDavid Howells
a8e0aeadSWill Deacon	These are similar to readX() and writeX(), but provide weaker memory
a8e0aeadSWill Deacon	ordering guarantees. Specifically, they do not guarantee ordering with
9726840dSWill Deacon	respect to locking, normal memory accesses or delay() loops (i.e.
9726840dSWill Deacon	bullets 2-5 above) but they are still guaranteed to be ordered with
9726840dSWill Deacon	respect to other accesses from the same CPU thread to the same
9726840dSWill Deacon	peripheral when operating on __iomem pointers mapped with the default
9726840dSWill Deacon	I/O attributes.
4614bbdeSWill Deacon
4614bbdeSWill Deacon (*) readsX(), writesX():
4614bbdeSWill Deacon
4614bbdeSWill Deacon	The readsX() and writesX() MMIO accessors are designed for accessing
4614bbdeSWill Deacon	register-based, memory-mapped FIFOs residing on peripherals that are not
4614bbdeSWill Deacon	capable of performing DMA. Consequently, they provide only the ordering
4614bbdeSWill Deacon	guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
4614bbdeSWill Deacon
4614bbdeSWill Deacon (*) inX(), outX():
4614bbdeSWill Deacon
4614bbdeSWill Deacon	The inX() and outX() accessors are intended to access legacy port-mapped
4614bbdeSWill Deacon	I/O peripherals, which may require special instructions on some
4614bbdeSWill Deacon	architectures (notably x86). The port number of the peripheral being
4614bbdeSWill Deacon	accessed is passed as an argument.
4614bbdeSWill Deacon
4614bbdeSWill Deacon	Since many CPU architectures ultimately access these peripherals via an
0cde62a4SWill Deacon	internal virtual memory mapping, the portable ordering guarantees
0cde62a4SWill Deacon	provided by inX() and outX() are the same as those provided by readX()
0cde62a4SWill Deacon	and writeX() respectively when accessing a mapping with the default I/O
0cde62a4SWill Deacon	attributes.
4614bbdeSWill Deacon
4614bbdeSWill Deacon	Device drivers may expect outX() to emit a non-posted write transaction
4614bbdeSWill Deacon	that waits for a completion response from the I/O peripheral before
4614bbdeSWill Deacon	returning. This is not guaranteed by all architectures and is therefore
4614bbdeSWill Deacon	not part of the portable ordering semantics.
4614bbdeSWill Deacon
4614bbdeSWill Deacon (*) insX(), outsX():
4614bbdeSWill Deacon
4614bbdeSWill Deacon	As above, the insX() and outsX() accessors provide the same ordering
0cde62a4SWill Deacon	guarantees as readsX() and writesX() respectively when accessing a
0cde62a4SWill Deacon	mapping with the default I/O attributes.
108b42b4SDavid Howells
0cde62a4SWill Deacon (*) ioreadX(), iowriteX():
108b42b4SDavid Howells
81fc6323SJarek Poplawski	These will perform appropriately for the type of access they're actually
108b42b4SDavid Howells	doing, be it inX()/outX() or readX()/writeX().
108b42b4SDavid Howells
9726840dSWill DeaconWith the exception of the string accessors (insX(), outsX(), readsX() and
9726840dSWill DeaconwritesX()), all of the above assume that the underlying peripheral is
9726840dSWill Deaconlittle-endian and will therefore perform byte-swapping operations on big-endian
9726840dSWill Deaconarchitectures.
4614bbdeSWill Deacon
108b42b4SDavid Howells
108b42b4SDavid Howells========================================
108b42b4SDavid HowellsASSUMED MINIMUM EXECUTION ORDERING MODEL
108b42b4SDavid Howells========================================
108b42b4SDavid Howells
108b42b4SDavid HowellsIt has to be assumed that the conceptual CPU is weakly-ordered but that it will
108b42b4SDavid Howellsmaintain the appearance of program causality with respect to itself.  Some CPUs
108b42b4SDavid Howells(such as i386 or x86_64) are more constrained than others (such as powerpc or
108b42b4SDavid Howellsfrv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
108b42b4SDavid Howellsof arch-specific code.
108b42b4SDavid Howells
108b42b4SDavid HowellsThis means that it must be considered that the CPU will execute its instruction
108b42b4SDavid Howellsstream in any order it feels like - or even in parallel - provided that if an
81fc6323SJarek Poplawskiinstruction in the stream depends on an earlier instruction, then that
108b42b4SDavid Howellsearlier instruction must be sufficiently complete[*] before the later
108b42b4SDavid Howellsinstruction may proceed; in other words: provided that the appearance of
108b42b4SDavid Howellscausality is maintained.
108b42b4SDavid Howells
108b42b4SDavid Howells [*] Some instructions have more than one effect - such as changing the
108b42b4SDavid Howells     condition codes, changing registers or changing memory - and different
108b42b4SDavid Howells     instructions may depend on different effects.
108b42b4SDavid Howells
108b42b4SDavid HowellsA CPU may also discard any instruction sequence that winds up having no
108b42b4SDavid Howellsultimate effect.  For example, if two adjacent instructions both load an
108b42b4SDavid Howellsimmediate value into the same register, the first may be discarded.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsSimilarly, it has to be assumed that compiler might reorder the instruction
108b42b4SDavid Howellsstream in any way it sees fit, again provided the appearance of causality is
108b42b4SDavid Howellsmaintained.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells============================
108b42b4SDavid HowellsTHE EFFECTS OF THE CPU CACHE
108b42b4SDavid Howells============================
108b42b4SDavid Howells
108b42b4SDavid HowellsThe way cached memory operations are perceived across the system is affected to
108b42b4SDavid Howellsa certain extent by the caches that lie between CPUs and memory, and by the
108b42b4SDavid Howellsmemory coherence system that maintains the consistency of state in the system.
108b42b4SDavid Howells
108b42b4SDavid HowellsAs far as the way a CPU interacts with another part of the system through the
108b42b4SDavid Howellscaches goes, the memory system has to include the CPU's caches, and memory
108b42b4SDavid Howellsbarriers for the most part act at the interface between the CPU and its cache
108b42b4SDavid Howells(memory barriers logically act on the dotted line in the following diagram):
108b42b4SDavid Howells
108b42b4SDavid Howells	    <--- CPU --->         :       <----------- Memory ----------->
108b42b4SDavid Howells	                          :
108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    +-----------+
108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    +--------+
108b42b4SDavid Howells	|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
108b42b4SDavid Howells	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
108b42b4SDavid Howells	|        |    | Queue  |  :   |        |    |           |--->| Memory |
108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    |        |
108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    |           |    |        |
108b42b4SDavid Howells	                          :                 | Cache     |    +--------+
108b42b4SDavid Howells	                          :                 | Coherency |
108b42b4SDavid Howells	                          :                 | Mechanism |    +--------+
108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    |           |    |	      |
108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    |        |
108b42b4SDavid Howells	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
108b42b4SDavid Howells	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
108b42b4SDavid Howells	|        |    | Queue  |  :   |        |    |           |    |        |
108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    +--------+
108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    +-----------+
108b42b4SDavid Howells	                          :
108b42b4SDavid Howells	                          :
108b42b4SDavid Howells
108b42b4SDavid HowellsAlthough any particular load or store may not actually appear outside of the
108b42b4SDavid HowellsCPU that issued it since it may have been satisfied within the CPU's own cache,
108b42b4SDavid Howellsit will still appear as if the full memory access had taken place as far as the
108b42b4SDavid Howellsother CPUs are concerned since the cache coherency mechanisms will migrate the
108b42b4SDavid Howellscacheline over to the accessing CPU and propagate the effects upon conflict.
108b42b4SDavid Howells
108b42b4SDavid HowellsThe CPU core may execute instructions in any order it deems fit, provided the
108b42b4SDavid Howellsexpected program causality appears to be maintained.  Some of the instructions
108b42b4SDavid Howellsgenerate load and store operations which then go into the queue of memory
108b42b4SDavid Howellsaccesses to be performed.  The core may place these in the queue in any order
108b42b4SDavid Howellsit wishes, and continue execution until it is forced to wait for an instruction
108b42b4SDavid Howellsto complete.
108b42b4SDavid Howells
108b42b4SDavid HowellsWhat memory barriers are concerned with is controlling the order in which
108b42b4SDavid Howellsaccesses cross from the CPU side of things to the memory side of things, and
108b42b4SDavid Howellsthe order in which the effects are perceived to happen by the other observers
108b42b4SDavid Howellsin the system.
108b42b4SDavid Howells
108b42b4SDavid Howells[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
108b42b4SDavid Howellstheir own loads and stores as if they had happened in program order.
108b42b4SDavid Howells
108b42b4SDavid Howells[!] MMIO or other device accesses may bypass the cache system.  This depends on
108b42b4SDavid Howellsthe properties of the memory window through which devices are accessed and/or
108b42b4SDavid Howellsthe use of any special device communication instructions the CPU may have.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsCACHE COHERENCY VS DMA
108b42b4SDavid Howells----------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsNot all systems maintain cache coherency with respect to devices doing DMA.  In
108b42b4SDavid Howellssuch cases, a device attempting DMA may obtain stale data from RAM because
108b42b4SDavid Howellsdirty cache lines may be resident in the caches of various CPUs, and may not
108b42b4SDavid Howellshave been written back to RAM yet.  To deal with this, the appropriate part of
108b42b4SDavid Howellsthe kernel must flush the overlapping bits of cache on each CPU (and maybe
108b42b4SDavid Howellsinvalidate them as well).
108b42b4SDavid Howells
108b42b4SDavid HowellsIn addition, the data DMA'd to RAM by a device may be overwritten by dirty
108b42b4SDavid Howellscache lines being written back to RAM from a CPU's cache after the device has
81fc6323SJarek Poplawskiinstalled its own data, or cache lines present in the CPU's cache may simply
81fc6323SJarek Poplawskiobscure the fact that RAM has been updated, until at such time as the cacheline
81fc6323SJarek Poplawskiis discarded from the CPU's cache and reloaded.  To deal with this, the
81fc6323SJarek Poplawskiappropriate part of the kernel must invalidate the overlapping bits of the
108b42b4SDavid Howellscache on each CPU.
108b42b4SDavid Howells
f556082dSAkira YokosawaSee Documentation/core-api/cachetlb.rst for more information on cache
f556082dSAkira Yokosawamanagement.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsCACHE COHERENCY VS MMIO
108b42b4SDavid Howells-----------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsMemory mapped I/O usually takes place through memory locations that are part of
81fc6323SJarek Poplawskia window in the CPU's memory space that has different properties assigned than
108b42b4SDavid Howellsthe usual RAM directed window.
108b42b4SDavid Howells
108b42b4SDavid HowellsAmongst these properties is usually the fact that such accesses bypass the
108b42b4SDavid Howellscaching entirely and go directly to the device buses.  This means MMIO accesses
108b42b4SDavid Howellsmay, in effect, overtake accesses to cached memory that were emitted earlier.
108b42b4SDavid HowellsA memory barrier isn't sufficient in such a case, but rather the cache must be
108b42b4SDavid Howellsflushed between the cached memory write and the MMIO access if the two are in
108b42b4SDavid Howellsany way dependent.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid Howells=========================
108b42b4SDavid HowellsTHE THINGS CPUS GET UP TO
108b42b4SDavid Howells=========================
108b42b4SDavid Howells
108b42b4SDavid HowellsA programmer might take it for granted that the CPU will perform memory
81fc6323SJarek Poplawskioperations in exactly the order specified, so that if the CPU is, for example,
108b42b4SDavid Howellsgiven the following piece of code to execute:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	a = READ_ONCE(*A);
9af194ceSPaul E. McKenney	WRITE_ONCE(*B, b);
9af194ceSPaul E. McKenney	c = READ_ONCE(*C);
9af194ceSPaul E. McKenney	d = READ_ONCE(*D);
9af194ceSPaul E. McKenney	WRITE_ONCE(*E, e);
108b42b4SDavid Howells
81fc6323SJarek Poplawskithey would then expect that the CPU will complete the memory operation for each
108b42b4SDavid Howellsinstruction before moving on to the next one, leading to a definite sequence of
108b42b4SDavid Howellsoperations as seen by external observers in the system:
108b42b4SDavid Howells
108b42b4SDavid Howells	LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsReality is, of course, much messier.  With many CPUs and compilers, the above
108b42b4SDavid Howellsassumption doesn't hold because:
108b42b4SDavid Howells
108b42b4SDavid Howells (*) loads are more likely to need to be completed immediately to permit
108b42b4SDavid Howells     execution progress, whereas stores can often be deferred without a
108b42b4SDavid Howells     problem;
108b42b4SDavid Howells
108b42b4SDavid Howells (*) loads may be done speculatively, and the result discarded should it prove
108b42b4SDavid Howells     to have been unnecessary;
108b42b4SDavid Howells
81fc6323SJarek Poplawski (*) loads may be done speculatively, leading to the result having been fetched
81fc6323SJarek Poplawski     at the wrong time in the expected sequence of events;
108b42b4SDavid Howells
108b42b4SDavid Howells (*) the order of the memory accesses may be rearranged to promote better use
108b42b4SDavid Howells     of the CPU buses and caches;
108b42b4SDavid Howells
108b42b4SDavid Howells (*) loads and stores may be combined to improve performance when talking to
108b42b4SDavid Howells     memory or I/O hardware that can do batched accesses of adjacent locations,
108b42b4SDavid Howells     thus cutting down on transaction setup costs (memory and PCI devices may
108b42b4SDavid Howells     both be able to do this); and
108b42b4SDavid Howells
806654a9SWill Deacon (*) the CPU's data cache may affect the ordering, and while cache-coherency
108b42b4SDavid Howells     mechanisms may alleviate this - once the store has actually hit the cache
108b42b4SDavid Howells     - there's no guarantee that the coherency management will be propagated in
108b42b4SDavid Howells     order to other CPUs.
108b42b4SDavid Howells
108b42b4SDavid HowellsSo what another CPU, say, might actually observe from the above piece of code
108b42b4SDavid Howellsis:
108b42b4SDavid Howells
108b42b4SDavid Howells	LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
108b42b4SDavid Howells
108b42b4SDavid Howells	(Where "LOAD {*C,*D}" is a combined load)
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsHowever, it is guaranteed that a CPU will be self-consistent: it will see its
108b42b4SDavid Howells_own_ accesses appear to be correctly ordered, without the need for a memory
108b42b4SDavid Howellsbarrier.  For instance with the following code:
108b42b4SDavid Howells
9af194ceSPaul E. McKenney	U = READ_ONCE(*A);
9af194ceSPaul E. McKenney	WRITE_ONCE(*A, V);
9af194ceSPaul E. McKenney	WRITE_ONCE(*A, W);
9af194ceSPaul E. McKenney	X = READ_ONCE(*A);
9af194ceSPaul E. McKenney	WRITE_ONCE(*A, Y);
9af194ceSPaul E. McKenney	Z = READ_ONCE(*A);
108b42b4SDavid Howells
108b42b4SDavid Howellsand assuming no intervention by an external influence, it can be assumed that
108b42b4SDavid Howellsthe final result will appear to be:
108b42b4SDavid Howells
108b42b4SDavid Howells	U == the original value of *A
108b42b4SDavid Howells	X == W
108b42b4SDavid Howells	Z == Y
108b42b4SDavid Howells	*A == Y
108b42b4SDavid Howells
108b42b4SDavid HowellsThe code above may cause the CPU to generate the full sequence of memory
108b42b4SDavid Howellsaccesses:
108b42b4SDavid Howells
108b42b4SDavid Howells	U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
108b42b4SDavid Howells
108b42b4SDavid Howellsin that order, but, without intervention, the sequence may have almost any
9af194ceSPaul E. McKenneycombination of elements combined or discarded, provided the program's view
9af194ceSPaul E. McKenneyof the world remains consistent.  Note that READ_ONCE() and WRITE_ONCE()
9af194ceSPaul E. McKenneyare -not- optional in the above example, as there are architectures
9af194ceSPaul E. McKenneywhere a given CPU might reorder successive loads to the same location.
9af194ceSPaul E. McKenneyOn such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
9af194ceSPaul E. McKenneynecessary to prevent this, for example, on Itanium the volatile casts
9af194ceSPaul E. McKenneyused by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq
9af194ceSPaul E. McKenneyand st.rel instructions (respectively) that prevent such reordering.
108b42b4SDavid Howells
108b42b4SDavid HowellsThe compiler may also combine, discard or defer elements of the sequence before
108b42b4SDavid Howellsthe CPU even sees them.
108b42b4SDavid Howells
108b42b4SDavid HowellsFor instance:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = V;
108b42b4SDavid Howells	*A = W;
108b42b4SDavid Howells
108b42b4SDavid Howellsmay be reduced to:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = W;
108b42b4SDavid Howells
9af194ceSPaul E. McKenneysince, without either a write barrier or an WRITE_ONCE(), it can be
2ecf8101SPaul E. McKenneyassumed that the effect of the storage of V to *A is lost.  Similarly:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = Y;
108b42b4SDavid Howells	Z = *A;
108b42b4SDavid Howells
9af194ceSPaul E. McKenneymay, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be
9af194ceSPaul E. McKenneyreduced to:
108b42b4SDavid Howells
108b42b4SDavid Howells	*A = Y;
108b42b4SDavid Howells	Z = Y;
108b42b4SDavid Howells
108b42b4SDavid Howellsand the LOAD operation never appear outside of the CPU.
108b42b4SDavid Howells
108b42b4SDavid Howells
108b42b4SDavid HowellsAND THEN THERE'S THE ALPHA
108b42b4SDavid Howells--------------------------
108b42b4SDavid Howells
108b42b4SDavid HowellsThe DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
108b42b4SDavid Howellssome versions of the Alpha CPU have a split data cache, permitting them to have
81fc6323SJarek Poplawskitwo semantically-related cache lines updated at separate times.  This is where
f556082dSAkira Yokosawathe address-dependency barrier really becomes necessary as this synchronises
f556082dSAkira Yokosawaboth caches with the memory coherence system, thus making it seem like pointer
108b42b4SDavid Howellschanges vs new data occur in the right order.
108b42b4SDavid Howells
f28f0868SPaul E. McKenneyThe Alpha defines the Linux kernel's memory model, although as of v4.15
8ca924aeSWill Deaconthe Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly
8ca924aeSWill Deaconreduced its impact on the memory model.
108b42b4SDavid Howells
0b6fa347SSeongJae Park
6a65d263SMichael S. TsirkinVIRTUAL MACHINE GUESTS
3dbf0913SSeongJae Park----------------------
6a65d263SMichael S. Tsirkin
6a65d263SMichael S. TsirkinGuests running within virtual machines might be affected by SMP effects even if
6a65d263SMichael S. Tsirkinthe guest itself is compiled without SMP support.  This is an artifact of
6a65d263SMichael S. Tsirkininterfacing with an SMP host while running an UP kernel.  Using mandatory
6a65d263SMichael S. Tsirkinbarriers for this use-case would be possible but is often suboptimal.
6a65d263SMichael S. Tsirkin
6a65d263SMichael S. TsirkinTo handle this case optimally, low-level virt_mb() etc macros are available.
6a65d263SMichael S. TsirkinThese have the same effect as smp_mb() etc when SMP is enabled, but generate
6a65d263SMichael S. Tsirkinidentical code for SMP and non-SMP systems.  For example, virtual machine guests
6a65d263SMichael S. Tsirkinshould use virt_mb() rather than smp_mb() when synchronizing against a
6a65d263SMichael S. Tsirkin(possibly SMP) host.
6a65d263SMichael S. Tsirkin
6a65d263SMichael S. TsirkinThese are equivalent to smp_mb() etc counterparts in all other respects,
6a65d263SMichael S. Tsirkinin particular, they do not control MMIO effects: to control
6a65d263SMichael S. TsirkinMMIO effects, use mandatory barriers.
108b42b4SDavid Howells
0b6fa347SSeongJae Park
90fddabfSDavid Howells============
90fddabfSDavid HowellsEXAMPLE USES
90fddabfSDavid Howells============
90fddabfSDavid Howells
90fddabfSDavid HowellsCIRCULAR BUFFERS
90fddabfSDavid Howells----------------
90fddabfSDavid Howells
90fddabfSDavid HowellsMemory barriers can be used to implement circular buffering without the need
90fddabfSDavid Howellsof a lock to serialise the producer with the consumer.  See:
90fddabfSDavid Howells
d8a121e3SMauro Carvalho Chehab	Documentation/core-api/circular-buffers.rst
90fddabfSDavid Howells
90fddabfSDavid Howellsfor details.
90fddabfSDavid Howells
90fddabfSDavid Howells
108b42b4SDavid Howells==========
108b42b4SDavid HowellsREFERENCES
108b42b4SDavid Howells==========
108b42b4SDavid Howells
108b42b4SDavid HowellsAlpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
108b42b4SDavid HowellsDigital Press)
108b42b4SDavid Howells	Chapter 5.2: Physical Address Space Characteristics
108b42b4SDavid Howells	Chapter 5.4: Caches and Write Buffers
108b42b4SDavid Howells	Chapter 5.5: Data Sharing
108b42b4SDavid Howells	Chapter 5.6: Read/Write Ordering
108b42b4SDavid Howells
108b42b4SDavid HowellsAMD64 Architecture Programmer's Manual Volume 2: System Programming
108b42b4SDavid Howells	Chapter 7.1: Memory-Access Ordering
108b42b4SDavid Howells	Chapter 7.4: Buffering and Combining Memory Writes
108b42b4SDavid Howells
f1ab25a3SPaul E. McKenneyARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
f1ab25a3SPaul E. McKenney	Chapter B2: The AArch64 Application Level Memory Model
f1ab25a3SPaul E. McKenney
108b42b4SDavid HowellsIA-32 Intel Architecture Software Developer's Manual, Volume 3:
108b42b4SDavid HowellsSystem Programming Guide
108b42b4SDavid Howells	Chapter 7.1: Locked Atomic Operations
108b42b4SDavid Howells	Chapter 7.2: Memory Ordering
108b42b4SDavid Howells	Chapter 7.4: Serializing Instructions
108b42b4SDavid Howells
108b42b4SDavid HowellsThe SPARC Architecture Manual, Version 9
108b42b4SDavid Howells	Chapter 8: Memory Models
108b42b4SDavid Howells	Appendix D: Formal Specification of the Memory Models
108b42b4SDavid Howells	Appendix J: Programming with the Memory Models
108b42b4SDavid Howells
f1ab25a3SPaul E. McKenneyStorage in the PowerPC (Stone and Fitzgerald)
f1ab25a3SPaul E. McKenney
108b42b4SDavid HowellsUltraSPARC Programmer Reference Manual
108b42b4SDavid Howells	Chapter 5: Memory Accesses and Cacheability
108b42b4SDavid Howells	Chapter 15: Sparc-V9 Memory Models
108b42b4SDavid Howells
108b42b4SDavid HowellsUltraSPARC III Cu User's Manual
108b42b4SDavid Howells	Chapter 9: Memory Models
108b42b4SDavid Howells
108b42b4SDavid HowellsUltraSPARC IIIi Processor User's Manual
108b42b4SDavid Howells	Chapter 8: Memory Models
108b42b4SDavid Howells
108b42b4SDavid HowellsUltraSPARC Architecture 2005
108b42b4SDavid Howells	Chapter 9: Memory
108b42b4SDavid Howells	Appendix D: Formal Specifications of the Memory Models
108b42b4SDavid Howells
108b42b4SDavid HowellsUltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
108b42b4SDavid Howells	Chapter 8: Memory Models
108b42b4SDavid Howells	Appendix F: Caches and Cache Coherency
108b42b4SDavid Howells
108b42b4SDavid HowellsSolaris Internals, Core Kernel Architecture, p63-68:
108b42b4SDavid Howells	Chapter 3.3: Hardware Considerations for Locks and
108b42b4SDavid Howells			Synchronization
108b42b4SDavid Howells
108b42b4SDavid HowellsUnix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
108b42b4SDavid Howellsfor Kernel Programmers:
108b42b4SDavid Howells	Chapter 13: Other Memory Models
108b42b4SDavid Howells
108b42b4SDavid HowellsIntel Itanium Architecture Software Developer's Manual: Volume 1:
108b42b4SDavid Howells	Section 2.6: Speculation
108b42b4SDavid Howells	Section 4.4: Memory Access