xref: /linux/Documentation/memory-barriers.txt (revision f4b369c6fe0ceaba2da2daff8c9eb415f85926dd)
1108b42b4SDavid Howells			 ============================
2108b42b4SDavid Howells			 LINUX KERNEL MEMORY BARRIERS
3108b42b4SDavid Howells			 ============================
4108b42b4SDavid Howells
5108b42b4SDavid HowellsBy: David Howells <dhowells@redhat.com>
6714b6904SPaul E. McKenney    Paul E. McKenney <paulmck@linux.ibm.com>
7e7720af5SPeter Zijlstra    Will Deacon <will.deacon@arm.com>
8e7720af5SPeter Zijlstra    Peter Zijlstra <peterz@infradead.org>
9108b42b4SDavid Howells
10e7720af5SPeter Zijlstra==========
11e7720af5SPeter ZijlstraDISCLAIMER
12e7720af5SPeter Zijlstra==========
13e7720af5SPeter Zijlstra
14e7720af5SPeter ZijlstraThis document is not a specification; it is intentionally (for the sake of
15e7720af5SPeter Zijlstrabrevity) and unintentionally (due to being human) incomplete. This document is
16e7720af5SPeter Zijlstrameant as a guide to using the various memory barriers provided by Linux, but
17621df431SAndrea Parriin case of any doubt (and there are many) please ask.  Some doubts may be
18621df431SAndrea Parriresolved by referring to the formal memory consistency model and related
19621df431SAndrea Parridocumentation at tools/memory-model/.  Nevertheless, even this memory
20621df431SAndrea Parrimodel should be viewed as the collective opinion of its maintainers rather
21621df431SAndrea Parrithan as an infallible oracle.
22e7720af5SPeter Zijlstra
23e7720af5SPeter ZijlstraTo repeat, this document is not a specification of what Linux expects from
24e7720af5SPeter Zijlstrahardware.
25e7720af5SPeter Zijlstra
268d4840e8SDavid HowellsThe purpose of this document is twofold:
278d4840e8SDavid Howells
288d4840e8SDavid Howells (1) to specify the minimum functionality that one can rely on for any
298d4840e8SDavid Howells     particular barrier, and
308d4840e8SDavid Howells
318d4840e8SDavid Howells (2) to provide a guide as to how to use the barriers that are available.
328d4840e8SDavid Howells
338d4840e8SDavid HowellsNote that an architecture can provide more than the minimum requirement
3435bdc72aSStan Drozdfor any particular barrier, but if the architecture provides less than
358d4840e8SDavid Howellsthat, that architecture is incorrect.
368d4840e8SDavid Howells
378d4840e8SDavid HowellsNote also that it is possible that a barrier may be a no-op for an
388d4840e8SDavid Howellsarchitecture because the way that arch works renders an explicit barrier
398d4840e8SDavid Howellsunnecessary in that case.
408d4840e8SDavid Howells
418d4840e8SDavid Howells
42e7720af5SPeter Zijlstra========
43e7720af5SPeter ZijlstraCONTENTS
44e7720af5SPeter Zijlstra========
45108b42b4SDavid Howells
46108b42b4SDavid Howells (*) Abstract memory access model.
47108b42b4SDavid Howells
48108b42b4SDavid Howells     - Device operations.
49108b42b4SDavid Howells     - Guarantees.
50108b42b4SDavid Howells
51108b42b4SDavid Howells (*) What are memory barriers?
52108b42b4SDavid Howells
53108b42b4SDavid Howells     - Varieties of memory barrier.
54108b42b4SDavid Howells     - What may not be assumed about memory barriers?
55203185f6SAkira Yokosawa     - Address-dependency barriers (historical).
56108b42b4SDavid Howells     - Control dependencies.
57108b42b4SDavid Howells     - SMP barrier pairing.
58108b42b4SDavid Howells     - Examples of memory barrier sequences.
59670bd95eSDavid Howells     - Read memory barriers vs load speculation.
60f1ab25a3SPaul E. McKenney     - Multicopy atomicity.
61108b42b4SDavid Howells
62108b42b4SDavid Howells (*) Explicit kernel barriers.
63108b42b4SDavid Howells
64108b42b4SDavid Howells     - Compiler barrier.
6581fc6323SJarek Poplawski     - CPU memory barriers.
66108b42b4SDavid Howells
67108b42b4SDavid Howells (*) Implicit kernel memory barriers.
68108b42b4SDavid Howells
69166bda71SSeongJae Park     - Lock acquisition functions.
70108b42b4SDavid Howells     - Interrupt disabling functions.
7150fa610aSDavid Howells     - Sleep and wake-up functions.
72108b42b4SDavid Howells     - Miscellaneous functions.
73108b42b4SDavid Howells
74166bda71SSeongJae Park (*) Inter-CPU acquiring barrier effects.
75108b42b4SDavid Howells
76166bda71SSeongJae Park     - Acquires vs memory accesses.
77108b42b4SDavid Howells
78108b42b4SDavid Howells (*) Where are memory barriers needed?
79108b42b4SDavid Howells
80108b42b4SDavid Howells     - Interprocessor interaction.
81108b42b4SDavid Howells     - Atomic operations.
82108b42b4SDavid Howells     - Accessing devices.
83108b42b4SDavid Howells     - Interrupts.
84108b42b4SDavid Howells
85108b42b4SDavid Howells (*) Kernel I/O barrier effects.
86108b42b4SDavid Howells
87108b42b4SDavid Howells (*) Assumed minimum execution ordering model.
88108b42b4SDavid Howells
89108b42b4SDavid Howells (*) The effects of the cpu cache.
90108b42b4SDavid Howells
91108b42b4SDavid Howells     - Cache coherency vs DMA.
92108b42b4SDavid Howells     - Cache coherency vs MMIO.
93108b42b4SDavid Howells
94108b42b4SDavid Howells (*) The things CPUs get up to.
95108b42b4SDavid Howells
96108b42b4SDavid Howells     - And then there's the Alpha.
9701e1cd6dSSeongJae Park     - Virtual Machine Guests.
98108b42b4SDavid Howells
9990fddabfSDavid Howells (*) Example uses.
10090fddabfSDavid Howells
10190fddabfSDavid Howells     - Circular buffers.
10290fddabfSDavid Howells
103108b42b4SDavid Howells (*) References.
104108b42b4SDavid Howells
105108b42b4SDavid Howells
106108b42b4SDavid Howells============================
107108b42b4SDavid HowellsABSTRACT MEMORY ACCESS MODEL
108108b42b4SDavid Howells============================
109108b42b4SDavid Howells
110108b42b4SDavid HowellsConsider the following abstract model of the system:
111108b42b4SDavid Howells
112108b42b4SDavid Howells		            :                :
113108b42b4SDavid Howells		            :                :
114108b42b4SDavid Howells		            :                :
115108b42b4SDavid Howells		+-------+   :   +--------+   :   +-------+
116108b42b4SDavid Howells		|       |   :   |        |   :   |       |
117108b42b4SDavid Howells		|       |   :   |        |   :   |       |
118108b42b4SDavid Howells		| CPU 1 |<----->| Memory |<----->| CPU 2 |
119108b42b4SDavid Howells		|       |   :   |        |   :   |       |
120108b42b4SDavid Howells		|       |   :   |        |   :   |       |
121108b42b4SDavid Howells		+-------+   :   +--------+   :   +-------+
122108b42b4SDavid Howells		    ^       :       ^        :       ^
123108b42b4SDavid Howells		    |       :       |        :       |
124108b42b4SDavid Howells		    |       :       |        :       |
125108b42b4SDavid Howells		    |       :       v        :       |
126108b42b4SDavid Howells		    |       :   +--------+   :       |
127108b42b4SDavid Howells		    |       :   |        |   :       |
128108b42b4SDavid Howells		    |       :   |        |   :       |
129108b42b4SDavid Howells		    +---------->| Device |<----------+
130108b42b4SDavid Howells		            :   |        |   :
131108b42b4SDavid Howells		            :   |        |   :
132108b42b4SDavid Howells		            :   +--------+   :
133108b42b4SDavid Howells		            :                :
134108b42b4SDavid Howells
135108b42b4SDavid HowellsEach CPU executes a program that generates memory access operations.  In the
136108b42b4SDavid Howellsabstract CPU, memory operation ordering is very relaxed, and a CPU may actually
137108b42b4SDavid Howellsperform the memory operations in any order it likes, provided program causality
138108b42b4SDavid Howellsappears to be maintained.  Similarly, the compiler may also arrange the
139108b42b4SDavid Howellsinstructions it emits in any order it likes, provided it doesn't affect the
140108b42b4SDavid Howellsapparent operation of the program.
141108b42b4SDavid Howells
142108b42b4SDavid HowellsSo in the above diagram, the effects of the memory operations performed by a
143108b42b4SDavid HowellsCPU are perceived by the rest of the system as the operations cross the
144108b42b4SDavid Howellsinterface between the CPU and rest of the system (the dotted lines).
145108b42b4SDavid Howells
146108b42b4SDavid Howells
147108b42b4SDavid HowellsFor example, consider the following sequence of events:
148108b42b4SDavid Howells
149108b42b4SDavid Howells	CPU 1		CPU 2
150108b42b4SDavid Howells	===============	===============
151108b42b4SDavid Howells	{ A == 1; B == 2 }
152615cc2c9SAlexey Dobriyan	A = 3;		x = B;
153615cc2c9SAlexey Dobriyan	B = 4;		y = A;
154108b42b4SDavid Howells
155108b42b4SDavid HowellsThe set of accesses as seen by the memory system in the middle can be arranged
156108b42b4SDavid Howellsin 24 different combinations:
157108b42b4SDavid Howells
1588ab8b3e1SPranith Kumar	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
1598ab8b3e1SPranith Kumar	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
1608ab8b3e1SPranith Kumar	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
1618ab8b3e1SPranith Kumar	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
1628ab8b3e1SPranith Kumar	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
1638ab8b3e1SPranith Kumar	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
1648ab8b3e1SPranith Kumar	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
165108b42b4SDavid Howells	STORE B=4, ...
166108b42b4SDavid Howells	...
167108b42b4SDavid Howells
168108b42b4SDavid Howellsand can thus result in four different combinations of values:
169108b42b4SDavid Howells
1708ab8b3e1SPranith Kumar	x == 2, y == 1
1718ab8b3e1SPranith Kumar	x == 2, y == 3
1728ab8b3e1SPranith Kumar	x == 4, y == 1
1738ab8b3e1SPranith Kumar	x == 4, y == 3
174108b42b4SDavid Howells
175108b42b4SDavid Howells
176108b42b4SDavid HowellsFurthermore, the stores committed by a CPU to the memory system may not be
177108b42b4SDavid Howellsperceived by the loads made by another CPU in the same order as the stores were
178108b42b4SDavid Howellscommitted.
179108b42b4SDavid Howells
180108b42b4SDavid Howells
181108b42b4SDavid HowellsAs a further example, consider this sequence of events:
182108b42b4SDavid Howells
183108b42b4SDavid Howells	CPU 1		CPU 2
184108b42b4SDavid Howells	===============	===============
1853dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
186108b42b4SDavid Howells	B = 4;		Q = P;
1878149b5cbSSeongJae Park	P = &B;		D = *Q;
188108b42b4SDavid Howells
189f556082dSAkira YokosawaThere is an obvious address dependency here, as the value loaded into D depends
190f556082dSAkira Yokosawaon the address retrieved from P by CPU 2.  At the end of the sequence, any of
191f556082dSAkira Yokosawathe following results are possible:
192108b42b4SDavid Howells
193108b42b4SDavid Howells	(Q == &A) and (D == 1)
194108b42b4SDavid Howells	(Q == &B) and (D == 2)
195108b42b4SDavid Howells	(Q == &B) and (D == 4)
196108b42b4SDavid Howells
197108b42b4SDavid HowellsNote that CPU 2 will never try and load C into D because the CPU will load P
198108b42b4SDavid Howellsinto Q before issuing the load of *Q.
199108b42b4SDavid Howells
200108b42b4SDavid Howells
201108b42b4SDavid HowellsDEVICE OPERATIONS
202108b42b4SDavid Howells-----------------
203108b42b4SDavid Howells
204108b42b4SDavid HowellsSome devices present their control interfaces as collections of memory
205108b42b4SDavid Howellslocations, but the order in which the control registers are accessed is very
206108b42b4SDavid Howellsimportant.  For instance, imagine an ethernet card with a set of internal
207108b42b4SDavid Howellsregisters that are accessed through an address port register (A) and a data
208108b42b4SDavid Howellsport register (D).  To read internal register 5, the following code might then
209108b42b4SDavid Howellsbe used:
210108b42b4SDavid Howells
211108b42b4SDavid Howells	*A = 5;
212108b42b4SDavid Howells	x = *D;
213108b42b4SDavid Howells
214108b42b4SDavid Howellsbut this might show up as either of the following two sequences:
215108b42b4SDavid Howells
216108b42b4SDavid Howells	STORE *A = 5, x = LOAD *D
217108b42b4SDavid Howells	x = LOAD *D, STORE *A = 5
218108b42b4SDavid Howells
219108b42b4SDavid Howellsthe second of which will almost certainly result in a malfunction, since it set
220108b42b4SDavid Howellsthe address _after_ attempting to read the register.
221108b42b4SDavid Howells
222108b42b4SDavid Howells
223108b42b4SDavid HowellsGUARANTEES
224108b42b4SDavid Howells----------
225108b42b4SDavid Howells
226108b42b4SDavid HowellsThere are some minimal guarantees that may be expected of a CPU:
227108b42b4SDavid Howells
228108b42b4SDavid Howells (*) On any given CPU, dependent memory accesses will be issued in order, with
229108b42b4SDavid Howells     respect to itself.  This means that for:
230108b42b4SDavid Howells
23140555946SPaul E. McKenney	Q = READ_ONCE(P); D = READ_ONCE(*Q);
232108b42b4SDavid Howells
233108b42b4SDavid Howells     the CPU will issue the following memory operations:
234108b42b4SDavid Howells
235108b42b4SDavid Howells	Q = LOAD P, D = LOAD *Q
236108b42b4SDavid Howells
23740555946SPaul E. McKenney     and always in that order.  However, on DEC Alpha, READ_ONCE() also
23840555946SPaul E. McKenney     emits a memory-barrier instruction, so that a DEC Alpha CPU will
23940555946SPaul E. McKenney     instead issue the following memory operations:
24040555946SPaul E. McKenney
24140555946SPaul E. McKenney	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
24240555946SPaul E. McKenney
24340555946SPaul E. McKenney     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
24440555946SPaul E. McKenney     mischief.
245108b42b4SDavid Howells
246108b42b4SDavid Howells (*) Overlapping loads and stores within a particular CPU will appear to be
247108b42b4SDavid Howells     ordered within that CPU.  This means that for:
248108b42b4SDavid Howells
2499af194ceSPaul E. McKenney	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
250108b42b4SDavid Howells
251108b42b4SDavid Howells     the CPU will only issue the following sequence of memory operations:
252108b42b4SDavid Howells
253108b42b4SDavid Howells	a = LOAD *X, STORE *X = b
254108b42b4SDavid Howells
255108b42b4SDavid Howells     And for:
256108b42b4SDavid Howells
2579af194ceSPaul E. McKenney	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
258108b42b4SDavid Howells
259108b42b4SDavid Howells     the CPU will only issue:
260108b42b4SDavid Howells
261108b42b4SDavid Howells	STORE *X = c, d = LOAD *X
262108b42b4SDavid Howells
263fa00e7e1SMatt LaPlante     (Loads and stores overlap if they are targeted at overlapping pieces of
264108b42b4SDavid Howells     memory).
265108b42b4SDavid Howells
266108b42b4SDavid HowellsAnd there are a number of things that _must_ or _must_not_ be assumed:
267108b42b4SDavid Howells
2689af194ceSPaul E. McKenney (*) It _must_not_ be assumed that the compiler will do what you want
2699af194ceSPaul E. McKenney     with memory references that are not protected by READ_ONCE() and
2709af194ceSPaul E. McKenney     WRITE_ONCE().  Without them, the compiler is within its rights to
2719af194ceSPaul E. McKenney     do all sorts of "creative" transformations, which are covered in
272895f5542SPaul E. McKenney     the COMPILER BARRIER section.
2732ecf8101SPaul E. McKenney
274108b42b4SDavid Howells (*) It _must_not_ be assumed that independent loads and stores will be issued
275108b42b4SDavid Howells     in the order given.  This means that for:
276108b42b4SDavid Howells
277108b42b4SDavid Howells	X = *A; Y = *B; *D = Z;
278108b42b4SDavid Howells
279108b42b4SDavid Howells     we may get any of the following sequences:
280108b42b4SDavid Howells
281108b42b4SDavid Howells	X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
282108b42b4SDavid Howells	X = LOAD *A,  STORE *D = Z, Y = LOAD *B
283108b42b4SDavid Howells	Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
284108b42b4SDavid Howells	Y = LOAD *B,  STORE *D = Z, X = LOAD *A
285108b42b4SDavid Howells	STORE *D = Z, X = LOAD *A,  Y = LOAD *B
286108b42b4SDavid Howells	STORE *D = Z, Y = LOAD *B,  X = LOAD *A
287108b42b4SDavid Howells
288108b42b4SDavid Howells (*) It _must_ be assumed that overlapping memory accesses may be merged or
289108b42b4SDavid Howells     discarded.  This means that for:
290108b42b4SDavid Howells
291108b42b4SDavid Howells	X = *A; Y = *(A + 4);
292108b42b4SDavid Howells
293108b42b4SDavid Howells     we may get any one of the following sequences:
294108b42b4SDavid Howells
295108b42b4SDavid Howells	X = LOAD *A; Y = LOAD *(A + 4);
296108b42b4SDavid Howells	Y = LOAD *(A + 4); X = LOAD *A;
297108b42b4SDavid Howells	{X, Y} = LOAD {*A, *(A + 4) };
298108b42b4SDavid Howells
299108b42b4SDavid Howells     And for:
300108b42b4SDavid Howells
301f191eec5SPaul E. McKenney	*A = X; *(A + 4) = Y;
302108b42b4SDavid Howells
303f191eec5SPaul E. McKenney     we may get any of:
304108b42b4SDavid Howells
305f191eec5SPaul E. McKenney	STORE *A = X; STORE *(A + 4) = Y;
306f191eec5SPaul E. McKenney	STORE *(A + 4) = Y; STORE *A = X;
307f191eec5SPaul E. McKenney	STORE {*A, *(A + 4) } = {X, Y};
308108b42b4SDavid Howells
309432fbf3cSPaul E. McKenneyAnd there are anti-guarantees:
310432fbf3cSPaul E. McKenney
311432fbf3cSPaul E. McKenney (*) These guarantees do not apply to bitfields, because compilers often
312432fbf3cSPaul E. McKenney     generate code to modify these using non-atomic read-modify-write
313432fbf3cSPaul E. McKenney     sequences.  Do not attempt to use bitfields to synchronize parallel
314432fbf3cSPaul E. McKenney     algorithms.
315432fbf3cSPaul E. McKenney
316432fbf3cSPaul E. McKenney (*) Even in cases where bitfields are protected by locks, all fields
317432fbf3cSPaul E. McKenney     in a given bitfield must be protected by one lock.  If two fields
318432fbf3cSPaul E. McKenney     in a given bitfield are protected by different locks, the compiler's
319432fbf3cSPaul E. McKenney     non-atomic read-modify-write sequences can cause an update to one
320432fbf3cSPaul E. McKenney     field to corrupt the value of an adjacent field.
321432fbf3cSPaul E. McKenney
322432fbf3cSPaul E. McKenney (*) These guarantees apply only to properly aligned and sized scalar
323432fbf3cSPaul E. McKenney     variables.  "Properly sized" currently means variables that are
324432fbf3cSPaul E. McKenney     the same size as "char", "short", "int" and "long".  "Properly
325432fbf3cSPaul E. McKenney     aligned" means the natural alignment, thus no constraints for
326432fbf3cSPaul E. McKenney     "char", two-byte alignment for "short", four-byte alignment for
327432fbf3cSPaul E. McKenney     "int", and either four-byte or eight-byte alignment for "long",
328432fbf3cSPaul E. McKenney     on 32-bit and 64-bit systems, respectively.  Note that these
329432fbf3cSPaul E. McKenney     guarantees were introduced into the C11 standard, so beware when
330432fbf3cSPaul E. McKenney     using older pre-C11 compilers (for example, gcc 4.6).  The portion
331432fbf3cSPaul E. McKenney     of the standard containing this guarantee is Section 3.14, which
332432fbf3cSPaul E. McKenney     defines "memory location" as follows:
333432fbf3cSPaul E. McKenney
334432fbf3cSPaul E. McKenney     	memory location
335432fbf3cSPaul E. McKenney		either an object of scalar type, or a maximal sequence
336432fbf3cSPaul E. McKenney		of adjacent bit-fields all having nonzero width
337432fbf3cSPaul E. McKenney
338432fbf3cSPaul E. McKenney		NOTE 1: Two threads of execution can update and access
339432fbf3cSPaul E. McKenney		separate memory locations without interfering with
340432fbf3cSPaul E. McKenney		each other.
341432fbf3cSPaul E. McKenney
342432fbf3cSPaul E. McKenney		NOTE 2: A bit-field and an adjacent non-bit-field member
343432fbf3cSPaul E. McKenney		are in separate memory locations. The same applies
344432fbf3cSPaul E. McKenney		to two bit-fields, if one is declared inside a nested
345432fbf3cSPaul E. McKenney		structure declaration and the other is not, or if the two
346432fbf3cSPaul E. McKenney		are separated by a zero-length bit-field declaration,
347432fbf3cSPaul E. McKenney		or if they are separated by a non-bit-field member
348432fbf3cSPaul E. McKenney		declaration. It is not safe to concurrently update two
349432fbf3cSPaul E. McKenney		bit-fields in the same structure if all members declared
350432fbf3cSPaul E. McKenney		between them are also bit-fields, no matter what the
351432fbf3cSPaul E. McKenney		sizes of those intervening bit-fields happen to be.
352432fbf3cSPaul E. McKenney
353108b42b4SDavid Howells
354108b42b4SDavid Howells=========================
355108b42b4SDavid HowellsWHAT ARE MEMORY BARRIERS?
356108b42b4SDavid Howells=========================
357108b42b4SDavid Howells
358108b42b4SDavid HowellsAs can be seen above, independent memory operations are effectively performed
359108b42b4SDavid Howellsin random order, but this can be a problem for CPU-CPU interaction and for I/O.
360108b42b4SDavid HowellsWhat is required is some way of intervening to instruct the compiler and the
361108b42b4SDavid HowellsCPU to restrict the order.
362108b42b4SDavid Howells
363108b42b4SDavid HowellsMemory barriers are such interventions.  They impose a perceived partial
3642b94895bSDavid Howellsordering over the memory operations on either side of the barrier.
3652b94895bSDavid Howells
3662b94895bSDavid HowellsSuch enforcement is important because the CPUs and other devices in a system
36781fc6323SJarek Poplawskican use a variety of tricks to improve performance, including reordering,
3682b94895bSDavid Howellsdeferral and combination of memory operations; speculative loads; speculative
3692b94895bSDavid Howellsbranch prediction and various types of caching.  Memory barriers are used to
3702b94895bSDavid Howellsoverride or suppress these tricks, allowing the code to sanely control the
3712b94895bSDavid Howellsinteraction of multiple CPUs and/or devices.
372108b42b4SDavid Howells
373108b42b4SDavid Howells
374108b42b4SDavid HowellsVARIETIES OF MEMORY BARRIER
375108b42b4SDavid Howells---------------------------
376108b42b4SDavid Howells
377108b42b4SDavid HowellsMemory barriers come in four basic varieties:
378108b42b4SDavid Howells
379108b42b4SDavid Howells (1) Write (or store) memory barriers.
380108b42b4SDavid Howells
381108b42b4SDavid Howells     A write memory barrier gives a guarantee that all the STORE operations
382108b42b4SDavid Howells     specified before the barrier will appear to happen before all the STORE
383108b42b4SDavid Howells     operations specified after the barrier with respect to the other
384108b42b4SDavid Howells     components of the system.
385108b42b4SDavid Howells
386108b42b4SDavid Howells     A write barrier is a partial ordering on stores only; it is not required
387108b42b4SDavid Howells     to have any effect on loads.
388108b42b4SDavid Howells
3896bc39274SDavid Howells     A CPU can be viewed as committing a sequence of store operations to the
3905692fcc6SGuilherme G. Piccoli     memory system as time progresses.  All stores _before_ a write barrier
3915692fcc6SGuilherme G. Piccoli     will occur _before_ all the stores after the write barrier.
392108b42b4SDavid Howells
393203185f6SAkira Yokosawa     [!] Note that write barriers should normally be paired with read or
394203185f6SAkira Yokosawa     address-dependency barriers; see the "SMP barrier pairing" subsection.
395108b42b4SDavid Howells
396108b42b4SDavid Howells
397203185f6SAkira Yokosawa (2) Address-dependency barriers (historical).
398ad944630SPaul E. McKenney     [!] This section is marked as HISTORICAL: it covers the long-obsolete
399ad944630SPaul E. McKenney     smp_read_barrier_depends() macro, the semantics of which are now
400ad944630SPaul E. McKenney     implicit in all marked accesses.  For more up-to-date information,
401ad944630SPaul E. McKenney     including how compiler transformations can sometimes break address
402ad944630SPaul E. McKenney     dependencies, see Documentation/RCU/rcu_dereference.rst.
403108b42b4SDavid Howells
404f556082dSAkira Yokosawa     An address-dependency barrier is a weaker form of read barrier.  In the
405f556082dSAkira Yokosawa     case where two loads are performed such that the second depends on the
406f556082dSAkira Yokosawa     result of the first (eg: the first load retrieves the address to which
407f556082dSAkira Yokosawa     the second load will be directed), an address-dependency barrier would
408f556082dSAkira Yokosawa     be required to make sure that the target of the second load is updated
409f556082dSAkira Yokosawa     after the address obtained by the first load is accessed.
410108b42b4SDavid Howells
411f556082dSAkira Yokosawa     An address-dependency barrier is a partial ordering on interdependent
412f556082dSAkira Yokosawa     loads only; it is not required to have any effect on stores, independent
413f556082dSAkira Yokosawa     loads or overlapping loads.
414108b42b4SDavid Howells
415108b42b4SDavid Howells     As mentioned in (1), the other CPUs in the system can be viewed as
416108b42b4SDavid Howells     committing sequences of stores to the memory system that the CPU being
417f556082dSAkira Yokosawa     considered can then perceive.  An address-dependency barrier issued by
418f556082dSAkira Yokosawa     the CPU under consideration guarantees that for any load preceding it,
419f556082dSAkira Yokosawa     if that load touches one of a sequence of stores from another CPU, then
420f556082dSAkira Yokosawa     by the time the barrier completes, the effects of all the stores prior to
421f556082dSAkira Yokosawa     that touched by the load will be perceptible to any loads issued after
422f556082dSAkira Yokosawa     the address-dependency barrier.
423108b42b4SDavid Howells
424108b42b4SDavid Howells     See the "Examples of memory barrier sequences" subsection for diagrams
425108b42b4SDavid Howells     showing the ordering constraints.
426108b42b4SDavid Howells
427203185f6SAkira Yokosawa     [!] Note that the first load really has to have an _address_ dependency and
428108b42b4SDavid Howells     not a control dependency.  If the address for the second load is dependent
429108b42b4SDavid Howells     on the first load, but the dependency is through a conditional rather than
430108b42b4SDavid Howells     actually loading the address itself, then it's a _control_ dependency and
431108b42b4SDavid Howells     a full read barrier or better is required.  See the "Control dependencies"
432108b42b4SDavid Howells     subsection for more information.
433108b42b4SDavid Howells
434203185f6SAkira Yokosawa     [!] Note that address-dependency barriers should normally be paired with
435108b42b4SDavid Howells     write barriers; see the "SMP barrier pairing" subsection.
436108b42b4SDavid Howells
437203185f6SAkira Yokosawa     [!] Kernel release v5.9 removed kernel APIs for explicit address-
438203185f6SAkira Yokosawa     dependency barriers.  Nowadays, APIs for marking loads from shared
439203185f6SAkira Yokosawa     variables such as READ_ONCE() and rcu_dereference() provide implicit
440203185f6SAkira Yokosawa     address-dependency barriers.
441108b42b4SDavid Howells
442108b42b4SDavid Howells (3) Read (or load) memory barriers.
443108b42b4SDavid Howells
444f556082dSAkira Yokosawa     A read barrier is an address-dependency barrier plus a guarantee that all
445f556082dSAkira Yokosawa     the LOAD operations specified before the barrier will appear to happen
446f556082dSAkira Yokosawa     before all the LOAD operations specified after the barrier with respect to
447f556082dSAkira Yokosawa     the other components of the system.
448108b42b4SDavid Howells
449108b42b4SDavid Howells     A read barrier is a partial ordering on loads only; it is not required to
450108b42b4SDavid Howells     have any effect on stores.
451108b42b4SDavid Howells
452f556082dSAkira Yokosawa     Read memory barriers imply address-dependency barriers, and so can
453f556082dSAkira Yokosawa     substitute for them.
454108b42b4SDavid Howells
455108b42b4SDavid Howells     [!] Note that read barriers should normally be paired with write barriers;
456108b42b4SDavid Howells     see the "SMP barrier pairing" subsection.
457108b42b4SDavid Howells
458108b42b4SDavid Howells
459108b42b4SDavid Howells (4) General memory barriers.
460108b42b4SDavid Howells
461670bd95eSDavid Howells     A general memory barrier gives a guarantee that all the LOAD and STORE
462670bd95eSDavid Howells     operations specified before the barrier will appear to happen before all
463670bd95eSDavid Howells     the LOAD and STORE operations specified after the barrier with respect to
464670bd95eSDavid Howells     the other components of the system.
465670bd95eSDavid Howells
466670bd95eSDavid Howells     A general memory barrier is a partial ordering over both loads and stores.
467108b42b4SDavid Howells
468108b42b4SDavid Howells     General memory barriers imply both read and write memory barriers, and so
469108b42b4SDavid Howells     can substitute for either.
470108b42b4SDavid Howells
471108b42b4SDavid Howells
472108b42b4SDavid HowellsAnd a couple of implicit varieties:
473108b42b4SDavid Howells
4742e4f5382SPeter Zijlstra (5) ACQUIRE operations.
475108b42b4SDavid Howells
476108b42b4SDavid Howells     This acts as a one-way permeable barrier.  It guarantees that all memory
4772e4f5382SPeter Zijlstra     operations after the ACQUIRE operation will appear to happen after the
4782e4f5382SPeter Zijlstra     ACQUIRE operation with respect to the other components of the system.
479787df638SDavidlohr Bueso     ACQUIRE operations include LOCK operations and both smp_load_acquire()
4802f359c7eSAndrea Parri     and smp_cond_load_acquire() operations.
481108b42b4SDavid Howells
4822e4f5382SPeter Zijlstra     Memory operations that occur before an ACQUIRE operation may appear to
4832e4f5382SPeter Zijlstra     happen after it completes.
484108b42b4SDavid Howells
4852e4f5382SPeter Zijlstra     An ACQUIRE operation should almost always be paired with a RELEASE
4862e4f5382SPeter Zijlstra     operation.
487108b42b4SDavid Howells
488108b42b4SDavid Howells
4892e4f5382SPeter Zijlstra (6) RELEASE operations.
490108b42b4SDavid Howells
491108b42b4SDavid Howells     This also acts as a one-way permeable barrier.  It guarantees that all
4922e4f5382SPeter Zijlstra     memory operations before the RELEASE operation will appear to happen
4932e4f5382SPeter Zijlstra     before the RELEASE operation with respect to the other components of the
4942e4f5382SPeter Zijlstra     system. RELEASE operations include UNLOCK operations and
4952e4f5382SPeter Zijlstra     smp_store_release() operations.
496108b42b4SDavid Howells
4972e4f5382SPeter Zijlstra     Memory operations that occur after a RELEASE operation may appear to
498108b42b4SDavid Howells     happen before it completes.
499108b42b4SDavid Howells
5002e4f5382SPeter Zijlstra     The use of ACQUIRE and RELEASE operations generally precludes the need
501a897b13dSSeongJae Park     for other sorts of memory barrier.  In addition, a RELEASE+ACQUIRE pair is
502a897b13dSSeongJae Park     -not- guaranteed to act as a full memory barrier.  However, after an
503a897b13dSSeongJae Park     ACQUIRE on a given variable, all memory accesses preceding any prior
5042e4f5382SPeter Zijlstra     RELEASE on that same variable are guaranteed to be visible.  In other
5052e4f5382SPeter Zijlstra     words, within a given variable's critical section, all accesses of all
5062e4f5382SPeter Zijlstra     previous critical sections for that variable are guaranteed to have
5072e4f5382SPeter Zijlstra     completed.
50817eb88e0SPaul E. McKenney
5092e4f5382SPeter Zijlstra     This means that ACQUIRE acts as a minimal "acquire" operation and
5102e4f5382SPeter Zijlstra     RELEASE acts as a minimal "release" operation.
511108b42b4SDavid Howells
512706eeb3eSPeter ZijlstraA subset of the atomic operations described in atomic_t.txt have ACQUIRE and
513706eeb3eSPeter ZijlstraRELEASE variants in addition to fully-ordered and relaxed (no barrier
514706eeb3eSPeter Zijlstrasemantics) definitions.  For compound atomics performing both a load and a
515706eeb3eSPeter Zijlstrastore, ACQUIRE semantics apply only to the load and RELEASE semantics apply
516706eeb3eSPeter Zijlstraonly to the store portion of the operation.
517108b42b4SDavid Howells
518108b42b4SDavid HowellsMemory barriers are only required where there's a possibility of interaction
519108b42b4SDavid Howellsbetween two CPUs or between a CPU and a device.  If it can be guaranteed that
520108b42b4SDavid Howellsthere won't be any such interaction in any particular piece of code, then
521108b42b4SDavid Howellsmemory barriers are unnecessary in that piece of code.
522108b42b4SDavid Howells
523108b42b4SDavid Howells
524108b42b4SDavid HowellsNote that these are the _minimum_ guarantees.  Different architectures may give
525108b42b4SDavid Howellsmore substantial guarantees, but they may _not_ be relied upon outside of arch
526108b42b4SDavid Howellsspecific code.
527108b42b4SDavid Howells
528108b42b4SDavid Howells
529108b42b4SDavid HowellsWHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
530108b42b4SDavid Howells----------------------------------------------
531108b42b4SDavid Howells
532108b42b4SDavid HowellsThere are certain things that the Linux kernel memory barriers do not guarantee:
533108b42b4SDavid Howells
534108b42b4SDavid Howells (*) There is no guarantee that any of the memory accesses specified before a
535108b42b4SDavid Howells     memory barrier will be _complete_ by the completion of a memory barrier
536108b42b4SDavid Howells     instruction; the barrier can be considered to draw a line in that CPU's
537108b42b4SDavid Howells     access queue that accesses of the appropriate type may not cross.
538108b42b4SDavid Howells
539108b42b4SDavid Howells (*) There is no guarantee that issuing a memory barrier on one CPU will have
540108b42b4SDavid Howells     any direct effect on another CPU or any other hardware in the system.  The
541108b42b4SDavid Howells     indirect effect will be the order in which the second CPU sees the effects
542108b42b4SDavid Howells     of the first CPU's accesses occur, but see the next point:
543108b42b4SDavid Howells
5446bc39274SDavid Howells (*) There is no guarantee that a CPU will see the correct order of effects
545108b42b4SDavid Howells     from a second CPU's accesses, even _if_ the second CPU uses a memory
546108b42b4SDavid Howells     barrier, unless the first CPU _also_ uses a matching memory barrier (see
547108b42b4SDavid Howells     the subsection on "SMP Barrier Pairing").
548108b42b4SDavid Howells
549108b42b4SDavid Howells (*) There is no guarantee that some intervening piece of off-the-CPU
550108b42b4SDavid Howells     hardware[*] will not reorder the memory accesses.  CPU cache coherency
551108b42b4SDavid Howells     mechanisms should propagate the indirect effects of a memory barrier
552108b42b4SDavid Howells     between CPUs, but might not do so in order.
553108b42b4SDavid Howells
554108b42b4SDavid Howells	[*] For information on bus mastering DMA and coherency please read:
555108b42b4SDavid Howells
556bff9e34cSMauro Carvalho Chehab	    Documentation/driver-api/pci/pci.rst
557537f3a7cSSeongJae Park	    Documentation/core-api/dma-api-howto.rst
558537f3a7cSSeongJae Park	    Documentation/core-api/dma-api.rst
559108b42b4SDavid Howells
560108b42b4SDavid Howells
561203185f6SAkira YokosawaADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
562203185f6SAkira Yokosawa----------------------------------------
563ad944630SPaul E. McKenney[!] This section is marked as HISTORICAL: it covers the long-obsolete
564ad944630SPaul E. McKenneysmp_read_barrier_depends() macro, the semantics of which are now implicit
565ad944630SPaul E. McKenneyin all marked accesses.  For more up-to-date information, including
566ad944630SPaul E. McKenneyhow compiler transformations can sometimes break address dependencies,
567ad944630SPaul E. McKenneysee Documentation/RCU/rcu_dereference.rst.
568f28f0868SPaul E. McKenney
5698ca924aeSWill DeaconAs of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
5708ca924aeSWill DeaconDEC Alpha, which means that about the only people who need to pay attention
5718ca924aeSWill Deaconto this section are those working on DEC Alpha architecture-specific code
5728ca924aeSWill Deaconand those working on READ_ONCE() itself.  For those who need it, and for
5738ca924aeSWill Deaconthose who are interested in the history, here is the story of
574203185f6SAkira Yokosawaaddress-dependency barriers.
575108b42b4SDavid Howells
576203185f6SAkira Yokosawa[!] While address dependencies are observed in both load-to-load and
577203185f6SAkira Yokosawaload-to-store relations, address-dependency barriers are not necessary
578203185f6SAkira Yokosawafor load-to-store situations.
579203185f6SAkira Yokosawa
580203185f6SAkira YokosawaThe requirement of address-dependency barriers is a little subtle, and
581108b42b4SDavid Howellsit's not always obvious that they're needed.  To illustrate, consider the
582108b42b4SDavid Howellsfollowing sequence of events:
583108b42b4SDavid Howells
584108b42b4SDavid Howells	CPU 1		      CPU 2
585108b42b4SDavid Howells	===============	      ===============
5863dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
587108b42b4SDavid Howells	B = 4;
588108b42b4SDavid Howells	<write barrier>
5898149b5cbSSeongJae Park	WRITE_ONCE(P, &B);
590203185f6SAkira Yokosawa			      Q = READ_ONCE_OLD(P);
591108b42b4SDavid Howells			      D = *Q;
592108b42b4SDavid Howells
593203185f6SAkira Yokosawa[!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
594203185f6SAkira Yokosawadoesn't imply an address-dependency barrier.
595203185f6SAkira Yokosawa
596f556082dSAkira YokosawaThere's a clear address dependency here, and it would seem that by the end of
597f556082dSAkira Yokosawathe sequence, Q must be either &A or &B, and that:
598108b42b4SDavid Howells
599108b42b4SDavid Howells	(Q == &A) implies (D == 1)
600108b42b4SDavid Howells	(Q == &B) implies (D == 4)
601108b42b4SDavid Howells
602108b42b4SDavid HowellsBut!  CPU 2's perception of P may be updated _before_ its perception of B, thus
603108b42b4SDavid Howellsleading to the following situation:
604108b42b4SDavid Howells
605108b42b4SDavid Howells	(Q == &B) and (D == 2) ????
606108b42b4SDavid Howells
607806654a9SWill DeaconWhile this may seem like a failure of coherency or causality maintenance, it
608108b42b4SDavid Howellsisn't, and this behaviour can be observed on certain real CPUs (such as the DEC
609108b42b4SDavid HowellsAlpha).
610108b42b4SDavid Howells
611f556082dSAkira YokosawaTo deal with this, READ_ONCE() provides an implicit address-dependency barrier
612f556082dSAkira Yokosawasince kernel release v4.15:
613108b42b4SDavid Howells
614108b42b4SDavid Howells	CPU 1		      CPU 2
615108b42b4SDavid Howells	===============	      ===============
6163dbf0913SSeongJae Park	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
617108b42b4SDavid Howells	B = 4;
618108b42b4SDavid Howells	<write barrier>
6199af194ceSPaul E. McKenney	WRITE_ONCE(P, &B);
6209af194ceSPaul E. McKenney			      Q = READ_ONCE(P);
621203185f6SAkira Yokosawa			      <implicit address-dependency barrier>
622108b42b4SDavid Howells			      D = *Q;
623108b42b4SDavid Howells
624108b42b4SDavid HowellsThis enforces the occurrence of one of the two implications, and prevents the
625108b42b4SDavid Howellsthird possibility from arising.
626108b42b4SDavid Howells
62792a84dd2SPaul E. McKenney
628108b42b4SDavid Howells[!] Note that this extremely counterintuitive situation arises most easily on
629108b42b4SDavid Howellsmachines with split caches, so that, for example, one cache bank processes
630108b42b4SDavid Howellseven-numbered cache lines and the other bank processes odd-numbered cache
631108b42b4SDavid Howellslines.  The pointer P might be stored in an odd-numbered cache line, and the
632108b42b4SDavid Howellsvariable B might be stored in an even-numbered cache line.  Then, if the
633108b42b4SDavid Howellseven-numbered bank of the reading CPU's cache is extremely busy while the
634108b42b4SDavid Howellsodd-numbered bank is idle, one can see the new value of the pointer P (&B),
6356bc39274SDavid Howellsbut the old value of the variable B (2).
636108b42b4SDavid Howells
637108b42b4SDavid Howells
638203185f6SAkira YokosawaAn address-dependency barrier is not required to order dependent writes
639f556082dSAkira Yokosawabecause the CPUs that the Linux kernel supports don't do writes until they
640f556082dSAkira Yokosawaare certain (1) that the write will actually happen, (2) of the location of
641f556082dSAkira Yokosawathe write, and (3) of the value to be written.
64266ce3a4dSPaul E. McKenneyBut please carefully read the "CONTROL DEPENDENCIES" section and the
643f556082dSAkira YokosawaDocumentation/RCU/rcu_dereference.rst file:  The compiler can and does break
644f556082dSAkira Yokosawadependencies in a great many highly creative ways.
64566ce3a4dSPaul E. McKenney
64666ce3a4dSPaul E. McKenney	CPU 1		      CPU 2
64766ce3a4dSPaul E. McKenney	===============	      ===============
64866ce3a4dSPaul E. McKenney	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
64966ce3a4dSPaul E. McKenney	B = 4;
65066ce3a4dSPaul E. McKenney	<write barrier>
65166ce3a4dSPaul E. McKenney	WRITE_ONCE(P, &B);
652203185f6SAkira Yokosawa			      Q = READ_ONCE_OLD(P);
65366ce3a4dSPaul E. McKenney			      WRITE_ONCE(*Q, 5);
65466ce3a4dSPaul E. McKenney
655203185f6SAkira YokosawaTherefore, no address-dependency barrier is required to order the read into
65666ce3a4dSPaul E. McKenneyQ with the store into *Q.  In other words, this outcome is prohibited,
657203185f6SAkira Yokosawaeven without an implicit address-dependency barrier of modern READ_ONCE():
65866ce3a4dSPaul E. McKenney
65966ce3a4dSPaul E. McKenney	(Q == &B) && (B == 4)
66066ce3a4dSPaul E. McKenney
66166ce3a4dSPaul E. McKenneyPlease note that this pattern should be rare.  After all, the whole point
66266ce3a4dSPaul E. McKenneyof dependency ordering is to -prevent- writes to the data structure, along
66366ce3a4dSPaul E. McKenneywith the expensive cache misses associated with those writes.  This pattern
66466ce3a4dSPaul E. McKenneycan be used to record rare error conditions and the like, and the CPUs'
66566ce3a4dSPaul E. McKenneynaturally occurring ordering prevents such records from being lost.
66666ce3a4dSPaul E. McKenney
66766ce3a4dSPaul E. McKenney
668203185f6SAkira YokosawaNote well that the ordering provided by an address dependency is local to
669f1ab25a3SPaul E. McKenneythe CPU containing it.  See the section on "Multicopy atomicity" for
670f1ab25a3SPaul E. McKenneymore information.
671f1ab25a3SPaul E. McKenney
672f1ab25a3SPaul E. McKenney
673203185f6SAkira YokosawaThe address-dependency barrier is very important to the RCU system,
6742ecf8101SPaul E. McKenneyfor example.  See rcu_assign_pointer() and rcu_dereference() in
6752ecf8101SPaul E. McKenneyinclude/linux/rcupdate.h.  This permits the current target of an RCU'd
6762ecf8101SPaul E. McKenneypointer to be replaced with a new modified target, without the replacement
6772ecf8101SPaul E. McKenneytarget appearing to be incompletely initialised.
678108b42b4SDavid Howells
679108b42b4SDavid Howells
680108b42b4SDavid HowellsCONTROL DEPENDENCIES
681108b42b4SDavid Howells--------------------
682108b42b4SDavid Howells
683c8241f85SPaul E. McKenneyControl dependencies can be a bit tricky because current compilers do
684c8241f85SPaul E. McKenneynot understand them.  The purpose of this section is to help you prevent
685c8241f85SPaul E. McKenneythe compiler's ignorance from breaking your code.
686c8241f85SPaul E. McKenney
687ff382810SPaul E. McKenneyA load-load control dependency requires a full read memory barrier, not
688f556082dSAkira Yokosawasimply an (implicit) address-dependency barrier to make it work correctly.
689f556082dSAkira YokosawaConsider the following bit of code:
690108b42b4SDavid Howells
6919af194ceSPaul E. McKenney	q = READ_ONCE(a);
692203185f6SAkira Yokosawa	<implicit address-dependency barrier>
69318c03c61SPeter Zijlstra	if (q) {
694203185f6SAkira Yokosawa		/* BUG: No address dependency!!! */
6959af194ceSPaul E. McKenney		p = READ_ONCE(b);
69645c8a36aSPaul E. McKenney	}
697108b42b4SDavid Howells
698203185f6SAkira YokosawaThis will not have the desired effect because there is no actual address
6992ecf8101SPaul E. McKenneydependency, but rather a control dependency that the CPU may short-circuit
7002ecf8101SPaul E. McKenneyby attempting to predict the outcome in advance, so that other CPUs see
701f556082dSAkira Yokosawathe load from b as having happened before the load from a.  In such a case
702f556082dSAkira Yokosawawhat's actually required is:
703108b42b4SDavid Howells
7049af194ceSPaul E. McKenney	q = READ_ONCE(a);
70518c03c61SPeter Zijlstra	if (q) {
706108b42b4SDavid Howells		<read barrier>
7079af194ceSPaul E. McKenney		p = READ_ONCE(b);
70845c8a36aSPaul E. McKenney	}
70918c03c61SPeter Zijlstra
71018c03c61SPeter ZijlstraHowever, stores are not speculated.  This means that ordering -is- provided
711ff382810SPaul E. McKenneyfor load-store control dependencies, as in the following example:
71218c03c61SPeter Zijlstra
713105ff3cbSLinus Torvalds	q = READ_ONCE(a);
7142456d2a6SPaul E. McKenney	if (q) {
715c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
71618c03c61SPeter Zijlstra	}
71718c03c61SPeter Zijlstra
718c8241f85SPaul E. McKenneyControl dependencies pair normally with other types of barriers.
719c8241f85SPaul E. McKenneyThat said, please note that neither READ_ONCE() nor WRITE_ONCE()
720c8241f85SPaul E. McKenneyare optional! Without the READ_ONCE(), the compiler might combine the
721c8241f85SPaul E. McKenneyload from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
722c8241f85SPaul E. McKenneythe compiler might combine the store to 'b' with other stores to 'b'.
723c8241f85SPaul E. McKenneyEither can result in highly counterintuitive effects on ordering.
72418c03c61SPeter Zijlstra
72518c03c61SPeter ZijlstraWorse yet, if the compiler is able to prove (say) that the value of
72618c03c61SPeter Zijlstravariable 'a' is always non-zero, it would be well within its rights
72718c03c61SPeter Zijlstrato optimize the original example by eliminating the "if" statement
72818c03c61SPeter Zijlstraas follows:
72918c03c61SPeter Zijlstra
73018c03c61SPeter Zijlstra	q = a;
731c8241f85SPaul E. McKenney	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
73218c03c61SPeter Zijlstra
733105ff3cbSLinus TorvaldsSo don't leave out the READ_ONCE().
7342456d2a6SPaul E. McKenney
7352456d2a6SPaul E. McKenneyIt is tempting to try to enforce ordering on identical stores on both
7362456d2a6SPaul E. McKenneybranches of the "if" statement as follows:
73718c03c61SPeter Zijlstra
738105ff3cbSLinus Torvalds	q = READ_ONCE(a);
73918c03c61SPeter Zijlstra	if (q) {
7409b2b3bf5SPaul E. McKenney		barrier();
741c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
74218c03c61SPeter Zijlstra		do_something();
74318c03c61SPeter Zijlstra	} else {
7449b2b3bf5SPaul E. McKenney		barrier();
745c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
74618c03c61SPeter Zijlstra		do_something_else();
74718c03c61SPeter Zijlstra	}
74818c03c61SPeter Zijlstra
7492456d2a6SPaul E. McKenneyUnfortunately, current compilers will transform this as follows at high
7502456d2a6SPaul E. McKenneyoptimization levels:
75118c03c61SPeter Zijlstra
752105ff3cbSLinus Torvalds	q = READ_ONCE(a);
7532456d2a6SPaul E. McKenney	barrier();
754c8241f85SPaul E. McKenney	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
75518c03c61SPeter Zijlstra	if (q) {
756c8241f85SPaul E. McKenney		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
75718c03c61SPeter Zijlstra		do_something();
75818c03c61SPeter Zijlstra	} else {
759c8241f85SPaul E. McKenney		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
76018c03c61SPeter Zijlstra		do_something_else();
76118c03c61SPeter Zijlstra	}
76218c03c61SPeter Zijlstra
7632456d2a6SPaul E. McKenneyNow there is no conditional between the load from 'a' and the store to
7642456d2a6SPaul E. McKenney'b', which means that the CPU is within its rights to reorder them:
7652456d2a6SPaul E. McKenneyThe conditional is absolutely required, and must be present in the
7662456d2a6SPaul E. McKenneyassembly code even after all compiler optimizations have been applied.
7672456d2a6SPaul E. McKenneyTherefore, if you need ordering in this example, you need explicit
7682456d2a6SPaul E. McKenneymemory barriers, for example, smp_store_release():
76918c03c61SPeter Zijlstra
7709af194ceSPaul E. McKenney	q = READ_ONCE(a);
7712456d2a6SPaul E. McKenney	if (q) {
772c8241f85SPaul E. McKenney		smp_store_release(&b, 1);
77318c03c61SPeter Zijlstra		do_something();
77418c03c61SPeter Zijlstra	} else {
775c8241f85SPaul E. McKenney		smp_store_release(&b, 1);
77618c03c61SPeter Zijlstra		do_something_else();
77718c03c61SPeter Zijlstra	}
77818c03c61SPeter Zijlstra
7792456d2a6SPaul E. McKenneyIn contrast, without explicit memory barriers, two-legged-if control
7802456d2a6SPaul E. McKenneyordering is guaranteed only when the stores differ, for example:
7812456d2a6SPaul E. McKenney
782105ff3cbSLinus Torvalds	q = READ_ONCE(a);
7832456d2a6SPaul E. McKenney	if (q) {
784c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
7852456d2a6SPaul E. McKenney		do_something();
7862456d2a6SPaul E. McKenney	} else {
787c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
7882456d2a6SPaul E. McKenney		do_something_else();
7892456d2a6SPaul E. McKenney	}
7902456d2a6SPaul E. McKenney
791105ff3cbSLinus TorvaldsThe initial READ_ONCE() is still required to prevent the compiler from
792105ff3cbSLinus Torvaldsproving the value of 'a'.
79318c03c61SPeter Zijlstra
79418c03c61SPeter ZijlstraIn addition, you need to be careful what you do with the local variable 'q',
79518c03c61SPeter Zijlstraotherwise the compiler might be able to guess the value and again remove
79618c03c61SPeter Zijlstrathe needed conditional.  For example:
79718c03c61SPeter Zijlstra
798105ff3cbSLinus Torvalds	q = READ_ONCE(a);
79918c03c61SPeter Zijlstra	if (q % MAX) {
800c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
80118c03c61SPeter Zijlstra		do_something();
80218c03c61SPeter Zijlstra	} else {
803c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
80418c03c61SPeter Zijlstra		do_something_else();
80518c03c61SPeter Zijlstra	}
80618c03c61SPeter Zijlstra
80718c03c61SPeter ZijlstraIf MAX is defined to be 1, then the compiler knows that (q % MAX) is
80818c03c61SPeter Zijlstraequal to zero, in which case the compiler is within its rights to
80918c03c61SPeter Zijlstratransform the above code into the following:
81018c03c61SPeter Zijlstra
811105ff3cbSLinus Torvalds	q = READ_ONCE(a);
812b26cfc48Spierre Kuo	WRITE_ONCE(b, 2);
81318c03c61SPeter Zijlstra	do_something_else();
81418c03c61SPeter Zijlstra
8152456d2a6SPaul E. McKenneyGiven this transformation, the CPU is not required to respect the ordering
8162456d2a6SPaul E. McKenneybetween the load from variable 'a' and the store to variable 'b'.  It is
8172456d2a6SPaul E. McKenneytempting to add a barrier(), but this does not help.  The conditional
8182456d2a6SPaul E. McKenneyis gone, and the barrier won't bring it back.  Therefore, if you are
8192456d2a6SPaul E. McKenneyrelying on this ordering, you should make sure that MAX is greater than
8202456d2a6SPaul E. McKenneyone, perhaps as follows:
82118c03c61SPeter Zijlstra
822105ff3cbSLinus Torvalds	q = READ_ONCE(a);
82318c03c61SPeter Zijlstra	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
82418c03c61SPeter Zijlstra	if (q % MAX) {
825c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
82618c03c61SPeter Zijlstra		do_something();
82718c03c61SPeter Zijlstra	} else {
828c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
82918c03c61SPeter Zijlstra		do_something_else();
83018c03c61SPeter Zijlstra	}
83118c03c61SPeter Zijlstra
8322456d2a6SPaul E. McKenneyPlease note once again that the stores to 'b' differ.  If they were
8332456d2a6SPaul E. McKenneyidentical, as noted earlier, the compiler could pull this store outside
8342456d2a6SPaul E. McKenneyof the 'if' statement.
8352456d2a6SPaul E. McKenney
8368b19d1deSPaul E. McKenneyYou must also be careful not to rely too much on boolean short-circuit
8378b19d1deSPaul E. McKenneyevaluation.  Consider this example:
8388b19d1deSPaul E. McKenney
839105ff3cbSLinus Torvalds	q = READ_ONCE(a);
84057aecae9SPaul E. McKenney	if (q || 1 > 0)
8419af194ceSPaul E. McKenney		WRITE_ONCE(b, 1);
8428b19d1deSPaul E. McKenney
8435af4692aSPaul E. McKenneyBecause the first condition cannot fault and the second condition is
8445af4692aSPaul E. McKenneyalways true, the compiler can transform this example as following,
8455af4692aSPaul E. McKenneydefeating control dependency:
8468b19d1deSPaul E. McKenney
847105ff3cbSLinus Torvalds	q = READ_ONCE(a);
8489af194ceSPaul E. McKenney	WRITE_ONCE(b, 1);
8498b19d1deSPaul E. McKenney
8508b19d1deSPaul E. McKenneyThis example underscores the need to ensure that the compiler cannot
8519af194ceSPaul E. McKenneyout-guess your code.  More generally, although READ_ONCE() does force
8528b19d1deSPaul E. McKenneythe compiler to actually emit code for a given load, it does not force
8538b19d1deSPaul E. McKenneythe compiler to use the results.
8548b19d1deSPaul E. McKenney
855ebff09a6SPaul E. McKenneyIn addition, control dependencies apply only to the then-clause and
856ebff09a6SPaul E. McKenneyelse-clause of the if-statement in question.  In particular, it does
857ebff09a6SPaul E. McKenneynot necessarily apply to code following the if-statement:
858ebff09a6SPaul E. McKenney
859ebff09a6SPaul E. McKenney	q = READ_ONCE(a);
860ebff09a6SPaul E. McKenney	if (q) {
861c8241f85SPaul E. McKenney		WRITE_ONCE(b, 1);
862ebff09a6SPaul E. McKenney	} else {
863c8241f85SPaul E. McKenney		WRITE_ONCE(b, 2);
864ebff09a6SPaul E. McKenney	}
865c8241f85SPaul E. McKenney	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
866ebff09a6SPaul E. McKenney
867ebff09a6SPaul E. McKenneyIt is tempting to argue that there in fact is ordering because the
868ebff09a6SPaul E. McKenneycompiler cannot reorder volatile accesses and also cannot reorder
869c8241f85SPaul E. McKenneythe writes to 'b' with the condition.  Unfortunately for this line
870c8241f85SPaul E. McKenneyof reasoning, the compiler might compile the two writes to 'b' as
871ebff09a6SPaul E. McKenneyconditional-move instructions, as in this fanciful pseudo-assembly
872ebff09a6SPaul E. McKenneylanguage:
873ebff09a6SPaul E. McKenney
874ebff09a6SPaul E. McKenney	ld r1,a
875ebff09a6SPaul E. McKenney	cmp r1,$0
876c8241f85SPaul E. McKenney	cmov,ne r4,$1
877c8241f85SPaul E. McKenney	cmov,eq r4,$2
878ebff09a6SPaul E. McKenney	st r4,b
879ebff09a6SPaul E. McKenney	st $1,c
880ebff09a6SPaul E. McKenney
881ebff09a6SPaul E. McKenneyA weakly ordered CPU would have no dependency of any sort between the load
882c8241f85SPaul E. McKenneyfrom 'a' and the store to 'c'.  The control dependencies would extend
883ebff09a6SPaul E. McKenneyonly to the pair of cmov instructions and the store depending on them.
884ebff09a6SPaul E. McKenneyIn short, control dependencies apply only to the stores in the then-clause
885ebff09a6SPaul E. McKenneyand else-clause of the if-statement in question (including functions
886ebff09a6SPaul E. McKenneyinvoked by those two clauses), not to code following that if-statement.
887ebff09a6SPaul E. McKenney
88818c03c61SPeter Zijlstra
889f1ab25a3SPaul E. McKenneyNote well that the ordering provided by a control dependency is local
890f1ab25a3SPaul E. McKenneyto the CPU containing it.  See the section on "Multicopy atomicity"
891f1ab25a3SPaul E. McKenneyfor more information.
89218c03c61SPeter Zijlstra
89318c03c61SPeter Zijlstra
89418c03c61SPeter ZijlstraIn summary:
89518c03c61SPeter Zijlstra
89618c03c61SPeter Zijlstra  (*) Control dependencies can order prior loads against later stores.
89718c03c61SPeter Zijlstra      However, they do -not- guarantee any other sort of ordering:
89818c03c61SPeter Zijlstra      Not prior loads against later loads, nor prior stores against
89918c03c61SPeter Zijlstra      later anything.  If you need these other forms of ordering,
900d87510c5SDavidlohr Bueso      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
90118c03c61SPeter Zijlstra      later loads, smp_mb().
90218c03c61SPeter Zijlstra
9037817b799SPaul E. McKenney  (*) If both legs of the "if" statement begin with identical stores to
9047817b799SPaul E. McKenney      the same variable, then those stores must be ordered, either by
9057817b799SPaul E. McKenney      preceding both of them with smp_mb() or by using smp_store_release()
9067817b799SPaul E. McKenney      to carry out the stores.  Please note that it is -not- sufficient
907a5052657SPaul E. McKenney      to use barrier() at beginning of each leg of the "if" statement
908a5052657SPaul E. McKenney      because, as shown by the example above, optimizing compilers can
909a5052657SPaul E. McKenney      destroy the control dependency while respecting the letter of the
910a5052657SPaul E. McKenney      barrier() law.
9119b2b3bf5SPaul E. McKenney
91218c03c61SPeter Zijlstra  (*) Control dependencies require at least one run-time conditional
913586dd56aSPaul E. McKenney      between the prior load and the subsequent store, and this
9149af194ceSPaul E. McKenney      conditional must involve the prior load.  If the compiler is able
9159af194ceSPaul E. McKenney      to optimize the conditional away, it will have also optimized
916105ff3cbSLinus Torvalds      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
917105ff3cbSLinus Torvalds      can help to preserve the needed conditional.
91818c03c61SPeter Zijlstra
91918c03c61SPeter Zijlstra  (*) Control dependencies require that the compiler avoid reordering the
920105ff3cbSLinus Torvalds      dependency into nonexistence.  Careful use of READ_ONCE() or
921105ff3cbSLinus Torvalds      atomic{,64}_read() can help to preserve your control dependency.
922895f5542SPaul E. McKenney      Please see the COMPILER BARRIER section for more information.
92318c03c61SPeter Zijlstra
924ebff09a6SPaul E. McKenney  (*) Control dependencies apply only to the then-clause and else-clause
925ebff09a6SPaul E. McKenney      of the if-statement containing the control dependency, including
926ebff09a6SPaul E. McKenney      any functions that these two clauses call.  Control dependencies
927ebff09a6SPaul E. McKenney      do -not- apply to code following the if-statement containing the
928ebff09a6SPaul E. McKenney      control dependency.
929ebff09a6SPaul E. McKenney
930ff382810SPaul E. McKenney  (*) Control dependencies pair normally with other types of barriers.
931ff382810SPaul E. McKenney
932f1ab25a3SPaul E. McKenney  (*) Control dependencies do -not- provide multicopy atomicity.  If you
933f1ab25a3SPaul E. McKenney      need all the CPUs to see a given store at the same time, use smp_mb().
934108b42b4SDavid Howells
935c8241f85SPaul E. McKenney  (*) Compilers do not understand control dependencies.  It is therefore
936c8241f85SPaul E. McKenney      your job to ensure that they do not break your code.
937c8241f85SPaul E. McKenney
938108b42b4SDavid Howells
939108b42b4SDavid HowellsSMP BARRIER PAIRING
940108b42b4SDavid Howells-------------------
941108b42b4SDavid Howells
942108b42b4SDavid HowellsWhen dealing with CPU-CPU interactions, certain types of memory barrier should
943108b42b4SDavid Howellsalways be paired.  A lack of appropriate pairing is almost certainly an error.
944108b42b4SDavid Howells
945ff382810SPaul E. McKenneyGeneral barriers pair with each other, though they also pair with most
946f1ab25a3SPaul E. McKenneyother types of barriers, albeit without multicopy atomicity.  An acquire
947f1ab25a3SPaul E. McKenneybarrier pairs with a release barrier, but both may also pair with other
948f1ab25a3SPaul E. McKenneybarriers, including of course general barriers.  A write barrier pairs
949203185f6SAkira Yokosawawith an address-dependency barrier, a control dependency, an acquire barrier,
950f1ab25a3SPaul E. McKenneya release barrier, a read barrier, or a general barrier.  Similarly a
951203185f6SAkira Yokosawaread barrier, control dependency, or an address-dependency barrier pairs
952f1ab25a3SPaul E. McKenneywith a write barrier, an acquire barrier, a release barrier, or a
953f1ab25a3SPaul E. McKenneygeneral barrier:
954108b42b4SDavid Howells
955108b42b4SDavid Howells	CPU 1		      CPU 2
956108b42b4SDavid Howells	===============	      ===============
9579af194ceSPaul E. McKenney	WRITE_ONCE(a, 1);
958108b42b4SDavid Howells	<write barrier>
9599af194ceSPaul E. McKenney	WRITE_ONCE(b, 2);     x = READ_ONCE(b);
960108b42b4SDavid Howells			      <read barrier>
9619af194ceSPaul E. McKenney			      y = READ_ONCE(a);
962108b42b4SDavid Howells
963108b42b4SDavid HowellsOr:
964108b42b4SDavid Howells
965108b42b4SDavid Howells	CPU 1		      CPU 2
966108b42b4SDavid Howells	===============	      ===============================
967108b42b4SDavid Howells	a = 1;
968108b42b4SDavid Howells	<write barrier>
9699af194ceSPaul E. McKenney	WRITE_ONCE(b, &a);    x = READ_ONCE(b);
970203185f6SAkira Yokosawa			      <implicit address-dependency barrier>
971108b42b4SDavid Howells			      y = *x;
972108b42b4SDavid Howells
973ff382810SPaul E. McKenneyOr even:
974ff382810SPaul E. McKenney
975ff382810SPaul E. McKenney	CPU 1		      CPU 2
976ff382810SPaul E. McKenney	===============	      ===============================
9779af194ceSPaul E. McKenney	r1 = READ_ONCE(y);
978ff382810SPaul E. McKenney	<general barrier>
979d92f842bSScott Tsai	WRITE_ONCE(x, 1);     if (r2 = READ_ONCE(x)) {
980ff382810SPaul E. McKenney			         <implicit control dependency>
9819af194ceSPaul E. McKenney			         WRITE_ONCE(y, 1);
982ff382810SPaul E. McKenney			      }
983ff382810SPaul E. McKenney
984ff382810SPaul E. McKenney	assert(r1 == 0 || r2 == 0);
985ff382810SPaul E. McKenney
986108b42b4SDavid HowellsBasically, the read barrier always has to be there, even though it can be of
987108b42b4SDavid Howellsthe "weaker" type.
988108b42b4SDavid Howells
989670bd95eSDavid Howells[!] Note that the stores before the write barrier would normally be expected to
990f556082dSAkira Yokosawamatch the loads after the read barrier or the address-dependency barrier, and
991f556082dSAkira Yokosawavice versa:
992670bd95eSDavid Howells
993670bd95eSDavid Howells	CPU 1                               CPU 2
9942ecf8101SPaul E. McKenney	===================                 ===================
9959af194ceSPaul E. McKenney	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
9969af194ceSPaul E. McKenney	WRITE_ONCE(b, 2);    }    \ /    {  w = READ_ONCE(d);
997670bd95eSDavid Howells	<write barrier>            \        <read barrier>
9989af194ceSPaul E. McKenney	WRITE_ONCE(c, 3);    }    / \    {  x = READ_ONCE(a);
9999af194ceSPaul E. McKenney	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
1000670bd95eSDavid Howells
1001108b42b4SDavid Howells
1002108b42b4SDavid HowellsEXAMPLES OF MEMORY BARRIER SEQUENCES
1003108b42b4SDavid Howells------------------------------------
1004108b42b4SDavid Howells
100581fc6323SJarek PoplawskiFirstly, write barriers act as partial orderings on store operations.
1006108b42b4SDavid HowellsConsider the following sequence of events:
1007108b42b4SDavid Howells
1008108b42b4SDavid Howells	CPU 1
1009108b42b4SDavid Howells	=======================
1010108b42b4SDavid Howells	STORE A = 1
1011108b42b4SDavid Howells	STORE B = 2
1012108b42b4SDavid Howells	STORE C = 3
1013108b42b4SDavid Howells	<write barrier>
1014108b42b4SDavid Howells	STORE D = 4
1015108b42b4SDavid Howells	STORE E = 5
1016108b42b4SDavid Howells
1017108b42b4SDavid HowellsThis sequence of events is committed to the memory coherence system in an order
1018108b42b4SDavid Howellsthat the rest of the system might perceive as the unordered set of { STORE A,
101980f7228bSAdrian BunkSTORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
1020108b42b4SDavid Howells}:
1021108b42b4SDavid Howells
1022108b42b4SDavid Howells	+-------+       :      :
1023108b42b4SDavid Howells	|       |       +------+
1024108b42b4SDavid Howells	|       |------>| C=3  |     }     /\
102581fc6323SJarek Poplawski	|       |  :    +------+     }-----  \  -----> Events perceptible to
102681fc6323SJarek Poplawski	|       |  :    | A=1  |     }        \/       the rest of the system
1027108b42b4SDavid Howells	|       |  :    +------+     }
1028108b42b4SDavid Howells	| CPU 1 |  :    | B=2  |     }
1029108b42b4SDavid Howells	|       |       +------+     }
1030108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1031108b42b4SDavid Howells	|       |       +------+     }        requires all stores prior to the
1032108b42b4SDavid Howells	|       |  :    | E=5  |     }        barrier to be committed before
103381fc6323SJarek Poplawski	|       |  :    +------+     }        further stores may take place
1034108b42b4SDavid Howells	|       |------>| D=4  |     }
1035108b42b4SDavid Howells	|       |       +------+
1036108b42b4SDavid Howells	+-------+       :      :
1037108b42b4SDavid Howells	                   |
1038670bd95eSDavid Howells	                   | Sequence in which stores are committed to the
1039670bd95eSDavid Howells	                   | memory system by CPU 1
1040108b42b4SDavid Howells	                   V
1041108b42b4SDavid Howells
1042108b42b4SDavid Howells
1043f556082dSAkira YokosawaSecondly, address-dependency barriers act as partial orderings on address-
1044f556082dSAkira Yokosawadependent loads.  Consider the following sequence of events:
1045108b42b4SDavid Howells
1046108b42b4SDavid Howells	CPU 1			CPU 2
1047108b42b4SDavid Howells	=======================	=======================
1048c14038c3SDavid Howells		{ B = 7; X = 9; Y = 8; C = &Y }
1049108b42b4SDavid Howells	STORE A = 1
1050108b42b4SDavid Howells	STORE B = 2
1051108b42b4SDavid Howells	<write barrier>
1052108b42b4SDavid Howells	STORE C = &B		LOAD X
1053108b42b4SDavid Howells	STORE D = 4		LOAD C (gets &B)
1054108b42b4SDavid Howells				LOAD *C (reads B)
1055108b42b4SDavid Howells
1056108b42b4SDavid HowellsWithout intervention, CPU 2 may perceive the events on CPU 1 in some
1057108b42b4SDavid Howellseffectively random order, despite the write barrier issued by CPU 1:
1058108b42b4SDavid Howells
1059108b42b4SDavid Howells	+-------+       :      :                :       :
1060108b42b4SDavid Howells	|       |       +------+                +-------+  | Sequence of update
1061108b42b4SDavid Howells	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1062108b42b4SDavid Howells	|       |  :    +------+     \          +-------+  | CPU 2
1063108b42b4SDavid Howells	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1064108b42b4SDavid Howells	|       |       +------+       |        +-------+
1065108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww   |        :       :
1066108b42b4SDavid Howells	|       |       +------+       |        :       :
1067108b42b4SDavid Howells	|       |  :    | C=&B |---    |        :       :       +-------+
1068108b42b4SDavid Howells	|       |  :    +------+   \   |        +-------+       |       |
1069108b42b4SDavid Howells	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1070108b42b4SDavid Howells	|       |       +------+       |        +-------+       |       |
1071108b42b4SDavid Howells	+-------+       :      :       |        :       :       |       |
1072108b42b4SDavid Howells	                               |        :       :       |       |
1073108b42b4SDavid Howells	                               |        :       :       | CPU 2 |
1074108b42b4SDavid Howells	                               |        +-------+       |       |
1075108b42b4SDavid Howells	    Apparently incorrect --->  |        | B->7  |------>|       |
1076108b42b4SDavid Howells	    perception of B (!)        |        +-------+       |       |
1077108b42b4SDavid Howells	                               |        :       :       |       |
1078108b42b4SDavid Howells	                               |        +-------+       |       |
1079108b42b4SDavid Howells	    The load of X holds --->    \       | X->9  |------>|       |
1080108b42b4SDavid Howells	    up the maintenance           \      +-------+       |       |
1081108b42b4SDavid Howells	    of coherence of B             ----->| B->2  |       +-------+
1082108b42b4SDavid Howells	                                        +-------+
1083108b42b4SDavid Howells	                                        :       :
1084108b42b4SDavid Howells
1085108b42b4SDavid Howells
1086108b42b4SDavid HowellsIn the above example, CPU 2 perceives that B is 7, despite the load of *C
1087670e9f34SPaolo Ornati(which would be B) coming after the LOAD of C.
1088108b42b4SDavid Howells
1089f556082dSAkira YokosawaIf, however, an address-dependency barrier were to be placed between the load
1090f556082dSAkira Yokosawaof C and the load of *C (ie: B) on CPU 2:
1091c14038c3SDavid Howells
1092c14038c3SDavid Howells	CPU 1			CPU 2
1093c14038c3SDavid Howells	=======================	=======================
1094c14038c3SDavid Howells		{ B = 7; X = 9; Y = 8; C = &Y }
1095c14038c3SDavid Howells	STORE A = 1
1096c14038c3SDavid Howells	STORE B = 2
1097c14038c3SDavid Howells	<write barrier>
1098c14038c3SDavid Howells	STORE C = &B		LOAD X
1099c14038c3SDavid Howells	STORE D = 4		LOAD C (gets &B)
1100203185f6SAkira Yokosawa				<address-dependency barrier>
1101c14038c3SDavid Howells				LOAD *C (reads B)
1102c14038c3SDavid Howells
1103c14038c3SDavid Howellsthen the following will occur:
1104108b42b4SDavid Howells
1105108b42b4SDavid Howells	+-------+       :      :                :       :
1106108b42b4SDavid Howells	|       |       +------+                +-------+
1107108b42b4SDavid Howells	|       |------>| B=2  |-----       --->| Y->8  |
1108108b42b4SDavid Howells	|       |  :    +------+     \          +-------+
1109108b42b4SDavid Howells	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1110108b42b4SDavid Howells	|       |       +------+       |        +-------+
1111108b42b4SDavid Howells	|       |   wwwwwwwwwwwwwwww   |        :       :
1112108b42b4SDavid Howells	|       |       +------+       |        :       :
1113108b42b4SDavid Howells	|       |  :    | C=&B |---    |        :       :       +-------+
1114108b42b4SDavid Howells	|       |  :    +------+   \   |        +-------+       |       |
1115108b42b4SDavid Howells	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1116108b42b4SDavid Howells	|       |       +------+       |        +-------+       |       |
1117108b42b4SDavid Howells	+-------+       :      :       |        :       :       |       |
1118108b42b4SDavid Howells	                               |        :       :       |       |
1119108b42b4SDavid Howells	                               |        :       :       | CPU 2 |
1120108b42b4SDavid Howells	                               |        +-------+       |       |
1121670bd95eSDavid Howells	                               |        | X->9  |------>|       |
1122670bd95eSDavid Howells	                               |        +-------+       |       |
1123203185f6SAkira Yokosawa	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
1124670bd95eSDavid Howells	  prior to the store of C        \      +-------+       |       |
1125670bd95eSDavid Howells	  are perceptible to              ----->| B->2  |------>|       |
1126670bd95eSDavid Howells	  subsequent loads                      +-------+       |       |
1127108b42b4SDavid Howells	                                        :       :       +-------+
1128108b42b4SDavid Howells
1129108b42b4SDavid Howells
1130108b42b4SDavid HowellsAnd thirdly, a read barrier acts as a partial order on loads.  Consider the
1131108b42b4SDavid Howellsfollowing sequence of events:
1132108b42b4SDavid Howells
1133108b42b4SDavid Howells	CPU 1			CPU 2
1134108b42b4SDavid Howells	=======================	=======================
1135670bd95eSDavid Howells		{ A = 0, B = 9 }
1136108b42b4SDavid Howells	STORE A=1
1137108b42b4SDavid Howells	<write barrier>
1138670bd95eSDavid Howells	STORE B=2
1139108b42b4SDavid Howells				LOAD B
1140670bd95eSDavid Howells				LOAD A
1141108b42b4SDavid Howells
1142108b42b4SDavid HowellsWithout intervention, CPU 2 may then choose to perceive the events on CPU 1 in
1143108b42b4SDavid Howellssome effectively random order, despite the write barrier issued by CPU 1:
1144108b42b4SDavid Howells
1145670bd95eSDavid Howells	+-------+       :      :                :       :
1146670bd95eSDavid Howells	|       |       +------+                +-------+
1147670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
1148670bd95eSDavid Howells	|       |       +------+      \         +-------+
1149670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1150670bd95eSDavid Howells	|       |       +------+        |       +-------+
1151670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
1152670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
1153670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
1154670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
1155670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
1156670bd95eSDavid Howells	                                |       | A->0  |------>|       |
1157670bd95eSDavid Howells	                                |       +-------+       |       |
1158670bd95eSDavid Howells	                                |       :       :       +-------+
1159670bd95eSDavid Howells	                                 \      :       :
1160670bd95eSDavid Howells	                                  \     +-------+
1161670bd95eSDavid Howells	                                   ---->| A->1  |
1162670bd95eSDavid Howells	                                        +-------+
1163670bd95eSDavid Howells	                                        :       :
1164108b42b4SDavid Howells
1165108b42b4SDavid Howells
11666bc39274SDavid HowellsIf, however, a read barrier were to be placed between the load of B and the
1167670bd95eSDavid Howellsload of A on CPU 2:
1168108b42b4SDavid Howells
1169670bd95eSDavid Howells	CPU 1			CPU 2
1170670bd95eSDavid Howells	=======================	=======================
1171670bd95eSDavid Howells		{ A = 0, B = 9 }
1172670bd95eSDavid Howells	STORE A=1
1173670bd95eSDavid Howells	<write barrier>
1174670bd95eSDavid Howells	STORE B=2
1175670bd95eSDavid Howells				LOAD B
1176670bd95eSDavid Howells				<read barrier>
1177670bd95eSDavid Howells				LOAD A
1178670bd95eSDavid Howells
1179670bd95eSDavid Howellsthen the partial ordering imposed by CPU 1 will be perceived correctly by CPU
1180670bd95eSDavid Howells2:
1181670bd95eSDavid Howells
1182670bd95eSDavid Howells	+-------+       :      :                :       :
1183670bd95eSDavid Howells	|       |       +------+                +-------+
1184670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
1185670bd95eSDavid Howells	|       |       +------+      \         +-------+
1186670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1187670bd95eSDavid Howells	|       |       +------+        |       +-------+
1188670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
1189670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
1190670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
1191670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
1192670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
1193670bd95eSDavid Howells	                                |       :       :       |       |
1194670bd95eSDavid Howells	                                |       :       :       |       |
1195108b42b4SDavid Howells	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1196108b42b4SDavid Howells	  barrier causes all effects      \     +-------+       |       |
1197670bd95eSDavid Howells	  prior to the storage of B        ---->| A->1  |------>|       |
1198670bd95eSDavid Howells	  to be perceptible to CPU 2            +-------+       |       |
1199670bd95eSDavid Howells	                                        :       :       +-------+
1200670bd95eSDavid Howells
1201670bd95eSDavid Howells
1202670bd95eSDavid HowellsTo illustrate this more completely, consider what could happen if the code
1203670bd95eSDavid Howellscontained a load of A either side of the read barrier:
1204670bd95eSDavid Howells
1205670bd95eSDavid Howells	CPU 1			CPU 2
1206670bd95eSDavid Howells	=======================	=======================
1207670bd95eSDavid Howells		{ A = 0, B = 9 }
1208670bd95eSDavid Howells	STORE A=1
1209670bd95eSDavid Howells	<write barrier>
1210670bd95eSDavid Howells	STORE B=2
1211670bd95eSDavid Howells				LOAD B
1212670bd95eSDavid Howells				LOAD A [first load of A]
1213670bd95eSDavid Howells				<read barrier>
1214670bd95eSDavid Howells				LOAD A [second load of A]
1215670bd95eSDavid Howells
1216670bd95eSDavid HowellsEven though the two loads of A both occur after the load of B, they may both
1217670bd95eSDavid Howellscome up with different values:
1218670bd95eSDavid Howells
1219670bd95eSDavid Howells	+-------+       :      :                :       :
1220670bd95eSDavid Howells	|       |       +------+                +-------+
1221670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
1222670bd95eSDavid Howells	|       |       +------+      \         +-------+
1223670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1224670bd95eSDavid Howells	|       |       +------+        |       +-------+
1225670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
1226670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
1227670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
1228670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
1229670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
1230670bd95eSDavid Howells	                                |       :       :       |       |
1231670bd95eSDavid Howells	                                |       :       :       |       |
1232670bd95eSDavid Howells	                                |       +-------+       |       |
1233670bd95eSDavid Howells	                                |       | A->0  |------>| 1st   |
1234670bd95eSDavid Howells	                                |       +-------+       |       |
1235670bd95eSDavid Howells	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1236670bd95eSDavid Howells	  barrier causes all effects      \     +-------+       |       |
1237670bd95eSDavid Howells	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1238670bd95eSDavid Howells	  to be perceptible to CPU 2            +-------+       |       |
1239670bd95eSDavid Howells	                                        :       :       +-------+
1240670bd95eSDavid Howells
1241670bd95eSDavid Howells
1242670bd95eSDavid HowellsBut it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1243670bd95eSDavid Howellsbefore the read barrier completes anyway:
1244670bd95eSDavid Howells
1245670bd95eSDavid Howells	+-------+       :      :                :       :
1246670bd95eSDavid Howells	|       |       +------+                +-------+
1247670bd95eSDavid Howells	|       |------>| A=1  |------      --->| A->0  |
1248670bd95eSDavid Howells	|       |       +------+      \         +-------+
1249670bd95eSDavid Howells	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1250670bd95eSDavid Howells	|       |       +------+        |       +-------+
1251670bd95eSDavid Howells	|       |------>| B=2  |---     |       :       :
1252670bd95eSDavid Howells	|       |       +------+   \    |       :       :       +-------+
1253670bd95eSDavid Howells	+-------+       :      :    \   |       +-------+       |       |
1254670bd95eSDavid Howells	                             ---------->| B->2  |------>|       |
1255670bd95eSDavid Howells	                                |       +-------+       | CPU 2 |
1256670bd95eSDavid Howells	                                |       :       :       |       |
1257670bd95eSDavid Howells	                                 \      :       :       |       |
1258670bd95eSDavid Howells	                                  \     +-------+       |       |
1259670bd95eSDavid Howells	                                   ---->| A->1  |------>| 1st   |
1260670bd95eSDavid Howells	                                        +-------+       |       |
1261670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrrr   |       |
1262670bd95eSDavid Howells	                                        +-------+       |       |
1263670bd95eSDavid Howells	                                        | A->1  |------>| 2nd   |
1264108b42b4SDavid Howells	                                        +-------+       |       |
1265108b42b4SDavid Howells	                                        :       :       +-------+
1266108b42b4SDavid Howells
1267108b42b4SDavid Howells
1268670bd95eSDavid HowellsThe guarantee is that the second load will always come up with A == 1 if the
1269670bd95eSDavid Howellsload of B came up with B == 2.  No such guarantee exists for the first load of
1270670bd95eSDavid HowellsA; that may come up with either A == 0 or A == 1.
1271670bd95eSDavid Howells
1272670bd95eSDavid Howells
1273670bd95eSDavid HowellsREAD MEMORY BARRIERS VS LOAD SPECULATION
1274670bd95eSDavid Howells----------------------------------------
1275670bd95eSDavid Howells
1276670bd95eSDavid HowellsMany CPUs speculate with loads: that is they see that they will need to load an
1277670bd95eSDavid Howellsitem from memory, and they find a time where they're not using the bus for any
1278670bd95eSDavid Howellsother loads, and so do the load in advance - even though they haven't actually
1279670bd95eSDavid Howellsgot to that point in the instruction execution flow yet.  This permits the
1280670bd95eSDavid Howellsactual load instruction to potentially complete immediately because the CPU
1281670bd95eSDavid Howellsalready has the value to hand.
1282670bd95eSDavid Howells
1283670bd95eSDavid HowellsIt may turn out that the CPU didn't actually need the value - perhaps because a
1284670bd95eSDavid Howellsbranch circumvented the load - in which case it can discard the value or just
1285670bd95eSDavid Howellscache it for later use.
1286670bd95eSDavid Howells
1287670bd95eSDavid HowellsConsider:
1288670bd95eSDavid Howells
1289670bd95eSDavid Howells	CPU 1			CPU 2
1290670bd95eSDavid Howells	=======================	=======================
1291670bd95eSDavid Howells				LOAD B
1292670bd95eSDavid Howells				DIVIDE		} Divide instructions generally
1293670bd95eSDavid Howells				DIVIDE		} take a long time to perform
1294670bd95eSDavid Howells				LOAD A
1295670bd95eSDavid Howells
1296670bd95eSDavid HowellsWhich might appear as this:
1297670bd95eSDavid Howells
1298670bd95eSDavid Howells	                                        :       :       +-------+
1299670bd95eSDavid Howells	                                        +-------+       |       |
1300670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
1301670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
1302670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1303670bd95eSDavid Howells	                                        +-------+       |       |
1304670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1305670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
1306670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
1307670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1308670bd95eSDavid Howells	                                        :       :   ~   |       |
1309670bd95eSDavid Howells	Once the divisions are complete -->     :       :   ~-->|       |
1310670bd95eSDavid Howells	the CPU can then perform the            :       :       |       |
1311670bd95eSDavid Howells	LOAD with immediate effect              :       :       +-------+
1312670bd95eSDavid Howells
1313670bd95eSDavid Howells
1314203185f6SAkira YokosawaPlacing a read barrier or an address-dependency barrier just before the second
1315670bd95eSDavid Howellsload:
1316670bd95eSDavid Howells
1317670bd95eSDavid Howells	CPU 1			CPU 2
1318670bd95eSDavid Howells	=======================	=======================
1319670bd95eSDavid Howells				LOAD B
1320670bd95eSDavid Howells				DIVIDE
1321670bd95eSDavid Howells				DIVIDE
1322670bd95eSDavid Howells				<read barrier>
1323670bd95eSDavid Howells				LOAD A
1324670bd95eSDavid Howells
1325670bd95eSDavid Howellswill force any value speculatively obtained to be reconsidered to an extent
1326670bd95eSDavid Howellsdependent on the type of barrier used.  If there was no change made to the
1327670bd95eSDavid Howellsspeculated memory location, then the speculated value will just be used:
1328670bd95eSDavid Howells
1329670bd95eSDavid Howells	                                        :       :       +-------+
1330670bd95eSDavid Howells	                                        +-------+       |       |
1331670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
1332670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
1333670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1334670bd95eSDavid Howells	                                        +-------+       |       |
1335670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1336670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
1337670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
1338670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1339670bd95eSDavid Howells	                                        :       :   ~   |       |
1340670bd95eSDavid Howells	                                        :       :   ~   |       |
1341670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrr~   |       |
1342670bd95eSDavid Howells	                                        :       :   ~   |       |
1343670bd95eSDavid Howells	                                        :       :   ~-->|       |
1344670bd95eSDavid Howells	                                        :       :       |       |
1345670bd95eSDavid Howells	                                        :       :       +-------+
1346670bd95eSDavid Howells
1347670bd95eSDavid Howells
1348670bd95eSDavid Howellsbut if there was an update or an invalidation from another CPU pending, then
1349670bd95eSDavid Howellsthe speculation will be cancelled and the value reloaded:
1350670bd95eSDavid Howells
1351670bd95eSDavid Howells	                                        :       :       +-------+
1352670bd95eSDavid Howells	                                        +-------+       |       |
1353670bd95eSDavid Howells	                                    --->| B->2  |------>|       |
1354670bd95eSDavid Howells	                                        +-------+       | CPU 2 |
1355670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1356670bd95eSDavid Howells	                                        +-------+       |       |
1357670bd95eSDavid Howells	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1358670bd95eSDavid Howells	division speculates on the              +-------+   ~   |       |
1359670bd95eSDavid Howells	LOAD of A                               :       :   ~   |       |
1360670bd95eSDavid Howells	                                        :       :DIVIDE |       |
1361670bd95eSDavid Howells	                                        :       :   ~   |       |
1362670bd95eSDavid Howells	                                        :       :   ~   |       |
1363670bd95eSDavid Howells	                                    rrrrrrrrrrrrrrrrr   |       |
1364670bd95eSDavid Howells	                                        +-------+       |       |
1365670bd95eSDavid Howells	The speculation is discarded --->   --->| A->1  |------>|       |
1366670bd95eSDavid Howells	and an updated value is                 +-------+       |       |
1367670bd95eSDavid Howells	retrieved                               :       :       +-------+
1368670bd95eSDavid Howells
1369670bd95eSDavid Howells
1370f1ab25a3SPaul E. McKenneyMULTICOPY ATOMICITY
1371f1ab25a3SPaul E. McKenney--------------------
1372241e6663SPaul E. McKenney
1373f1ab25a3SPaul E. McKenneyMulticopy atomicity is a deeply intuitive notion about ordering that is
1374f1ab25a3SPaul E. McKenneynot always provided by real computer systems, namely that a given store
13750902b1f4SAlan Sternbecomes visible at the same time to all CPUs, or, alternatively, that all
13760902b1f4SAlan SternCPUs agree on the order in which all stores become visible.  However,
13770902b1f4SAlan Sternsupport of full multicopy atomicity would rule out valuable hardware
13780902b1f4SAlan Sternoptimizations, so a weaker form called ``other multicopy atomicity''
13790902b1f4SAlan Sterninstead guarantees only that a given store becomes visible at the same
13800902b1f4SAlan Sterntime to all -other- CPUs.  The remainder of this document discusses this
13810902b1f4SAlan Sternweaker form, but for brevity will call it simply ``multicopy atomicity''.
1382f1ab25a3SPaul E. McKenney
1383f1ab25a3SPaul E. McKenneyThe following example demonstrates multicopy atomicity:
1384241e6663SPaul E. McKenney
1385241e6663SPaul E. McKenney	CPU 1			CPU 2			CPU 3
1386241e6663SPaul E. McKenney	=======================	=======================	=======================
1387241e6663SPaul E. McKenney		{ X = 0, Y = 0 }
1388f1ab25a3SPaul E. McKenney	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1389f1ab25a3SPaul E. McKenney				<general barrier>	<read barrier>
1390f1ab25a3SPaul E. McKenney				STORE Y=r1		LOAD X
1391241e6663SPaul E. McKenney
13920902b1f4SAlan SternSuppose that CPU 2's load from X returns 1, which it then stores to Y,
13930902b1f4SAlan Sternand CPU 3's load from Y returns 1.  This indicates that CPU 1's store
13940902b1f4SAlan Sternto X precedes CPU 2's load from X and that CPU 2's store to Y precedes
13950902b1f4SAlan SternCPU 3's load from Y.  In addition, the memory barriers guarantee that
13960902b1f4SAlan SternCPU 2 executes its load before its store, and CPU 3 loads from Y before
13970902b1f4SAlan Sternit loads from X.  The question is then "Can CPU 3's load from X return 0?"
1398241e6663SPaul E. McKenney
13990902b1f4SAlan SternBecause CPU 3's load from X in some sense comes after CPU 2's load, it
1400241e6663SPaul E. McKenneyis natural to expect that CPU 3's load from X must therefore return 1.
14010902b1f4SAlan SternThis expectation follows from multicopy atomicity: if a load executing
14020902b1f4SAlan Sternon CPU B follows a load from the same variable executing on CPU A (and
14030902b1f4SAlan SternCPU A did not originally store the value which it read), then on
14040902b1f4SAlan Sternmulticopy-atomic systems, CPU B's load must return either the same value
14050902b1f4SAlan Sternthat CPU A's load did or some later value.  However, the Linux kernel
14060902b1f4SAlan Sterndoes not require systems to be multicopy atomic.
1407241e6663SPaul E. McKenney
14080902b1f4SAlan SternThe use of a general memory barrier in the example above compensates
14090902b1f4SAlan Sternfor any lack of multicopy atomicity.  In the example, if CPU 2's load
14100902b1f4SAlan Sternfrom X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
14110902b1f4SAlan Sternfrom X must indeed also return 1.
1412241e6663SPaul E. McKenney
1413f1ab25a3SPaul E. McKenneyHowever, dependencies, read barriers, and write barriers are not always
1414f1ab25a3SPaul E. McKenneyable to compensate for non-multicopy atomicity.  For example, suppose
1415f1ab25a3SPaul E. McKenneythat CPU 2's general barrier is removed from the above example, leaving
1416f1ab25a3SPaul E. McKenneyonly the data dependency shown below:
1417241e6663SPaul E. McKenney
1418241e6663SPaul E. McKenney	CPU 1			CPU 2			CPU 3
1419241e6663SPaul E. McKenney	=======================	=======================	=======================
1420241e6663SPaul E. McKenney		{ X = 0, Y = 0 }
1421f1ab25a3SPaul E. McKenney	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1422f1ab25a3SPaul E. McKenney				<data dependency>	<read barrier>
1423f1ab25a3SPaul E. McKenney				STORE Y=r1		LOAD X (reads 0)
1424241e6663SPaul E. McKenney
1425f1ab25a3SPaul E. McKenneyThis substitution allows non-multicopy atomicity to run rampant: in
1426f1ab25a3SPaul E. McKenneythis example, it is perfectly legal for CPU 2's load from X to return 1,
1427f1ab25a3SPaul E. McKenneyCPU 3's load from Y to return 1, and its load from X to return 0.
1428241e6663SPaul E. McKenney
1429f1ab25a3SPaul E. McKenneyThe key point is that although CPU 2's data dependency orders its load
14300902b1f4SAlan Sternand store, it does not guarantee to order CPU 1's store.  Thus, if this
14310902b1f4SAlan Sternexample runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
14320902b1f4SAlan Sternstore buffer or a level of cache, CPU 2 might have early access to CPU 1's
14330902b1f4SAlan Sternwrites.  General barriers are therefore required to ensure that all CPUs
14340902b1f4SAlan Sternagree on the combined order of multiple accesses.
1435241e6663SPaul E. McKenney
1436f1ab25a3SPaul E. McKenneyGeneral barriers can compensate not only for non-multicopy atomicity,
1437f1ab25a3SPaul E. McKenneybut can also generate additional ordering that can ensure that -all-
1438f1ab25a3SPaul E. McKenneyCPUs will perceive the same order of -all- operations.  In contrast, a
1439f1ab25a3SPaul E. McKenneychain of release-acquire pairs do not provide this additional ordering,
1440f1ab25a3SPaul E. McKenneywhich means that only those CPUs on the chain are guaranteed to agree
1441f1ab25a3SPaul E. McKenneyon the combined order of the accesses.  For example, switching to C code
1442f1ab25a3SPaul E. McKenneyin deference to the ghost of Herman Hollerith:
1443c535cc92SPaul E. McKenney
1444c535cc92SPaul E. McKenney	int u, v, x, y, z;
1445c535cc92SPaul E. McKenney
1446c535cc92SPaul E. McKenney	void cpu0(void)
1447c535cc92SPaul E. McKenney	{
1448c535cc92SPaul E. McKenney		r0 = smp_load_acquire(&x);
1449c535cc92SPaul E. McKenney		WRITE_ONCE(u, 1);
1450c535cc92SPaul E. McKenney		smp_store_release(&y, 1);
1451c535cc92SPaul E. McKenney	}
1452c535cc92SPaul E. McKenney
1453c535cc92SPaul E. McKenney	void cpu1(void)
1454c535cc92SPaul E. McKenney	{
1455c535cc92SPaul E. McKenney		r1 = smp_load_acquire(&y);
1456c535cc92SPaul E. McKenney		r4 = READ_ONCE(v);
1457c535cc92SPaul E. McKenney		r5 = READ_ONCE(u);
1458c535cc92SPaul E. McKenney		smp_store_release(&z, 1);
1459c535cc92SPaul E. McKenney	}
1460c535cc92SPaul E. McKenney
1461c535cc92SPaul E. McKenney	void cpu2(void)
1462c535cc92SPaul E. McKenney	{
1463c535cc92SPaul E. McKenney		r2 = smp_load_acquire(&z);
1464c535cc92SPaul E. McKenney		smp_store_release(&x, 1);
1465c535cc92SPaul E. McKenney	}
1466c535cc92SPaul E. McKenney
1467c535cc92SPaul E. McKenney	void cpu3(void)
1468c535cc92SPaul E. McKenney	{
1469c535cc92SPaul E. McKenney		WRITE_ONCE(v, 1);
1470c535cc92SPaul E. McKenney		smp_mb();
1471c535cc92SPaul E. McKenney		r3 = READ_ONCE(u);
1472c535cc92SPaul E. McKenney	}
1473c535cc92SPaul E. McKenney
1474f1ab25a3SPaul E. McKenneyBecause cpu0(), cpu1(), and cpu2() participate in a chain of
1475f1ab25a3SPaul E. McKenneysmp_store_release()/smp_load_acquire() pairs, the following outcome
1476f1ab25a3SPaul E. McKenneyis prohibited:
1477c535cc92SPaul E. McKenney
1478c535cc92SPaul E. McKenney	r0 == 1 && r1 == 1 && r2 == 1
1479c535cc92SPaul E. McKenney
1480c535cc92SPaul E. McKenneyFurthermore, because of the release-acquire relationship between cpu0()
1481c535cc92SPaul E. McKenneyand cpu1(), cpu1() must see cpu0()'s writes, so that the following
1482c535cc92SPaul E. McKenneyoutcome is prohibited:
1483c535cc92SPaul E. McKenney
1484c535cc92SPaul E. McKenney	r1 == 1 && r5 == 0
1485c535cc92SPaul E. McKenney
1486f1ab25a3SPaul E. McKenneyHowever, the ordering provided by a release-acquire chain is local
1487f1ab25a3SPaul E. McKenneyto the CPUs participating in that chain and does not apply to cpu3(),
1488f1ab25a3SPaul E. McKenneyat least aside from stores.  Therefore, the following outcome is possible:
1489c535cc92SPaul E. McKenney
1490c535cc92SPaul E. McKenney	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
1491c535cc92SPaul E. McKenney
149237ef0341SPaul E. McKenneyAs an aside, the following outcome is also possible:
149337ef0341SPaul E. McKenney
149437ef0341SPaul E. McKenney	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
149537ef0341SPaul E. McKenney
1496c535cc92SPaul E. McKenneyAlthough cpu0(), cpu1(), and cpu2() will see their respective reads and
1497c535cc92SPaul E. McKenneywrites in order, CPUs not involved in the release-acquire chain might
1498c535cc92SPaul E. McKenneywell disagree on the order.  This disagreement stems from the fact that
1499c535cc92SPaul E. McKenneythe weak memory-barrier instructions used to implement smp_load_acquire()
1500c535cc92SPaul E. McKenneyand smp_store_release() are not required to order prior stores against
1501c535cc92SPaul E. McKenneysubsequent loads in all cases.  This means that cpu3() can see cpu0()'s
1502c535cc92SPaul E. McKenneystore to u as happening -after- cpu1()'s load from v, even though
1503c535cc92SPaul E. McKenneyboth cpu0() and cpu1() agree that these two operations occurred in the
1504c535cc92SPaul E. McKenneyintended order.
1505c535cc92SPaul E. McKenney
1506c535cc92SPaul E. McKenneyHowever, please keep in mind that smp_load_acquire() is not magic.
1507c535cc92SPaul E. McKenneyIn particular, it simply reads from its argument with ordering.  It does
1508c535cc92SPaul E. McKenney-not- ensure that any particular value will be read.  Therefore, the
1509c535cc92SPaul E. McKenneyfollowing outcome is possible:
1510c535cc92SPaul E. McKenney
1511c535cc92SPaul E. McKenney	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
1512c535cc92SPaul E. McKenney
1513c535cc92SPaul E. McKenneyNote that this outcome can happen even on a mythical sequentially
1514c535cc92SPaul E. McKenneyconsistent system where nothing is ever reordered.
1515c535cc92SPaul E. McKenney
1516f1ab25a3SPaul E. McKenneyTo reiterate, if your code requires full ordering of all operations,
1517f1ab25a3SPaul E. McKenneyuse general barriers throughout.
1518241e6663SPaul E. McKenney
1519241e6663SPaul E. McKenney
1520108b42b4SDavid Howells========================
1521108b42b4SDavid HowellsEXPLICIT KERNEL BARRIERS
1522108b42b4SDavid Howells========================
1523108b42b4SDavid Howells
1524108b42b4SDavid HowellsThe Linux kernel has a variety of different barriers that act at different
1525108b42b4SDavid Howellslevels:
1526108b42b4SDavid Howells
1527108b42b4SDavid Howells  (*) Compiler barrier.
1528108b42b4SDavid Howells
1529108b42b4SDavid Howells  (*) CPU memory barriers.
1530108b42b4SDavid Howells
1531108b42b4SDavid Howells
1532108b42b4SDavid HowellsCOMPILER BARRIER
1533108b42b4SDavid Howells----------------
1534108b42b4SDavid Howells
1535108b42b4SDavid HowellsThe Linux kernel has an explicit compiler barrier function that prevents the
1536108b42b4SDavid Howellscompiler from moving the memory accesses either side of it to the other side:
1537108b42b4SDavid Howells
1538108b42b4SDavid Howells	barrier();
1539108b42b4SDavid Howells
15409af194ceSPaul E. McKenneyThis is a general barrier -- there are no read-read or write-write
15419af194ceSPaul E. McKenneyvariants of barrier().  However, READ_ONCE() and WRITE_ONCE() can be
15429af194ceSPaul E. McKenneythought of as weak forms of barrier() that affect only the specific
15439af194ceSPaul E. McKenneyaccesses flagged by the READ_ONCE() or WRITE_ONCE().
1544108b42b4SDavid Howells
1545692118daSPaul E. McKenneyThe barrier() function has the following effects:
1546692118daSPaul E. McKenney
1547692118daSPaul E. McKenney (*) Prevents the compiler from reordering accesses following the
1548692118daSPaul E. McKenney     barrier() to precede any accesses preceding the barrier().
1549692118daSPaul E. McKenney     One example use for this property is to ease communication between
1550692118daSPaul E. McKenney     interrupt-handler code and the code that was interrupted.
1551692118daSPaul E. McKenney
1552692118daSPaul E. McKenney (*) Within a loop, forces the compiler to load the variables used
1553692118daSPaul E. McKenney     in that loop's conditional on each pass through that loop.
1554692118daSPaul E. McKenney
15559af194ceSPaul E. McKenneyThe READ_ONCE() and WRITE_ONCE() functions can prevent any number of
15569af194ceSPaul E. McKenneyoptimizations that, while perfectly safe in single-threaded code, can
15579af194ceSPaul E. McKenneybe fatal in concurrent code.  Here are some examples of these sorts
15589af194ceSPaul E. McKenneyof optimizations:
1559692118daSPaul E. McKenney
1560449f7413SPaul E. McKenney (*) The compiler is within its rights to reorder loads and stores
1561449f7413SPaul E. McKenney     to the same variable, and in some cases, the CPU is within its
1562449f7413SPaul E. McKenney     rights to reorder loads to the same variable.  This means that
1563449f7413SPaul E. McKenney     the following code:
1564449f7413SPaul E. McKenney
1565449f7413SPaul E. McKenney	a[0] = x;
1566449f7413SPaul E. McKenney	a[1] = x;
1567449f7413SPaul E. McKenney
1568449f7413SPaul E. McKenney     Might result in an older value of x stored in a[1] than in a[0].
1569449f7413SPaul E. McKenney     Prevent both the compiler and the CPU from doing this as follows:
1570449f7413SPaul E. McKenney
15719af194ceSPaul E. McKenney	a[0] = READ_ONCE(x);
15729af194ceSPaul E. McKenney	a[1] = READ_ONCE(x);
1573449f7413SPaul E. McKenney
15749af194ceSPaul E. McKenney     In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for
15759af194ceSPaul E. McKenney     accesses from multiple CPUs to a single variable.
1576449f7413SPaul E. McKenney
1577692118daSPaul E. McKenney (*) The compiler is within its rights to merge successive loads from
1578692118daSPaul E. McKenney     the same variable.  Such merging can cause the compiler to "optimize"
1579692118daSPaul E. McKenney     the following code:
1580692118daSPaul E. McKenney
1581692118daSPaul E. McKenney	while (tmp = a)
1582692118daSPaul E. McKenney		do_something_with(tmp);
1583692118daSPaul E. McKenney
1584692118daSPaul E. McKenney     into the following code, which, although in some sense legitimate
1585692118daSPaul E. McKenney     for single-threaded code, is almost certainly not what the developer
1586692118daSPaul E. McKenney     intended:
1587692118daSPaul E. McKenney
1588692118daSPaul E. McKenney	if (tmp = a)
1589692118daSPaul E. McKenney		for (;;)
1590692118daSPaul E. McKenney			do_something_with(tmp);
1591692118daSPaul E. McKenney
15929af194ceSPaul E. McKenney     Use READ_ONCE() to prevent the compiler from doing this to you:
1593692118daSPaul E. McKenney
15949af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
1595692118daSPaul E. McKenney		do_something_with(tmp);
1596692118daSPaul E. McKenney
1597692118daSPaul E. McKenney (*) The compiler is within its rights to reload a variable, for example,
1598692118daSPaul E. McKenney     in cases where high register pressure prevents the compiler from
1599692118daSPaul E. McKenney     keeping all data of interest in registers.  The compiler might
1600692118daSPaul E. McKenney     therefore optimize the variable 'tmp' out of our previous example:
1601692118daSPaul E. McKenney
1602692118daSPaul E. McKenney	while (tmp = a)
1603692118daSPaul E. McKenney		do_something_with(tmp);
1604692118daSPaul E. McKenney
1605692118daSPaul E. McKenney     This could result in the following code, which is perfectly safe in
1606692118daSPaul E. McKenney     single-threaded code, but can be fatal in concurrent code:
1607692118daSPaul E. McKenney
1608692118daSPaul E. McKenney	while (a)
1609692118daSPaul E. McKenney		do_something_with(a);
1610692118daSPaul E. McKenney
1611692118daSPaul E. McKenney     For example, the optimized version of this code could result in
1612692118daSPaul E. McKenney     passing a zero to do_something_with() in the case where the variable
1613692118daSPaul E. McKenney     a was modified by some other CPU between the "while" statement and
1614692118daSPaul E. McKenney     the call to do_something_with().
1615692118daSPaul E. McKenney
16169af194ceSPaul E. McKenney     Again, use READ_ONCE() to prevent the compiler from doing this:
1617692118daSPaul E. McKenney
16189af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
1619692118daSPaul E. McKenney		do_something_with(tmp);
1620692118daSPaul E. McKenney
1621692118daSPaul E. McKenney     Note that if the compiler runs short of registers, it might save
1622692118daSPaul E. McKenney     tmp onto the stack.  The overhead of this saving and later restoring
1623692118daSPaul E. McKenney     is why compilers reload variables.  Doing so is perfectly safe for
1624692118daSPaul E. McKenney     single-threaded code, so you need to tell the compiler about cases
1625692118daSPaul E. McKenney     where it is not safe.
1626692118daSPaul E. McKenney
1627692118daSPaul E. McKenney (*) The compiler is within its rights to omit a load entirely if it knows
1628692118daSPaul E. McKenney     what the value will be.  For example, if the compiler can prove that
1629692118daSPaul E. McKenney     the value of variable 'a' is always zero, it can optimize this code:
1630692118daSPaul E. McKenney
1631692118daSPaul E. McKenney	while (tmp = a)
1632692118daSPaul E. McKenney		do_something_with(tmp);
1633692118daSPaul E. McKenney
1634692118daSPaul E. McKenney     Into this:
1635692118daSPaul E. McKenney
1636692118daSPaul E. McKenney	do { } while (0);
1637692118daSPaul E. McKenney
16389af194ceSPaul E. McKenney     This transformation is a win for single-threaded code because it
16399af194ceSPaul E. McKenney     gets rid of a load and a branch.  The problem is that the compiler
16409af194ceSPaul E. McKenney     will carry out its proof assuming that the current CPU is the only
16419af194ceSPaul E. McKenney     one updating variable 'a'.  If variable 'a' is shared, then the
16429af194ceSPaul E. McKenney     compiler's proof will be erroneous.  Use READ_ONCE() to tell the
16439af194ceSPaul E. McKenney     compiler that it doesn't know as much as it thinks it does:
1644692118daSPaul E. McKenney
16459af194ceSPaul E. McKenney	while (tmp = READ_ONCE(a))
1646692118daSPaul E. McKenney		do_something_with(tmp);
1647692118daSPaul E. McKenney
1648692118daSPaul E. McKenney     But please note that the compiler is also closely watching what you
16499af194ceSPaul E. McKenney     do with the value after the READ_ONCE().  For example, suppose you
1650692118daSPaul E. McKenney     do the following and MAX is a preprocessor macro with the value 1:
1651692118daSPaul E. McKenney
16529af194ceSPaul E. McKenney	while ((tmp = READ_ONCE(a)) % MAX)
1653692118daSPaul E. McKenney		do_something_with(tmp);
1654692118daSPaul E. McKenney
1655692118daSPaul E. McKenney     Then the compiler knows that the result of the "%" operator applied
1656692118daSPaul E. McKenney     to MAX will always be zero, again allowing the compiler to optimize
1657692118daSPaul E. McKenney     the code into near-nonexistence.  (It will still load from the
1658692118daSPaul E. McKenney     variable 'a'.)
1659692118daSPaul E. McKenney
1660692118daSPaul E. McKenney (*) Similarly, the compiler is within its rights to omit a store entirely
1661692118daSPaul E. McKenney     if it knows that the variable already has the value being stored.
1662692118daSPaul E. McKenney     Again, the compiler assumes that the current CPU is the only one
1663692118daSPaul E. McKenney     storing into the variable, which can cause the compiler to do the
1664692118daSPaul E. McKenney     wrong thing for shared variables.  For example, suppose you have
1665692118daSPaul E. McKenney     the following:
1666692118daSPaul E. McKenney
1667692118daSPaul E. McKenney	a = 0;
166865f95ff2SSeongJae Park	... Code that does not store to variable a ...
1669692118daSPaul E. McKenney	a = 0;
1670692118daSPaul E. McKenney
1671692118daSPaul E. McKenney     The compiler sees that the value of variable 'a' is already zero, so
1672692118daSPaul E. McKenney     it might well omit the second store.  This would come as a fatal
1673692118daSPaul E. McKenney     surprise if some other CPU might have stored to variable 'a' in the
1674692118daSPaul E. McKenney     meantime.
1675692118daSPaul E. McKenney
16769af194ceSPaul E. McKenney     Use WRITE_ONCE() to prevent the compiler from making this sort of
1677692118daSPaul E. McKenney     wrong guess:
1678692118daSPaul E. McKenney
16799af194ceSPaul E. McKenney	WRITE_ONCE(a, 0);
168065f95ff2SSeongJae Park	... Code that does not store to variable a ...
16819af194ceSPaul E. McKenney	WRITE_ONCE(a, 0);
1682692118daSPaul E. McKenney
1683692118daSPaul E. McKenney (*) The compiler is within its rights to reorder memory accesses unless
1684692118daSPaul E. McKenney     you tell it not to.  For example, consider the following interaction
1685692118daSPaul E. McKenney     between process-level code and an interrupt handler:
1686692118daSPaul E. McKenney
1687692118daSPaul E. McKenney	void process_level(void)
1688692118daSPaul E. McKenney	{
1689692118daSPaul E. McKenney		msg = get_message();
1690692118daSPaul E. McKenney		flag = true;
1691692118daSPaul E. McKenney	}
1692692118daSPaul E. McKenney
1693692118daSPaul E. McKenney	void interrupt_handler(void)
1694692118daSPaul E. McKenney	{
1695692118daSPaul E. McKenney		if (flag)
1696692118daSPaul E. McKenney			process_message(msg);
1697692118daSPaul E. McKenney	}
1698692118daSPaul E. McKenney
1699df5cbb27SMasanari Iida     There is nothing to prevent the compiler from transforming
1700692118daSPaul E. McKenney     process_level() to the following, in fact, this might well be a
1701692118daSPaul E. McKenney     win for single-threaded code:
1702692118daSPaul E. McKenney
1703692118daSPaul E. McKenney	void process_level(void)
1704692118daSPaul E. McKenney	{
1705692118daSPaul E. McKenney		flag = true;
1706692118daSPaul E. McKenney		msg = get_message();
1707692118daSPaul E. McKenney	}
1708692118daSPaul E. McKenney
1709692118daSPaul E. McKenney     If the interrupt occurs between these two statement, then
17109af194ceSPaul E. McKenney     interrupt_handler() might be passed a garbled msg.  Use WRITE_ONCE()
1711692118daSPaul E. McKenney     to prevent this as follows:
1712692118daSPaul E. McKenney
1713692118daSPaul E. McKenney	void process_level(void)
1714692118daSPaul E. McKenney	{
17159af194ceSPaul E. McKenney		WRITE_ONCE(msg, get_message());
17169af194ceSPaul E. McKenney		WRITE_ONCE(flag, true);
1717692118daSPaul E. McKenney	}
1718692118daSPaul E. McKenney
1719692118daSPaul E. McKenney	void interrupt_handler(void)
1720692118daSPaul E. McKenney	{
17219af194ceSPaul E. McKenney		if (READ_ONCE(flag))
17229af194ceSPaul E. McKenney			process_message(READ_ONCE(msg));
1723692118daSPaul E. McKenney	}
1724692118daSPaul E. McKenney
17259af194ceSPaul E. McKenney     Note that the READ_ONCE() and WRITE_ONCE() wrappers in
17269af194ceSPaul E. McKenney     interrupt_handler() are needed if this interrupt handler can itself
17279af194ceSPaul E. McKenney     be interrupted by something that also accesses 'flag' and 'msg',
17289af194ceSPaul E. McKenney     for example, a nested interrupt or an NMI.  Otherwise, READ_ONCE()
17299af194ceSPaul E. McKenney     and WRITE_ONCE() are not needed in interrupt_handler() other than
17309af194ceSPaul E. McKenney     for documentation purposes.  (Note also that nested interrupts
17319af194ceSPaul E. McKenney     do not typically occur in modern Linux kernels, in fact, if an
17329af194ceSPaul E. McKenney     interrupt handler returns with interrupts enabled, you will get a
17339af194ceSPaul E. McKenney     WARN_ONCE() splat.)
1734692118daSPaul E. McKenney
17359af194ceSPaul E. McKenney     You should assume that the compiler can move READ_ONCE() and
17369af194ceSPaul E. McKenney     WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
17379af194ceSPaul E. McKenney     barrier(), or similar primitives.
1738692118daSPaul E. McKenney
17399af194ceSPaul E. McKenney     This effect could also be achieved using barrier(), but READ_ONCE()
17409af194ceSPaul E. McKenney     and WRITE_ONCE() are more selective:  With READ_ONCE() and
17419af194ceSPaul E. McKenney     WRITE_ONCE(), the compiler need only forget the contents of the
17429af194ceSPaul E. McKenney     indicated memory locations, while with barrier() the compiler must
17438149b5cbSSeongJae Park     discard the value of all memory locations that it has currently
17449af194ceSPaul E. McKenney     cached in any machine registers.  Of course, the compiler must also
17459af194ceSPaul E. McKenney     respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
17469af194ceSPaul E. McKenney     though the CPU of course need not do so.
1747692118daSPaul E. McKenney
1748692118daSPaul E. McKenney (*) The compiler is within its rights to invent stores to a variable,
1749692118daSPaul E. McKenney     as in the following example:
1750692118daSPaul E. McKenney
1751692118daSPaul E. McKenney	if (a)
1752692118daSPaul E. McKenney		b = a;
1753692118daSPaul E. McKenney	else
1754692118daSPaul E. McKenney		b = 42;
1755692118daSPaul E. McKenney
1756692118daSPaul E. McKenney     The compiler might save a branch by optimizing this as follows:
1757692118daSPaul E. McKenney
1758692118daSPaul E. McKenney	b = 42;
1759692118daSPaul E. McKenney	if (a)
1760692118daSPaul E. McKenney		b = a;
1761692118daSPaul E. McKenney
1762692118daSPaul E. McKenney     In single-threaded code, this is not only safe, but also saves
1763692118daSPaul E. McKenney     a branch.  Unfortunately, in concurrent code, this optimization
1764692118daSPaul E. McKenney     could cause some other CPU to see a spurious value of 42 -- even
1765692118daSPaul E. McKenney     if variable 'a' was never zero -- when loading variable 'b'.
17669af194ceSPaul E. McKenney     Use WRITE_ONCE() to prevent this as follows:
1767692118daSPaul E. McKenney
1768692118daSPaul E. McKenney	if (a)
17699af194ceSPaul E. McKenney		WRITE_ONCE(b, a);
1770692118daSPaul E. McKenney	else
17719af194ceSPaul E. McKenney		WRITE_ONCE(b, 42);
1772692118daSPaul E. McKenney
1773692118daSPaul E. McKenney     The compiler can also invent loads.  These are usually less
1774692118daSPaul E. McKenney     damaging, but they can result in cache-line bouncing and thus in
17759af194ceSPaul E. McKenney     poor performance and scalability.  Use READ_ONCE() to prevent
1776692118daSPaul E. McKenney     invented loads.
1777692118daSPaul E. McKenney
1778692118daSPaul E. McKenney (*) For aligned memory locations whose size allows them to be accessed
1779692118daSPaul E. McKenney     with a single memory-reference instruction, prevents "load tearing"
1780692118daSPaul E. McKenney     and "store tearing," in which a single large access is replaced by
1781692118daSPaul E. McKenney     multiple smaller accesses.  For example, given an architecture having
1782692118daSPaul E. McKenney     16-bit store instructions with 7-bit immediate fields, the compiler
1783692118daSPaul E. McKenney     might be tempted to use two 16-bit store-immediate instructions to
1784692118daSPaul E. McKenney     implement the following 32-bit store:
1785692118daSPaul E. McKenney
1786692118daSPaul E. McKenney	p = 0x00010002;
1787692118daSPaul E. McKenney
1788692118daSPaul E. McKenney     Please note that GCC really does use this sort of optimization,
1789692118daSPaul E. McKenney     which is not surprising given that it would likely take more
1790692118daSPaul E. McKenney     than two instructions to build the constant and then store it.
1791692118daSPaul E. McKenney     This optimization can therefore be a win in single-threaded code.
1792692118daSPaul E. McKenney     In fact, a recent bug (since fixed) caused GCC to incorrectly use
1793692118daSPaul E. McKenney     this optimization in a volatile store.  In the absence of such bugs,
17949af194ceSPaul E. McKenney     use of WRITE_ONCE() prevents store tearing in the following example:
1795692118daSPaul E. McKenney
17969af194ceSPaul E. McKenney	WRITE_ONCE(p, 0x00010002);
1797692118daSPaul E. McKenney
1798692118daSPaul E. McKenney     Use of packed structures can also result in load and store tearing,
1799692118daSPaul E. McKenney     as in this example:
1800692118daSPaul E. McKenney
1801692118daSPaul E. McKenney	struct __attribute__((__packed__)) foo {
1802692118daSPaul E. McKenney		short a;
1803692118daSPaul E. McKenney		int b;
1804692118daSPaul E. McKenney		short c;
1805692118daSPaul E. McKenney	};
1806692118daSPaul E. McKenney	struct foo foo1, foo2;
1807692118daSPaul E. McKenney	...
1808692118daSPaul E. McKenney
1809692118daSPaul E. McKenney	foo2.a = foo1.a;
1810692118daSPaul E. McKenney	foo2.b = foo1.b;
1811692118daSPaul E. McKenney	foo2.c = foo1.c;
1812692118daSPaul E. McKenney
18139af194ceSPaul E. McKenney     Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no
18149af194ceSPaul E. McKenney     volatile markings, the compiler would be well within its rights to
18159af194ceSPaul E. McKenney     implement these three assignment statements as a pair of 32-bit
18169af194ceSPaul E. McKenney     loads followed by a pair of 32-bit stores.  This would result in
18179af194ceSPaul E. McKenney     load tearing on 'foo1.b' and store tearing on 'foo2.b'.  READ_ONCE()
18189af194ceSPaul E. McKenney     and WRITE_ONCE() again prevent tearing in this example:
1819692118daSPaul E. McKenney
1820692118daSPaul E. McKenney	foo2.a = foo1.a;
18219af194ceSPaul E. McKenney	WRITE_ONCE(foo2.b, READ_ONCE(foo1.b));
1822692118daSPaul E. McKenney	foo2.c = foo1.c;
1823692118daSPaul E. McKenney
18249af194ceSPaul E. McKenneyAll that aside, it is never necessary to use READ_ONCE() and
18259af194ceSPaul E. McKenneyWRITE_ONCE() on a variable that has been marked volatile.  For example,
18269af194ceSPaul E. McKenneybecause 'jiffies' is marked volatile, it is never necessary to
18279af194ceSPaul E. McKenneysay READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
18289af194ceSPaul E. McKenneyWRITE_ONCE() are implemented as volatile casts, which has no effect when
18299af194ceSPaul E. McKenneyits argument is already marked volatile.
1830692118daSPaul E. McKenney
1831692118daSPaul E. McKenneyPlease note that these compiler barriers have no direct effect on the CPU,
1832692118daSPaul E. McKenneywhich may then reorder things however it wishes.
1833108b42b4SDavid Howells
1834108b42b4SDavid Howells
1835108b42b4SDavid HowellsCPU MEMORY BARRIERS
1836108b42b4SDavid Howells-------------------
1837108b42b4SDavid Howells
1838203185f6SAkira YokosawaThe Linux kernel has seven basic CPU memory barriers:
1839108b42b4SDavid Howells
1840108b42b4SDavid Howells	TYPE			MANDATORY	SMP CONDITIONAL
1841203185f6SAkira Yokosawa	=======================	===============	===============
1842108b42b4SDavid Howells	GENERAL			mb()		smp_mb()
1843108b42b4SDavid Howells	WRITE			wmb()		smp_wmb()
1844108b42b4SDavid Howells	READ			rmb()		smp_rmb()
1845203185f6SAkira Yokosawa	ADDRESS DEPENDENCY			READ_ONCE()
1846108b42b4SDavid Howells
1847108b42b4SDavid Howells
1848203185f6SAkira YokosawaAll memory barriers except the address-dependency barriers imply a compiler
1849203185f6SAkira Yokosawabarrier.  Address dependencies do not impose any additional compiler ordering.
185073f10281SNick Piggin
1851203185f6SAkira YokosawaAside: In the case of address dependencies, the compiler would be expected
18529af194ceSPaul E. McKenneyto issue the loads in the correct order (eg. `a[b]` would have to load
18539af194ceSPaul E. McKenneythe value of b before loading a[b]), however there is no guarantee in
18549af194ceSPaul E. McKenneythe C specification that the compiler may not speculate the value of b
18558149b5cbSSeongJae Park(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
18569af194ceSPaul E. McKenneytmp = a[b]; ).  There is also the problem of a compiler reloading b after
18579af194ceSPaul E. McKenneyhaving loaded a[b], thus having a newer copy of b than a[b].  A consensus
18589af194ceSPaul E. McKenneyhas not yet been reached about these problems, however the READ_ONCE()
18599af194ceSPaul E. McKenneymacro is a good place to start looking.
1860108b42b4SDavid Howells
1861108b42b4SDavid HowellsSMP memory barriers are reduced to compiler barriers on uniprocessor compiled
186281fc6323SJarek Poplawskisystems because it is assumed that a CPU will appear to be self-consistent,
1863108b42b4SDavid Howellsand will order overlapping accesses correctly with respect to itself.
18646a65d263SMichael S. TsirkinHowever, see the subsection on "Virtual Machine Guests" below.
1865108b42b4SDavid Howells
1866108b42b4SDavid Howells[!] Note that SMP memory barriers _must_ be used to control the ordering of
1867108b42b4SDavid Howellsreferences to shared memory on SMP systems, though the use of locking instead
1868108b42b4SDavid Howellsis sufficient.
1869108b42b4SDavid Howells
1870108b42b4SDavid HowellsMandatory barriers should not be used to control SMP effects, since mandatory
18716a65d263SMichael S. Tsirkinbarriers impose unnecessary overhead on both SMP and UP systems. They may,
18726a65d263SMichael S. Tsirkinhowever, be used to control MMIO effects on accesses through relaxed memory I/O
18736a65d263SMichael S. Tsirkinwindows.  These barriers are required even on non-SMP systems as they affect
18746a65d263SMichael S. Tsirkinthe order in which memory operations appear to a device by prohibiting both the
18756a65d263SMichael S. Tsirkincompiler and the CPU from reordering them.
1876108b42b4SDavid Howells
1877108b42b4SDavid Howells
1878108b42b4SDavid HowellsThere are some more advanced barrier functions:
1879108b42b4SDavid Howells
1880b92b8b35SPeter Zijlstra (*) smp_store_mb(var, value)
1881108b42b4SDavid Howells
188275b2bd55SOleg Nesterov     This assigns the value to the variable and then inserts a full memory
18832d142e59SDavidlohr Bueso     barrier after it.  It isn't guaranteed to insert anything more than a
18842d142e59SDavidlohr Bueso     compiler barrier in a UP compilation.
1885108b42b4SDavid Howells
1886108b42b4SDavid Howells
18871b15611eSPeter Zijlstra (*) smp_mb__before_atomic();
18881b15611eSPeter Zijlstra (*) smp_mb__after_atomic();
1889108b42b4SDavid Howells
189039323c64SManfred Spraul     These are for use with atomic RMW functions that do not imply memory
189139323c64SManfred Spraul     barriers, but where the code needs a memory barrier. Examples for atomic
1892d8566f15SFox Chen     RMW functions that do not imply a memory barrier are e.g. add,
189339323c64SManfred Spraul     subtract, (failed) conditional operations, _relaxed functions,
189439323c64SManfred Spraul     but not atomic_read or atomic_set. A common example where a memory
189539323c64SManfred Spraul     barrier may be required is when atomic ops are used for reference
189639323c64SManfred Spraul     counting.
18971b15611eSPeter Zijlstra
189839323c64SManfred Spraul     These are also used for atomic RMW bitop functions that do not imply a
189939323c64SManfred Spraul     memory barrier (such as set_bit and clear_bit).
1900108b42b4SDavid Howells
1901108b42b4SDavid Howells     As an example, consider a piece of code that marks an object as being dead
1902108b42b4SDavid Howells     and then decrements the object's reference count:
1903108b42b4SDavid Howells
1904108b42b4SDavid Howells	obj->dead = 1;
19051b15611eSPeter Zijlstra	smp_mb__before_atomic();
1906108b42b4SDavid Howells	atomic_dec(&obj->ref_count);
1907108b42b4SDavid Howells
1908108b42b4SDavid Howells     This makes sure that the death mark on the object is perceived to be set
1909108b42b4SDavid Howells     *before* the reference counter is decremented.
1910108b42b4SDavid Howells
1911706eeb3eSPeter Zijlstra     See Documentation/atomic_{t,bitops}.txt for more information.
1912108b42b4SDavid Howells
1913108b42b4SDavid Howells
19141077fa36SAlexander Duyck (*) dma_wmb();
19151077fa36SAlexander Duyck (*) dma_rmb();
1916ed59dfd9SKefeng Wang (*) dma_mb();
19171077fa36SAlexander Duyck
19181077fa36SAlexander Duyck     These are for use with consistent memory to guarantee the ordering
19191077fa36SAlexander Duyck     of writes or reads of shared memory accessible to both the CPU and a
1920289e1c89SParav Pandit     DMA capable device. See Documentation/core-api/dma-api.rst file for more
1921289e1c89SParav Pandit     information about consistent memory.
19221077fa36SAlexander Duyck
19231077fa36SAlexander Duyck     For example, consider a device driver that shares memory with a device
19241077fa36SAlexander Duyck     and uses a descriptor status value to indicate if the descriptor belongs
19251077fa36SAlexander Duyck     to the device or the CPU, and a doorbell to notify it when new
19261077fa36SAlexander Duyck     descriptors are available:
19271077fa36SAlexander Duyck
19281077fa36SAlexander Duyck	if (desc->status != DEVICE_OWN) {
19291077fa36SAlexander Duyck		/* do not read data until we own descriptor */
19301077fa36SAlexander Duyck		dma_rmb();
19311077fa36SAlexander Duyck
19321077fa36SAlexander Duyck		/* read/modify data */
19331077fa36SAlexander Duyck		read_data = desc->data;
19341077fa36SAlexander Duyck		desc->data = write_data;
19351077fa36SAlexander Duyck
19361077fa36SAlexander Duyck		/* flush modifications before status update */
19371077fa36SAlexander Duyck		dma_wmb();
19381077fa36SAlexander Duyck
19391077fa36SAlexander Duyck		/* assign ownership */
19401077fa36SAlexander Duyck		desc->status = DEVICE_OWN;
19411077fa36SAlexander Duyck
1942289e1c89SParav Pandit		/* Make descriptor status visible to the device followed by
1943289e1c89SParav Pandit		 * notify device of new descriptor
1944289e1c89SParav Pandit		 */
19451077fa36SAlexander Duyck		writel(DESC_NOTIFY, doorbell);
19461077fa36SAlexander Duyck	}
19471077fa36SAlexander Duyck
1948289e1c89SParav Pandit     The dma_rmb() allows us to guarantee that the device has released ownership
19497a458007SSylvain Trias     before we read the data from the descriptor, and the dma_wmb() allows
19501077fa36SAlexander Duyck     us to guarantee the data is written to the descriptor before the device
1951ed59dfd9SKefeng Wang     can see it now has ownership.  The dma_mb() implies both a dma_rmb() and
1952289e1c89SParav Pandit     a dma_wmb().
19531077fa36SAlexander Duyck
1954289e1c89SParav Pandit     Note that the dma_*() barriers do not provide any ordering guarantees for
1955289e1c89SParav Pandit     accesses to MMIO regions.  See the later "KERNEL I/O BARRIER EFFECTS"
1956289e1c89SParav Pandit     subsection for more information about I/O accessors and MMIO ordering.
19571077fa36SAlexander Duyck
19583e79f082SAneesh Kumar K.V (*) pmem_wmb();
19593e79f082SAneesh Kumar K.V
19603e79f082SAneesh Kumar K.V     This is for use with persistent memory to ensure that stores for which
19613e79f082SAneesh Kumar K.V     modifications are written to persistent storage reached a platform
19623e79f082SAneesh Kumar K.V     durability domain.
19633e79f082SAneesh Kumar K.V
19643e79f082SAneesh Kumar K.V     For example, after a non-temporal write to pmem region, we use pmem_wmb()
19653e79f082SAneesh Kumar K.V     to ensure that stores have reached a platform durability domain. This ensures
19663e79f082SAneesh Kumar K.V     that stores have updated persistent storage before any data access or
19673e79f082SAneesh Kumar K.V     data transfer caused by subsequent instructions is initiated. This is
19683e79f082SAneesh Kumar K.V     in addition to the ordering done by wmb().
19693e79f082SAneesh Kumar K.V
19703e79f082SAneesh Kumar K.V     For load from persistent memory, existing read memory barriers are sufficient
19713e79f082SAneesh Kumar K.V     to ensure read ordering.
1972dfeccea6SSeongJae Park
1973d5624bb2SXiongfeng Wang (*) io_stop_wc();
1974d5624bb2SXiongfeng Wang
1975d5624bb2SXiongfeng Wang     For memory accesses with write-combining attributes (e.g. those returned
19761ab8f248SSeongJae Park     by ioremap_wc()), the CPU may wait for prior accesses to be merged with
1977d5624bb2SXiongfeng Wang     subsequent ones. io_stop_wc() can be used to prevent the merging of
1978d5624bb2SXiongfeng Wang     write-combining memory accesses before this macro with those after it when
1979d5624bb2SXiongfeng Wang     such wait has performance implications.
1980d5624bb2SXiongfeng Wang
1981108b42b4SDavid Howells===============================
1982108b42b4SDavid HowellsIMPLICIT KERNEL MEMORY BARRIERS
1983108b42b4SDavid Howells===============================
1984108b42b4SDavid Howells
1985108b42b4SDavid HowellsSome of the other functions in the linux kernel imply memory barriers, amongst
1986670bd95eSDavid Howellswhich are locking and scheduling functions.
1987108b42b4SDavid Howells
1988108b42b4SDavid HowellsThis specification is a _minimum_ guarantee; any particular architecture may
1989108b42b4SDavid Howellsprovide more substantial guarantees, but these may not be relied upon outside
1990108b42b4SDavid Howellsof arch specific code.
1991108b42b4SDavid Howells
1992108b42b4SDavid Howells
1993166bda71SSeongJae ParkLOCK ACQUISITION FUNCTIONS
1994166bda71SSeongJae Park--------------------------
1995108b42b4SDavid Howells
1996108b42b4SDavid HowellsThe Linux kernel has a number of locking constructs:
1997108b42b4SDavid Howells
1998108b42b4SDavid Howells (*) spin locks
1999108b42b4SDavid Howells (*) R/W spin locks
2000108b42b4SDavid Howells (*) mutexes
2001108b42b4SDavid Howells (*) semaphores
2002108b42b4SDavid Howells (*) R/W semaphores
2003108b42b4SDavid Howells
20042e4f5382SPeter ZijlstraIn all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
2005108b42b4SDavid Howellsfor each construct.  These operations all imply certain barriers:
2006108b42b4SDavid Howells
20072e4f5382SPeter Zijlstra (1) ACQUIRE operation implication:
2008108b42b4SDavid Howells
20092e4f5382SPeter Zijlstra     Memory operations issued after the ACQUIRE will be completed after the
20102e4f5382SPeter Zijlstra     ACQUIRE operation has completed.
2011108b42b4SDavid Howells
20128dd853d7SPaul E. McKenney     Memory operations issued before the ACQUIRE may be completed after
2013a9668cd6SPeter Zijlstra     the ACQUIRE operation has completed.
2014108b42b4SDavid Howells
20152e4f5382SPeter Zijlstra (2) RELEASE operation implication:
2016108b42b4SDavid Howells
20172e4f5382SPeter Zijlstra     Memory operations issued before the RELEASE will be completed before the
20182e4f5382SPeter Zijlstra     RELEASE operation has completed.
2019108b42b4SDavid Howells
20202e4f5382SPeter Zijlstra     Memory operations issued after the RELEASE may be completed before the
20212e4f5382SPeter Zijlstra     RELEASE operation has completed.
2022108b42b4SDavid Howells
20232e4f5382SPeter Zijlstra (3) ACQUIRE vs ACQUIRE implication:
2024108b42b4SDavid Howells
20252e4f5382SPeter Zijlstra     All ACQUIRE operations issued before another ACQUIRE operation will be
20262e4f5382SPeter Zijlstra     completed before that ACQUIRE operation.
2027108b42b4SDavid Howells
20282e4f5382SPeter Zijlstra (4) ACQUIRE vs RELEASE implication:
2029108b42b4SDavid Howells
20302e4f5382SPeter Zijlstra     All ACQUIRE operations issued before a RELEASE operation will be
20312e4f5382SPeter Zijlstra     completed before the RELEASE operation.
2032108b42b4SDavid Howells
20332e4f5382SPeter Zijlstra (5) Failed conditional ACQUIRE implication:
2034108b42b4SDavid Howells
20352e4f5382SPeter Zijlstra     Certain locking variants of the ACQUIRE operation may fail, either due to
20362e4f5382SPeter Zijlstra     being unable to get the lock immediately, or due to receiving an unblocked
2037806654a9SWill Deacon     signal while asleep waiting for the lock to become available.  Failed
2038108b42b4SDavid Howells     locks do not imply any sort of barrier.
2039108b42b4SDavid Howells
20402e4f5382SPeter Zijlstra[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
20412e4f5382SPeter Zijlstraone-way barriers is that the effects of instructions outside of a critical
20422e4f5382SPeter Zijlstrasection may seep into the inside of the critical section.
2043108b42b4SDavid Howells
20442e4f5382SPeter ZijlstraAn ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier
20452e4f5382SPeter Zijlstrabecause it is possible for an access preceding the ACQUIRE to happen after the
20462e4f5382SPeter ZijlstraACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
20472e4f5382SPeter Zijlstrathe two accesses can themselves then cross:
2048670bd95eSDavid Howells
2049670bd95eSDavid Howells	*A = a;
20502e4f5382SPeter Zijlstra	ACQUIRE M
20512e4f5382SPeter Zijlstra	RELEASE M
2052670bd95eSDavid Howells	*B = b;
2053670bd95eSDavid Howells
2054670bd95eSDavid Howellsmay occur as:
2055670bd95eSDavid Howells
20562e4f5382SPeter Zijlstra	ACQUIRE M, STORE *B, STORE *A, RELEASE M
205717eb88e0SPaul E. McKenney
20588dd853d7SPaul E. McKenneyWhen the ACQUIRE and RELEASE are a lock acquisition and release,
20598dd853d7SPaul E. McKenneyrespectively, this same reordering can occur if the lock's ACQUIRE and
20608dd853d7SPaul E. McKenneyRELEASE are to the same lock variable, but only from the perspective of
20618dd853d7SPaul E. McKenneyanother CPU not holding that lock.  In short, a ACQUIRE followed by an
20628dd853d7SPaul E. McKenneyRELEASE may -not- be assumed to be a full memory barrier.
206317eb88e0SPaul E. McKenney
206412d560f4SPaul E. McKenneySimilarly, the reverse case of a RELEASE followed by an ACQUIRE does
206512d560f4SPaul E. McKenneynot imply a full memory barrier.  Therefore, the CPU's execution of the
206612d560f4SPaul E. McKenneycritical sections corresponding to the RELEASE and the ACQUIRE can cross,
206712d560f4SPaul E. McKenneyso that:
206817eb88e0SPaul E. McKenney
206917eb88e0SPaul E. McKenney	*A = a;
20702e4f5382SPeter Zijlstra	RELEASE M
20712e4f5382SPeter Zijlstra	ACQUIRE N
207217eb88e0SPaul E. McKenney	*B = b;
207317eb88e0SPaul E. McKenney
207417eb88e0SPaul E. McKenneycould occur as:
207517eb88e0SPaul E. McKenney
20762e4f5382SPeter Zijlstra	ACQUIRE N, STORE *B, STORE *A, RELEASE M
207717eb88e0SPaul E. McKenney
20788dd853d7SPaul E. McKenneyIt might appear that this reordering could introduce a deadlock.
20798dd853d7SPaul E. McKenneyHowever, this cannot happen because if such a deadlock threatened,
20808dd853d7SPaul E. McKenneythe RELEASE would simply complete, thereby avoiding the deadlock.
20818dd853d7SPaul E. McKenney
20828dd853d7SPaul E. McKenney	Why does this work?
20838dd853d7SPaul E. McKenney
20848dd853d7SPaul E. McKenney	One key point is that we are only talking about the CPU doing
20858dd853d7SPaul E. McKenney	the reordering, not the compiler.  If the compiler (or, for
20868dd853d7SPaul E. McKenney	that matter, the developer) switched the operations, deadlock
20878dd853d7SPaul E. McKenney	-could- occur.
20888dd853d7SPaul E. McKenney
20898dd853d7SPaul E. McKenney	But suppose the CPU reordered the operations.  In this case,
20908dd853d7SPaul E. McKenney	the unlock precedes the lock in the assembly code.  The CPU
20918dd853d7SPaul E. McKenney	simply elected to try executing the later lock operation first.
20928dd853d7SPaul E. McKenney	If there is a deadlock, this lock operation will simply spin (or
20938dd853d7SPaul E. McKenney	try to sleep, but more on that later).	The CPU will eventually
20948dd853d7SPaul E. McKenney	execute the unlock operation (which preceded the lock operation
20958dd853d7SPaul E. McKenney	in the assembly code), which will unravel the potential deadlock,
20968dd853d7SPaul E. McKenney	allowing the lock operation to succeed.
20978dd853d7SPaul E. McKenney
20988dd853d7SPaul E. McKenney	But what if the lock is a sleeplock?  In that case, the code will
20998dd853d7SPaul E. McKenney	try to enter the scheduler, where it will eventually encounter
21008dd853d7SPaul E. McKenney	a memory barrier, which will force the earlier unlock operation
21018dd853d7SPaul E. McKenney	to complete, again unraveling the deadlock.  There might be
21028dd853d7SPaul E. McKenney	a sleep-unlock race, but the locking primitive needs to resolve
21038dd853d7SPaul E. McKenney	such races properly in any case.
21048dd853d7SPaul E. McKenney
2105108b42b4SDavid HowellsLocks and semaphores may not provide any guarantee of ordering on UP compiled
2106108b42b4SDavid Howellssystems, and so cannot be counted on in such a situation to actually achieve
2107108b42b4SDavid Howellsanything at all - especially with respect to I/O accesses - unless combined
2108108b42b4SDavid Howellswith interrupt disabling operations.
2109108b42b4SDavid Howells
2110d7cab36dSSeongJae ParkSee also the section on "Inter-CPU acquiring barrier effects".
2111108b42b4SDavid Howells
2112108b42b4SDavid Howells
2113108b42b4SDavid HowellsAs an example, consider the following:
2114108b42b4SDavid Howells
2115108b42b4SDavid Howells	*A = a;
2116108b42b4SDavid Howells	*B = b;
21172e4f5382SPeter Zijlstra	ACQUIRE
2118108b42b4SDavid Howells	*C = c;
2119108b42b4SDavid Howells	*D = d;
21202e4f5382SPeter Zijlstra	RELEASE
2121108b42b4SDavid Howells	*E = e;
2122108b42b4SDavid Howells	*F = f;
2123108b42b4SDavid Howells
2124108b42b4SDavid HowellsThe following sequence of events is acceptable:
2125108b42b4SDavid Howells
21262e4f5382SPeter Zijlstra	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
2127108b42b4SDavid Howells
2128108b42b4SDavid Howells	[+] Note that {*F,*A} indicates a combined access.
2129108b42b4SDavid Howells
2130108b42b4SDavid HowellsBut none of the following are:
2131108b42b4SDavid Howells
21322e4f5382SPeter Zijlstra	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
21332e4f5382SPeter Zijlstra	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
21342e4f5382SPeter Zijlstra	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
21352e4f5382SPeter Zijlstra	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
2136108b42b4SDavid Howells
2137108b42b4SDavid Howells
2138108b42b4SDavid Howells
2139108b42b4SDavid HowellsINTERRUPT DISABLING FUNCTIONS
2140108b42b4SDavid Howells-----------------------------
2141108b42b4SDavid Howells
21422e4f5382SPeter ZijlstraFunctions that disable interrupts (ACQUIRE equivalent) and enable interrupts
21432e4f5382SPeter Zijlstra(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
2144108b42b4SDavid Howellsbarriers are required in such a situation, they must be provided from some
2145108b42b4SDavid Howellsother means.
2146108b42b4SDavid Howells
2147108b42b4SDavid Howells
214850fa610aSDavid HowellsSLEEP AND WAKE-UP FUNCTIONS
214950fa610aSDavid Howells---------------------------
215050fa610aSDavid Howells
215150fa610aSDavid HowellsSleeping and waking on an event flagged in global data can be viewed as an
215250fa610aSDavid Howellsinteraction between two pieces of data: the task state of the task waiting for
215350fa610aSDavid Howellsthe event and the global data used to indicate the event.  To make sure that
215450fa610aSDavid Howellsthese appear to happen in the right order, the primitives to begin the process
215550fa610aSDavid Howellsof going to sleep, and the primitives to initiate a wake up imply certain
215650fa610aSDavid Howellsbarriers.
215750fa610aSDavid Howells
215850fa610aSDavid HowellsFirstly, the sleeper normally follows something like this sequence of events:
215950fa610aSDavid Howells
216050fa610aSDavid Howells	for (;;) {
216150fa610aSDavid Howells		set_current_state(TASK_UNINTERRUPTIBLE);
216250fa610aSDavid Howells		if (event_indicated)
216350fa610aSDavid Howells			break;
216450fa610aSDavid Howells		schedule();
216550fa610aSDavid Howells	}
216650fa610aSDavid Howells
216750fa610aSDavid HowellsA general memory barrier is interpolated automatically by set_current_state()
216850fa610aSDavid Howellsafter it has altered the task state:
216950fa610aSDavid Howells
217050fa610aSDavid Howells	CPU 1
217150fa610aSDavid Howells	===============================
217250fa610aSDavid Howells	set_current_state();
2173b92b8b35SPeter Zijlstra	  smp_store_mb();
217450fa610aSDavid Howells	    STORE current->state
217550fa610aSDavid Howells	    <general barrier>
217650fa610aSDavid Howells	LOAD event_indicated
217750fa610aSDavid Howells
217850fa610aSDavid Howellsset_current_state() may be wrapped by:
217950fa610aSDavid Howells
218050fa610aSDavid Howells	prepare_to_wait();
218150fa610aSDavid Howells	prepare_to_wait_exclusive();
218250fa610aSDavid Howells
218350fa610aSDavid Howellswhich therefore also imply a general memory barrier after setting the state.
218450fa610aSDavid HowellsThe whole sequence above is available in various canned forms, all of which
2185*7f8fcc6fSAkira Yokosawainterpolate the memory barrier in the right place, for example:
218650fa610aSDavid Howells
218750fa610aSDavid Howells	wait_event();
2188*7f8fcc6fSAkira Yokosawa	wait_event_cmd();
2189*7f8fcc6fSAkira Yokosawa	wait_event_exclusive_cmd();
219050fa610aSDavid Howells	wait_event_interruptible();
219150fa610aSDavid Howells	wait_event_interruptible_exclusive();
219250fa610aSDavid Howells	wait_event_interruptible_timeout();
219350fa610aSDavid Howells	wait_event_killable();
219450fa610aSDavid Howells	wait_event_timeout();
219550fa610aSDavid Howells	wait_on_bit();
219650fa610aSDavid Howells	wait_on_bit_lock();
219750fa610aSDavid Howells
219850fa610aSDavid Howells
219950fa610aSDavid HowellsSecondly, code that performs a wake up normally follows something like this:
220050fa610aSDavid Howells
220150fa610aSDavid Howells	event_indicated = 1;
220250fa610aSDavid Howells	wake_up(&event_wait_queue);
220350fa610aSDavid Howells
220450fa610aSDavid Howellsor:
220550fa610aSDavid Howells
220650fa610aSDavid Howells	event_indicated = 1;
220750fa610aSDavid Howells	wake_up_process(event_daemon);
220850fa610aSDavid Howells
22097696f991SAndrea ParriA general memory barrier is executed by wake_up() if it wakes something up.
22107696f991SAndrea ParriIf it doesn't wake anything up then a memory barrier may or may not be
22117696f991SAndrea Parriexecuted; you must not rely on it.  The barrier occurs before the task state
22127696f991SAndrea Parriis accessed, in particular, it sits between the STORE to indicate the event
22137696f991SAndrea Parriand the STORE to set TASK_RUNNING:
221450fa610aSDavid Howells
22157696f991SAndrea Parri	CPU 1 (Sleeper)			CPU 2 (Waker)
221650fa610aSDavid Howells	===============================	===============================
221750fa610aSDavid Howells	set_current_state();		STORE event_indicated
2218b92b8b35SPeter Zijlstra	  smp_store_mb();		wake_up();
22197696f991SAndrea Parri	    STORE current->state	  ...
22207696f991SAndrea Parri	    <general barrier>		  <general barrier>
22217696f991SAndrea Parri	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
22227696f991SAndrea Parri					    STORE task->state
222350fa610aSDavid Howells
22247696f991SAndrea Parriwhere "task" is the thread being woken up and it equals CPU 1's "current".
22257696f991SAndrea Parri
22267696f991SAndrea ParriTo repeat, a general memory barrier is guaranteed to be executed by wake_up()
22277696f991SAndrea Parriif something is actually awakened, but otherwise there is no such guarantee.
22287696f991SAndrea ParriTo see this, consider the following sequence of events, where X and Y are both
22297696f991SAndrea Parriinitially zero:
22305726ce06SPaul E. McKenney
22315726ce06SPaul E. McKenney	CPU 1				CPU 2
22325726ce06SPaul E. McKenney	===============================	===============================
22337696f991SAndrea Parri	X = 1;				Y = 1;
22345726ce06SPaul E. McKenney	smp_mb();			wake_up();
22357696f991SAndrea Parri	LOAD Y				LOAD X
22365726ce06SPaul E. McKenney
22377696f991SAndrea ParriIf a wakeup does occur, one (at least) of the two loads must see 1.  If, on
22387696f991SAndrea Parrithe other hand, a wakeup does not occur, both loads might see 0.
22397696f991SAndrea Parri
22407696f991SAndrea Parriwake_up_process() always executes a general memory barrier.  The barrier again
22417696f991SAndrea Parrioccurs before the task state is accessed.  In particular, if the wake_up() in
22427696f991SAndrea Parrithe previous snippet were replaced by a call to wake_up_process() then one of
22437696f991SAndrea Parrithe two loads would be guaranteed to see 1.
22445726ce06SPaul E. McKenney
224550fa610aSDavid HowellsThe available waker functions include:
224650fa610aSDavid Howells
224750fa610aSDavid Howells	complete();
224850fa610aSDavid Howells	wake_up();
224950fa610aSDavid Howells	wake_up_all();
225050fa610aSDavid Howells	wake_up_bit();
225150fa610aSDavid Howells	wake_up_interruptible();
225250fa610aSDavid Howells	wake_up_interruptible_all();
225350fa610aSDavid Howells	wake_up_interruptible_nr();
225450fa610aSDavid Howells	wake_up_interruptible_poll();
225550fa610aSDavid Howells	wake_up_interruptible_sync();
225650fa610aSDavid Howells	wake_up_interruptible_sync_poll();
225750fa610aSDavid Howells	wake_up_locked();
225850fa610aSDavid Howells	wake_up_locked_poll();
225950fa610aSDavid Howells	wake_up_nr();
226050fa610aSDavid Howells	wake_up_poll();
226150fa610aSDavid Howells	wake_up_process();
226250fa610aSDavid Howells
22637696f991SAndrea ParriIn terms of memory ordering, these functions all provide the same guarantees of
22647696f991SAndrea Parria wake_up() (or stronger).
226550fa610aSDavid Howells
226650fa610aSDavid Howells[!] Note that the memory barriers implied by the sleeper and the waker do _not_
226750fa610aSDavid Howellsorder multiple stores before the wake-up with respect to loads of those stored
226850fa610aSDavid Howellsvalues after the sleeper has called set_current_state().  For instance, if the
226950fa610aSDavid Howellssleeper does:
227050fa610aSDavid Howells
227150fa610aSDavid Howells	set_current_state(TASK_INTERRUPTIBLE);
227250fa610aSDavid Howells	if (event_indicated)
227350fa610aSDavid Howells		break;
227450fa610aSDavid Howells	__set_current_state(TASK_RUNNING);
227550fa610aSDavid Howells	do_something(my_data);
227650fa610aSDavid Howells
227750fa610aSDavid Howellsand the waker does:
227850fa610aSDavid Howells
227950fa610aSDavid Howells	my_data = value;
228050fa610aSDavid Howells	event_indicated = 1;
228150fa610aSDavid Howells	wake_up(&event_wait_queue);
228250fa610aSDavid Howells
228350fa610aSDavid Howellsthere's no guarantee that the change to event_indicated will be perceived by
228450fa610aSDavid Howellsthe sleeper as coming after the change to my_data.  In such a circumstance, the
228550fa610aSDavid Howellscode on both sides must interpolate its own memory barriers between the
228650fa610aSDavid Howellsseparate data accesses.  Thus the above sleeper ought to do:
228750fa610aSDavid Howells
228850fa610aSDavid Howells	set_current_state(TASK_INTERRUPTIBLE);
228950fa610aSDavid Howells	if (event_indicated) {
229050fa610aSDavid Howells		smp_rmb();
229150fa610aSDavid Howells		do_something(my_data);
229250fa610aSDavid Howells	}
229350fa610aSDavid Howells
229450fa610aSDavid Howellsand the waker should do:
229550fa610aSDavid Howells
229650fa610aSDavid Howells	my_data = value;
229750fa610aSDavid Howells	smp_wmb();
229850fa610aSDavid Howells	event_indicated = 1;
229950fa610aSDavid Howells	wake_up(&event_wait_queue);
230050fa610aSDavid Howells
230150fa610aSDavid Howells
2302108b42b4SDavid HowellsMISCELLANEOUS FUNCTIONS
2303108b42b4SDavid Howells-----------------------
2304108b42b4SDavid Howells
2305108b42b4SDavid HowellsOther functions that imply barriers:
2306108b42b4SDavid Howells
2307108b42b4SDavid Howells (*) schedule() and similar imply full memory barriers.
2308108b42b4SDavid Howells
2309108b42b4SDavid Howells
23102e4f5382SPeter Zijlstra===================================
23112e4f5382SPeter ZijlstraINTER-CPU ACQUIRING BARRIER EFFECTS
23122e4f5382SPeter Zijlstra===================================
2313108b42b4SDavid Howells
2314108b42b4SDavid HowellsOn SMP systems locking primitives give a more substantial form of barrier: one
2315108b42b4SDavid Howellsthat does affect memory access ordering on other CPUs, within the context of
2316108b42b4SDavid Howellsconflict on any particular lock.
2317108b42b4SDavid Howells
2318108b42b4SDavid Howells
23192e4f5382SPeter ZijlstraACQUIRES VS MEMORY ACCESSES
23202e4f5382SPeter Zijlstra---------------------------
2321108b42b4SDavid Howells
232279afecfaSAneesh KumarConsider the following: the system has a pair of spinlocks (M) and (Q), and
2323108b42b4SDavid Howellsthree CPUs; then should the following sequence of events occur:
2324108b42b4SDavid Howells
2325108b42b4SDavid Howells	CPU 1				CPU 2
2326108b42b4SDavid Howells	===============================	===============================
23279af194ceSPaul E. McKenney	WRITE_ONCE(*A, a);		WRITE_ONCE(*E, e);
23282e4f5382SPeter Zijlstra	ACQUIRE M			ACQUIRE Q
23299af194ceSPaul E. McKenney	WRITE_ONCE(*B, b);		WRITE_ONCE(*F, f);
23309af194ceSPaul E. McKenney	WRITE_ONCE(*C, c);		WRITE_ONCE(*G, g);
23312e4f5382SPeter Zijlstra	RELEASE M			RELEASE Q
23329af194ceSPaul E. McKenney	WRITE_ONCE(*D, d);		WRITE_ONCE(*H, h);
2333108b42b4SDavid Howells
233481fc6323SJarek PoplawskiThen there is no guarantee as to what order CPU 3 will see the accesses to *A
2335108b42b4SDavid Howellsthrough *H occur in, other than the constraints imposed by the separate locks
2336108b42b4SDavid Howellson the separate CPUs.  It might, for example, see:
2337108b42b4SDavid Howells
23382e4f5382SPeter Zijlstra	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
2339108b42b4SDavid Howells
2340108b42b4SDavid HowellsBut it won't see any of:
2341108b42b4SDavid Howells
23422e4f5382SPeter Zijlstra	*B, *C or *D preceding ACQUIRE M
23432e4f5382SPeter Zijlstra	*A, *B or *C following RELEASE M
23442e4f5382SPeter Zijlstra	*F, *G or *H preceding ACQUIRE Q
23452e4f5382SPeter Zijlstra	*E, *F or *G following RELEASE Q
2346108b42b4SDavid Howells
2347108b42b4SDavid Howells
2348108b42b4SDavid Howells=================================
2349108b42b4SDavid HowellsWHERE ARE MEMORY BARRIERS NEEDED?
2350108b42b4SDavid Howells=================================
2351108b42b4SDavid Howells
2352108b42b4SDavid HowellsUnder normal operation, memory operation reordering is generally not going to
2353108b42b4SDavid Howellsbe a problem as a single-threaded linear piece of code will still appear to
235450fa610aSDavid Howellswork correctly, even if it's in an SMP kernel.  There are, however, four
2355108b42b4SDavid Howellscircumstances in which reordering definitely _could_ be a problem:
2356108b42b4SDavid Howells
2357108b42b4SDavid Howells (*) Interprocessor interaction.
2358108b42b4SDavid Howells
2359108b42b4SDavid Howells (*) Atomic operations.
2360108b42b4SDavid Howells
236181fc6323SJarek Poplawski (*) Accessing devices.
2362108b42b4SDavid Howells
2363108b42b4SDavid Howells (*) Interrupts.
2364108b42b4SDavid Howells
2365108b42b4SDavid Howells
2366108b42b4SDavid HowellsINTERPROCESSOR INTERACTION
2367108b42b4SDavid Howells--------------------------
2368108b42b4SDavid Howells
2369108b42b4SDavid HowellsWhen there's a system with more than one processor, more than one CPU in the
2370108b42b4SDavid Howellssystem may be working on the same data set at the same time.  This can cause
2371108b42b4SDavid Howellssynchronisation problems, and the usual way of dealing with them is to use
2372108b42b4SDavid Howellslocks.  Locks, however, are quite expensive, and so it may be preferable to
2373108b42b4SDavid Howellsoperate without the use of a lock if at all possible.  In such a case
2374108b42b4SDavid Howellsoperations that affect both CPUs may have to be carefully ordered to prevent
2375108b42b4SDavid Howellsa malfunction.
2376108b42b4SDavid Howells
2377108b42b4SDavid HowellsConsider, for example, the R/W semaphore slow path.  Here a waiting process is
2378108b42b4SDavid Howellsqueued on the semaphore, by virtue of it having a piece of its stack linked to
2379108b42b4SDavid Howellsthe semaphore's list of waiting processes:
2380108b42b4SDavid Howells
2381108b42b4SDavid Howells	struct rw_semaphore {
2382108b42b4SDavid Howells		...
2383108b42b4SDavid Howells		spinlock_t lock;
2384108b42b4SDavid Howells		struct list_head waiters;
2385108b42b4SDavid Howells	};
2386108b42b4SDavid Howells
2387108b42b4SDavid Howells	struct rwsem_waiter {
2388108b42b4SDavid Howells		struct list_head list;
2389108b42b4SDavid Howells		struct task_struct *task;
2390108b42b4SDavid Howells	};
2391108b42b4SDavid Howells
2392108b42b4SDavid HowellsTo wake up a particular waiter, the up_read() or up_write() functions have to:
2393108b42b4SDavid Howells
2394108b42b4SDavid Howells (1) read the next pointer from this waiter's record to know as to where the
2395108b42b4SDavid Howells     next waiter record is;
2396108b42b4SDavid Howells
239781fc6323SJarek Poplawski (2) read the pointer to the waiter's task structure;
2398108b42b4SDavid Howells
2399108b42b4SDavid Howells (3) clear the task pointer to tell the waiter it has been given the semaphore;
2400108b42b4SDavid Howells
2401108b42b4SDavid Howells (4) call wake_up_process() on the task; and
2402108b42b4SDavid Howells
2403108b42b4SDavid Howells (5) release the reference held on the waiter's task struct.
2404108b42b4SDavid Howells
2405108b42b4SDavid HowellsIn other words, it has to perform this sequence of events:
2406108b42b4SDavid Howells
2407108b42b4SDavid Howells	LOAD waiter->list.next;
2408108b42b4SDavid Howells	LOAD waiter->task;
2409108b42b4SDavid Howells	STORE waiter->task;
2410108b42b4SDavid Howells	CALL wakeup
2411108b42b4SDavid Howells	RELEASE task
2412108b42b4SDavid Howells
2413108b42b4SDavid Howellsand if any of these steps occur out of order, then the whole thing may
2414108b42b4SDavid Howellsmalfunction.
2415108b42b4SDavid Howells
2416108b42b4SDavid HowellsOnce it has queued itself and dropped the semaphore lock, the waiter does not
2417108b42b4SDavid Howellsget the lock again; it instead just waits for its task pointer to be cleared
2418108b42b4SDavid Howellsbefore proceeding.  Since the record is on the waiter's stack, this means that
2419108b42b4SDavid Howellsif the task pointer is cleared _before_ the next pointer in the list is read,
2420108b42b4SDavid Howellsanother CPU might start processing the waiter and might clobber the waiter's
2421108b42b4SDavid Howellsstack before the up*() function has a chance to read the next pointer.
2422108b42b4SDavid Howells
2423108b42b4SDavid HowellsConsider then what might happen to the above sequence of events:
2424108b42b4SDavid Howells
2425108b42b4SDavid Howells	CPU 1				CPU 2
2426108b42b4SDavid Howells	===============================	===============================
2427108b42b4SDavid Howells					down_xxx()
2428108b42b4SDavid Howells					Queue waiter
2429108b42b4SDavid Howells					Sleep
2430108b42b4SDavid Howells	up_yyy()
2431108b42b4SDavid Howells	LOAD waiter->task;
2432108b42b4SDavid Howells	STORE waiter->task;
2433108b42b4SDavid Howells					Woken up by other event
2434108b42b4SDavid Howells	<preempt>
2435108b42b4SDavid Howells					Resume processing
2436108b42b4SDavid Howells					down_xxx() returns
2437108b42b4SDavid Howells					call foo()
2438108b42b4SDavid Howells					foo() clobbers *waiter
2439108b42b4SDavid Howells	</preempt>
2440108b42b4SDavid Howells	LOAD waiter->list.next;
2441108b42b4SDavid Howells	--- OOPS ---
2442108b42b4SDavid Howells
2443108b42b4SDavid HowellsThis could be dealt with using the semaphore lock, but then the down_xxx()
2444108b42b4SDavid Howellsfunction has to needlessly get the spinlock again after being woken up.
2445108b42b4SDavid Howells
2446108b42b4SDavid HowellsThe way to deal with this is to insert a general SMP memory barrier:
2447108b42b4SDavid Howells
2448108b42b4SDavid Howells	LOAD waiter->list.next;
2449108b42b4SDavid Howells	LOAD waiter->task;
2450108b42b4SDavid Howells	smp_mb();
2451108b42b4SDavid Howells	STORE waiter->task;
2452108b42b4SDavid Howells	CALL wakeup
2453108b42b4SDavid Howells	RELEASE task
2454108b42b4SDavid Howells
2455108b42b4SDavid HowellsIn this case, the barrier makes a guarantee that all memory accesses before the
2456108b42b4SDavid Howellsbarrier will appear to happen before all the memory accesses after the barrier
2457108b42b4SDavid Howellswith respect to the other CPUs on the system.  It does _not_ guarantee that all
2458108b42b4SDavid Howellsthe memory accesses before the barrier will be complete by the time the barrier
2459108b42b4SDavid Howellsinstruction itself is complete.
2460108b42b4SDavid Howells
2461108b42b4SDavid HowellsOn a UP system - where this wouldn't be a problem - the smp_mb() is just a
2462108b42b4SDavid Howellscompiler barrier, thus making sure the compiler emits the instructions in the
24636bc39274SDavid Howellsright order without actually intervening in the CPU.  Since there's only one
24646bc39274SDavid HowellsCPU, that CPU's dependency ordering logic will take care of everything else.
2465108b42b4SDavid Howells
2466108b42b4SDavid Howells
2467108b42b4SDavid HowellsATOMIC OPERATIONS
2468108b42b4SDavid Howells-----------------
2469108b42b4SDavid Howells
2470806654a9SWill DeaconWhile they are technically interprocessor interaction considerations, atomic
2471dbc8700eSDavid Howellsoperations are noted specially as some of them imply full memory barriers and
2472dbc8700eSDavid Howellssome don't, but they're very heavily relied on as a group throughout the
2473dbc8700eSDavid Howellskernel.
2474dbc8700eSDavid Howells
2475706eeb3eSPeter ZijlstraSee Documentation/atomic_t.txt for more information.
2476108b42b4SDavid Howells
2477108b42b4SDavid Howells
2478108b42b4SDavid HowellsACCESSING DEVICES
2479108b42b4SDavid Howells-----------------
2480108b42b4SDavid Howells
2481108b42b4SDavid HowellsMany devices can be memory mapped, and so appear to the CPU as if they're just
2482108b42b4SDavid Howellsa set of memory locations.  To control such a device, the driver usually has to
2483108b42b4SDavid Howellsmake the right memory accesses in exactly the right order.
2484108b42b4SDavid Howells
2485108b42b4SDavid HowellsHowever, having a clever CPU or a clever compiler creates a potential problem
2486108b42b4SDavid Howellsin that the carefully sequenced accesses in the driver code won't reach the
2487108b42b4SDavid Howellsdevice in the requisite order if the CPU or the compiler thinks it is more
2488108b42b4SDavid Howellsefficient to reorder, combine or merge accesses - something that would cause
2489108b42b4SDavid Howellsthe device to malfunction.
2490108b42b4SDavid Howells
2491108b42b4SDavid HowellsInside of the Linux kernel, I/O should be done through the appropriate accessor
2492108b42b4SDavid Howellsroutines - such as inb() or writel() - which know how to make such accesses
2493806654a9SWill Deaconappropriately sequential.  While this, for the most part, renders the explicit
249491553039SWill Deaconuse of memory barriers unnecessary, if the accessor functions are used to refer
249591553039SWill Deaconto an I/O memory window with relaxed memory access properties, then _mandatory_
249691553039SWill Deaconmemory barriers are required to enforce ordering.
2497108b42b4SDavid Howells
24980fe397f0SHelmut GrohneSee Documentation/driver-api/device-io.rst for more information.
2499108b42b4SDavid Howells
2500108b42b4SDavid Howells
2501108b42b4SDavid HowellsINTERRUPTS
2502108b42b4SDavid Howells----------
2503108b42b4SDavid Howells
2504108b42b4SDavid HowellsA driver may be interrupted by its own interrupt service routine, and thus the
2505108b42b4SDavid Howellstwo parts of the driver may interfere with each other's attempts to control or
2506108b42b4SDavid Howellsaccess the device.
2507108b42b4SDavid Howells
2508108b42b4SDavid HowellsThis may be alleviated - at least in part - by disabling local interrupts (a
2509108b42b4SDavid Howellsform of locking), such that the critical operations are all contained within
2510806654a9SWill Deaconthe interrupt-disabled section in the driver.  While the driver's interrupt
2511108b42b4SDavid Howellsroutine is executing, the driver's core may not run on the same CPU, and its
2512108b42b4SDavid Howellsinterrupt is not permitted to happen again until the current interrupt has been
2513108b42b4SDavid Howellshandled, thus the interrupt handler does not need to lock against that.
2514108b42b4SDavid Howells
2515108b42b4SDavid HowellsHowever, consider a driver that was talking to an ethernet card that sports an
2516108b42b4SDavid Howellsaddress register and a data register.  If that driver's core talks to the card
2517108b42b4SDavid Howellsunder interrupt-disablement and then the driver's interrupt handler is invoked:
2518108b42b4SDavid Howells
2519108b42b4SDavid Howells	LOCAL IRQ DISABLE
2520108b42b4SDavid Howells	writew(ADDR, 3);
2521108b42b4SDavid Howells	writew(DATA, y);
2522108b42b4SDavid Howells	LOCAL IRQ ENABLE
2523108b42b4SDavid Howells	<interrupt>
2524108b42b4SDavid Howells	writew(ADDR, 4);
2525108b42b4SDavid Howells	q = readw(DATA);
2526108b42b4SDavid Howells	</interrupt>
2527108b42b4SDavid Howells
2528108b42b4SDavid HowellsThe store to the data register might happen after the second store to the
2529108b42b4SDavid Howellsaddress register if ordering rules are sufficiently relaxed:
2530108b42b4SDavid Howells
2531108b42b4SDavid Howells	STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
2532108b42b4SDavid Howells
2533108b42b4SDavid Howells
2534108b42b4SDavid HowellsIf ordering rules are relaxed, it must be assumed that accesses done inside an
2535108b42b4SDavid Howellsinterrupt disabled section may leak outside of it and may interleave with
2536108b42b4SDavid Howellsaccesses performed in an interrupt - and vice versa - unless implicit or
2537108b42b4SDavid Howellsexplicit barriers are used.
2538108b42b4SDavid Howells
2539108b42b4SDavid HowellsNormally this won't be a problem because the I/O accesses done inside such
2540108b42b4SDavid Howellssections will include synchronous load operations on strictly ordered I/O
254191553039SWill Deaconregisters that form implicit I/O barriers.
2542108b42b4SDavid Howells
2543108b42b4SDavid Howells
2544108b42b4SDavid HowellsA similar situation may occur between an interrupt routine and two routines
2545108b42b4SDavid Howellsrunning on separate CPUs that communicate with each other.  If such a case is
2546108b42b4SDavid Howellslikely, then interrupt-disabling locks should be used to guarantee ordering.
2547108b42b4SDavid Howells
2548108b42b4SDavid Howells
2549108b42b4SDavid Howells==========================
2550108b42b4SDavid HowellsKERNEL I/O BARRIER EFFECTS
2551108b42b4SDavid Howells==========================
2552108b42b4SDavid Howells
25534614bbdeSWill DeaconInterfacing with peripherals via I/O accesses is deeply architecture and device
25544614bbdeSWill Deaconspecific. Therefore, drivers which are inherently non-portable may rely on
25554614bbdeSWill Deaconspecific behaviours of their target systems in order to achieve synchronization
25564614bbdeSWill Deaconin the most lightweight manner possible. For drivers intending to be portable
25574614bbdeSWill Deaconbetween multiple architectures and bus implementations, the kernel offers a
25584614bbdeSWill Deaconseries of accessor functions that provide various degrees of ordering
25594614bbdeSWill Deaconguarantees:
2560108b42b4SDavid Howells
2561108b42b4SDavid Howells (*) readX(), writeX():
2562108b42b4SDavid Howells
25630cde62a4SWill Deacon	The readX() and writeX() MMIO accessors take a pointer to the
25640cde62a4SWill Deacon	peripheral being accessed as an __iomem * parameter. For pointers
25650cde62a4SWill Deacon	mapped with the default I/O attributes (e.g. those returned by
25660cde62a4SWill Deacon	ioremap()), the ordering guarantees are as follows:
2567108b42b4SDavid Howells
25684614bbdeSWill Deacon	1. All readX() and writeX() accesses to the same peripheral are ordered
25699726840dSWill Deacon	   with respect to each other. This ensures that MMIO register accesses
25709726840dSWill Deacon	   by the same CPU thread to a particular device will arrive in program
25719726840dSWill Deacon	   order.
2572108b42b4SDavid Howells
25739726840dSWill Deacon	2. A writeX() issued by a CPU thread holding a spinlock is ordered
25749726840dSWill Deacon	   before a writeX() to the same peripheral from another CPU thread
25759726840dSWill Deacon	   issued after a later acquisition of the same spinlock. This ensures
25769726840dSWill Deacon	   that MMIO register writes to a particular device issued while holding
25779726840dSWill Deacon	   a spinlock will arrive in an order consistent with acquisitions of
25789726840dSWill Deacon	   the lock.
2579108b42b4SDavid Howells
25809726840dSWill Deacon	3. A writeX() by a CPU thread to the peripheral will first wait for the
25819726840dSWill Deacon	   completion of all prior writes to memory either issued by, or
25829726840dSWill Deacon	   propagated to, the same thread. This ensures that writes by the CPU
25839726840dSWill Deacon	   to an outbound DMA buffer allocated by dma_alloc_coherent() will be
25849726840dSWill Deacon	   visible to a DMA engine when the CPU writes to its MMIO control
25859726840dSWill Deacon	   register to trigger the transfer.
2586108b42b4SDavid Howells
25879726840dSWill Deacon	4. A readX() by a CPU thread from the peripheral will complete before
25889726840dSWill Deacon	   any subsequent reads from memory by the same thread can begin. This
25899726840dSWill Deacon	   ensures that reads by the CPU from an incoming DMA buffer allocated
25909726840dSWill Deacon	   by dma_alloc_coherent() will not see stale data after reading from
25919726840dSWill Deacon	   the DMA engine's MMIO status register to establish that the DMA
25929726840dSWill Deacon	   transfer has completed.
25939726840dSWill Deacon
25949726840dSWill Deacon	5. A readX() by a CPU thread from the peripheral will complete before
25959726840dSWill Deacon	   any subsequent delay() loop can begin execution on the same thread.
25969726840dSWill Deacon	   This ensures that two MMIO register writes by the CPU to a peripheral
25979726840dSWill Deacon	   will arrive at least 1us apart if the first write is immediately read
25989726840dSWill Deacon	   back with readX() and udelay(1) is called prior to the second
25999726840dSWill Deacon	   writeX():
2600108b42b4SDavid Howells
26010cde62a4SWill Deacon		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
26020cde62a4SWill Deacon		readl(DEVICE_REGISTER_0);
26030cde62a4SWill Deacon		udelay(1);
26040cde62a4SWill Deacon		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
26050cde62a4SWill Deacon
26060cde62a4SWill Deacon	The ordering properties of __iomem pointers obtained with non-default
26070cde62a4SWill Deacon	attributes (e.g. those returned by ioremap_wc()) are specific to the
26080cde62a4SWill Deacon	underlying architecture and therefore the guarantees listed above cannot
26090cde62a4SWill Deacon	generally be relied upon for accesses to these types of mappings.
2610108b42b4SDavid Howells
26114614bbdeSWill Deacon (*) readX_relaxed(), writeX_relaxed():
2612108b42b4SDavid Howells
2613a8e0aeadSWill Deacon	These are similar to readX() and writeX(), but provide weaker memory
2614a8e0aeadSWill Deacon	ordering guarantees. Specifically, they do not guarantee ordering with
26159726840dSWill Deacon	respect to locking, normal memory accesses or delay() loops (i.e.
26169726840dSWill Deacon	bullets 2-5 above) but they are still guaranteed to be ordered with
26179726840dSWill Deacon	respect to other accesses from the same CPU thread to the same
26189726840dSWill Deacon	peripheral when operating on __iomem pointers mapped with the default
26199726840dSWill Deacon	I/O attributes.
26204614bbdeSWill Deacon
26214614bbdeSWill Deacon (*) readsX(), writesX():
26224614bbdeSWill Deacon
26234614bbdeSWill Deacon	The readsX() and writesX() MMIO accessors are designed for accessing
26244614bbdeSWill Deacon	register-based, memory-mapped FIFOs residing on peripherals that are not
26254614bbdeSWill Deacon	capable of performing DMA. Consequently, they provide only the ordering
26264614bbdeSWill Deacon	guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
26274614bbdeSWill Deacon
26284614bbdeSWill Deacon (*) inX(), outX():
26294614bbdeSWill Deacon
26304614bbdeSWill Deacon	The inX() and outX() accessors are intended to access legacy port-mapped
26314614bbdeSWill Deacon	I/O peripherals, which may require special instructions on some
26324614bbdeSWill Deacon	architectures (notably x86). The port number of the peripheral being
26334614bbdeSWill Deacon	accessed is passed as an argument.
26344614bbdeSWill Deacon
26354614bbdeSWill Deacon	Since many CPU architectures ultimately access these peripherals via an
26360cde62a4SWill Deacon	internal virtual memory mapping, the portable ordering guarantees
26370cde62a4SWill Deacon	provided by inX() and outX() are the same as those provided by readX()
26380cde62a4SWill Deacon	and writeX() respectively when accessing a mapping with the default I/O
26390cde62a4SWill Deacon	attributes.
26404614bbdeSWill Deacon
26414614bbdeSWill Deacon	Device drivers may expect outX() to emit a non-posted write transaction
26424614bbdeSWill Deacon	that waits for a completion response from the I/O peripheral before
26434614bbdeSWill Deacon	returning. This is not guaranteed by all architectures and is therefore
26444614bbdeSWill Deacon	not part of the portable ordering semantics.
26454614bbdeSWill Deacon
26464614bbdeSWill Deacon (*) insX(), outsX():
26474614bbdeSWill Deacon
26484614bbdeSWill Deacon	As above, the insX() and outsX() accessors provide the same ordering
26490cde62a4SWill Deacon	guarantees as readsX() and writesX() respectively when accessing a
26500cde62a4SWill Deacon	mapping with the default I/O attributes.
2651108b42b4SDavid Howells
26520cde62a4SWill Deacon (*) ioreadX(), iowriteX():
2653108b42b4SDavid Howells
265481fc6323SJarek Poplawski	These will perform appropriately for the type of access they're actually
2655108b42b4SDavid Howells	doing, be it inX()/outX() or readX()/writeX().
2656108b42b4SDavid Howells
26579726840dSWill DeaconWith the exception of the string accessors (insX(), outsX(), readsX() and
26589726840dSWill DeaconwritesX()), all of the above assume that the underlying peripheral is
26599726840dSWill Deaconlittle-endian and will therefore perform byte-swapping operations on big-endian
26609726840dSWill Deaconarchitectures.
26614614bbdeSWill Deacon
2662108b42b4SDavid Howells
2663108b42b4SDavid Howells========================================
2664108b42b4SDavid HowellsASSUMED MINIMUM EXECUTION ORDERING MODEL
2665108b42b4SDavid Howells========================================
2666108b42b4SDavid Howells
2667108b42b4SDavid HowellsIt has to be assumed that the conceptual CPU is weakly-ordered but that it will
2668108b42b4SDavid Howellsmaintain the appearance of program causality with respect to itself.  Some CPUs
2669108b42b4SDavid Howells(such as i386 or x86_64) are more constrained than others (such as powerpc or
2670108b42b4SDavid Howellsfrv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
2671108b42b4SDavid Howellsof arch-specific code.
2672108b42b4SDavid Howells
2673108b42b4SDavid HowellsThis means that it must be considered that the CPU will execute its instruction
2674108b42b4SDavid Howellsstream in any order it feels like - or even in parallel - provided that if an
267581fc6323SJarek Poplawskiinstruction in the stream depends on an earlier instruction, then that
2676108b42b4SDavid Howellsearlier instruction must be sufficiently complete[*] before the later
2677108b42b4SDavid Howellsinstruction may proceed; in other words: provided that the appearance of
2678108b42b4SDavid Howellscausality is maintained.
2679108b42b4SDavid Howells
2680108b42b4SDavid Howells [*] Some instructions have more than one effect - such as changing the
2681108b42b4SDavid Howells     condition codes, changing registers or changing memory - and different
2682108b42b4SDavid Howells     instructions may depend on different effects.
2683108b42b4SDavid Howells
2684108b42b4SDavid HowellsA CPU may also discard any instruction sequence that winds up having no
2685108b42b4SDavid Howellsultimate effect.  For example, if two adjacent instructions both load an
2686108b42b4SDavid Howellsimmediate value into the same register, the first may be discarded.
2687108b42b4SDavid Howells
2688108b42b4SDavid Howells
2689108b42b4SDavid HowellsSimilarly, it has to be assumed that compiler might reorder the instruction
2690108b42b4SDavid Howellsstream in any way it sees fit, again provided the appearance of causality is
2691108b42b4SDavid Howellsmaintained.
2692108b42b4SDavid Howells
2693108b42b4SDavid Howells
2694108b42b4SDavid Howells============================
2695108b42b4SDavid HowellsTHE EFFECTS OF THE CPU CACHE
2696108b42b4SDavid Howells============================
2697108b42b4SDavid Howells
2698108b42b4SDavid HowellsThe way cached memory operations are perceived across the system is affected to
2699108b42b4SDavid Howellsa certain extent by the caches that lie between CPUs and memory, and by the
2700108b42b4SDavid Howellsmemory coherence system that maintains the consistency of state in the system.
2701108b42b4SDavid Howells
2702108b42b4SDavid HowellsAs far as the way a CPU interacts with another part of the system through the
2703108b42b4SDavid Howellscaches goes, the memory system has to include the CPU's caches, and memory
2704108b42b4SDavid Howellsbarriers for the most part act at the interface between the CPU and its cache
2705108b42b4SDavid Howells(memory barriers logically act on the dotted line in the following diagram):
2706108b42b4SDavid Howells
2707108b42b4SDavid Howells	    <--- CPU --->         :       <----------- Memory ----------->
2708108b42b4SDavid Howells	                          :
2709108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    +-----------+
2710108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    +--------+
2711108b42b4SDavid Howells	|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
2712108b42b4SDavid Howells	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2713108b42b4SDavid Howells	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2714108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    |        |
2715108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    |           |    |        |
2716108b42b4SDavid Howells	                          :                 | Cache     |    +--------+
2717108b42b4SDavid Howells	                          :                 | Coherency |
2718108b42b4SDavid Howells	                          :                 | Mechanism |    +--------+
2719108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    |           |    |	      |
2720108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    |        |
2721108b42b4SDavid Howells	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2722108b42b4SDavid Howells	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2723108b42b4SDavid Howells	|        |    | Queue  |  :   |        |    |           |    |        |
2724108b42b4SDavid Howells	|        |    |        |  :   |        |    |           |    +--------+
2725108b42b4SDavid Howells	+--------+    +--------+  :   +--------+    +-----------+
2726108b42b4SDavid Howells	                          :
2727108b42b4SDavid Howells	                          :
2728108b42b4SDavid Howells
2729108b42b4SDavid HowellsAlthough any particular load or store may not actually appear outside of the
2730108b42b4SDavid HowellsCPU that issued it since it may have been satisfied within the CPU's own cache,
2731108b42b4SDavid Howellsit will still appear as if the full memory access had taken place as far as the
2732108b42b4SDavid Howellsother CPUs are concerned since the cache coherency mechanisms will migrate the
2733108b42b4SDavid Howellscacheline over to the accessing CPU and propagate the effects upon conflict.
2734108b42b4SDavid Howells
2735108b42b4SDavid HowellsThe CPU core may execute instructions in any order it deems fit, provided the
2736108b42b4SDavid Howellsexpected program causality appears to be maintained.  Some of the instructions
2737108b42b4SDavid Howellsgenerate load and store operations which then go into the queue of memory
2738108b42b4SDavid Howellsaccesses to be performed.  The core may place these in the queue in any order
2739108b42b4SDavid Howellsit wishes, and continue execution until it is forced to wait for an instruction
2740108b42b4SDavid Howellsto complete.
2741108b42b4SDavid Howells
2742108b42b4SDavid HowellsWhat memory barriers are concerned with is controlling the order in which
2743108b42b4SDavid Howellsaccesses cross from the CPU side of things to the memory side of things, and
2744108b42b4SDavid Howellsthe order in which the effects are perceived to happen by the other observers
2745108b42b4SDavid Howellsin the system.
2746108b42b4SDavid Howells
2747108b42b4SDavid Howells[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2748108b42b4SDavid Howellstheir own loads and stores as if they had happened in program order.
2749108b42b4SDavid Howells
2750108b42b4SDavid Howells[!] MMIO or other device accesses may bypass the cache system.  This depends on
2751108b42b4SDavid Howellsthe properties of the memory window through which devices are accessed and/or
2752108b42b4SDavid Howellsthe use of any special device communication instructions the CPU may have.
2753108b42b4SDavid Howells
2754108b42b4SDavid Howells
2755108b42b4SDavid HowellsCACHE COHERENCY VS DMA
2756108b42b4SDavid Howells----------------------
2757108b42b4SDavid Howells
2758108b42b4SDavid HowellsNot all systems maintain cache coherency with respect to devices doing DMA.  In
2759108b42b4SDavid Howellssuch cases, a device attempting DMA may obtain stale data from RAM because
2760108b42b4SDavid Howellsdirty cache lines may be resident in the caches of various CPUs, and may not
2761108b42b4SDavid Howellshave been written back to RAM yet.  To deal with this, the appropriate part of
2762108b42b4SDavid Howellsthe kernel must flush the overlapping bits of cache on each CPU (and maybe
2763108b42b4SDavid Howellsinvalidate them as well).
2764108b42b4SDavid Howells
2765108b42b4SDavid HowellsIn addition, the data DMA'd to RAM by a device may be overwritten by dirty
2766108b42b4SDavid Howellscache lines being written back to RAM from a CPU's cache after the device has
276781fc6323SJarek Poplawskiinstalled its own data, or cache lines present in the CPU's cache may simply
276881fc6323SJarek Poplawskiobscure the fact that RAM has been updated, until at such time as the cacheline
276981fc6323SJarek Poplawskiis discarded from the CPU's cache and reloaded.  To deal with this, the
277081fc6323SJarek Poplawskiappropriate part of the kernel must invalidate the overlapping bits of the
2771108b42b4SDavid Howellscache on each CPU.
2772108b42b4SDavid Howells
2773f556082dSAkira YokosawaSee Documentation/core-api/cachetlb.rst for more information on cache
2774f556082dSAkira Yokosawamanagement.
2775108b42b4SDavid Howells
2776108b42b4SDavid Howells
2777108b42b4SDavid HowellsCACHE COHERENCY VS MMIO
2778108b42b4SDavid Howells-----------------------
2779108b42b4SDavid Howells
2780108b42b4SDavid HowellsMemory mapped I/O usually takes place through memory locations that are part of
278181fc6323SJarek Poplawskia window in the CPU's memory space that has different properties assigned than
2782108b42b4SDavid Howellsthe usual RAM directed window.
2783108b42b4SDavid Howells
2784108b42b4SDavid HowellsAmongst these properties is usually the fact that such accesses bypass the
2785108b42b4SDavid Howellscaching entirely and go directly to the device buses.  This means MMIO accesses
2786108b42b4SDavid Howellsmay, in effect, overtake accesses to cached memory that were emitted earlier.
2787108b42b4SDavid HowellsA memory barrier isn't sufficient in such a case, but rather the cache must be
2788108b42b4SDavid Howellsflushed between the cached memory write and the MMIO access if the two are in
2789108b42b4SDavid Howellsany way dependent.
2790108b42b4SDavid Howells
2791108b42b4SDavid Howells
2792108b42b4SDavid Howells=========================
2793108b42b4SDavid HowellsTHE THINGS CPUS GET UP TO
2794108b42b4SDavid Howells=========================
2795108b42b4SDavid Howells
2796108b42b4SDavid HowellsA programmer might take it for granted that the CPU will perform memory
279781fc6323SJarek Poplawskioperations in exactly the order specified, so that if the CPU is, for example,
2798108b42b4SDavid Howellsgiven the following piece of code to execute:
2799108b42b4SDavid Howells
28009af194ceSPaul E. McKenney	a = READ_ONCE(*A);
28019af194ceSPaul E. McKenney	WRITE_ONCE(*B, b);
28029af194ceSPaul E. McKenney	c = READ_ONCE(*C);
28039af194ceSPaul E. McKenney	d = READ_ONCE(*D);
28049af194ceSPaul E. McKenney	WRITE_ONCE(*E, e);
2805108b42b4SDavid Howells
280681fc6323SJarek Poplawskithey would then expect that the CPU will complete the memory operation for each
2807108b42b4SDavid Howellsinstruction before moving on to the next one, leading to a definite sequence of
2808108b42b4SDavid Howellsoperations as seen by external observers in the system:
2809108b42b4SDavid Howells
2810108b42b4SDavid Howells	LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2811108b42b4SDavid Howells
2812108b42b4SDavid Howells
2813108b42b4SDavid HowellsReality is, of course, much messier.  With many CPUs and compilers, the above
2814108b42b4SDavid Howellsassumption doesn't hold because:
2815108b42b4SDavid Howells
2816108b42b4SDavid Howells (*) loads are more likely to need to be completed immediately to permit
2817108b42b4SDavid Howells     execution progress, whereas stores can often be deferred without a
2818108b42b4SDavid Howells     problem;
2819108b42b4SDavid Howells
2820108b42b4SDavid Howells (*) loads may be done speculatively, and the result discarded should it prove
2821108b42b4SDavid Howells     to have been unnecessary;
2822108b42b4SDavid Howells
282381fc6323SJarek Poplawski (*) loads may be done speculatively, leading to the result having been fetched
282481fc6323SJarek Poplawski     at the wrong time in the expected sequence of events;
2825108b42b4SDavid Howells
2826108b42b4SDavid Howells (*) the order of the memory accesses may be rearranged to promote better use
2827108b42b4SDavid Howells     of the CPU buses and caches;
2828108b42b4SDavid Howells
2829108b42b4SDavid Howells (*) loads and stores may be combined to improve performance when talking to
2830108b42b4SDavid Howells     memory or I/O hardware that can do batched accesses of adjacent locations,
2831108b42b4SDavid Howells     thus cutting down on transaction setup costs (memory and PCI devices may
2832108b42b4SDavid Howells     both be able to do this); and
2833108b42b4SDavid Howells
2834806654a9SWill Deacon (*) the CPU's data cache may affect the ordering, and while cache-coherency
2835108b42b4SDavid Howells     mechanisms may alleviate this - once the store has actually hit the cache
2836108b42b4SDavid Howells     - there's no guarantee that the coherency management will be propagated in
2837108b42b4SDavid Howells     order to other CPUs.
2838108b42b4SDavid Howells
2839108b42b4SDavid HowellsSo what another CPU, say, might actually observe from the above piece of code
2840108b42b4SDavid Howellsis:
2841108b42b4SDavid Howells
2842108b42b4SDavid Howells	LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2843108b42b4SDavid Howells
2844108b42b4SDavid Howells	(Where "LOAD {*C,*D}" is a combined load)
2845108b42b4SDavid Howells
2846108b42b4SDavid Howells
2847108b42b4SDavid HowellsHowever, it is guaranteed that a CPU will be self-consistent: it will see its
2848108b42b4SDavid Howells_own_ accesses appear to be correctly ordered, without the need for a memory
2849108b42b4SDavid Howellsbarrier.  For instance with the following code:
2850108b42b4SDavid Howells
28519af194ceSPaul E. McKenney	U = READ_ONCE(*A);
28529af194ceSPaul E. McKenney	WRITE_ONCE(*A, V);
28539af194ceSPaul E. McKenney	WRITE_ONCE(*A, W);
28549af194ceSPaul E. McKenney	X = READ_ONCE(*A);
28559af194ceSPaul E. McKenney	WRITE_ONCE(*A, Y);
28569af194ceSPaul E. McKenney	Z = READ_ONCE(*A);
2857108b42b4SDavid Howells
2858108b42b4SDavid Howellsand assuming no intervention by an external influence, it can be assumed that
2859108b42b4SDavid Howellsthe final result will appear to be:
2860108b42b4SDavid Howells
2861108b42b4SDavid Howells	U == the original value of *A
2862108b42b4SDavid Howells	X == W
2863108b42b4SDavid Howells	Z == Y
2864108b42b4SDavid Howells	*A == Y
2865108b42b4SDavid Howells
2866108b42b4SDavid HowellsThe code above may cause the CPU to generate the full sequence of memory
2867108b42b4SDavid Howellsaccesses:
2868108b42b4SDavid Howells
2869108b42b4SDavid Howells	U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2870108b42b4SDavid Howells
2871108b42b4SDavid Howellsin that order, but, without intervention, the sequence may have almost any
28729af194ceSPaul E. McKenneycombination of elements combined or discarded, provided the program's view
28739af194ceSPaul E. McKenneyof the world remains consistent.  Note that READ_ONCE() and WRITE_ONCE()
28749af194ceSPaul E. McKenneyare -not- optional in the above example, as there are architectures
28759af194ceSPaul E. McKenneywhere a given CPU might reorder successive loads to the same location.
28769af194ceSPaul E. McKenneyOn such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
28779af194ceSPaul E. McKenneynecessary to prevent this, for example, on Itanium the volatile casts
28789af194ceSPaul E. McKenneyused by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq
28799af194ceSPaul E. McKenneyand st.rel instructions (respectively) that prevent such reordering.
2880108b42b4SDavid Howells
2881108b42b4SDavid HowellsThe compiler may also combine, discard or defer elements of the sequence before
2882108b42b4SDavid Howellsthe CPU even sees them.
2883108b42b4SDavid Howells
2884108b42b4SDavid HowellsFor instance:
2885108b42b4SDavid Howells
2886108b42b4SDavid Howells	*A = V;
2887108b42b4SDavid Howells	*A = W;
2888108b42b4SDavid Howells
2889108b42b4SDavid Howellsmay be reduced to:
2890108b42b4SDavid Howells
2891108b42b4SDavid Howells	*A = W;
2892108b42b4SDavid Howells
28939af194ceSPaul E. McKenneysince, without either a write barrier or an WRITE_ONCE(), it can be
28942ecf8101SPaul E. McKenneyassumed that the effect of the storage of V to *A is lost.  Similarly:
2895108b42b4SDavid Howells
2896108b42b4SDavid Howells	*A = Y;
2897108b42b4SDavid Howells	Z = *A;
2898108b42b4SDavid Howells
28999af194ceSPaul E. McKenneymay, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be
29009af194ceSPaul E. McKenneyreduced to:
2901108b42b4SDavid Howells
2902108b42b4SDavid Howells	*A = Y;
2903108b42b4SDavid Howells	Z = Y;
2904108b42b4SDavid Howells
2905108b42b4SDavid Howellsand the LOAD operation never appear outside of the CPU.
2906108b42b4SDavid Howells
2907108b42b4SDavid Howells
2908108b42b4SDavid HowellsAND THEN THERE'S THE ALPHA
2909108b42b4SDavid Howells--------------------------
2910108b42b4SDavid Howells
2911108b42b4SDavid HowellsThe DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2912108b42b4SDavid Howellssome versions of the Alpha CPU have a split data cache, permitting them to have
291381fc6323SJarek Poplawskitwo semantically-related cache lines updated at separate times.  This is where
2914f556082dSAkira Yokosawathe address-dependency barrier really becomes necessary as this synchronises
2915f556082dSAkira Yokosawaboth caches with the memory coherence system, thus making it seem like pointer
2916108b42b4SDavid Howellschanges vs new data occur in the right order.
2917108b42b4SDavid Howells
2918f28f0868SPaul E. McKenneyThe Alpha defines the Linux kernel's memory model, although as of v4.15
29198ca924aeSWill Deaconthe Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly
29208ca924aeSWill Deaconreduced its impact on the memory model.
2921108b42b4SDavid Howells
29220b6fa347SSeongJae Park
29236a65d263SMichael S. TsirkinVIRTUAL MACHINE GUESTS
29243dbf0913SSeongJae Park----------------------
29256a65d263SMichael S. Tsirkin
29266a65d263SMichael S. TsirkinGuests running within virtual machines might be affected by SMP effects even if
29276a65d263SMichael S. Tsirkinthe guest itself is compiled without SMP support.  This is an artifact of
29286a65d263SMichael S. Tsirkininterfacing with an SMP host while running an UP kernel.  Using mandatory
29296a65d263SMichael S. Tsirkinbarriers for this use-case would be possible but is often suboptimal.
29306a65d263SMichael S. Tsirkin
29316a65d263SMichael S. TsirkinTo handle this case optimally, low-level virt_mb() etc macros are available.
29326a65d263SMichael S. TsirkinThese have the same effect as smp_mb() etc when SMP is enabled, but generate
29336a65d263SMichael S. Tsirkinidentical code for SMP and non-SMP systems.  For example, virtual machine guests
29346a65d263SMichael S. Tsirkinshould use virt_mb() rather than smp_mb() when synchronizing against a
29356a65d263SMichael S. Tsirkin(possibly SMP) host.
29366a65d263SMichael S. Tsirkin
29376a65d263SMichael S. TsirkinThese are equivalent to smp_mb() etc counterparts in all other respects,
29386a65d263SMichael S. Tsirkinin particular, they do not control MMIO effects: to control
29396a65d263SMichael S. TsirkinMMIO effects, use mandatory barriers.
2940108b42b4SDavid Howells
29410b6fa347SSeongJae Park
294290fddabfSDavid Howells============
294390fddabfSDavid HowellsEXAMPLE USES
294490fddabfSDavid Howells============
294590fddabfSDavid Howells
294690fddabfSDavid HowellsCIRCULAR BUFFERS
294790fddabfSDavid Howells----------------
294890fddabfSDavid Howells
294990fddabfSDavid HowellsMemory barriers can be used to implement circular buffering without the need
295090fddabfSDavid Howellsof a lock to serialise the producer with the consumer.  See:
295190fddabfSDavid Howells
2952d8a121e3SMauro Carvalho Chehab	Documentation/core-api/circular-buffers.rst
295390fddabfSDavid Howells
295490fddabfSDavid Howellsfor details.
295590fddabfSDavid Howells
295690fddabfSDavid Howells
2957108b42b4SDavid Howells==========
2958108b42b4SDavid HowellsREFERENCES
2959108b42b4SDavid Howells==========
2960108b42b4SDavid Howells
2961108b42b4SDavid HowellsAlpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
2962108b42b4SDavid HowellsDigital Press)
2963108b42b4SDavid Howells	Chapter 5.2: Physical Address Space Characteristics
2964108b42b4SDavid Howells	Chapter 5.4: Caches and Write Buffers
2965108b42b4SDavid Howells	Chapter 5.5: Data Sharing
2966108b42b4SDavid Howells	Chapter 5.6: Read/Write Ordering
2967108b42b4SDavid Howells
2968108b42b4SDavid HowellsAMD64 Architecture Programmer's Manual Volume 2: System Programming
2969108b42b4SDavid Howells	Chapter 7.1: Memory-Access Ordering
2970108b42b4SDavid Howells	Chapter 7.4: Buffering and Combining Memory Writes
2971108b42b4SDavid Howells
2972f1ab25a3SPaul E. McKenneyARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2973f1ab25a3SPaul E. McKenney	Chapter B2: The AArch64 Application Level Memory Model
2974f1ab25a3SPaul E. McKenney
2975108b42b4SDavid HowellsIA-32 Intel Architecture Software Developer's Manual, Volume 3:
2976108b42b4SDavid HowellsSystem Programming Guide
2977108b42b4SDavid Howells	Chapter 7.1: Locked Atomic Operations
2978108b42b4SDavid Howells	Chapter 7.2: Memory Ordering
2979108b42b4SDavid Howells	Chapter 7.4: Serializing Instructions
2980108b42b4SDavid Howells
2981108b42b4SDavid HowellsThe SPARC Architecture Manual, Version 9
2982108b42b4SDavid Howells	Chapter 8: Memory Models
2983108b42b4SDavid Howells	Appendix D: Formal Specification of the Memory Models
2984108b42b4SDavid Howells	Appendix J: Programming with the Memory Models
2985108b42b4SDavid Howells
2986f1ab25a3SPaul E. McKenneyStorage in the PowerPC (Stone and Fitzgerald)
2987f1ab25a3SPaul E. McKenney
2988108b42b4SDavid HowellsUltraSPARC Programmer Reference Manual
2989108b42b4SDavid Howells	Chapter 5: Memory Accesses and Cacheability
2990108b42b4SDavid Howells	Chapter 15: Sparc-V9 Memory Models
2991108b42b4SDavid Howells
2992108b42b4SDavid HowellsUltraSPARC III Cu User's Manual
2993108b42b4SDavid Howells	Chapter 9: Memory Models
2994108b42b4SDavid Howells
2995108b42b4SDavid HowellsUltraSPARC IIIi Processor User's Manual
2996108b42b4SDavid Howells	Chapter 8: Memory Models
2997108b42b4SDavid Howells
2998108b42b4SDavid HowellsUltraSPARC Architecture 2005
2999108b42b4SDavid Howells	Chapter 9: Memory
3000108b42b4SDavid Howells	Appendix D: Formal Specifications of the Memory Models
3001108b42b4SDavid Howells
3002108b42b4SDavid HowellsUltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
3003108b42b4SDavid Howells	Chapter 8: Memory Models
3004108b42b4SDavid Howells	Appendix F: Caches and Cache Coherency
3005108b42b4SDavid Howells
3006108b42b4SDavid HowellsSolaris Internals, Core Kernel Architecture, p63-68:
3007108b42b4SDavid Howells	Chapter 3.3: Hardware Considerations for Locks and
3008108b42b4SDavid Howells			Synchronization
3009108b42b4SDavid Howells
3010108b42b4SDavid HowellsUnix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
3011108b42b4SDavid Howellsfor Kernel Programmers:
3012108b42b4SDavid Howells	Chapter 13: Other Memory Models
3013108b42b4SDavid Howells
3014108b42b4SDavid HowellsIntel Itanium Architecture Software Developer's Manual: Volume 1:
3015108b42b4SDavid Howells	Section 2.6: Speculation
3016108b42b4SDavid Howells	Section 4.4: Memory Access
3017