1108b42b4SDavid Howells ============================ 2108b42b4SDavid Howells LINUX KERNEL MEMORY BARRIERS 3108b42b4SDavid Howells ============================ 4108b42b4SDavid Howells 5108b42b4SDavid HowellsBy: David Howells <dhowells@redhat.com> 6714b6904SPaul E. McKenney Paul E. McKenney <paulmck@linux.ibm.com> 7e7720af5SPeter Zijlstra Will Deacon <will.deacon@arm.com> 8e7720af5SPeter Zijlstra Peter Zijlstra <peterz@infradead.org> 9108b42b4SDavid Howells 10e7720af5SPeter Zijlstra========== 11e7720af5SPeter ZijlstraDISCLAIMER 12e7720af5SPeter Zijlstra========== 13e7720af5SPeter Zijlstra 14e7720af5SPeter ZijlstraThis document is not a specification; it is intentionally (for the sake of 15e7720af5SPeter Zijlstrabrevity) and unintentionally (due to being human) incomplete. This document is 16e7720af5SPeter Zijlstrameant as a guide to using the various memory barriers provided by Linux, but 17621df431SAndrea Parriin case of any doubt (and there are many) please ask. Some doubts may be 18621df431SAndrea Parriresolved by referring to the formal memory consistency model and related 19621df431SAndrea Parridocumentation at tools/memory-model/. Nevertheless, even this memory 20621df431SAndrea Parrimodel should be viewed as the collective opinion of its maintainers rather 21621df431SAndrea Parrithan as an infallible oracle. 22e7720af5SPeter Zijlstra 23e7720af5SPeter ZijlstraTo repeat, this document is not a specification of what Linux expects from 24e7720af5SPeter Zijlstrahardware. 25e7720af5SPeter Zijlstra 268d4840e8SDavid HowellsThe purpose of this document is twofold: 278d4840e8SDavid Howells 288d4840e8SDavid Howells (1) to specify the minimum functionality that one can rely on for any 298d4840e8SDavid Howells particular barrier, and 308d4840e8SDavid Howells 318d4840e8SDavid Howells (2) to provide a guide as to how to use the barriers that are available. 328d4840e8SDavid Howells 338d4840e8SDavid HowellsNote that an architecture can provide more than the minimum requirement 3435bdc72aSStan Drozdfor any particular barrier, but if the architecture provides less than 358d4840e8SDavid Howellsthat, that architecture is incorrect. 368d4840e8SDavid Howells 378d4840e8SDavid HowellsNote also that it is possible that a barrier may be a no-op for an 388d4840e8SDavid Howellsarchitecture because the way that arch works renders an explicit barrier 398d4840e8SDavid Howellsunnecessary in that case. 408d4840e8SDavid Howells 418d4840e8SDavid Howells 42e7720af5SPeter Zijlstra======== 43e7720af5SPeter ZijlstraCONTENTS 44e7720af5SPeter Zijlstra======== 45108b42b4SDavid Howells 46108b42b4SDavid Howells (*) Abstract memory access model. 47108b42b4SDavid Howells 48108b42b4SDavid Howells - Device operations. 49108b42b4SDavid Howells - Guarantees. 50108b42b4SDavid Howells 51108b42b4SDavid Howells (*) What are memory barriers? 52108b42b4SDavid Howells 53108b42b4SDavid Howells - Varieties of memory barrier. 54108b42b4SDavid Howells - What may not be assumed about memory barriers? 55203185f6SAkira Yokosawa - Address-dependency barriers (historical). 56108b42b4SDavid Howells - Control dependencies. 57108b42b4SDavid Howells - SMP barrier pairing. 58108b42b4SDavid Howells - Examples of memory barrier sequences. 59670bd95eSDavid Howells - Read memory barriers vs load speculation. 60f1ab25a3SPaul E. McKenney - Multicopy atomicity. 61108b42b4SDavid Howells 62108b42b4SDavid Howells (*) Explicit kernel barriers. 63108b42b4SDavid Howells 64108b42b4SDavid Howells - Compiler barrier. 6581fc6323SJarek Poplawski - CPU memory barriers. 66108b42b4SDavid Howells 67108b42b4SDavid Howells (*) Implicit kernel memory barriers. 68108b42b4SDavid Howells 69166bda71SSeongJae Park - Lock acquisition functions. 70108b42b4SDavid Howells - Interrupt disabling functions. 7150fa610aSDavid Howells - Sleep and wake-up functions. 72108b42b4SDavid Howells - Miscellaneous functions. 73108b42b4SDavid Howells 74166bda71SSeongJae Park (*) Inter-CPU acquiring barrier effects. 75108b42b4SDavid Howells 76166bda71SSeongJae Park - Acquires vs memory accesses. 77108b42b4SDavid Howells 78108b42b4SDavid Howells (*) Where are memory barriers needed? 79108b42b4SDavid Howells 80108b42b4SDavid Howells - Interprocessor interaction. 81108b42b4SDavid Howells - Atomic operations. 82108b42b4SDavid Howells - Accessing devices. 83108b42b4SDavid Howells - Interrupts. 84108b42b4SDavid Howells 85108b42b4SDavid Howells (*) Kernel I/O barrier effects. 86108b42b4SDavid Howells 87108b42b4SDavid Howells (*) Assumed minimum execution ordering model. 88108b42b4SDavid Howells 89108b42b4SDavid Howells (*) The effects of the cpu cache. 90108b42b4SDavid Howells 91108b42b4SDavid Howells - Cache coherency vs DMA. 92108b42b4SDavid Howells - Cache coherency vs MMIO. 93108b42b4SDavid Howells 94108b42b4SDavid Howells (*) The things CPUs get up to. 95108b42b4SDavid Howells 96108b42b4SDavid Howells - And then there's the Alpha. 9701e1cd6dSSeongJae Park - Virtual Machine Guests. 98108b42b4SDavid Howells 9990fddabfSDavid Howells (*) Example uses. 10090fddabfSDavid Howells 10190fddabfSDavid Howells - Circular buffers. 10290fddabfSDavid Howells 103108b42b4SDavid Howells (*) References. 104108b42b4SDavid Howells 105108b42b4SDavid Howells 106108b42b4SDavid Howells============================ 107108b42b4SDavid HowellsABSTRACT MEMORY ACCESS MODEL 108108b42b4SDavid Howells============================ 109108b42b4SDavid Howells 110108b42b4SDavid HowellsConsider the following abstract model of the system: 111108b42b4SDavid Howells 112108b42b4SDavid Howells : : 113108b42b4SDavid Howells : : 114108b42b4SDavid Howells : : 115108b42b4SDavid Howells +-------+ : +--------+ : +-------+ 116108b42b4SDavid Howells | | : | | : | | 117108b42b4SDavid Howells | | : | | : | | 118108b42b4SDavid Howells | CPU 1 |<----->| Memory |<----->| CPU 2 | 119108b42b4SDavid Howells | | : | | : | | 120108b42b4SDavid Howells | | : | | : | | 121108b42b4SDavid Howells +-------+ : +--------+ : +-------+ 122108b42b4SDavid Howells ^ : ^ : ^ 123108b42b4SDavid Howells | : | : | 124108b42b4SDavid Howells | : | : | 125108b42b4SDavid Howells | : v : | 126108b42b4SDavid Howells | : +--------+ : | 127108b42b4SDavid Howells | : | | : | 128108b42b4SDavid Howells | : | | : | 129108b42b4SDavid Howells +---------->| Device |<----------+ 130108b42b4SDavid Howells : | | : 131108b42b4SDavid Howells : | | : 132108b42b4SDavid Howells : +--------+ : 133108b42b4SDavid Howells : : 134108b42b4SDavid Howells 135108b42b4SDavid HowellsEach CPU executes a program that generates memory access operations. In the 136108b42b4SDavid Howellsabstract CPU, memory operation ordering is very relaxed, and a CPU may actually 137108b42b4SDavid Howellsperform the memory operations in any order it likes, provided program causality 138108b42b4SDavid Howellsappears to be maintained. Similarly, the compiler may also arrange the 139108b42b4SDavid Howellsinstructions it emits in any order it likes, provided it doesn't affect the 140108b42b4SDavid Howellsapparent operation of the program. 141108b42b4SDavid Howells 142108b42b4SDavid HowellsSo in the above diagram, the effects of the memory operations performed by a 143108b42b4SDavid HowellsCPU are perceived by the rest of the system as the operations cross the 144108b42b4SDavid Howellsinterface between the CPU and rest of the system (the dotted lines). 145108b42b4SDavid Howells 146108b42b4SDavid Howells 147108b42b4SDavid HowellsFor example, consider the following sequence of events: 148108b42b4SDavid Howells 149108b42b4SDavid Howells CPU 1 CPU 2 150108b42b4SDavid Howells =============== =============== 151108b42b4SDavid Howells { A == 1; B == 2 } 152615cc2c9SAlexey Dobriyan A = 3; x = B; 153615cc2c9SAlexey Dobriyan B = 4; y = A; 154108b42b4SDavid Howells 155108b42b4SDavid HowellsThe set of accesses as seen by the memory system in the middle can be arranged 156108b42b4SDavid Howellsin 24 different combinations: 157108b42b4SDavid Howells 1588ab8b3e1SPranith Kumar STORE A=3, STORE B=4, y=LOAD A->3, x=LOAD B->4 1598ab8b3e1SPranith Kumar STORE A=3, STORE B=4, x=LOAD B->4, y=LOAD A->3 1608ab8b3e1SPranith Kumar STORE A=3, y=LOAD A->3, STORE B=4, x=LOAD B->4 1618ab8b3e1SPranith Kumar STORE A=3, y=LOAD A->3, x=LOAD B->2, STORE B=4 1628ab8b3e1SPranith Kumar STORE A=3, x=LOAD B->2, STORE B=4, y=LOAD A->3 1638ab8b3e1SPranith Kumar STORE A=3, x=LOAD B->2, y=LOAD A->3, STORE B=4 1648ab8b3e1SPranith Kumar STORE B=4, STORE A=3, y=LOAD A->3, x=LOAD B->4 165108b42b4SDavid Howells STORE B=4, ... 166108b42b4SDavid Howells ... 167108b42b4SDavid Howells 168108b42b4SDavid Howellsand can thus result in four different combinations of values: 169108b42b4SDavid Howells 1708ab8b3e1SPranith Kumar x == 2, y == 1 1718ab8b3e1SPranith Kumar x == 2, y == 3 1728ab8b3e1SPranith Kumar x == 4, y == 1 1738ab8b3e1SPranith Kumar x == 4, y == 3 174108b42b4SDavid Howells 175108b42b4SDavid Howells 176108b42b4SDavid HowellsFurthermore, the stores committed by a CPU to the memory system may not be 177108b42b4SDavid Howellsperceived by the loads made by another CPU in the same order as the stores were 178108b42b4SDavid Howellscommitted. 179108b42b4SDavid Howells 180108b42b4SDavid Howells 181108b42b4SDavid HowellsAs a further example, consider this sequence of events: 182108b42b4SDavid Howells 183108b42b4SDavid Howells CPU 1 CPU 2 184108b42b4SDavid Howells =============== =============== 1853dbf0913SSeongJae Park { A == 1, B == 2, C == 3, P == &A, Q == &C } 186108b42b4SDavid Howells B = 4; Q = P; 1878149b5cbSSeongJae Park P = &B; D = *Q; 188108b42b4SDavid Howells 189f556082dSAkira YokosawaThere is an obvious address dependency here, as the value loaded into D depends 190f556082dSAkira Yokosawaon the address retrieved from P by CPU 2. At the end of the sequence, any of 191f556082dSAkira Yokosawathe following results are possible: 192108b42b4SDavid Howells 193108b42b4SDavid Howells (Q == &A) and (D == 1) 194108b42b4SDavid Howells (Q == &B) and (D == 2) 195108b42b4SDavid Howells (Q == &B) and (D == 4) 196108b42b4SDavid Howells 197108b42b4SDavid HowellsNote that CPU 2 will never try and load C into D because the CPU will load P 198108b42b4SDavid Howellsinto Q before issuing the load of *Q. 199108b42b4SDavid Howells 200108b42b4SDavid Howells 201108b42b4SDavid HowellsDEVICE OPERATIONS 202108b42b4SDavid Howells----------------- 203108b42b4SDavid Howells 204108b42b4SDavid HowellsSome devices present their control interfaces as collections of memory 205108b42b4SDavid Howellslocations, but the order in which the control registers are accessed is very 206108b42b4SDavid Howellsimportant. For instance, imagine an ethernet card with a set of internal 207108b42b4SDavid Howellsregisters that are accessed through an address port register (A) and a data 208108b42b4SDavid Howellsport register (D). To read internal register 5, the following code might then 209108b42b4SDavid Howellsbe used: 210108b42b4SDavid Howells 211108b42b4SDavid Howells *A = 5; 212108b42b4SDavid Howells x = *D; 213108b42b4SDavid Howells 214108b42b4SDavid Howellsbut this might show up as either of the following two sequences: 215108b42b4SDavid Howells 216108b42b4SDavid Howells STORE *A = 5, x = LOAD *D 217108b42b4SDavid Howells x = LOAD *D, STORE *A = 5 218108b42b4SDavid Howells 219108b42b4SDavid Howellsthe second of which will almost certainly result in a malfunction, since it set 220108b42b4SDavid Howellsthe address _after_ attempting to read the register. 221108b42b4SDavid Howells 222108b42b4SDavid Howells 223108b42b4SDavid HowellsGUARANTEES 224108b42b4SDavid Howells---------- 225108b42b4SDavid Howells 226108b42b4SDavid HowellsThere are some minimal guarantees that may be expected of a CPU: 227108b42b4SDavid Howells 228108b42b4SDavid Howells (*) On any given CPU, dependent memory accesses will be issued in order, with 229108b42b4SDavid Howells respect to itself. This means that for: 230108b42b4SDavid Howells 23140555946SPaul E. McKenney Q = READ_ONCE(P); D = READ_ONCE(*Q); 232108b42b4SDavid Howells 233108b42b4SDavid Howells the CPU will issue the following memory operations: 234108b42b4SDavid Howells 235108b42b4SDavid Howells Q = LOAD P, D = LOAD *Q 236108b42b4SDavid Howells 23740555946SPaul E. McKenney and always in that order. However, on DEC Alpha, READ_ONCE() also 23840555946SPaul E. McKenney emits a memory-barrier instruction, so that a DEC Alpha CPU will 23940555946SPaul E. McKenney instead issue the following memory operations: 24040555946SPaul E. McKenney 24140555946SPaul E. McKenney Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER 24240555946SPaul E. McKenney 24340555946SPaul E. McKenney Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler 24440555946SPaul E. McKenney mischief. 245108b42b4SDavid Howells 246108b42b4SDavid Howells (*) Overlapping loads and stores within a particular CPU will appear to be 247108b42b4SDavid Howells ordered within that CPU. This means that for: 248108b42b4SDavid Howells 2499af194ceSPaul E. McKenney a = READ_ONCE(*X); WRITE_ONCE(*X, b); 250108b42b4SDavid Howells 251108b42b4SDavid Howells the CPU will only issue the following sequence of memory operations: 252108b42b4SDavid Howells 253108b42b4SDavid Howells a = LOAD *X, STORE *X = b 254108b42b4SDavid Howells 255108b42b4SDavid Howells And for: 256108b42b4SDavid Howells 2579af194ceSPaul E. McKenney WRITE_ONCE(*X, c); d = READ_ONCE(*X); 258108b42b4SDavid Howells 259108b42b4SDavid Howells the CPU will only issue: 260108b42b4SDavid Howells 261108b42b4SDavid Howells STORE *X = c, d = LOAD *X 262108b42b4SDavid Howells 263fa00e7e1SMatt LaPlante (Loads and stores overlap if they are targeted at overlapping pieces of 264108b42b4SDavid Howells memory). 265108b42b4SDavid Howells 266108b42b4SDavid HowellsAnd there are a number of things that _must_ or _must_not_ be assumed: 267108b42b4SDavid Howells 2689af194ceSPaul E. McKenney (*) It _must_not_ be assumed that the compiler will do what you want 2699af194ceSPaul E. McKenney with memory references that are not protected by READ_ONCE() and 2709af194ceSPaul E. McKenney WRITE_ONCE(). Without them, the compiler is within its rights to 2719af194ceSPaul E. McKenney do all sorts of "creative" transformations, which are covered in 272895f5542SPaul E. McKenney the COMPILER BARRIER section. 2732ecf8101SPaul E. McKenney 274108b42b4SDavid Howells (*) It _must_not_ be assumed that independent loads and stores will be issued 275108b42b4SDavid Howells in the order given. This means that for: 276108b42b4SDavid Howells 277108b42b4SDavid Howells X = *A; Y = *B; *D = Z; 278108b42b4SDavid Howells 279108b42b4SDavid Howells we may get any of the following sequences: 280108b42b4SDavid Howells 281108b42b4SDavid Howells X = LOAD *A, Y = LOAD *B, STORE *D = Z 282108b42b4SDavid Howells X = LOAD *A, STORE *D = Z, Y = LOAD *B 283108b42b4SDavid Howells Y = LOAD *B, X = LOAD *A, STORE *D = Z 284108b42b4SDavid Howells Y = LOAD *B, STORE *D = Z, X = LOAD *A 285108b42b4SDavid Howells STORE *D = Z, X = LOAD *A, Y = LOAD *B 286108b42b4SDavid Howells STORE *D = Z, Y = LOAD *B, X = LOAD *A 287108b42b4SDavid Howells 288108b42b4SDavid Howells (*) It _must_ be assumed that overlapping memory accesses may be merged or 289108b42b4SDavid Howells discarded. This means that for: 290108b42b4SDavid Howells 291108b42b4SDavid Howells X = *A; Y = *(A + 4); 292108b42b4SDavid Howells 293108b42b4SDavid Howells we may get any one of the following sequences: 294108b42b4SDavid Howells 295108b42b4SDavid Howells X = LOAD *A; Y = LOAD *(A + 4); 296108b42b4SDavid Howells Y = LOAD *(A + 4); X = LOAD *A; 297108b42b4SDavid Howells {X, Y} = LOAD {*A, *(A + 4) }; 298108b42b4SDavid Howells 299108b42b4SDavid Howells And for: 300108b42b4SDavid Howells 301f191eec5SPaul E. McKenney *A = X; *(A + 4) = Y; 302108b42b4SDavid Howells 303f191eec5SPaul E. McKenney we may get any of: 304108b42b4SDavid Howells 305f191eec5SPaul E. McKenney STORE *A = X; STORE *(A + 4) = Y; 306f191eec5SPaul E. McKenney STORE *(A + 4) = Y; STORE *A = X; 307f191eec5SPaul E. McKenney STORE {*A, *(A + 4) } = {X, Y}; 308108b42b4SDavid Howells 309432fbf3cSPaul E. McKenneyAnd there are anti-guarantees: 310432fbf3cSPaul E. McKenney 311432fbf3cSPaul E. McKenney (*) These guarantees do not apply to bitfields, because compilers often 312432fbf3cSPaul E. McKenney generate code to modify these using non-atomic read-modify-write 313432fbf3cSPaul E. McKenney sequences. Do not attempt to use bitfields to synchronize parallel 314432fbf3cSPaul E. McKenney algorithms. 315432fbf3cSPaul E. McKenney 316432fbf3cSPaul E. McKenney (*) Even in cases where bitfields are protected by locks, all fields 317432fbf3cSPaul E. McKenney in a given bitfield must be protected by one lock. If two fields 318432fbf3cSPaul E. McKenney in a given bitfield are protected by different locks, the compiler's 319432fbf3cSPaul E. McKenney non-atomic read-modify-write sequences can cause an update to one 320432fbf3cSPaul E. McKenney field to corrupt the value of an adjacent field. 321432fbf3cSPaul E. McKenney 322432fbf3cSPaul E. McKenney (*) These guarantees apply only to properly aligned and sized scalar 323432fbf3cSPaul E. McKenney variables. "Properly sized" currently means variables that are 324432fbf3cSPaul E. McKenney the same size as "char", "short", "int" and "long". "Properly 325432fbf3cSPaul E. McKenney aligned" means the natural alignment, thus no constraints for 326432fbf3cSPaul E. McKenney "char", two-byte alignment for "short", four-byte alignment for 327432fbf3cSPaul E. McKenney "int", and either four-byte or eight-byte alignment for "long", 328432fbf3cSPaul E. McKenney on 32-bit and 64-bit systems, respectively. Note that these 329432fbf3cSPaul E. McKenney guarantees were introduced into the C11 standard, so beware when 330432fbf3cSPaul E. McKenney using older pre-C11 compilers (for example, gcc 4.6). The portion 331432fbf3cSPaul E. McKenney of the standard containing this guarantee is Section 3.14, which 332432fbf3cSPaul E. McKenney defines "memory location" as follows: 333432fbf3cSPaul E. McKenney 334432fbf3cSPaul E. McKenney memory location 335432fbf3cSPaul E. McKenney either an object of scalar type, or a maximal sequence 336432fbf3cSPaul E. McKenney of adjacent bit-fields all having nonzero width 337432fbf3cSPaul E. McKenney 338432fbf3cSPaul E. McKenney NOTE 1: Two threads of execution can update and access 339432fbf3cSPaul E. McKenney separate memory locations without interfering with 340432fbf3cSPaul E. McKenney each other. 341432fbf3cSPaul E. McKenney 342432fbf3cSPaul E. McKenney NOTE 2: A bit-field and an adjacent non-bit-field member 343432fbf3cSPaul E. McKenney are in separate memory locations. The same applies 344432fbf3cSPaul E. McKenney to two bit-fields, if one is declared inside a nested 345432fbf3cSPaul E. McKenney structure declaration and the other is not, or if the two 346432fbf3cSPaul E. McKenney are separated by a zero-length bit-field declaration, 347432fbf3cSPaul E. McKenney or if they are separated by a non-bit-field member 348432fbf3cSPaul E. McKenney declaration. It is not safe to concurrently update two 349432fbf3cSPaul E. McKenney bit-fields in the same structure if all members declared 350432fbf3cSPaul E. McKenney between them are also bit-fields, no matter what the 351432fbf3cSPaul E. McKenney sizes of those intervening bit-fields happen to be. 352432fbf3cSPaul E. McKenney 353108b42b4SDavid Howells 354108b42b4SDavid Howells========================= 355108b42b4SDavid HowellsWHAT ARE MEMORY BARRIERS? 356108b42b4SDavid Howells========================= 357108b42b4SDavid Howells 358108b42b4SDavid HowellsAs can be seen above, independent memory operations are effectively performed 359108b42b4SDavid Howellsin random order, but this can be a problem for CPU-CPU interaction and for I/O. 360108b42b4SDavid HowellsWhat is required is some way of intervening to instruct the compiler and the 361108b42b4SDavid HowellsCPU to restrict the order. 362108b42b4SDavid Howells 363108b42b4SDavid HowellsMemory barriers are such interventions. They impose a perceived partial 3642b94895bSDavid Howellsordering over the memory operations on either side of the barrier. 3652b94895bSDavid Howells 3662b94895bSDavid HowellsSuch enforcement is important because the CPUs and other devices in a system 36781fc6323SJarek Poplawskican use a variety of tricks to improve performance, including reordering, 3682b94895bSDavid Howellsdeferral and combination of memory operations; speculative loads; speculative 3692b94895bSDavid Howellsbranch prediction and various types of caching. Memory barriers are used to 3702b94895bSDavid Howellsoverride or suppress these tricks, allowing the code to sanely control the 3712b94895bSDavid Howellsinteraction of multiple CPUs and/or devices. 372108b42b4SDavid Howells 373108b42b4SDavid Howells 374108b42b4SDavid HowellsVARIETIES OF MEMORY BARRIER 375108b42b4SDavid Howells--------------------------- 376108b42b4SDavid Howells 377108b42b4SDavid HowellsMemory barriers come in four basic varieties: 378108b42b4SDavid Howells 379108b42b4SDavid Howells (1) Write (or store) memory barriers. 380108b42b4SDavid Howells 381108b42b4SDavid Howells A write memory barrier gives a guarantee that all the STORE operations 382108b42b4SDavid Howells specified before the barrier will appear to happen before all the STORE 383108b42b4SDavid Howells operations specified after the barrier with respect to the other 384108b42b4SDavid Howells components of the system. 385108b42b4SDavid Howells 386108b42b4SDavid Howells A write barrier is a partial ordering on stores only; it is not required 387108b42b4SDavid Howells to have any effect on loads. 388108b42b4SDavid Howells 3896bc39274SDavid Howells A CPU can be viewed as committing a sequence of store operations to the 3905692fcc6SGuilherme G. Piccoli memory system as time progresses. All stores _before_ a write barrier 3915692fcc6SGuilherme G. Piccoli will occur _before_ all the stores after the write barrier. 392108b42b4SDavid Howells 393203185f6SAkira Yokosawa [!] Note that write barriers should normally be paired with read or 394203185f6SAkira Yokosawa address-dependency barriers; see the "SMP barrier pairing" subsection. 395108b42b4SDavid Howells 396108b42b4SDavid Howells 397203185f6SAkira Yokosawa (2) Address-dependency barriers (historical). 398ad944630SPaul E. McKenney [!] This section is marked as HISTORICAL: it covers the long-obsolete 399ad944630SPaul E. McKenney smp_read_barrier_depends() macro, the semantics of which are now 400ad944630SPaul E. McKenney implicit in all marked accesses. For more up-to-date information, 401ad944630SPaul E. McKenney including how compiler transformations can sometimes break address 402ad944630SPaul E. McKenney dependencies, see Documentation/RCU/rcu_dereference.rst. 403108b42b4SDavid Howells 404f556082dSAkira Yokosawa An address-dependency barrier is a weaker form of read barrier. In the 405f556082dSAkira Yokosawa case where two loads are performed such that the second depends on the 406f556082dSAkira Yokosawa result of the first (eg: the first load retrieves the address to which 407f556082dSAkira Yokosawa the second load will be directed), an address-dependency barrier would 408f556082dSAkira Yokosawa be required to make sure that the target of the second load is updated 409f556082dSAkira Yokosawa after the address obtained by the first load is accessed. 410108b42b4SDavid Howells 411f556082dSAkira Yokosawa An address-dependency barrier is a partial ordering on interdependent 412f556082dSAkira Yokosawa loads only; it is not required to have any effect on stores, independent 413f556082dSAkira Yokosawa loads or overlapping loads. 414108b42b4SDavid Howells 415108b42b4SDavid Howells As mentioned in (1), the other CPUs in the system can be viewed as 416108b42b4SDavid Howells committing sequences of stores to the memory system that the CPU being 417f556082dSAkira Yokosawa considered can then perceive. An address-dependency barrier issued by 418f556082dSAkira Yokosawa the CPU under consideration guarantees that for any load preceding it, 419f556082dSAkira Yokosawa if that load touches one of a sequence of stores from another CPU, then 420f556082dSAkira Yokosawa by the time the barrier completes, the effects of all the stores prior to 421f556082dSAkira Yokosawa that touched by the load will be perceptible to any loads issued after 422f556082dSAkira Yokosawa the address-dependency barrier. 423108b42b4SDavid Howells 424108b42b4SDavid Howells See the "Examples of memory barrier sequences" subsection for diagrams 425108b42b4SDavid Howells showing the ordering constraints. 426108b42b4SDavid Howells 427203185f6SAkira Yokosawa [!] Note that the first load really has to have an _address_ dependency and 428108b42b4SDavid Howells not a control dependency. If the address for the second load is dependent 429108b42b4SDavid Howells on the first load, but the dependency is through a conditional rather than 430108b42b4SDavid Howells actually loading the address itself, then it's a _control_ dependency and 431108b42b4SDavid Howells a full read barrier or better is required. See the "Control dependencies" 432108b42b4SDavid Howells subsection for more information. 433108b42b4SDavid Howells 434203185f6SAkira Yokosawa [!] Note that address-dependency barriers should normally be paired with 435108b42b4SDavid Howells write barriers; see the "SMP barrier pairing" subsection. 436108b42b4SDavid Howells 437203185f6SAkira Yokosawa [!] Kernel release v5.9 removed kernel APIs for explicit address- 438203185f6SAkira Yokosawa dependency barriers. Nowadays, APIs for marking loads from shared 439203185f6SAkira Yokosawa variables such as READ_ONCE() and rcu_dereference() provide implicit 440203185f6SAkira Yokosawa address-dependency barriers. 441108b42b4SDavid Howells 442108b42b4SDavid Howells (3) Read (or load) memory barriers. 443108b42b4SDavid Howells 444f556082dSAkira Yokosawa A read barrier is an address-dependency barrier plus a guarantee that all 445f556082dSAkira Yokosawa the LOAD operations specified before the barrier will appear to happen 446f556082dSAkira Yokosawa before all the LOAD operations specified after the barrier with respect to 447f556082dSAkira Yokosawa the other components of the system. 448108b42b4SDavid Howells 449108b42b4SDavid Howells A read barrier is a partial ordering on loads only; it is not required to 450108b42b4SDavid Howells have any effect on stores. 451108b42b4SDavid Howells 452f556082dSAkira Yokosawa Read memory barriers imply address-dependency barriers, and so can 453f556082dSAkira Yokosawa substitute for them. 454108b42b4SDavid Howells 455108b42b4SDavid Howells [!] Note that read barriers should normally be paired with write barriers; 456108b42b4SDavid Howells see the "SMP barrier pairing" subsection. 457108b42b4SDavid Howells 458108b42b4SDavid Howells 459108b42b4SDavid Howells (4) General memory barriers. 460108b42b4SDavid Howells 461670bd95eSDavid Howells A general memory barrier gives a guarantee that all the LOAD and STORE 462670bd95eSDavid Howells operations specified before the barrier will appear to happen before all 463670bd95eSDavid Howells the LOAD and STORE operations specified after the barrier with respect to 464670bd95eSDavid Howells the other components of the system. 465670bd95eSDavid Howells 466670bd95eSDavid Howells A general memory barrier is a partial ordering over both loads and stores. 467108b42b4SDavid Howells 468108b42b4SDavid Howells General memory barriers imply both read and write memory barriers, and so 469108b42b4SDavid Howells can substitute for either. 470108b42b4SDavid Howells 471108b42b4SDavid Howells 472108b42b4SDavid HowellsAnd a couple of implicit varieties: 473108b42b4SDavid Howells 4742e4f5382SPeter Zijlstra (5) ACQUIRE operations. 475108b42b4SDavid Howells 476108b42b4SDavid Howells This acts as a one-way permeable barrier. It guarantees that all memory 4772e4f5382SPeter Zijlstra operations after the ACQUIRE operation will appear to happen after the 4782e4f5382SPeter Zijlstra ACQUIRE operation with respect to the other components of the system. 479787df638SDavidlohr Bueso ACQUIRE operations include LOCK operations and both smp_load_acquire() 4802f359c7eSAndrea Parri and smp_cond_load_acquire() operations. 481108b42b4SDavid Howells 4822e4f5382SPeter Zijlstra Memory operations that occur before an ACQUIRE operation may appear to 4832e4f5382SPeter Zijlstra happen after it completes. 484108b42b4SDavid Howells 4852e4f5382SPeter Zijlstra An ACQUIRE operation should almost always be paired with a RELEASE 4862e4f5382SPeter Zijlstra operation. 487108b42b4SDavid Howells 488108b42b4SDavid Howells 4892e4f5382SPeter Zijlstra (6) RELEASE operations. 490108b42b4SDavid Howells 491108b42b4SDavid Howells This also acts as a one-way permeable barrier. It guarantees that all 4922e4f5382SPeter Zijlstra memory operations before the RELEASE operation will appear to happen 4932e4f5382SPeter Zijlstra before the RELEASE operation with respect to the other components of the 4942e4f5382SPeter Zijlstra system. RELEASE operations include UNLOCK operations and 4952e4f5382SPeter Zijlstra smp_store_release() operations. 496108b42b4SDavid Howells 4972e4f5382SPeter Zijlstra Memory operations that occur after a RELEASE operation may appear to 498108b42b4SDavid Howells happen before it completes. 499108b42b4SDavid Howells 5002e4f5382SPeter Zijlstra The use of ACQUIRE and RELEASE operations generally precludes the need 501a897b13dSSeongJae Park for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is 502a897b13dSSeongJae Park -not- guaranteed to act as a full memory barrier. However, after an 503a897b13dSSeongJae Park ACQUIRE on a given variable, all memory accesses preceding any prior 5042e4f5382SPeter Zijlstra RELEASE on that same variable are guaranteed to be visible. In other 5052e4f5382SPeter Zijlstra words, within a given variable's critical section, all accesses of all 5062e4f5382SPeter Zijlstra previous critical sections for that variable are guaranteed to have 5072e4f5382SPeter Zijlstra completed. 50817eb88e0SPaul E. McKenney 5092e4f5382SPeter Zijlstra This means that ACQUIRE acts as a minimal "acquire" operation and 5102e4f5382SPeter Zijlstra RELEASE acts as a minimal "release" operation. 511108b42b4SDavid Howells 512706eeb3eSPeter ZijlstraA subset of the atomic operations described in atomic_t.txt have ACQUIRE and 513706eeb3eSPeter ZijlstraRELEASE variants in addition to fully-ordered and relaxed (no barrier 514706eeb3eSPeter Zijlstrasemantics) definitions. For compound atomics performing both a load and a 515706eeb3eSPeter Zijlstrastore, ACQUIRE semantics apply only to the load and RELEASE semantics apply 516706eeb3eSPeter Zijlstraonly to the store portion of the operation. 517108b42b4SDavid Howells 518108b42b4SDavid HowellsMemory barriers are only required where there's a possibility of interaction 519108b42b4SDavid Howellsbetween two CPUs or between a CPU and a device. If it can be guaranteed that 520108b42b4SDavid Howellsthere won't be any such interaction in any particular piece of code, then 521108b42b4SDavid Howellsmemory barriers are unnecessary in that piece of code. 522108b42b4SDavid Howells 523108b42b4SDavid Howells 524108b42b4SDavid HowellsNote that these are the _minimum_ guarantees. Different architectures may give 525108b42b4SDavid Howellsmore substantial guarantees, but they may _not_ be relied upon outside of arch 526108b42b4SDavid Howellsspecific code. 527108b42b4SDavid Howells 528108b42b4SDavid Howells 529108b42b4SDavid HowellsWHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS? 530108b42b4SDavid Howells---------------------------------------------- 531108b42b4SDavid Howells 532108b42b4SDavid HowellsThere are certain things that the Linux kernel memory barriers do not guarantee: 533108b42b4SDavid Howells 534108b42b4SDavid Howells (*) There is no guarantee that any of the memory accesses specified before a 535108b42b4SDavid Howells memory barrier will be _complete_ by the completion of a memory barrier 536108b42b4SDavid Howells instruction; the barrier can be considered to draw a line in that CPU's 537108b42b4SDavid Howells access queue that accesses of the appropriate type may not cross. 538108b42b4SDavid Howells 539108b42b4SDavid Howells (*) There is no guarantee that issuing a memory barrier on one CPU will have 540108b42b4SDavid Howells any direct effect on another CPU or any other hardware in the system. The 541108b42b4SDavid Howells indirect effect will be the order in which the second CPU sees the effects 542108b42b4SDavid Howells of the first CPU's accesses occur, but see the next point: 543108b42b4SDavid Howells 5446bc39274SDavid Howells (*) There is no guarantee that a CPU will see the correct order of effects 545108b42b4SDavid Howells from a second CPU's accesses, even _if_ the second CPU uses a memory 546108b42b4SDavid Howells barrier, unless the first CPU _also_ uses a matching memory barrier (see 547108b42b4SDavid Howells the subsection on "SMP Barrier Pairing"). 548108b42b4SDavid Howells 549108b42b4SDavid Howells (*) There is no guarantee that some intervening piece of off-the-CPU 550108b42b4SDavid Howells hardware[*] will not reorder the memory accesses. CPU cache coherency 551108b42b4SDavid Howells mechanisms should propagate the indirect effects of a memory barrier 552108b42b4SDavid Howells between CPUs, but might not do so in order. 553108b42b4SDavid Howells 554108b42b4SDavid Howells [*] For information on bus mastering DMA and coherency please read: 555108b42b4SDavid Howells 556bff9e34cSMauro Carvalho Chehab Documentation/driver-api/pci/pci.rst 557537f3a7cSSeongJae Park Documentation/core-api/dma-api-howto.rst 558537f3a7cSSeongJae Park Documentation/core-api/dma-api.rst 559108b42b4SDavid Howells 560108b42b4SDavid Howells 561203185f6SAkira YokosawaADDRESS-DEPENDENCY BARRIERS (HISTORICAL) 562203185f6SAkira Yokosawa---------------------------------------- 563ad944630SPaul E. McKenney[!] This section is marked as HISTORICAL: it covers the long-obsolete 564ad944630SPaul E. McKenneysmp_read_barrier_depends() macro, the semantics of which are now implicit 565ad944630SPaul E. McKenneyin all marked accesses. For more up-to-date information, including 566ad944630SPaul E. McKenneyhow compiler transformations can sometimes break address dependencies, 567ad944630SPaul E. McKenneysee Documentation/RCU/rcu_dereference.rst. 568f28f0868SPaul E. McKenney 5698ca924aeSWill DeaconAs of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for 5708ca924aeSWill DeaconDEC Alpha, which means that about the only people who need to pay attention 5718ca924aeSWill Deaconto this section are those working on DEC Alpha architecture-specific code 5728ca924aeSWill Deaconand those working on READ_ONCE() itself. For those who need it, and for 5738ca924aeSWill Deaconthose who are interested in the history, here is the story of 574203185f6SAkira Yokosawaaddress-dependency barriers. 575108b42b4SDavid Howells 576203185f6SAkira Yokosawa[!] While address dependencies are observed in both load-to-load and 577203185f6SAkira Yokosawaload-to-store relations, address-dependency barriers are not necessary 578203185f6SAkira Yokosawafor load-to-store situations. 579203185f6SAkira Yokosawa 580203185f6SAkira YokosawaThe requirement of address-dependency barriers is a little subtle, and 581108b42b4SDavid Howellsit's not always obvious that they're needed. To illustrate, consider the 582108b42b4SDavid Howellsfollowing sequence of events: 583108b42b4SDavid Howells 584108b42b4SDavid Howells CPU 1 CPU 2 585108b42b4SDavid Howells =============== =============== 5863dbf0913SSeongJae Park { A == 1, B == 2, C == 3, P == &A, Q == &C } 587108b42b4SDavid Howells B = 4; 588108b42b4SDavid Howells <write barrier> 5898149b5cbSSeongJae Park WRITE_ONCE(P, &B); 590203185f6SAkira Yokosawa Q = READ_ONCE_OLD(P); 591108b42b4SDavid Howells D = *Q; 592108b42b4SDavid Howells 593203185f6SAkira Yokosawa[!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which 594203185f6SAkira Yokosawadoesn't imply an address-dependency barrier. 595203185f6SAkira Yokosawa 596f556082dSAkira YokosawaThere's a clear address dependency here, and it would seem that by the end of 597f556082dSAkira Yokosawathe sequence, Q must be either &A or &B, and that: 598108b42b4SDavid Howells 599108b42b4SDavid Howells (Q == &A) implies (D == 1) 600108b42b4SDavid Howells (Q == &B) implies (D == 4) 601108b42b4SDavid Howells 602108b42b4SDavid HowellsBut! CPU 2's perception of P may be updated _before_ its perception of B, thus 603108b42b4SDavid Howellsleading to the following situation: 604108b42b4SDavid Howells 605108b42b4SDavid Howells (Q == &B) and (D == 2) ???? 606108b42b4SDavid Howells 607806654a9SWill DeaconWhile this may seem like a failure of coherency or causality maintenance, it 608108b42b4SDavid Howellsisn't, and this behaviour can be observed on certain real CPUs (such as the DEC 609108b42b4SDavid HowellsAlpha). 610108b42b4SDavid Howells 611f556082dSAkira YokosawaTo deal with this, READ_ONCE() provides an implicit address-dependency barrier 612f556082dSAkira Yokosawasince kernel release v4.15: 613108b42b4SDavid Howells 614108b42b4SDavid Howells CPU 1 CPU 2 615108b42b4SDavid Howells =============== =============== 6163dbf0913SSeongJae Park { A == 1, B == 2, C == 3, P == &A, Q == &C } 617108b42b4SDavid Howells B = 4; 618108b42b4SDavid Howells <write barrier> 6199af194ceSPaul E. McKenney WRITE_ONCE(P, &B); 6209af194ceSPaul E. McKenney Q = READ_ONCE(P); 621203185f6SAkira Yokosawa <implicit address-dependency barrier> 622108b42b4SDavid Howells D = *Q; 623108b42b4SDavid Howells 624108b42b4SDavid HowellsThis enforces the occurrence of one of the two implications, and prevents the 625108b42b4SDavid Howellsthird possibility from arising. 626108b42b4SDavid Howells 62792a84dd2SPaul E. McKenney 628108b42b4SDavid Howells[!] Note that this extremely counterintuitive situation arises most easily on 629108b42b4SDavid Howellsmachines with split caches, so that, for example, one cache bank processes 630108b42b4SDavid Howellseven-numbered cache lines and the other bank processes odd-numbered cache 631108b42b4SDavid Howellslines. The pointer P might be stored in an odd-numbered cache line, and the 632108b42b4SDavid Howellsvariable B might be stored in an even-numbered cache line. Then, if the 633108b42b4SDavid Howellseven-numbered bank of the reading CPU's cache is extremely busy while the 634108b42b4SDavid Howellsodd-numbered bank is idle, one can see the new value of the pointer P (&B), 6356bc39274SDavid Howellsbut the old value of the variable B (2). 636108b42b4SDavid Howells 637108b42b4SDavid Howells 638203185f6SAkira YokosawaAn address-dependency barrier is not required to order dependent writes 639f556082dSAkira Yokosawabecause the CPUs that the Linux kernel supports don't do writes until they 640f556082dSAkira Yokosawaare certain (1) that the write will actually happen, (2) of the location of 641f556082dSAkira Yokosawathe write, and (3) of the value to be written. 64266ce3a4dSPaul E. McKenneyBut please carefully read the "CONTROL DEPENDENCIES" section and the 643f556082dSAkira YokosawaDocumentation/RCU/rcu_dereference.rst file: The compiler can and does break 644f556082dSAkira Yokosawadependencies in a great many highly creative ways. 64566ce3a4dSPaul E. McKenney 64666ce3a4dSPaul E. McKenney CPU 1 CPU 2 64766ce3a4dSPaul E. McKenney =============== =============== 64866ce3a4dSPaul E. McKenney { A == 1, B == 2, C = 3, P == &A, Q == &C } 64966ce3a4dSPaul E. McKenney B = 4; 65066ce3a4dSPaul E. McKenney <write barrier> 65166ce3a4dSPaul E. McKenney WRITE_ONCE(P, &B); 652203185f6SAkira Yokosawa Q = READ_ONCE_OLD(P); 65366ce3a4dSPaul E. McKenney WRITE_ONCE(*Q, 5); 65466ce3a4dSPaul E. McKenney 655203185f6SAkira YokosawaTherefore, no address-dependency barrier is required to order the read into 65666ce3a4dSPaul E. McKenneyQ with the store into *Q. In other words, this outcome is prohibited, 657203185f6SAkira Yokosawaeven without an implicit address-dependency barrier of modern READ_ONCE(): 65866ce3a4dSPaul E. McKenney 65966ce3a4dSPaul E. McKenney (Q == &B) && (B == 4) 66066ce3a4dSPaul E. McKenney 66166ce3a4dSPaul E. McKenneyPlease note that this pattern should be rare. After all, the whole point 66266ce3a4dSPaul E. McKenneyof dependency ordering is to -prevent- writes to the data structure, along 66366ce3a4dSPaul E. McKenneywith the expensive cache misses associated with those writes. This pattern 66466ce3a4dSPaul E. McKenneycan be used to record rare error conditions and the like, and the CPUs' 66566ce3a4dSPaul E. McKenneynaturally occurring ordering prevents such records from being lost. 66666ce3a4dSPaul E. McKenney 66766ce3a4dSPaul E. McKenney 668203185f6SAkira YokosawaNote well that the ordering provided by an address dependency is local to 669f1ab25a3SPaul E. McKenneythe CPU containing it. See the section on "Multicopy atomicity" for 670f1ab25a3SPaul E. McKenneymore information. 671f1ab25a3SPaul E. McKenney 672f1ab25a3SPaul E. McKenney 673203185f6SAkira YokosawaThe address-dependency barrier is very important to the RCU system, 6742ecf8101SPaul E. McKenneyfor example. See rcu_assign_pointer() and rcu_dereference() in 6752ecf8101SPaul E. McKenneyinclude/linux/rcupdate.h. This permits the current target of an RCU'd 6762ecf8101SPaul E. McKenneypointer to be replaced with a new modified target, without the replacement 6772ecf8101SPaul E. McKenneytarget appearing to be incompletely initialised. 678108b42b4SDavid Howells 679108b42b4SDavid Howells 680108b42b4SDavid HowellsCONTROL DEPENDENCIES 681108b42b4SDavid Howells-------------------- 682108b42b4SDavid Howells 683c8241f85SPaul E. McKenneyControl dependencies can be a bit tricky because current compilers do 684c8241f85SPaul E. McKenneynot understand them. The purpose of this section is to help you prevent 685c8241f85SPaul E. McKenneythe compiler's ignorance from breaking your code. 686c8241f85SPaul E. McKenney 687ff382810SPaul E. McKenneyA load-load control dependency requires a full read memory barrier, not 688f556082dSAkira Yokosawasimply an (implicit) address-dependency barrier to make it work correctly. 689f556082dSAkira YokosawaConsider the following bit of code: 690108b42b4SDavid Howells 6919af194ceSPaul E. McKenney q = READ_ONCE(a); 692203185f6SAkira Yokosawa <implicit address-dependency barrier> 69318c03c61SPeter Zijlstra if (q) { 694203185f6SAkira Yokosawa /* BUG: No address dependency!!! */ 6959af194ceSPaul E. McKenney p = READ_ONCE(b); 69645c8a36aSPaul E. McKenney } 697108b42b4SDavid Howells 698203185f6SAkira YokosawaThis will not have the desired effect because there is no actual address 6992ecf8101SPaul E. McKenneydependency, but rather a control dependency that the CPU may short-circuit 7002ecf8101SPaul E. McKenneyby attempting to predict the outcome in advance, so that other CPUs see 701f556082dSAkira Yokosawathe load from b as having happened before the load from a. In such a case 702f556082dSAkira Yokosawawhat's actually required is: 703108b42b4SDavid Howells 7049af194ceSPaul E. McKenney q = READ_ONCE(a); 70518c03c61SPeter Zijlstra if (q) { 706108b42b4SDavid Howells <read barrier> 7079af194ceSPaul E. McKenney p = READ_ONCE(b); 70845c8a36aSPaul E. McKenney } 70918c03c61SPeter Zijlstra 71018c03c61SPeter ZijlstraHowever, stores are not speculated. This means that ordering -is- provided 711ff382810SPaul E. McKenneyfor load-store control dependencies, as in the following example: 71218c03c61SPeter Zijlstra 713105ff3cbSLinus Torvalds q = READ_ONCE(a); 7142456d2a6SPaul E. McKenney if (q) { 715c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 71618c03c61SPeter Zijlstra } 71718c03c61SPeter Zijlstra 718c8241f85SPaul E. McKenneyControl dependencies pair normally with other types of barriers. 719c8241f85SPaul E. McKenneyThat said, please note that neither READ_ONCE() nor WRITE_ONCE() 720c8241f85SPaul E. McKenneyare optional! Without the READ_ONCE(), the compiler might combine the 721c8241f85SPaul E. McKenneyload from 'a' with other loads from 'a'. Without the WRITE_ONCE(), 722c8241f85SPaul E. McKenneythe compiler might combine the store to 'b' with other stores to 'b'. 723c8241f85SPaul E. McKenneyEither can result in highly counterintuitive effects on ordering. 72418c03c61SPeter Zijlstra 72518c03c61SPeter ZijlstraWorse yet, if the compiler is able to prove (say) that the value of 72618c03c61SPeter Zijlstravariable 'a' is always non-zero, it would be well within its rights 72718c03c61SPeter Zijlstrato optimize the original example by eliminating the "if" statement 72818c03c61SPeter Zijlstraas follows: 72918c03c61SPeter Zijlstra 73018c03c61SPeter Zijlstra q = a; 731c8241f85SPaul E. McKenney b = 1; /* BUG: Compiler and CPU can both reorder!!! */ 73218c03c61SPeter Zijlstra 733105ff3cbSLinus TorvaldsSo don't leave out the READ_ONCE(). 7342456d2a6SPaul E. McKenney 7352456d2a6SPaul E. McKenneyIt is tempting to try to enforce ordering on identical stores on both 7362456d2a6SPaul E. McKenneybranches of the "if" statement as follows: 73718c03c61SPeter Zijlstra 738105ff3cbSLinus Torvalds q = READ_ONCE(a); 73918c03c61SPeter Zijlstra if (q) { 7409b2b3bf5SPaul E. McKenney barrier(); 741c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 74218c03c61SPeter Zijlstra do_something(); 74318c03c61SPeter Zijlstra } else { 7449b2b3bf5SPaul E. McKenney barrier(); 745c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 74618c03c61SPeter Zijlstra do_something_else(); 74718c03c61SPeter Zijlstra } 74818c03c61SPeter Zijlstra 7492456d2a6SPaul E. McKenneyUnfortunately, current compilers will transform this as follows at high 7502456d2a6SPaul E. McKenneyoptimization levels: 75118c03c61SPeter Zijlstra 752105ff3cbSLinus Torvalds q = READ_ONCE(a); 7532456d2a6SPaul E. McKenney barrier(); 754c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); /* BUG: No ordering vs. load from a!!! */ 75518c03c61SPeter Zijlstra if (q) { 756c8241f85SPaul E. McKenney /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 75718c03c61SPeter Zijlstra do_something(); 75818c03c61SPeter Zijlstra } else { 759c8241f85SPaul E. McKenney /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 76018c03c61SPeter Zijlstra do_something_else(); 76118c03c61SPeter Zijlstra } 76218c03c61SPeter Zijlstra 7632456d2a6SPaul E. McKenneyNow there is no conditional between the load from 'a' and the store to 7642456d2a6SPaul E. McKenney'b', which means that the CPU is within its rights to reorder them: 7652456d2a6SPaul E. McKenneyThe conditional is absolutely required, and must be present in the 7662456d2a6SPaul E. McKenneyassembly code even after all compiler optimizations have been applied. 7672456d2a6SPaul E. McKenneyTherefore, if you need ordering in this example, you need explicit 7682456d2a6SPaul E. McKenneymemory barriers, for example, smp_store_release(): 76918c03c61SPeter Zijlstra 7709af194ceSPaul E. McKenney q = READ_ONCE(a); 7712456d2a6SPaul E. McKenney if (q) { 772c8241f85SPaul E. McKenney smp_store_release(&b, 1); 77318c03c61SPeter Zijlstra do_something(); 77418c03c61SPeter Zijlstra } else { 775c8241f85SPaul E. McKenney smp_store_release(&b, 1); 77618c03c61SPeter Zijlstra do_something_else(); 77718c03c61SPeter Zijlstra } 77818c03c61SPeter Zijlstra 7792456d2a6SPaul E. McKenneyIn contrast, without explicit memory barriers, two-legged-if control 7802456d2a6SPaul E. McKenneyordering is guaranteed only when the stores differ, for example: 7812456d2a6SPaul E. McKenney 782105ff3cbSLinus Torvalds q = READ_ONCE(a); 7832456d2a6SPaul E. McKenney if (q) { 784c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 7852456d2a6SPaul E. McKenney do_something(); 7862456d2a6SPaul E. McKenney } else { 787c8241f85SPaul E. McKenney WRITE_ONCE(b, 2); 7882456d2a6SPaul E. McKenney do_something_else(); 7892456d2a6SPaul E. McKenney } 7902456d2a6SPaul E. McKenney 791105ff3cbSLinus TorvaldsThe initial READ_ONCE() is still required to prevent the compiler from 792105ff3cbSLinus Torvaldsproving the value of 'a'. 79318c03c61SPeter Zijlstra 79418c03c61SPeter ZijlstraIn addition, you need to be careful what you do with the local variable 'q', 79518c03c61SPeter Zijlstraotherwise the compiler might be able to guess the value and again remove 79618c03c61SPeter Zijlstrathe needed conditional. For example: 79718c03c61SPeter Zijlstra 798105ff3cbSLinus Torvalds q = READ_ONCE(a); 79918c03c61SPeter Zijlstra if (q % MAX) { 800c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 80118c03c61SPeter Zijlstra do_something(); 80218c03c61SPeter Zijlstra } else { 803c8241f85SPaul E. McKenney WRITE_ONCE(b, 2); 80418c03c61SPeter Zijlstra do_something_else(); 80518c03c61SPeter Zijlstra } 80618c03c61SPeter Zijlstra 80718c03c61SPeter ZijlstraIf MAX is defined to be 1, then the compiler knows that (q % MAX) is 80818c03c61SPeter Zijlstraequal to zero, in which case the compiler is within its rights to 80918c03c61SPeter Zijlstratransform the above code into the following: 81018c03c61SPeter Zijlstra 811105ff3cbSLinus Torvalds q = READ_ONCE(a); 812b26cfc48Spierre Kuo WRITE_ONCE(b, 2); 81318c03c61SPeter Zijlstra do_something_else(); 81418c03c61SPeter Zijlstra 8152456d2a6SPaul E. McKenneyGiven this transformation, the CPU is not required to respect the ordering 8162456d2a6SPaul E. McKenneybetween the load from variable 'a' and the store to variable 'b'. It is 8172456d2a6SPaul E. McKenneytempting to add a barrier(), but this does not help. The conditional 8182456d2a6SPaul E. McKenneyis gone, and the barrier won't bring it back. Therefore, if you are 8192456d2a6SPaul E. McKenneyrelying on this ordering, you should make sure that MAX is greater than 8202456d2a6SPaul E. McKenneyone, perhaps as follows: 82118c03c61SPeter Zijlstra 822105ff3cbSLinus Torvalds q = READ_ONCE(a); 82318c03c61SPeter Zijlstra BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ 82418c03c61SPeter Zijlstra if (q % MAX) { 825c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 82618c03c61SPeter Zijlstra do_something(); 82718c03c61SPeter Zijlstra } else { 828c8241f85SPaul E. McKenney WRITE_ONCE(b, 2); 82918c03c61SPeter Zijlstra do_something_else(); 83018c03c61SPeter Zijlstra } 83118c03c61SPeter Zijlstra 8322456d2a6SPaul E. McKenneyPlease note once again that the stores to 'b' differ. If they were 8332456d2a6SPaul E. McKenneyidentical, as noted earlier, the compiler could pull this store outside 8342456d2a6SPaul E. McKenneyof the 'if' statement. 8352456d2a6SPaul E. McKenney 8368b19d1deSPaul E. McKenneyYou must also be careful not to rely too much on boolean short-circuit 8378b19d1deSPaul E. McKenneyevaluation. Consider this example: 8388b19d1deSPaul E. McKenney 839105ff3cbSLinus Torvalds q = READ_ONCE(a); 84057aecae9SPaul E. McKenney if (q || 1 > 0) 8419af194ceSPaul E. McKenney WRITE_ONCE(b, 1); 8428b19d1deSPaul E. McKenney 8435af4692aSPaul E. McKenneyBecause the first condition cannot fault and the second condition is 8445af4692aSPaul E. McKenneyalways true, the compiler can transform this example as following, 8455af4692aSPaul E. McKenneydefeating control dependency: 8468b19d1deSPaul E. McKenney 847105ff3cbSLinus Torvalds q = READ_ONCE(a); 8489af194ceSPaul E. McKenney WRITE_ONCE(b, 1); 8498b19d1deSPaul E. McKenney 8508b19d1deSPaul E. McKenneyThis example underscores the need to ensure that the compiler cannot 8519af194ceSPaul E. McKenneyout-guess your code. More generally, although READ_ONCE() does force 8528b19d1deSPaul E. McKenneythe compiler to actually emit code for a given load, it does not force 8538b19d1deSPaul E. McKenneythe compiler to use the results. 8548b19d1deSPaul E. McKenney 855ebff09a6SPaul E. McKenneyIn addition, control dependencies apply only to the then-clause and 856ebff09a6SPaul E. McKenneyelse-clause of the if-statement in question. In particular, it does 857ebff09a6SPaul E. McKenneynot necessarily apply to code following the if-statement: 858ebff09a6SPaul E. McKenney 859ebff09a6SPaul E. McKenney q = READ_ONCE(a); 860ebff09a6SPaul E. McKenney if (q) { 861c8241f85SPaul E. McKenney WRITE_ONCE(b, 1); 862ebff09a6SPaul E. McKenney } else { 863c8241f85SPaul E. McKenney WRITE_ONCE(b, 2); 864ebff09a6SPaul E. McKenney } 865c8241f85SPaul E. McKenney WRITE_ONCE(c, 1); /* BUG: No ordering against the read from 'a'. */ 866ebff09a6SPaul E. McKenney 867ebff09a6SPaul E. McKenneyIt is tempting to argue that there in fact is ordering because the 868ebff09a6SPaul E. McKenneycompiler cannot reorder volatile accesses and also cannot reorder 869c8241f85SPaul E. McKenneythe writes to 'b' with the condition. Unfortunately for this line 870c8241f85SPaul E. McKenneyof reasoning, the compiler might compile the two writes to 'b' as 871ebff09a6SPaul E. McKenneyconditional-move instructions, as in this fanciful pseudo-assembly 872ebff09a6SPaul E. McKenneylanguage: 873ebff09a6SPaul E. McKenney 874ebff09a6SPaul E. McKenney ld r1,a 875ebff09a6SPaul E. McKenney cmp r1,$0 876c8241f85SPaul E. McKenney cmov,ne r4,$1 877c8241f85SPaul E. McKenney cmov,eq r4,$2 878ebff09a6SPaul E. McKenney st r4,b 879ebff09a6SPaul E. McKenney st $1,c 880ebff09a6SPaul E. McKenney 881ebff09a6SPaul E. McKenneyA weakly ordered CPU would have no dependency of any sort between the load 882c8241f85SPaul E. McKenneyfrom 'a' and the store to 'c'. The control dependencies would extend 883ebff09a6SPaul E. McKenneyonly to the pair of cmov instructions and the store depending on them. 884ebff09a6SPaul E. McKenneyIn short, control dependencies apply only to the stores in the then-clause 885ebff09a6SPaul E. McKenneyand else-clause of the if-statement in question (including functions 886ebff09a6SPaul E. McKenneyinvoked by those two clauses), not to code following that if-statement. 887ebff09a6SPaul E. McKenney 88818c03c61SPeter Zijlstra 889f1ab25a3SPaul E. McKenneyNote well that the ordering provided by a control dependency is local 890f1ab25a3SPaul E. McKenneyto the CPU containing it. See the section on "Multicopy atomicity" 891f1ab25a3SPaul E. McKenneyfor more information. 89218c03c61SPeter Zijlstra 89318c03c61SPeter Zijlstra 89418c03c61SPeter ZijlstraIn summary: 89518c03c61SPeter Zijlstra 89618c03c61SPeter Zijlstra (*) Control dependencies can order prior loads against later stores. 89718c03c61SPeter Zijlstra However, they do -not- guarantee any other sort of ordering: 89818c03c61SPeter Zijlstra Not prior loads against later loads, nor prior stores against 89918c03c61SPeter Zijlstra later anything. If you need these other forms of ordering, 900d87510c5SDavidlohr Bueso use smp_rmb(), smp_wmb(), or, in the case of prior stores and 90118c03c61SPeter Zijlstra later loads, smp_mb(). 90218c03c61SPeter Zijlstra 9037817b799SPaul E. McKenney (*) If both legs of the "if" statement begin with identical stores to 9047817b799SPaul E. McKenney the same variable, then those stores must be ordered, either by 9057817b799SPaul E. McKenney preceding both of them with smp_mb() or by using smp_store_release() 9067817b799SPaul E. McKenney to carry out the stores. Please note that it is -not- sufficient 907a5052657SPaul E. McKenney to use barrier() at beginning of each leg of the "if" statement 908a5052657SPaul E. McKenney because, as shown by the example above, optimizing compilers can 909a5052657SPaul E. McKenney destroy the control dependency while respecting the letter of the 910a5052657SPaul E. McKenney barrier() law. 9119b2b3bf5SPaul E. McKenney 91218c03c61SPeter Zijlstra (*) Control dependencies require at least one run-time conditional 913586dd56aSPaul E. McKenney between the prior load and the subsequent store, and this 9149af194ceSPaul E. McKenney conditional must involve the prior load. If the compiler is able 9159af194ceSPaul E. McKenney to optimize the conditional away, it will have also optimized 916105ff3cbSLinus Torvalds away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() 917105ff3cbSLinus Torvalds can help to preserve the needed conditional. 91818c03c61SPeter Zijlstra 91918c03c61SPeter Zijlstra (*) Control dependencies require that the compiler avoid reordering the 920105ff3cbSLinus Torvalds dependency into nonexistence. Careful use of READ_ONCE() or 921105ff3cbSLinus Torvalds atomic{,64}_read() can help to preserve your control dependency. 922895f5542SPaul E. McKenney Please see the COMPILER BARRIER section for more information. 92318c03c61SPeter Zijlstra 924ebff09a6SPaul E. McKenney (*) Control dependencies apply only to the then-clause and else-clause 925ebff09a6SPaul E. McKenney of the if-statement containing the control dependency, including 926ebff09a6SPaul E. McKenney any functions that these two clauses call. Control dependencies 927ebff09a6SPaul E. McKenney do -not- apply to code following the if-statement containing the 928ebff09a6SPaul E. McKenney control dependency. 929ebff09a6SPaul E. McKenney 930ff382810SPaul E. McKenney (*) Control dependencies pair normally with other types of barriers. 931ff382810SPaul E. McKenney 932f1ab25a3SPaul E. McKenney (*) Control dependencies do -not- provide multicopy atomicity. If you 933f1ab25a3SPaul E. McKenney need all the CPUs to see a given store at the same time, use smp_mb(). 934108b42b4SDavid Howells 935c8241f85SPaul E. McKenney (*) Compilers do not understand control dependencies. It is therefore 936c8241f85SPaul E. McKenney your job to ensure that they do not break your code. 937c8241f85SPaul E. McKenney 938108b42b4SDavid Howells 939108b42b4SDavid HowellsSMP BARRIER PAIRING 940108b42b4SDavid Howells------------------- 941108b42b4SDavid Howells 942108b42b4SDavid HowellsWhen dealing with CPU-CPU interactions, certain types of memory barrier should 943108b42b4SDavid Howellsalways be paired. A lack of appropriate pairing is almost certainly an error. 944108b42b4SDavid Howells 945ff382810SPaul E. McKenneyGeneral barriers pair with each other, though they also pair with most 946f1ab25a3SPaul E. McKenneyother types of barriers, albeit without multicopy atomicity. An acquire 947f1ab25a3SPaul E. McKenneybarrier pairs with a release barrier, but both may also pair with other 948f1ab25a3SPaul E. McKenneybarriers, including of course general barriers. A write barrier pairs 949203185f6SAkira Yokosawawith an address-dependency barrier, a control dependency, an acquire barrier, 950f1ab25a3SPaul E. McKenneya release barrier, a read barrier, or a general barrier. Similarly a 951203185f6SAkira Yokosawaread barrier, control dependency, or an address-dependency barrier pairs 952f1ab25a3SPaul E. McKenneywith a write barrier, an acquire barrier, a release barrier, or a 953f1ab25a3SPaul E. McKenneygeneral barrier: 954108b42b4SDavid Howells 955108b42b4SDavid Howells CPU 1 CPU 2 956108b42b4SDavid Howells =============== =============== 9579af194ceSPaul E. McKenney WRITE_ONCE(a, 1); 958108b42b4SDavid Howells <write barrier> 9599af194ceSPaul E. McKenney WRITE_ONCE(b, 2); x = READ_ONCE(b); 960108b42b4SDavid Howells <read barrier> 9619af194ceSPaul E. McKenney y = READ_ONCE(a); 962108b42b4SDavid Howells 963108b42b4SDavid HowellsOr: 964108b42b4SDavid Howells 965108b42b4SDavid Howells CPU 1 CPU 2 966108b42b4SDavid Howells =============== =============================== 967108b42b4SDavid Howells a = 1; 968108b42b4SDavid Howells <write barrier> 9699af194ceSPaul E. McKenney WRITE_ONCE(b, &a); x = READ_ONCE(b); 970203185f6SAkira Yokosawa <implicit address-dependency barrier> 971108b42b4SDavid Howells y = *x; 972108b42b4SDavid Howells 973ff382810SPaul E. McKenneyOr even: 974ff382810SPaul E. McKenney 975ff382810SPaul E. McKenney CPU 1 CPU 2 976ff382810SPaul E. McKenney =============== =============================== 9779af194ceSPaul E. McKenney r1 = READ_ONCE(y); 978ff382810SPaul E. McKenney <general barrier> 979d92f842bSScott Tsai WRITE_ONCE(x, 1); if (r2 = READ_ONCE(x)) { 980ff382810SPaul E. McKenney <implicit control dependency> 9819af194ceSPaul E. McKenney WRITE_ONCE(y, 1); 982ff382810SPaul E. McKenney } 983ff382810SPaul E. McKenney 984ff382810SPaul E. McKenney assert(r1 == 0 || r2 == 0); 985ff382810SPaul E. McKenney 986108b42b4SDavid HowellsBasically, the read barrier always has to be there, even though it can be of 987108b42b4SDavid Howellsthe "weaker" type. 988108b42b4SDavid Howells 989670bd95eSDavid Howells[!] Note that the stores before the write barrier would normally be expected to 990f556082dSAkira Yokosawamatch the loads after the read barrier or the address-dependency barrier, and 991f556082dSAkira Yokosawavice versa: 992670bd95eSDavid Howells 993670bd95eSDavid Howells CPU 1 CPU 2 9942ecf8101SPaul E. McKenney =================== =================== 9959af194ceSPaul E. McKenney WRITE_ONCE(a, 1); }---- --->{ v = READ_ONCE(c); 9969af194ceSPaul E. McKenney WRITE_ONCE(b, 2); } \ / { w = READ_ONCE(d); 997670bd95eSDavid Howells <write barrier> \ <read barrier> 9989af194ceSPaul E. McKenney WRITE_ONCE(c, 3); } / \ { x = READ_ONCE(a); 9999af194ceSPaul E. McKenney WRITE_ONCE(d, 4); }---- --->{ y = READ_ONCE(b); 1000670bd95eSDavid Howells 1001108b42b4SDavid Howells 1002108b42b4SDavid HowellsEXAMPLES OF MEMORY BARRIER SEQUENCES 1003108b42b4SDavid Howells------------------------------------ 1004108b42b4SDavid Howells 100581fc6323SJarek PoplawskiFirstly, write barriers act as partial orderings on store operations. 1006108b42b4SDavid HowellsConsider the following sequence of events: 1007108b42b4SDavid Howells 1008108b42b4SDavid Howells CPU 1 1009108b42b4SDavid Howells ======================= 1010108b42b4SDavid Howells STORE A = 1 1011108b42b4SDavid Howells STORE B = 2 1012108b42b4SDavid Howells STORE C = 3 1013108b42b4SDavid Howells <write barrier> 1014108b42b4SDavid Howells STORE D = 4 1015108b42b4SDavid Howells STORE E = 5 1016108b42b4SDavid Howells 1017108b42b4SDavid HowellsThis sequence of events is committed to the memory coherence system in an order 1018108b42b4SDavid Howellsthat the rest of the system might perceive as the unordered set of { STORE A, 101980f7228bSAdrian BunkSTORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E 1020108b42b4SDavid Howells}: 1021108b42b4SDavid Howells 1022108b42b4SDavid Howells +-------+ : : 1023108b42b4SDavid Howells | | +------+ 1024108b42b4SDavid Howells | |------>| C=3 | } /\ 102581fc6323SJarek Poplawski | | : +------+ }----- \ -----> Events perceptible to 102681fc6323SJarek Poplawski | | : | A=1 | } \/ the rest of the system 1027108b42b4SDavid Howells | | : +------+ } 1028108b42b4SDavid Howells | CPU 1 | : | B=2 | } 1029108b42b4SDavid Howells | | +------+ } 1030108b42b4SDavid Howells | | wwwwwwwwwwwwwwww } <--- At this point the write barrier 1031108b42b4SDavid Howells | | +------+ } requires all stores prior to the 1032108b42b4SDavid Howells | | : | E=5 | } barrier to be committed before 103381fc6323SJarek Poplawski | | : +------+ } further stores may take place 1034108b42b4SDavid Howells | |------>| D=4 | } 1035108b42b4SDavid Howells | | +------+ 1036108b42b4SDavid Howells +-------+ : : 1037108b42b4SDavid Howells | 1038670bd95eSDavid Howells | Sequence in which stores are committed to the 1039670bd95eSDavid Howells | memory system by CPU 1 1040108b42b4SDavid Howells V 1041108b42b4SDavid Howells 1042108b42b4SDavid Howells 1043f556082dSAkira YokosawaSecondly, address-dependency barriers act as partial orderings on address- 1044f556082dSAkira Yokosawadependent loads. Consider the following sequence of events: 1045108b42b4SDavid Howells 1046108b42b4SDavid Howells CPU 1 CPU 2 1047108b42b4SDavid Howells ======================= ======================= 1048c14038c3SDavid Howells { B = 7; X = 9; Y = 8; C = &Y } 1049108b42b4SDavid Howells STORE A = 1 1050108b42b4SDavid Howells STORE B = 2 1051108b42b4SDavid Howells <write barrier> 1052108b42b4SDavid Howells STORE C = &B LOAD X 1053108b42b4SDavid Howells STORE D = 4 LOAD C (gets &B) 1054108b42b4SDavid Howells LOAD *C (reads B) 1055108b42b4SDavid Howells 1056108b42b4SDavid HowellsWithout intervention, CPU 2 may perceive the events on CPU 1 in some 1057108b42b4SDavid Howellseffectively random order, despite the write barrier issued by CPU 1: 1058108b42b4SDavid Howells 1059108b42b4SDavid Howells +-------+ : : : : 1060108b42b4SDavid Howells | | +------+ +-------+ | Sequence of update 1061108b42b4SDavid Howells | |------>| B=2 |----- --->| Y->8 | | of perception on 1062108b42b4SDavid Howells | | : +------+ \ +-------+ | CPU 2 1063108b42b4SDavid Howells | CPU 1 | : | A=1 | \ --->| C->&Y | V 1064108b42b4SDavid Howells | | +------+ | +-------+ 1065108b42b4SDavid Howells | | wwwwwwwwwwwwwwww | : : 1066108b42b4SDavid Howells | | +------+ | : : 1067108b42b4SDavid Howells | | : | C=&B |--- | : : +-------+ 1068108b42b4SDavid Howells | | : +------+ \ | +-------+ | | 1069108b42b4SDavid Howells | |------>| D=4 | ----------->| C->&B |------>| | 1070108b42b4SDavid Howells | | +------+ | +-------+ | | 1071108b42b4SDavid Howells +-------+ : : | : : | | 1072108b42b4SDavid Howells | : : | | 1073108b42b4SDavid Howells | : : | CPU 2 | 1074108b42b4SDavid Howells | +-------+ | | 1075108b42b4SDavid Howells Apparently incorrect ---> | | B->7 |------>| | 1076108b42b4SDavid Howells perception of B (!) | +-------+ | | 1077108b42b4SDavid Howells | : : | | 1078108b42b4SDavid Howells | +-------+ | | 1079108b42b4SDavid Howells The load of X holds ---> \ | X->9 |------>| | 1080108b42b4SDavid Howells up the maintenance \ +-------+ | | 1081108b42b4SDavid Howells of coherence of B ----->| B->2 | +-------+ 1082108b42b4SDavid Howells +-------+ 1083108b42b4SDavid Howells : : 1084108b42b4SDavid Howells 1085108b42b4SDavid Howells 1086108b42b4SDavid HowellsIn the above example, CPU 2 perceives that B is 7, despite the load of *C 1087670e9f34SPaolo Ornati(which would be B) coming after the LOAD of C. 1088108b42b4SDavid Howells 1089f556082dSAkira YokosawaIf, however, an address-dependency barrier were to be placed between the load 1090f556082dSAkira Yokosawaof C and the load of *C (ie: B) on CPU 2: 1091c14038c3SDavid Howells 1092c14038c3SDavid Howells CPU 1 CPU 2 1093c14038c3SDavid Howells ======================= ======================= 1094c14038c3SDavid Howells { B = 7; X = 9; Y = 8; C = &Y } 1095c14038c3SDavid Howells STORE A = 1 1096c14038c3SDavid Howells STORE B = 2 1097c14038c3SDavid Howells <write barrier> 1098c14038c3SDavid Howells STORE C = &B LOAD X 1099c14038c3SDavid Howells STORE D = 4 LOAD C (gets &B) 1100203185f6SAkira Yokosawa <address-dependency barrier> 1101c14038c3SDavid Howells LOAD *C (reads B) 1102c14038c3SDavid Howells 1103c14038c3SDavid Howellsthen the following will occur: 1104108b42b4SDavid Howells 1105108b42b4SDavid Howells +-------+ : : : : 1106108b42b4SDavid Howells | | +------+ +-------+ 1107108b42b4SDavid Howells | |------>| B=2 |----- --->| Y->8 | 1108108b42b4SDavid Howells | | : +------+ \ +-------+ 1109108b42b4SDavid Howells | CPU 1 | : | A=1 | \ --->| C->&Y | 1110108b42b4SDavid Howells | | +------+ | +-------+ 1111108b42b4SDavid Howells | | wwwwwwwwwwwwwwww | : : 1112108b42b4SDavid Howells | | +------+ | : : 1113108b42b4SDavid Howells | | : | C=&B |--- | : : +-------+ 1114108b42b4SDavid Howells | | : +------+ \ | +-------+ | | 1115108b42b4SDavid Howells | |------>| D=4 | ----------->| C->&B |------>| | 1116108b42b4SDavid Howells | | +------+ | +-------+ | | 1117108b42b4SDavid Howells +-------+ : : | : : | | 1118108b42b4SDavid Howells | : : | | 1119108b42b4SDavid Howells | : : | CPU 2 | 1120108b42b4SDavid Howells | +-------+ | | 1121670bd95eSDavid Howells | | X->9 |------>| | 1122670bd95eSDavid Howells | +-------+ | | 1123203185f6SAkira Yokosawa Makes sure all effects ---> \ aaaaaaaaaaaaaaaaa | | 1124670bd95eSDavid Howells prior to the store of C \ +-------+ | | 1125670bd95eSDavid Howells are perceptible to ----->| B->2 |------>| | 1126670bd95eSDavid Howells subsequent loads +-------+ | | 1127108b42b4SDavid Howells : : +-------+ 1128108b42b4SDavid Howells 1129108b42b4SDavid Howells 1130108b42b4SDavid HowellsAnd thirdly, a read barrier acts as a partial order on loads. Consider the 1131108b42b4SDavid Howellsfollowing sequence of events: 1132108b42b4SDavid Howells 1133108b42b4SDavid Howells CPU 1 CPU 2 1134108b42b4SDavid Howells ======================= ======================= 1135670bd95eSDavid Howells { A = 0, B = 9 } 1136108b42b4SDavid Howells STORE A=1 1137108b42b4SDavid Howells <write barrier> 1138670bd95eSDavid Howells STORE B=2 1139108b42b4SDavid Howells LOAD B 1140670bd95eSDavid Howells LOAD A 1141108b42b4SDavid Howells 1142108b42b4SDavid HowellsWithout intervention, CPU 2 may then choose to perceive the events on CPU 1 in 1143108b42b4SDavid Howellssome effectively random order, despite the write barrier issued by CPU 1: 1144108b42b4SDavid Howells 1145670bd95eSDavid Howells +-------+ : : : : 1146670bd95eSDavid Howells | | +------+ +-------+ 1147670bd95eSDavid Howells | |------>| A=1 |------ --->| A->0 | 1148670bd95eSDavid Howells | | +------+ \ +-------+ 1149670bd95eSDavid Howells | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1150670bd95eSDavid Howells | | +------+ | +-------+ 1151670bd95eSDavid Howells | |------>| B=2 |--- | : : 1152670bd95eSDavid Howells | | +------+ \ | : : +-------+ 1153670bd95eSDavid Howells +-------+ : : \ | +-------+ | | 1154670bd95eSDavid Howells ---------->| B->2 |------>| | 1155670bd95eSDavid Howells | +-------+ | CPU 2 | 1156670bd95eSDavid Howells | | A->0 |------>| | 1157670bd95eSDavid Howells | +-------+ | | 1158670bd95eSDavid Howells | : : +-------+ 1159670bd95eSDavid Howells \ : : 1160670bd95eSDavid Howells \ +-------+ 1161670bd95eSDavid Howells ---->| A->1 | 1162670bd95eSDavid Howells +-------+ 1163670bd95eSDavid Howells : : 1164108b42b4SDavid Howells 1165108b42b4SDavid Howells 11666bc39274SDavid HowellsIf, however, a read barrier were to be placed between the load of B and the 1167670bd95eSDavid Howellsload of A on CPU 2: 1168108b42b4SDavid Howells 1169670bd95eSDavid Howells CPU 1 CPU 2 1170670bd95eSDavid Howells ======================= ======================= 1171670bd95eSDavid Howells { A = 0, B = 9 } 1172670bd95eSDavid Howells STORE A=1 1173670bd95eSDavid Howells <write barrier> 1174670bd95eSDavid Howells STORE B=2 1175670bd95eSDavid Howells LOAD B 1176670bd95eSDavid Howells <read barrier> 1177670bd95eSDavid Howells LOAD A 1178670bd95eSDavid Howells 1179670bd95eSDavid Howellsthen the partial ordering imposed by CPU 1 will be perceived correctly by CPU 1180670bd95eSDavid Howells2: 1181670bd95eSDavid Howells 1182670bd95eSDavid Howells +-------+ : : : : 1183670bd95eSDavid Howells | | +------+ +-------+ 1184670bd95eSDavid Howells | |------>| A=1 |------ --->| A->0 | 1185670bd95eSDavid Howells | | +------+ \ +-------+ 1186670bd95eSDavid Howells | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1187670bd95eSDavid Howells | | +------+ | +-------+ 1188670bd95eSDavid Howells | |------>| B=2 |--- | : : 1189670bd95eSDavid Howells | | +------+ \ | : : +-------+ 1190670bd95eSDavid Howells +-------+ : : \ | +-------+ | | 1191670bd95eSDavid Howells ---------->| B->2 |------>| | 1192670bd95eSDavid Howells | +-------+ | CPU 2 | 1193670bd95eSDavid Howells | : : | | 1194670bd95eSDavid Howells | : : | | 1195108b42b4SDavid Howells At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 1196108b42b4SDavid Howells barrier causes all effects \ +-------+ | | 1197670bd95eSDavid Howells prior to the storage of B ---->| A->1 |------>| | 1198670bd95eSDavid Howells to be perceptible to CPU 2 +-------+ | | 1199670bd95eSDavid Howells : : +-------+ 1200670bd95eSDavid Howells 1201670bd95eSDavid Howells 1202670bd95eSDavid HowellsTo illustrate this more completely, consider what could happen if the code 1203670bd95eSDavid Howellscontained a load of A either side of the read barrier: 1204670bd95eSDavid Howells 1205670bd95eSDavid Howells CPU 1 CPU 2 1206670bd95eSDavid Howells ======================= ======================= 1207670bd95eSDavid Howells { A = 0, B = 9 } 1208670bd95eSDavid Howells STORE A=1 1209670bd95eSDavid Howells <write barrier> 1210670bd95eSDavid Howells STORE B=2 1211670bd95eSDavid Howells LOAD B 1212670bd95eSDavid Howells LOAD A [first load of A] 1213670bd95eSDavid Howells <read barrier> 1214670bd95eSDavid Howells LOAD A [second load of A] 1215670bd95eSDavid Howells 1216670bd95eSDavid HowellsEven though the two loads of A both occur after the load of B, they may both 1217670bd95eSDavid Howellscome up with different values: 1218670bd95eSDavid Howells 1219670bd95eSDavid Howells +-------+ : : : : 1220670bd95eSDavid Howells | | +------+ +-------+ 1221670bd95eSDavid Howells | |------>| A=1 |------ --->| A->0 | 1222670bd95eSDavid Howells | | +------+ \ +-------+ 1223670bd95eSDavid Howells | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1224670bd95eSDavid Howells | | +------+ | +-------+ 1225670bd95eSDavid Howells | |------>| B=2 |--- | : : 1226670bd95eSDavid Howells | | +------+ \ | : : +-------+ 1227670bd95eSDavid Howells +-------+ : : \ | +-------+ | | 1228670bd95eSDavid Howells ---------->| B->2 |------>| | 1229670bd95eSDavid Howells | +-------+ | CPU 2 | 1230670bd95eSDavid Howells | : : | | 1231670bd95eSDavid Howells | : : | | 1232670bd95eSDavid Howells | +-------+ | | 1233670bd95eSDavid Howells | | A->0 |------>| 1st | 1234670bd95eSDavid Howells | +-------+ | | 1235670bd95eSDavid Howells At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 1236670bd95eSDavid Howells barrier causes all effects \ +-------+ | | 1237670bd95eSDavid Howells prior to the storage of B ---->| A->1 |------>| 2nd | 1238670bd95eSDavid Howells to be perceptible to CPU 2 +-------+ | | 1239670bd95eSDavid Howells : : +-------+ 1240670bd95eSDavid Howells 1241670bd95eSDavid Howells 1242670bd95eSDavid HowellsBut it may be that the update to A from CPU 1 becomes perceptible to CPU 2 1243670bd95eSDavid Howellsbefore the read barrier completes anyway: 1244670bd95eSDavid Howells 1245670bd95eSDavid Howells +-------+ : : : : 1246670bd95eSDavid Howells | | +------+ +-------+ 1247670bd95eSDavid Howells | |------>| A=1 |------ --->| A->0 | 1248670bd95eSDavid Howells | | +------+ \ +-------+ 1249670bd95eSDavid Howells | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1250670bd95eSDavid Howells | | +------+ | +-------+ 1251670bd95eSDavid Howells | |------>| B=2 |--- | : : 1252670bd95eSDavid Howells | | +------+ \ | : : +-------+ 1253670bd95eSDavid Howells +-------+ : : \ | +-------+ | | 1254670bd95eSDavid Howells ---------->| B->2 |------>| | 1255670bd95eSDavid Howells | +-------+ | CPU 2 | 1256670bd95eSDavid Howells | : : | | 1257670bd95eSDavid Howells \ : : | | 1258670bd95eSDavid Howells \ +-------+ | | 1259670bd95eSDavid Howells ---->| A->1 |------>| 1st | 1260670bd95eSDavid Howells +-------+ | | 1261670bd95eSDavid Howells rrrrrrrrrrrrrrrrr | | 1262670bd95eSDavid Howells +-------+ | | 1263670bd95eSDavid Howells | A->1 |------>| 2nd | 1264108b42b4SDavid Howells +-------+ | | 1265108b42b4SDavid Howells : : +-------+ 1266108b42b4SDavid Howells 1267108b42b4SDavid Howells 1268670bd95eSDavid HowellsThe guarantee is that the second load will always come up with A == 1 if the 1269670bd95eSDavid Howellsload of B came up with B == 2. No such guarantee exists for the first load of 1270670bd95eSDavid HowellsA; that may come up with either A == 0 or A == 1. 1271670bd95eSDavid Howells 1272670bd95eSDavid Howells 1273670bd95eSDavid HowellsREAD MEMORY BARRIERS VS LOAD SPECULATION 1274670bd95eSDavid Howells---------------------------------------- 1275670bd95eSDavid Howells 1276670bd95eSDavid HowellsMany CPUs speculate with loads: that is they see that they will need to load an 1277670bd95eSDavid Howellsitem from memory, and they find a time where they're not using the bus for any 1278670bd95eSDavid Howellsother loads, and so do the load in advance - even though they haven't actually 1279670bd95eSDavid Howellsgot to that point in the instruction execution flow yet. This permits the 1280670bd95eSDavid Howellsactual load instruction to potentially complete immediately because the CPU 1281670bd95eSDavid Howellsalready has the value to hand. 1282670bd95eSDavid Howells 1283670bd95eSDavid HowellsIt may turn out that the CPU didn't actually need the value - perhaps because a 1284670bd95eSDavid Howellsbranch circumvented the load - in which case it can discard the value or just 1285670bd95eSDavid Howellscache it for later use. 1286670bd95eSDavid Howells 1287670bd95eSDavid HowellsConsider: 1288670bd95eSDavid Howells 1289670bd95eSDavid Howells CPU 1 CPU 2 1290670bd95eSDavid Howells ======================= ======================= 1291670bd95eSDavid Howells LOAD B 1292670bd95eSDavid Howells DIVIDE } Divide instructions generally 1293670bd95eSDavid Howells DIVIDE } take a long time to perform 1294670bd95eSDavid Howells LOAD A 1295670bd95eSDavid Howells 1296670bd95eSDavid HowellsWhich might appear as this: 1297670bd95eSDavid Howells 1298670bd95eSDavid Howells : : +-------+ 1299670bd95eSDavid Howells +-------+ | | 1300670bd95eSDavid Howells --->| B->2 |------>| | 1301670bd95eSDavid Howells +-------+ | CPU 2 | 1302670bd95eSDavid Howells : :DIVIDE | | 1303670bd95eSDavid Howells +-------+ | | 1304670bd95eSDavid Howells The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1305670bd95eSDavid Howells division speculates on the +-------+ ~ | | 1306670bd95eSDavid Howells LOAD of A : : ~ | | 1307670bd95eSDavid Howells : :DIVIDE | | 1308670bd95eSDavid Howells : : ~ | | 1309670bd95eSDavid Howells Once the divisions are complete --> : : ~-->| | 1310670bd95eSDavid Howells the CPU can then perform the : : | | 1311670bd95eSDavid Howells LOAD with immediate effect : : +-------+ 1312670bd95eSDavid Howells 1313670bd95eSDavid Howells 1314203185f6SAkira YokosawaPlacing a read barrier or an address-dependency barrier just before the second 1315670bd95eSDavid Howellsload: 1316670bd95eSDavid Howells 1317670bd95eSDavid Howells CPU 1 CPU 2 1318670bd95eSDavid Howells ======================= ======================= 1319670bd95eSDavid Howells LOAD B 1320670bd95eSDavid Howells DIVIDE 1321670bd95eSDavid Howells DIVIDE 1322670bd95eSDavid Howells <read barrier> 1323670bd95eSDavid Howells LOAD A 1324670bd95eSDavid Howells 1325670bd95eSDavid Howellswill force any value speculatively obtained to be reconsidered to an extent 1326670bd95eSDavid Howellsdependent on the type of barrier used. If there was no change made to the 1327670bd95eSDavid Howellsspeculated memory location, then the speculated value will just be used: 1328670bd95eSDavid Howells 1329670bd95eSDavid Howells : : +-------+ 1330670bd95eSDavid Howells +-------+ | | 1331670bd95eSDavid Howells --->| B->2 |------>| | 1332670bd95eSDavid Howells +-------+ | CPU 2 | 1333670bd95eSDavid Howells : :DIVIDE | | 1334670bd95eSDavid Howells +-------+ | | 1335670bd95eSDavid Howells The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1336670bd95eSDavid Howells division speculates on the +-------+ ~ | | 1337670bd95eSDavid Howells LOAD of A : : ~ | | 1338670bd95eSDavid Howells : :DIVIDE | | 1339670bd95eSDavid Howells : : ~ | | 1340670bd95eSDavid Howells : : ~ | | 1341670bd95eSDavid Howells rrrrrrrrrrrrrrrr~ | | 1342670bd95eSDavid Howells : : ~ | | 1343670bd95eSDavid Howells : : ~-->| | 1344670bd95eSDavid Howells : : | | 1345670bd95eSDavid Howells : : +-------+ 1346670bd95eSDavid Howells 1347670bd95eSDavid Howells 1348670bd95eSDavid Howellsbut if there was an update or an invalidation from another CPU pending, then 1349670bd95eSDavid Howellsthe speculation will be cancelled and the value reloaded: 1350670bd95eSDavid Howells 1351670bd95eSDavid Howells : : +-------+ 1352670bd95eSDavid Howells +-------+ | | 1353670bd95eSDavid Howells --->| B->2 |------>| | 1354670bd95eSDavid Howells +-------+ | CPU 2 | 1355670bd95eSDavid Howells : :DIVIDE | | 1356670bd95eSDavid Howells +-------+ | | 1357670bd95eSDavid Howells The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1358670bd95eSDavid Howells division speculates on the +-------+ ~ | | 1359670bd95eSDavid Howells LOAD of A : : ~ | | 1360670bd95eSDavid Howells : :DIVIDE | | 1361670bd95eSDavid Howells : : ~ | | 1362670bd95eSDavid Howells : : ~ | | 1363670bd95eSDavid Howells rrrrrrrrrrrrrrrrr | | 1364670bd95eSDavid Howells +-------+ | | 1365670bd95eSDavid Howells The speculation is discarded ---> --->| A->1 |------>| | 1366670bd95eSDavid Howells and an updated value is +-------+ | | 1367670bd95eSDavid Howells retrieved : : +-------+ 1368670bd95eSDavid Howells 1369670bd95eSDavid Howells 1370f1ab25a3SPaul E. McKenneyMULTICOPY ATOMICITY 1371f1ab25a3SPaul E. McKenney-------------------- 1372241e6663SPaul E. McKenney 1373f1ab25a3SPaul E. McKenneyMulticopy atomicity is a deeply intuitive notion about ordering that is 1374f1ab25a3SPaul E. McKenneynot always provided by real computer systems, namely that a given store 13750902b1f4SAlan Sternbecomes visible at the same time to all CPUs, or, alternatively, that all 13760902b1f4SAlan SternCPUs agree on the order in which all stores become visible. However, 13770902b1f4SAlan Sternsupport of full multicopy atomicity would rule out valuable hardware 13780902b1f4SAlan Sternoptimizations, so a weaker form called ``other multicopy atomicity'' 13790902b1f4SAlan Sterninstead guarantees only that a given store becomes visible at the same 13800902b1f4SAlan Sterntime to all -other- CPUs. The remainder of this document discusses this 13810902b1f4SAlan Sternweaker form, but for brevity will call it simply ``multicopy atomicity''. 1382f1ab25a3SPaul E. McKenney 1383f1ab25a3SPaul E. McKenneyThe following example demonstrates multicopy atomicity: 1384241e6663SPaul E. McKenney 1385241e6663SPaul E. McKenney CPU 1 CPU 2 CPU 3 1386241e6663SPaul E. McKenney ======================= ======================= ======================= 1387241e6663SPaul E. McKenney { X = 0, Y = 0 } 1388f1ab25a3SPaul E. McKenney STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) 1389f1ab25a3SPaul E. McKenney <general barrier> <read barrier> 1390f1ab25a3SPaul E. McKenney STORE Y=r1 LOAD X 1391241e6663SPaul E. McKenney 13920902b1f4SAlan SternSuppose that CPU 2's load from X returns 1, which it then stores to Y, 13930902b1f4SAlan Sternand CPU 3's load from Y returns 1. This indicates that CPU 1's store 13940902b1f4SAlan Sternto X precedes CPU 2's load from X and that CPU 2's store to Y precedes 13950902b1f4SAlan SternCPU 3's load from Y. In addition, the memory barriers guarantee that 13960902b1f4SAlan SternCPU 2 executes its load before its store, and CPU 3 loads from Y before 13970902b1f4SAlan Sternit loads from X. The question is then "Can CPU 3's load from X return 0?" 1398241e6663SPaul E. McKenney 13990902b1f4SAlan SternBecause CPU 3's load from X in some sense comes after CPU 2's load, it 1400241e6663SPaul E. McKenneyis natural to expect that CPU 3's load from X must therefore return 1. 14010902b1f4SAlan SternThis expectation follows from multicopy atomicity: if a load executing 14020902b1f4SAlan Sternon CPU B follows a load from the same variable executing on CPU A (and 14030902b1f4SAlan SternCPU A did not originally store the value which it read), then on 14040902b1f4SAlan Sternmulticopy-atomic systems, CPU B's load must return either the same value 14050902b1f4SAlan Sternthat CPU A's load did or some later value. However, the Linux kernel 14060902b1f4SAlan Sterndoes not require systems to be multicopy atomic. 1407241e6663SPaul E. McKenney 14080902b1f4SAlan SternThe use of a general memory barrier in the example above compensates 14090902b1f4SAlan Sternfor any lack of multicopy atomicity. In the example, if CPU 2's load 14100902b1f4SAlan Sternfrom X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load 14110902b1f4SAlan Sternfrom X must indeed also return 1. 1412241e6663SPaul E. McKenney 1413f1ab25a3SPaul E. McKenneyHowever, dependencies, read barriers, and write barriers are not always 1414f1ab25a3SPaul E. McKenneyable to compensate for non-multicopy atomicity. For example, suppose 1415f1ab25a3SPaul E. McKenneythat CPU 2's general barrier is removed from the above example, leaving 1416f1ab25a3SPaul E. McKenneyonly the data dependency shown below: 1417241e6663SPaul E. McKenney 1418241e6663SPaul E. McKenney CPU 1 CPU 2 CPU 3 1419241e6663SPaul E. McKenney ======================= ======================= ======================= 1420241e6663SPaul E. McKenney { X = 0, Y = 0 } 1421f1ab25a3SPaul E. McKenney STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) 1422f1ab25a3SPaul E. McKenney <data dependency> <read barrier> 1423f1ab25a3SPaul E. McKenney STORE Y=r1 LOAD X (reads 0) 1424241e6663SPaul E. McKenney 1425f1ab25a3SPaul E. McKenneyThis substitution allows non-multicopy atomicity to run rampant: in 1426f1ab25a3SPaul E. McKenneythis example, it is perfectly legal for CPU 2's load from X to return 1, 1427f1ab25a3SPaul E. McKenneyCPU 3's load from Y to return 1, and its load from X to return 0. 1428241e6663SPaul E. McKenney 1429f1ab25a3SPaul E. McKenneyThe key point is that although CPU 2's data dependency orders its load 14300902b1f4SAlan Sternand store, it does not guarantee to order CPU 1's store. Thus, if this 14310902b1f4SAlan Sternexample runs on a non-multicopy-atomic system where CPUs 1 and 2 share a 14320902b1f4SAlan Sternstore buffer or a level of cache, CPU 2 might have early access to CPU 1's 14330902b1f4SAlan Sternwrites. General barriers are therefore required to ensure that all CPUs 14340902b1f4SAlan Sternagree on the combined order of multiple accesses. 1435241e6663SPaul E. McKenney 1436f1ab25a3SPaul E. McKenneyGeneral barriers can compensate not only for non-multicopy atomicity, 1437f1ab25a3SPaul E. McKenneybut can also generate additional ordering that can ensure that -all- 1438f1ab25a3SPaul E. McKenneyCPUs will perceive the same order of -all- operations. In contrast, a 1439f1ab25a3SPaul E. McKenneychain of release-acquire pairs do not provide this additional ordering, 1440f1ab25a3SPaul E. McKenneywhich means that only those CPUs on the chain are guaranteed to agree 1441f1ab25a3SPaul E. McKenneyon the combined order of the accesses. For example, switching to C code 1442f1ab25a3SPaul E. McKenneyin deference to the ghost of Herman Hollerith: 1443c535cc92SPaul E. McKenney 1444c535cc92SPaul E. McKenney int u, v, x, y, z; 1445c535cc92SPaul E. McKenney 1446c535cc92SPaul E. McKenney void cpu0(void) 1447c535cc92SPaul E. McKenney { 1448c535cc92SPaul E. McKenney r0 = smp_load_acquire(&x); 1449c535cc92SPaul E. McKenney WRITE_ONCE(u, 1); 1450c535cc92SPaul E. McKenney smp_store_release(&y, 1); 1451c535cc92SPaul E. McKenney } 1452c535cc92SPaul E. McKenney 1453c535cc92SPaul E. McKenney void cpu1(void) 1454c535cc92SPaul E. McKenney { 1455c535cc92SPaul E. McKenney r1 = smp_load_acquire(&y); 1456c535cc92SPaul E. McKenney r4 = READ_ONCE(v); 1457c535cc92SPaul E. McKenney r5 = READ_ONCE(u); 1458c535cc92SPaul E. McKenney smp_store_release(&z, 1); 1459c535cc92SPaul E. McKenney } 1460c535cc92SPaul E. McKenney 1461c535cc92SPaul E. McKenney void cpu2(void) 1462c535cc92SPaul E. McKenney { 1463c535cc92SPaul E. McKenney r2 = smp_load_acquire(&z); 1464c535cc92SPaul E. McKenney smp_store_release(&x, 1); 1465c535cc92SPaul E. McKenney } 1466c535cc92SPaul E. McKenney 1467c535cc92SPaul E. McKenney void cpu3(void) 1468c535cc92SPaul E. McKenney { 1469c535cc92SPaul E. McKenney WRITE_ONCE(v, 1); 1470c535cc92SPaul E. McKenney smp_mb(); 1471c535cc92SPaul E. McKenney r3 = READ_ONCE(u); 1472c535cc92SPaul E. McKenney } 1473c535cc92SPaul E. McKenney 1474f1ab25a3SPaul E. McKenneyBecause cpu0(), cpu1(), and cpu2() participate in a chain of 1475f1ab25a3SPaul E. McKenneysmp_store_release()/smp_load_acquire() pairs, the following outcome 1476f1ab25a3SPaul E. McKenneyis prohibited: 1477c535cc92SPaul E. McKenney 1478c535cc92SPaul E. McKenney r0 == 1 && r1 == 1 && r2 == 1 1479c535cc92SPaul E. McKenney 1480c535cc92SPaul E. McKenneyFurthermore, because of the release-acquire relationship between cpu0() 1481c535cc92SPaul E. McKenneyand cpu1(), cpu1() must see cpu0()'s writes, so that the following 1482c535cc92SPaul E. McKenneyoutcome is prohibited: 1483c535cc92SPaul E. McKenney 1484c535cc92SPaul E. McKenney r1 == 1 && r5 == 0 1485c535cc92SPaul E. McKenney 1486f1ab25a3SPaul E. McKenneyHowever, the ordering provided by a release-acquire chain is local 1487f1ab25a3SPaul E. McKenneyto the CPUs participating in that chain and does not apply to cpu3(), 1488f1ab25a3SPaul E. McKenneyat least aside from stores. Therefore, the following outcome is possible: 1489c535cc92SPaul E. McKenney 1490c535cc92SPaul E. McKenney r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 1491c535cc92SPaul E. McKenney 149237ef0341SPaul E. McKenneyAs an aside, the following outcome is also possible: 149337ef0341SPaul E. McKenney 149437ef0341SPaul E. McKenney r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1 149537ef0341SPaul E. McKenney 1496c535cc92SPaul E. McKenneyAlthough cpu0(), cpu1(), and cpu2() will see their respective reads and 1497c535cc92SPaul E. McKenneywrites in order, CPUs not involved in the release-acquire chain might 1498c535cc92SPaul E. McKenneywell disagree on the order. This disagreement stems from the fact that 1499c535cc92SPaul E. McKenneythe weak memory-barrier instructions used to implement smp_load_acquire() 1500c535cc92SPaul E. McKenneyand smp_store_release() are not required to order prior stores against 1501c535cc92SPaul E. McKenneysubsequent loads in all cases. This means that cpu3() can see cpu0()'s 1502c535cc92SPaul E. McKenneystore to u as happening -after- cpu1()'s load from v, even though 1503c535cc92SPaul E. McKenneyboth cpu0() and cpu1() agree that these two operations occurred in the 1504c535cc92SPaul E. McKenneyintended order. 1505c535cc92SPaul E. McKenney 1506c535cc92SPaul E. McKenneyHowever, please keep in mind that smp_load_acquire() is not magic. 1507c535cc92SPaul E. McKenneyIn particular, it simply reads from its argument with ordering. It does 1508c535cc92SPaul E. McKenney-not- ensure that any particular value will be read. Therefore, the 1509c535cc92SPaul E. McKenneyfollowing outcome is possible: 1510c535cc92SPaul E. McKenney 1511c535cc92SPaul E. McKenney r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 1512c535cc92SPaul E. McKenney 1513c535cc92SPaul E. McKenneyNote that this outcome can happen even on a mythical sequentially 1514c535cc92SPaul E. McKenneyconsistent system where nothing is ever reordered. 1515c535cc92SPaul E. McKenney 1516f1ab25a3SPaul E. McKenneyTo reiterate, if your code requires full ordering of all operations, 1517f1ab25a3SPaul E. McKenneyuse general barriers throughout. 1518241e6663SPaul E. McKenney 1519241e6663SPaul E. McKenney 1520108b42b4SDavid Howells======================== 1521108b42b4SDavid HowellsEXPLICIT KERNEL BARRIERS 1522108b42b4SDavid Howells======================== 1523108b42b4SDavid Howells 1524108b42b4SDavid HowellsThe Linux kernel has a variety of different barriers that act at different 1525108b42b4SDavid Howellslevels: 1526108b42b4SDavid Howells 1527108b42b4SDavid Howells (*) Compiler barrier. 1528108b42b4SDavid Howells 1529108b42b4SDavid Howells (*) CPU memory barriers. 1530108b42b4SDavid Howells 1531108b42b4SDavid Howells 1532108b42b4SDavid HowellsCOMPILER BARRIER 1533108b42b4SDavid Howells---------------- 1534108b42b4SDavid Howells 1535108b42b4SDavid HowellsThe Linux kernel has an explicit compiler barrier function that prevents the 1536108b42b4SDavid Howellscompiler from moving the memory accesses either side of it to the other side: 1537108b42b4SDavid Howells 1538108b42b4SDavid Howells barrier(); 1539108b42b4SDavid Howells 15409af194ceSPaul E. McKenneyThis is a general barrier -- there are no read-read or write-write 15419af194ceSPaul E. McKenneyvariants of barrier(). However, READ_ONCE() and WRITE_ONCE() can be 15429af194ceSPaul E. McKenneythought of as weak forms of barrier() that affect only the specific 15439af194ceSPaul E. McKenneyaccesses flagged by the READ_ONCE() or WRITE_ONCE(). 1544108b42b4SDavid Howells 1545692118daSPaul E. McKenneyThe barrier() function has the following effects: 1546692118daSPaul E. McKenney 1547692118daSPaul E. McKenney (*) Prevents the compiler from reordering accesses following the 1548692118daSPaul E. McKenney barrier() to precede any accesses preceding the barrier(). 1549692118daSPaul E. McKenney One example use for this property is to ease communication between 1550692118daSPaul E. McKenney interrupt-handler code and the code that was interrupted. 1551692118daSPaul E. McKenney 1552692118daSPaul E. McKenney (*) Within a loop, forces the compiler to load the variables used 1553692118daSPaul E. McKenney in that loop's conditional on each pass through that loop. 1554692118daSPaul E. McKenney 15559af194ceSPaul E. McKenneyThe READ_ONCE() and WRITE_ONCE() functions can prevent any number of 15569af194ceSPaul E. McKenneyoptimizations that, while perfectly safe in single-threaded code, can 15579af194ceSPaul E. McKenneybe fatal in concurrent code. Here are some examples of these sorts 15589af194ceSPaul E. McKenneyof optimizations: 1559692118daSPaul E. McKenney 1560449f7413SPaul E. McKenney (*) The compiler is within its rights to reorder loads and stores 1561449f7413SPaul E. McKenney to the same variable, and in some cases, the CPU is within its 1562449f7413SPaul E. McKenney rights to reorder loads to the same variable. This means that 1563449f7413SPaul E. McKenney the following code: 1564449f7413SPaul E. McKenney 1565449f7413SPaul E. McKenney a[0] = x; 1566449f7413SPaul E. McKenney a[1] = x; 1567449f7413SPaul E. McKenney 1568449f7413SPaul E. McKenney Might result in an older value of x stored in a[1] than in a[0]. 1569449f7413SPaul E. McKenney Prevent both the compiler and the CPU from doing this as follows: 1570449f7413SPaul E. McKenney 15719af194ceSPaul E. McKenney a[0] = READ_ONCE(x); 15729af194ceSPaul E. McKenney a[1] = READ_ONCE(x); 1573449f7413SPaul E. McKenney 15749af194ceSPaul E. McKenney In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for 15759af194ceSPaul E. McKenney accesses from multiple CPUs to a single variable. 1576449f7413SPaul E. McKenney 1577692118daSPaul E. McKenney (*) The compiler is within its rights to merge successive loads from 1578692118daSPaul E. McKenney the same variable. Such merging can cause the compiler to "optimize" 1579692118daSPaul E. McKenney the following code: 1580692118daSPaul E. McKenney 1581692118daSPaul E. McKenney while (tmp = a) 1582692118daSPaul E. McKenney do_something_with(tmp); 1583692118daSPaul E. McKenney 1584692118daSPaul E. McKenney into the following code, which, although in some sense legitimate 1585692118daSPaul E. McKenney for single-threaded code, is almost certainly not what the developer 1586692118daSPaul E. McKenney intended: 1587692118daSPaul E. McKenney 1588692118daSPaul E. McKenney if (tmp = a) 1589692118daSPaul E. McKenney for (;;) 1590692118daSPaul E. McKenney do_something_with(tmp); 1591692118daSPaul E. McKenney 15929af194ceSPaul E. McKenney Use READ_ONCE() to prevent the compiler from doing this to you: 1593692118daSPaul E. McKenney 15949af194ceSPaul E. McKenney while (tmp = READ_ONCE(a)) 1595692118daSPaul E. McKenney do_something_with(tmp); 1596692118daSPaul E. McKenney 1597692118daSPaul E. McKenney (*) The compiler is within its rights to reload a variable, for example, 1598692118daSPaul E. McKenney in cases where high register pressure prevents the compiler from 1599692118daSPaul E. McKenney keeping all data of interest in registers. The compiler might 1600692118daSPaul E. McKenney therefore optimize the variable 'tmp' out of our previous example: 1601692118daSPaul E. McKenney 1602692118daSPaul E. McKenney while (tmp = a) 1603692118daSPaul E. McKenney do_something_with(tmp); 1604692118daSPaul E. McKenney 1605692118daSPaul E. McKenney This could result in the following code, which is perfectly safe in 1606692118daSPaul E. McKenney single-threaded code, but can be fatal in concurrent code: 1607692118daSPaul E. McKenney 1608692118daSPaul E. McKenney while (a) 1609692118daSPaul E. McKenney do_something_with(a); 1610692118daSPaul E. McKenney 1611692118daSPaul E. McKenney For example, the optimized version of this code could result in 1612692118daSPaul E. McKenney passing a zero to do_something_with() in the case where the variable 1613692118daSPaul E. McKenney a was modified by some other CPU between the "while" statement and 1614692118daSPaul E. McKenney the call to do_something_with(). 1615692118daSPaul E. McKenney 16169af194ceSPaul E. McKenney Again, use READ_ONCE() to prevent the compiler from doing this: 1617692118daSPaul E. McKenney 16189af194ceSPaul E. McKenney while (tmp = READ_ONCE(a)) 1619692118daSPaul E. McKenney do_something_with(tmp); 1620692118daSPaul E. McKenney 1621692118daSPaul E. McKenney Note that if the compiler runs short of registers, it might save 1622692118daSPaul E. McKenney tmp onto the stack. The overhead of this saving and later restoring 1623692118daSPaul E. McKenney is why compilers reload variables. Doing so is perfectly safe for 1624692118daSPaul E. McKenney single-threaded code, so you need to tell the compiler about cases 1625692118daSPaul E. McKenney where it is not safe. 1626692118daSPaul E. McKenney 1627692118daSPaul E. McKenney (*) The compiler is within its rights to omit a load entirely if it knows 1628692118daSPaul E. McKenney what the value will be. For example, if the compiler can prove that 1629692118daSPaul E. McKenney the value of variable 'a' is always zero, it can optimize this code: 1630692118daSPaul E. McKenney 1631692118daSPaul E. McKenney while (tmp = a) 1632692118daSPaul E. McKenney do_something_with(tmp); 1633692118daSPaul E. McKenney 1634692118daSPaul E. McKenney Into this: 1635692118daSPaul E. McKenney 1636692118daSPaul E. McKenney do { } while (0); 1637692118daSPaul E. McKenney 16389af194ceSPaul E. McKenney This transformation is a win for single-threaded code because it 16399af194ceSPaul E. McKenney gets rid of a load and a branch. The problem is that the compiler 16409af194ceSPaul E. McKenney will carry out its proof assuming that the current CPU is the only 16419af194ceSPaul E. McKenney one updating variable 'a'. If variable 'a' is shared, then the 16429af194ceSPaul E. McKenney compiler's proof will be erroneous. Use READ_ONCE() to tell the 16439af194ceSPaul E. McKenney compiler that it doesn't know as much as it thinks it does: 1644692118daSPaul E. McKenney 16459af194ceSPaul E. McKenney while (tmp = READ_ONCE(a)) 1646692118daSPaul E. McKenney do_something_with(tmp); 1647692118daSPaul E. McKenney 1648692118daSPaul E. McKenney But please note that the compiler is also closely watching what you 16499af194ceSPaul E. McKenney do with the value after the READ_ONCE(). For example, suppose you 1650692118daSPaul E. McKenney do the following and MAX is a preprocessor macro with the value 1: 1651692118daSPaul E. McKenney 16529af194ceSPaul E. McKenney while ((tmp = READ_ONCE(a)) % MAX) 1653692118daSPaul E. McKenney do_something_with(tmp); 1654692118daSPaul E. McKenney 1655692118daSPaul E. McKenney Then the compiler knows that the result of the "%" operator applied 1656692118daSPaul E. McKenney to MAX will always be zero, again allowing the compiler to optimize 1657692118daSPaul E. McKenney the code into near-nonexistence. (It will still load from the 1658692118daSPaul E. McKenney variable 'a'.) 1659692118daSPaul E. McKenney 1660692118daSPaul E. McKenney (*) Similarly, the compiler is within its rights to omit a store entirely 1661692118daSPaul E. McKenney if it knows that the variable already has the value being stored. 1662692118daSPaul E. McKenney Again, the compiler assumes that the current CPU is the only one 1663692118daSPaul E. McKenney storing into the variable, which can cause the compiler to do the 1664692118daSPaul E. McKenney wrong thing for shared variables. For example, suppose you have 1665692118daSPaul E. McKenney the following: 1666692118daSPaul E. McKenney 1667692118daSPaul E. McKenney a = 0; 166865f95ff2SSeongJae Park ... Code that does not store to variable a ... 1669692118daSPaul E. McKenney a = 0; 1670692118daSPaul E. McKenney 1671692118daSPaul E. McKenney The compiler sees that the value of variable 'a' is already zero, so 1672692118daSPaul E. McKenney it might well omit the second store. This would come as a fatal 1673692118daSPaul E. McKenney surprise if some other CPU might have stored to variable 'a' in the 1674692118daSPaul E. McKenney meantime. 1675692118daSPaul E. McKenney 16769af194ceSPaul E. McKenney Use WRITE_ONCE() to prevent the compiler from making this sort of 1677692118daSPaul E. McKenney wrong guess: 1678692118daSPaul E. McKenney 16799af194ceSPaul E. McKenney WRITE_ONCE(a, 0); 168065f95ff2SSeongJae Park ... Code that does not store to variable a ... 16819af194ceSPaul E. McKenney WRITE_ONCE(a, 0); 1682692118daSPaul E. McKenney 1683692118daSPaul E. McKenney (*) The compiler is within its rights to reorder memory accesses unless 1684692118daSPaul E. McKenney you tell it not to. For example, consider the following interaction 1685692118daSPaul E. McKenney between process-level code and an interrupt handler: 1686692118daSPaul E. McKenney 1687692118daSPaul E. McKenney void process_level(void) 1688692118daSPaul E. McKenney { 1689692118daSPaul E. McKenney msg = get_message(); 1690692118daSPaul E. McKenney flag = true; 1691692118daSPaul E. McKenney } 1692692118daSPaul E. McKenney 1693692118daSPaul E. McKenney void interrupt_handler(void) 1694692118daSPaul E. McKenney { 1695692118daSPaul E. McKenney if (flag) 1696692118daSPaul E. McKenney process_message(msg); 1697692118daSPaul E. McKenney } 1698692118daSPaul E. McKenney 1699df5cbb27SMasanari Iida There is nothing to prevent the compiler from transforming 1700692118daSPaul E. McKenney process_level() to the following, in fact, this might well be a 1701692118daSPaul E. McKenney win for single-threaded code: 1702692118daSPaul E. McKenney 1703692118daSPaul E. McKenney void process_level(void) 1704692118daSPaul E. McKenney { 1705692118daSPaul E. McKenney flag = true; 1706692118daSPaul E. McKenney msg = get_message(); 1707692118daSPaul E. McKenney } 1708692118daSPaul E. McKenney 1709692118daSPaul E. McKenney If the interrupt occurs between these two statement, then 17109af194ceSPaul E. McKenney interrupt_handler() might be passed a garbled msg. Use WRITE_ONCE() 1711692118daSPaul E. McKenney to prevent this as follows: 1712692118daSPaul E. McKenney 1713692118daSPaul E. McKenney void process_level(void) 1714692118daSPaul E. McKenney { 17159af194ceSPaul E. McKenney WRITE_ONCE(msg, get_message()); 17169af194ceSPaul E. McKenney WRITE_ONCE(flag, true); 1717692118daSPaul E. McKenney } 1718692118daSPaul E. McKenney 1719692118daSPaul E. McKenney void interrupt_handler(void) 1720692118daSPaul E. McKenney { 17219af194ceSPaul E. McKenney if (READ_ONCE(flag)) 17229af194ceSPaul E. McKenney process_message(READ_ONCE(msg)); 1723692118daSPaul E. McKenney } 1724692118daSPaul E. McKenney 17259af194ceSPaul E. McKenney Note that the READ_ONCE() and WRITE_ONCE() wrappers in 17269af194ceSPaul E. McKenney interrupt_handler() are needed if this interrupt handler can itself 17279af194ceSPaul E. McKenney be interrupted by something that also accesses 'flag' and 'msg', 17289af194ceSPaul E. McKenney for example, a nested interrupt or an NMI. Otherwise, READ_ONCE() 17299af194ceSPaul E. McKenney and WRITE_ONCE() are not needed in interrupt_handler() other than 17309af194ceSPaul E. McKenney for documentation purposes. (Note also that nested interrupts 17319af194ceSPaul E. McKenney do not typically occur in modern Linux kernels, in fact, if an 17329af194ceSPaul E. McKenney interrupt handler returns with interrupts enabled, you will get a 17339af194ceSPaul E. McKenney WARN_ONCE() splat.) 1734692118daSPaul E. McKenney 17359af194ceSPaul E. McKenney You should assume that the compiler can move READ_ONCE() and 17369af194ceSPaul E. McKenney WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(), 17379af194ceSPaul E. McKenney barrier(), or similar primitives. 1738692118daSPaul E. McKenney 17399af194ceSPaul E. McKenney This effect could also be achieved using barrier(), but READ_ONCE() 17409af194ceSPaul E. McKenney and WRITE_ONCE() are more selective: With READ_ONCE() and 17419af194ceSPaul E. McKenney WRITE_ONCE(), the compiler need only forget the contents of the 17429af194ceSPaul E. McKenney indicated memory locations, while with barrier() the compiler must 17438149b5cbSSeongJae Park discard the value of all memory locations that it has currently 17449af194ceSPaul E. McKenney cached in any machine registers. Of course, the compiler must also 17459af194ceSPaul E. McKenney respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur, 17469af194ceSPaul E. McKenney though the CPU of course need not do so. 1747692118daSPaul E. McKenney 1748692118daSPaul E. McKenney (*) The compiler is within its rights to invent stores to a variable, 1749692118daSPaul E. McKenney as in the following example: 1750692118daSPaul E. McKenney 1751692118daSPaul E. McKenney if (a) 1752692118daSPaul E. McKenney b = a; 1753692118daSPaul E. McKenney else 1754692118daSPaul E. McKenney b = 42; 1755692118daSPaul E. McKenney 1756692118daSPaul E. McKenney The compiler might save a branch by optimizing this as follows: 1757692118daSPaul E. McKenney 1758692118daSPaul E. McKenney b = 42; 1759692118daSPaul E. McKenney if (a) 1760692118daSPaul E. McKenney b = a; 1761692118daSPaul E. McKenney 1762692118daSPaul E. McKenney In single-threaded code, this is not only safe, but also saves 1763692118daSPaul E. McKenney a branch. Unfortunately, in concurrent code, this optimization 1764692118daSPaul E. McKenney could cause some other CPU to see a spurious value of 42 -- even 1765692118daSPaul E. McKenney if variable 'a' was never zero -- when loading variable 'b'. 17669af194ceSPaul E. McKenney Use WRITE_ONCE() to prevent this as follows: 1767692118daSPaul E. McKenney 1768692118daSPaul E. McKenney if (a) 17699af194ceSPaul E. McKenney WRITE_ONCE(b, a); 1770692118daSPaul E. McKenney else 17719af194ceSPaul E. McKenney WRITE_ONCE(b, 42); 1772692118daSPaul E. McKenney 1773692118daSPaul E. McKenney The compiler can also invent loads. These are usually less 1774692118daSPaul E. McKenney damaging, but they can result in cache-line bouncing and thus in 17759af194ceSPaul E. McKenney poor performance and scalability. Use READ_ONCE() to prevent 1776692118daSPaul E. McKenney invented loads. 1777692118daSPaul E. McKenney 1778692118daSPaul E. McKenney (*) For aligned memory locations whose size allows them to be accessed 1779692118daSPaul E. McKenney with a single memory-reference instruction, prevents "load tearing" 1780692118daSPaul E. McKenney and "store tearing," in which a single large access is replaced by 1781692118daSPaul E. McKenney multiple smaller accesses. For example, given an architecture having 1782692118daSPaul E. McKenney 16-bit store instructions with 7-bit immediate fields, the compiler 1783692118daSPaul E. McKenney might be tempted to use two 16-bit store-immediate instructions to 1784692118daSPaul E. McKenney implement the following 32-bit store: 1785692118daSPaul E. McKenney 1786692118daSPaul E. McKenney p = 0x00010002; 1787692118daSPaul E. McKenney 1788692118daSPaul E. McKenney Please note that GCC really does use this sort of optimization, 1789692118daSPaul E. McKenney which is not surprising given that it would likely take more 1790692118daSPaul E. McKenney than two instructions to build the constant and then store it. 1791692118daSPaul E. McKenney This optimization can therefore be a win in single-threaded code. 1792692118daSPaul E. McKenney In fact, a recent bug (since fixed) caused GCC to incorrectly use 1793692118daSPaul E. McKenney this optimization in a volatile store. In the absence of such bugs, 17949af194ceSPaul E. McKenney use of WRITE_ONCE() prevents store tearing in the following example: 1795692118daSPaul E. McKenney 17969af194ceSPaul E. McKenney WRITE_ONCE(p, 0x00010002); 1797692118daSPaul E. McKenney 1798692118daSPaul E. McKenney Use of packed structures can also result in load and store tearing, 1799692118daSPaul E. McKenney as in this example: 1800692118daSPaul E. McKenney 1801692118daSPaul E. McKenney struct __attribute__((__packed__)) foo { 1802692118daSPaul E. McKenney short a; 1803692118daSPaul E. McKenney int b; 1804692118daSPaul E. McKenney short c; 1805692118daSPaul E. McKenney }; 1806692118daSPaul E. McKenney struct foo foo1, foo2; 1807692118daSPaul E. McKenney ... 1808692118daSPaul E. McKenney 1809692118daSPaul E. McKenney foo2.a = foo1.a; 1810692118daSPaul E. McKenney foo2.b = foo1.b; 1811692118daSPaul E. McKenney foo2.c = foo1.c; 1812692118daSPaul E. McKenney 18139af194ceSPaul E. McKenney Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no 18149af194ceSPaul E. McKenney volatile markings, the compiler would be well within its rights to 18159af194ceSPaul E. McKenney implement these three assignment statements as a pair of 32-bit 18169af194ceSPaul E. McKenney loads followed by a pair of 32-bit stores. This would result in 18179af194ceSPaul E. McKenney load tearing on 'foo1.b' and store tearing on 'foo2.b'. READ_ONCE() 18189af194ceSPaul E. McKenney and WRITE_ONCE() again prevent tearing in this example: 1819692118daSPaul E. McKenney 1820692118daSPaul E. McKenney foo2.a = foo1.a; 18219af194ceSPaul E. McKenney WRITE_ONCE(foo2.b, READ_ONCE(foo1.b)); 1822692118daSPaul E. McKenney foo2.c = foo1.c; 1823692118daSPaul E. McKenney 18249af194ceSPaul E. McKenneyAll that aside, it is never necessary to use READ_ONCE() and 18259af194ceSPaul E. McKenneyWRITE_ONCE() on a variable that has been marked volatile. For example, 18269af194ceSPaul E. McKenneybecause 'jiffies' is marked volatile, it is never necessary to 18279af194ceSPaul E. McKenneysay READ_ONCE(jiffies). The reason for this is that READ_ONCE() and 18289af194ceSPaul E. McKenneyWRITE_ONCE() are implemented as volatile casts, which has no effect when 18299af194ceSPaul E. McKenneyits argument is already marked volatile. 1830692118daSPaul E. McKenney 1831692118daSPaul E. McKenneyPlease note that these compiler barriers have no direct effect on the CPU, 1832692118daSPaul E. McKenneywhich may then reorder things however it wishes. 1833108b42b4SDavid Howells 1834108b42b4SDavid Howells 1835108b42b4SDavid HowellsCPU MEMORY BARRIERS 1836108b42b4SDavid Howells------------------- 1837108b42b4SDavid Howells 1838203185f6SAkira YokosawaThe Linux kernel has seven basic CPU memory barriers: 1839108b42b4SDavid Howells 1840108b42b4SDavid Howells TYPE MANDATORY SMP CONDITIONAL 1841203185f6SAkira Yokosawa ======================= =============== =============== 1842108b42b4SDavid Howells GENERAL mb() smp_mb() 1843108b42b4SDavid Howells WRITE wmb() smp_wmb() 1844108b42b4SDavid Howells READ rmb() smp_rmb() 1845203185f6SAkira Yokosawa ADDRESS DEPENDENCY READ_ONCE() 1846108b42b4SDavid Howells 1847108b42b4SDavid Howells 1848203185f6SAkira YokosawaAll memory barriers except the address-dependency barriers imply a compiler 1849203185f6SAkira Yokosawabarrier. Address dependencies do not impose any additional compiler ordering. 185073f10281SNick Piggin 1851203185f6SAkira YokosawaAside: In the case of address dependencies, the compiler would be expected 18529af194ceSPaul E. McKenneyto issue the loads in the correct order (eg. `a[b]` would have to load 18539af194ceSPaul E. McKenneythe value of b before loading a[b]), however there is no guarantee in 18549af194ceSPaul E. McKenneythe C specification that the compiler may not speculate the value of b 18558149b5cbSSeongJae Park(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1) 18569af194ceSPaul E. McKenneytmp = a[b]; ). There is also the problem of a compiler reloading b after 18579af194ceSPaul E. McKenneyhaving loaded a[b], thus having a newer copy of b than a[b]. A consensus 18589af194ceSPaul E. McKenneyhas not yet been reached about these problems, however the READ_ONCE() 18599af194ceSPaul E. McKenneymacro is a good place to start looking. 1860108b42b4SDavid Howells 1861108b42b4SDavid HowellsSMP memory barriers are reduced to compiler barriers on uniprocessor compiled 186281fc6323SJarek Poplawskisystems because it is assumed that a CPU will appear to be self-consistent, 1863108b42b4SDavid Howellsand will order overlapping accesses correctly with respect to itself. 18646a65d263SMichael S. TsirkinHowever, see the subsection on "Virtual Machine Guests" below. 1865108b42b4SDavid Howells 1866108b42b4SDavid Howells[!] Note that SMP memory barriers _must_ be used to control the ordering of 1867108b42b4SDavid Howellsreferences to shared memory on SMP systems, though the use of locking instead 1868108b42b4SDavid Howellsis sufficient. 1869108b42b4SDavid Howells 1870108b42b4SDavid HowellsMandatory barriers should not be used to control SMP effects, since mandatory 18716a65d263SMichael S. Tsirkinbarriers impose unnecessary overhead on both SMP and UP systems. They may, 18726a65d263SMichael S. Tsirkinhowever, be used to control MMIO effects on accesses through relaxed memory I/O 18736a65d263SMichael S. Tsirkinwindows. These barriers are required even on non-SMP systems as they affect 18746a65d263SMichael S. Tsirkinthe order in which memory operations appear to a device by prohibiting both the 18756a65d263SMichael S. Tsirkincompiler and the CPU from reordering them. 1876108b42b4SDavid Howells 1877108b42b4SDavid Howells 1878108b42b4SDavid HowellsThere are some more advanced barrier functions: 1879108b42b4SDavid Howells 1880b92b8b35SPeter Zijlstra (*) smp_store_mb(var, value) 1881108b42b4SDavid Howells 188275b2bd55SOleg Nesterov This assigns the value to the variable and then inserts a full memory 18832d142e59SDavidlohr Bueso barrier after it. It isn't guaranteed to insert anything more than a 18842d142e59SDavidlohr Bueso compiler barrier in a UP compilation. 1885108b42b4SDavid Howells 1886108b42b4SDavid Howells 18871b15611eSPeter Zijlstra (*) smp_mb__before_atomic(); 18881b15611eSPeter Zijlstra (*) smp_mb__after_atomic(); 1889108b42b4SDavid Howells 189039323c64SManfred Spraul These are for use with atomic RMW functions that do not imply memory 189139323c64SManfred Spraul barriers, but where the code needs a memory barrier. Examples for atomic 1892d8566f15SFox Chen RMW functions that do not imply a memory barrier are e.g. add, 189339323c64SManfred Spraul subtract, (failed) conditional operations, _relaxed functions, 189439323c64SManfred Spraul but not atomic_read or atomic_set. A common example where a memory 189539323c64SManfred Spraul barrier may be required is when atomic ops are used for reference 189639323c64SManfred Spraul counting. 18971b15611eSPeter Zijlstra 189839323c64SManfred Spraul These are also used for atomic RMW bitop functions that do not imply a 189939323c64SManfred Spraul memory barrier (such as set_bit and clear_bit). 1900108b42b4SDavid Howells 1901108b42b4SDavid Howells As an example, consider a piece of code that marks an object as being dead 1902108b42b4SDavid Howells and then decrements the object's reference count: 1903108b42b4SDavid Howells 1904108b42b4SDavid Howells obj->dead = 1; 19051b15611eSPeter Zijlstra smp_mb__before_atomic(); 1906108b42b4SDavid Howells atomic_dec(&obj->ref_count); 1907108b42b4SDavid Howells 1908108b42b4SDavid Howells This makes sure that the death mark on the object is perceived to be set 1909108b42b4SDavid Howells *before* the reference counter is decremented. 1910108b42b4SDavid Howells 1911706eeb3eSPeter Zijlstra See Documentation/atomic_{t,bitops}.txt for more information. 1912108b42b4SDavid Howells 1913108b42b4SDavid Howells 19141077fa36SAlexander Duyck (*) dma_wmb(); 19151077fa36SAlexander Duyck (*) dma_rmb(); 1916ed59dfd9SKefeng Wang (*) dma_mb(); 19171077fa36SAlexander Duyck 19181077fa36SAlexander Duyck These are for use with consistent memory to guarantee the ordering 19191077fa36SAlexander Duyck of writes or reads of shared memory accessible to both the CPU and a 1920289e1c89SParav Pandit DMA capable device. See Documentation/core-api/dma-api.rst file for more 1921289e1c89SParav Pandit information about consistent memory. 19221077fa36SAlexander Duyck 19231077fa36SAlexander Duyck For example, consider a device driver that shares memory with a device 19241077fa36SAlexander Duyck and uses a descriptor status value to indicate if the descriptor belongs 19251077fa36SAlexander Duyck to the device or the CPU, and a doorbell to notify it when new 19261077fa36SAlexander Duyck descriptors are available: 19271077fa36SAlexander Duyck 19281077fa36SAlexander Duyck if (desc->status != DEVICE_OWN) { 19291077fa36SAlexander Duyck /* do not read data until we own descriptor */ 19301077fa36SAlexander Duyck dma_rmb(); 19311077fa36SAlexander Duyck 19321077fa36SAlexander Duyck /* read/modify data */ 19331077fa36SAlexander Duyck read_data = desc->data; 19341077fa36SAlexander Duyck desc->data = write_data; 19351077fa36SAlexander Duyck 19361077fa36SAlexander Duyck /* flush modifications before status update */ 19371077fa36SAlexander Duyck dma_wmb(); 19381077fa36SAlexander Duyck 19391077fa36SAlexander Duyck /* assign ownership */ 19401077fa36SAlexander Duyck desc->status = DEVICE_OWN; 19411077fa36SAlexander Duyck 1942289e1c89SParav Pandit /* Make descriptor status visible to the device followed by 1943289e1c89SParav Pandit * notify device of new descriptor 1944289e1c89SParav Pandit */ 19451077fa36SAlexander Duyck writel(DESC_NOTIFY, doorbell); 19461077fa36SAlexander Duyck } 19471077fa36SAlexander Duyck 1948289e1c89SParav Pandit The dma_rmb() allows us to guarantee that the device has released ownership 19497a458007SSylvain Trias before we read the data from the descriptor, and the dma_wmb() allows 19501077fa36SAlexander Duyck us to guarantee the data is written to the descriptor before the device 1951ed59dfd9SKefeng Wang can see it now has ownership. The dma_mb() implies both a dma_rmb() and 1952289e1c89SParav Pandit a dma_wmb(). 19531077fa36SAlexander Duyck 1954289e1c89SParav Pandit Note that the dma_*() barriers do not provide any ordering guarantees for 1955289e1c89SParav Pandit accesses to MMIO regions. See the later "KERNEL I/O BARRIER EFFECTS" 1956289e1c89SParav Pandit subsection for more information about I/O accessors and MMIO ordering. 19571077fa36SAlexander Duyck 19583e79f082SAneesh Kumar K.V (*) pmem_wmb(); 19593e79f082SAneesh Kumar K.V 19603e79f082SAneesh Kumar K.V This is for use with persistent memory to ensure that stores for which 19613e79f082SAneesh Kumar K.V modifications are written to persistent storage reached a platform 19623e79f082SAneesh Kumar K.V durability domain. 19633e79f082SAneesh Kumar K.V 19643e79f082SAneesh Kumar K.V For example, after a non-temporal write to pmem region, we use pmem_wmb() 19653e79f082SAneesh Kumar K.V to ensure that stores have reached a platform durability domain. This ensures 19663e79f082SAneesh Kumar K.V that stores have updated persistent storage before any data access or 19673e79f082SAneesh Kumar K.V data transfer caused by subsequent instructions is initiated. This is 19683e79f082SAneesh Kumar K.V in addition to the ordering done by wmb(). 19693e79f082SAneesh Kumar K.V 19703e79f082SAneesh Kumar K.V For load from persistent memory, existing read memory barriers are sufficient 19713e79f082SAneesh Kumar K.V to ensure read ordering. 1972dfeccea6SSeongJae Park 1973d5624bb2SXiongfeng Wang (*) io_stop_wc(); 1974d5624bb2SXiongfeng Wang 1975d5624bb2SXiongfeng Wang For memory accesses with write-combining attributes (e.g. those returned 19761ab8f248SSeongJae Park by ioremap_wc()), the CPU may wait for prior accesses to be merged with 1977d5624bb2SXiongfeng Wang subsequent ones. io_stop_wc() can be used to prevent the merging of 1978d5624bb2SXiongfeng Wang write-combining memory accesses before this macro with those after it when 1979d5624bb2SXiongfeng Wang such wait has performance implications. 1980d5624bb2SXiongfeng Wang 1981108b42b4SDavid Howells=============================== 1982108b42b4SDavid HowellsIMPLICIT KERNEL MEMORY BARRIERS 1983108b42b4SDavid Howells=============================== 1984108b42b4SDavid Howells 1985108b42b4SDavid HowellsSome of the other functions in the linux kernel imply memory barriers, amongst 1986670bd95eSDavid Howellswhich are locking and scheduling functions. 1987108b42b4SDavid Howells 1988108b42b4SDavid HowellsThis specification is a _minimum_ guarantee; any particular architecture may 1989108b42b4SDavid Howellsprovide more substantial guarantees, but these may not be relied upon outside 1990108b42b4SDavid Howellsof arch specific code. 1991108b42b4SDavid Howells 1992108b42b4SDavid Howells 1993166bda71SSeongJae ParkLOCK ACQUISITION FUNCTIONS 1994166bda71SSeongJae Park-------------------------- 1995108b42b4SDavid Howells 1996108b42b4SDavid HowellsThe Linux kernel has a number of locking constructs: 1997108b42b4SDavid Howells 1998108b42b4SDavid Howells (*) spin locks 1999108b42b4SDavid Howells (*) R/W spin locks 2000108b42b4SDavid Howells (*) mutexes 2001108b42b4SDavid Howells (*) semaphores 2002108b42b4SDavid Howells (*) R/W semaphores 2003108b42b4SDavid Howells 20042e4f5382SPeter ZijlstraIn all cases there are variants on "ACQUIRE" operations and "RELEASE" operations 2005108b42b4SDavid Howellsfor each construct. These operations all imply certain barriers: 2006108b42b4SDavid Howells 20072e4f5382SPeter Zijlstra (1) ACQUIRE operation implication: 2008108b42b4SDavid Howells 20092e4f5382SPeter Zijlstra Memory operations issued after the ACQUIRE will be completed after the 20102e4f5382SPeter Zijlstra ACQUIRE operation has completed. 2011108b42b4SDavid Howells 20128dd853d7SPaul E. McKenney Memory operations issued before the ACQUIRE may be completed after 2013a9668cd6SPeter Zijlstra the ACQUIRE operation has completed. 2014108b42b4SDavid Howells 20152e4f5382SPeter Zijlstra (2) RELEASE operation implication: 2016108b42b4SDavid Howells 20172e4f5382SPeter Zijlstra Memory operations issued before the RELEASE will be completed before the 20182e4f5382SPeter Zijlstra RELEASE operation has completed. 2019108b42b4SDavid Howells 20202e4f5382SPeter Zijlstra Memory operations issued after the RELEASE may be completed before the 20212e4f5382SPeter Zijlstra RELEASE operation has completed. 2022108b42b4SDavid Howells 20232e4f5382SPeter Zijlstra (3) ACQUIRE vs ACQUIRE implication: 2024108b42b4SDavid Howells 20252e4f5382SPeter Zijlstra All ACQUIRE operations issued before another ACQUIRE operation will be 20262e4f5382SPeter Zijlstra completed before that ACQUIRE operation. 2027108b42b4SDavid Howells 20282e4f5382SPeter Zijlstra (4) ACQUIRE vs RELEASE implication: 2029108b42b4SDavid Howells 20302e4f5382SPeter Zijlstra All ACQUIRE operations issued before a RELEASE operation will be 20312e4f5382SPeter Zijlstra completed before the RELEASE operation. 2032108b42b4SDavid Howells 20332e4f5382SPeter Zijlstra (5) Failed conditional ACQUIRE implication: 2034108b42b4SDavid Howells 20352e4f5382SPeter Zijlstra Certain locking variants of the ACQUIRE operation may fail, either due to 20362e4f5382SPeter Zijlstra being unable to get the lock immediately, or due to receiving an unblocked 2037806654a9SWill Deacon signal while asleep waiting for the lock to become available. Failed 2038108b42b4SDavid Howells locks do not imply any sort of barrier. 2039108b42b4SDavid Howells 20402e4f5382SPeter Zijlstra[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only 20412e4f5382SPeter Zijlstraone-way barriers is that the effects of instructions outside of a critical 20422e4f5382SPeter Zijlstrasection may seep into the inside of the critical section. 2043108b42b4SDavid Howells 20442e4f5382SPeter ZijlstraAn ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier 20452e4f5382SPeter Zijlstrabecause it is possible for an access preceding the ACQUIRE to happen after the 20462e4f5382SPeter ZijlstraACQUIRE, and an access following the RELEASE to happen before the RELEASE, and 20472e4f5382SPeter Zijlstrathe two accesses can themselves then cross: 2048670bd95eSDavid Howells 2049670bd95eSDavid Howells *A = a; 20502e4f5382SPeter Zijlstra ACQUIRE M 20512e4f5382SPeter Zijlstra RELEASE M 2052670bd95eSDavid Howells *B = b; 2053670bd95eSDavid Howells 2054670bd95eSDavid Howellsmay occur as: 2055670bd95eSDavid Howells 20562e4f5382SPeter Zijlstra ACQUIRE M, STORE *B, STORE *A, RELEASE M 205717eb88e0SPaul E. McKenney 20588dd853d7SPaul E. McKenneyWhen the ACQUIRE and RELEASE are a lock acquisition and release, 20598dd853d7SPaul E. McKenneyrespectively, this same reordering can occur if the lock's ACQUIRE and 20608dd853d7SPaul E. McKenneyRELEASE are to the same lock variable, but only from the perspective of 20618dd853d7SPaul E. McKenneyanother CPU not holding that lock. In short, a ACQUIRE followed by an 20628dd853d7SPaul E. McKenneyRELEASE may -not- be assumed to be a full memory barrier. 206317eb88e0SPaul E. McKenney 206412d560f4SPaul E. McKenneySimilarly, the reverse case of a RELEASE followed by an ACQUIRE does 206512d560f4SPaul E. McKenneynot imply a full memory barrier. Therefore, the CPU's execution of the 206612d560f4SPaul E. McKenneycritical sections corresponding to the RELEASE and the ACQUIRE can cross, 206712d560f4SPaul E. McKenneyso that: 206817eb88e0SPaul E. McKenney 206917eb88e0SPaul E. McKenney *A = a; 20702e4f5382SPeter Zijlstra RELEASE M 20712e4f5382SPeter Zijlstra ACQUIRE N 207217eb88e0SPaul E. McKenney *B = b; 207317eb88e0SPaul E. McKenney 207417eb88e0SPaul E. McKenneycould occur as: 207517eb88e0SPaul E. McKenney 20762e4f5382SPeter Zijlstra ACQUIRE N, STORE *B, STORE *A, RELEASE M 207717eb88e0SPaul E. McKenney 20788dd853d7SPaul E. McKenneyIt might appear that this reordering could introduce a deadlock. 20798dd853d7SPaul E. McKenneyHowever, this cannot happen because if such a deadlock threatened, 20808dd853d7SPaul E. McKenneythe RELEASE would simply complete, thereby avoiding the deadlock. 20818dd853d7SPaul E. McKenney 20828dd853d7SPaul E. McKenney Why does this work? 20838dd853d7SPaul E. McKenney 20848dd853d7SPaul E. McKenney One key point is that we are only talking about the CPU doing 20858dd853d7SPaul E. McKenney the reordering, not the compiler. If the compiler (or, for 20868dd853d7SPaul E. McKenney that matter, the developer) switched the operations, deadlock 20878dd853d7SPaul E. McKenney -could- occur. 20888dd853d7SPaul E. McKenney 20898dd853d7SPaul E. McKenney But suppose the CPU reordered the operations. In this case, 20908dd853d7SPaul E. McKenney the unlock precedes the lock in the assembly code. The CPU 20918dd853d7SPaul E. McKenney simply elected to try executing the later lock operation first. 20928dd853d7SPaul E. McKenney If there is a deadlock, this lock operation will simply spin (or 20938dd853d7SPaul E. McKenney try to sleep, but more on that later). The CPU will eventually 20948dd853d7SPaul E. McKenney execute the unlock operation (which preceded the lock operation 20958dd853d7SPaul E. McKenney in the assembly code), which will unravel the potential deadlock, 20968dd853d7SPaul E. McKenney allowing the lock operation to succeed. 20978dd853d7SPaul E. McKenney 20988dd853d7SPaul E. McKenney But what if the lock is a sleeplock? In that case, the code will 20998dd853d7SPaul E. McKenney try to enter the scheduler, where it will eventually encounter 21008dd853d7SPaul E. McKenney a memory barrier, which will force the earlier unlock operation 21018dd853d7SPaul E. McKenney to complete, again unraveling the deadlock. There might be 21028dd853d7SPaul E. McKenney a sleep-unlock race, but the locking primitive needs to resolve 21038dd853d7SPaul E. McKenney such races properly in any case. 21048dd853d7SPaul E. McKenney 2105108b42b4SDavid HowellsLocks and semaphores may not provide any guarantee of ordering on UP compiled 2106108b42b4SDavid Howellssystems, and so cannot be counted on in such a situation to actually achieve 2107108b42b4SDavid Howellsanything at all - especially with respect to I/O accesses - unless combined 2108108b42b4SDavid Howellswith interrupt disabling operations. 2109108b42b4SDavid Howells 2110d7cab36dSSeongJae ParkSee also the section on "Inter-CPU acquiring barrier effects". 2111108b42b4SDavid Howells 2112108b42b4SDavid Howells 2113108b42b4SDavid HowellsAs an example, consider the following: 2114108b42b4SDavid Howells 2115108b42b4SDavid Howells *A = a; 2116108b42b4SDavid Howells *B = b; 21172e4f5382SPeter Zijlstra ACQUIRE 2118108b42b4SDavid Howells *C = c; 2119108b42b4SDavid Howells *D = d; 21202e4f5382SPeter Zijlstra RELEASE 2121108b42b4SDavid Howells *E = e; 2122108b42b4SDavid Howells *F = f; 2123108b42b4SDavid Howells 2124108b42b4SDavid HowellsThe following sequence of events is acceptable: 2125108b42b4SDavid Howells 21262e4f5382SPeter Zijlstra ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE 2127108b42b4SDavid Howells 2128108b42b4SDavid Howells [+] Note that {*F,*A} indicates a combined access. 2129108b42b4SDavid Howells 2130108b42b4SDavid HowellsBut none of the following are: 2131108b42b4SDavid Howells 21322e4f5382SPeter Zijlstra {*F,*A}, *B, ACQUIRE, *C, *D, RELEASE, *E 21332e4f5382SPeter Zijlstra *A, *B, *C, ACQUIRE, *D, RELEASE, *E, *F 21342e4f5382SPeter Zijlstra *A, *B, ACQUIRE, *C, RELEASE, *D, *E, *F 21352e4f5382SPeter Zijlstra *B, ACQUIRE, *C, *D, RELEASE, {*F,*A}, *E 2136108b42b4SDavid Howells 2137108b42b4SDavid Howells 2138108b42b4SDavid Howells 2139108b42b4SDavid HowellsINTERRUPT DISABLING FUNCTIONS 2140108b42b4SDavid Howells----------------------------- 2141108b42b4SDavid Howells 21422e4f5382SPeter ZijlstraFunctions that disable interrupts (ACQUIRE equivalent) and enable interrupts 21432e4f5382SPeter Zijlstra(RELEASE equivalent) will act as compiler barriers only. So if memory or I/O 2144108b42b4SDavid Howellsbarriers are required in such a situation, they must be provided from some 2145108b42b4SDavid Howellsother means. 2146108b42b4SDavid Howells 2147108b42b4SDavid Howells 214850fa610aSDavid HowellsSLEEP AND WAKE-UP FUNCTIONS 214950fa610aSDavid Howells--------------------------- 215050fa610aSDavid Howells 215150fa610aSDavid HowellsSleeping and waking on an event flagged in global data can be viewed as an 215250fa610aSDavid Howellsinteraction between two pieces of data: the task state of the task waiting for 215350fa610aSDavid Howellsthe event and the global data used to indicate the event. To make sure that 215450fa610aSDavid Howellsthese appear to happen in the right order, the primitives to begin the process 215550fa610aSDavid Howellsof going to sleep, and the primitives to initiate a wake up imply certain 215650fa610aSDavid Howellsbarriers. 215750fa610aSDavid Howells 215850fa610aSDavid HowellsFirstly, the sleeper normally follows something like this sequence of events: 215950fa610aSDavid Howells 216050fa610aSDavid Howells for (;;) { 216150fa610aSDavid Howells set_current_state(TASK_UNINTERRUPTIBLE); 216250fa610aSDavid Howells if (event_indicated) 216350fa610aSDavid Howells break; 216450fa610aSDavid Howells schedule(); 216550fa610aSDavid Howells } 216650fa610aSDavid Howells 216750fa610aSDavid HowellsA general memory barrier is interpolated automatically by set_current_state() 216850fa610aSDavid Howellsafter it has altered the task state: 216950fa610aSDavid Howells 217050fa610aSDavid Howells CPU 1 217150fa610aSDavid Howells =============================== 217250fa610aSDavid Howells set_current_state(); 2173b92b8b35SPeter Zijlstra smp_store_mb(); 217450fa610aSDavid Howells STORE current->state 217550fa610aSDavid Howells <general barrier> 217650fa610aSDavid Howells LOAD event_indicated 217750fa610aSDavid Howells 217850fa610aSDavid Howellsset_current_state() may be wrapped by: 217950fa610aSDavid Howells 218050fa610aSDavid Howells prepare_to_wait(); 218150fa610aSDavid Howells prepare_to_wait_exclusive(); 218250fa610aSDavid Howells 218350fa610aSDavid Howellswhich therefore also imply a general memory barrier after setting the state. 218450fa610aSDavid HowellsThe whole sequence above is available in various canned forms, all of which 2185*7f8fcc6fSAkira Yokosawainterpolate the memory barrier in the right place, for example: 218650fa610aSDavid Howells 218750fa610aSDavid Howells wait_event(); 2188*7f8fcc6fSAkira Yokosawa wait_event_cmd(); 2189*7f8fcc6fSAkira Yokosawa wait_event_exclusive_cmd(); 219050fa610aSDavid Howells wait_event_interruptible(); 219150fa610aSDavid Howells wait_event_interruptible_exclusive(); 219250fa610aSDavid Howells wait_event_interruptible_timeout(); 219350fa610aSDavid Howells wait_event_killable(); 219450fa610aSDavid Howells wait_event_timeout(); 219550fa610aSDavid Howells wait_on_bit(); 219650fa610aSDavid Howells wait_on_bit_lock(); 219750fa610aSDavid Howells 219850fa610aSDavid Howells 219950fa610aSDavid HowellsSecondly, code that performs a wake up normally follows something like this: 220050fa610aSDavid Howells 220150fa610aSDavid Howells event_indicated = 1; 220250fa610aSDavid Howells wake_up(&event_wait_queue); 220350fa610aSDavid Howells 220450fa610aSDavid Howellsor: 220550fa610aSDavid Howells 220650fa610aSDavid Howells event_indicated = 1; 220750fa610aSDavid Howells wake_up_process(event_daemon); 220850fa610aSDavid Howells 22097696f991SAndrea ParriA general memory barrier is executed by wake_up() if it wakes something up. 22107696f991SAndrea ParriIf it doesn't wake anything up then a memory barrier may or may not be 22117696f991SAndrea Parriexecuted; you must not rely on it. The barrier occurs before the task state 22127696f991SAndrea Parriis accessed, in particular, it sits between the STORE to indicate the event 22137696f991SAndrea Parriand the STORE to set TASK_RUNNING: 221450fa610aSDavid Howells 22157696f991SAndrea Parri CPU 1 (Sleeper) CPU 2 (Waker) 221650fa610aSDavid Howells =============================== =============================== 221750fa610aSDavid Howells set_current_state(); STORE event_indicated 2218b92b8b35SPeter Zijlstra smp_store_mb(); wake_up(); 22197696f991SAndrea Parri STORE current->state ... 22207696f991SAndrea Parri <general barrier> <general barrier> 22217696f991SAndrea Parri LOAD event_indicated if ((LOAD task->state) & TASK_NORMAL) 22227696f991SAndrea Parri STORE task->state 222350fa610aSDavid Howells 22247696f991SAndrea Parriwhere "task" is the thread being woken up and it equals CPU 1's "current". 22257696f991SAndrea Parri 22267696f991SAndrea ParriTo repeat, a general memory barrier is guaranteed to be executed by wake_up() 22277696f991SAndrea Parriif something is actually awakened, but otherwise there is no such guarantee. 22287696f991SAndrea ParriTo see this, consider the following sequence of events, where X and Y are both 22297696f991SAndrea Parriinitially zero: 22305726ce06SPaul E. McKenney 22315726ce06SPaul E. McKenney CPU 1 CPU 2 22325726ce06SPaul E. McKenney =============================== =============================== 22337696f991SAndrea Parri X = 1; Y = 1; 22345726ce06SPaul E. McKenney smp_mb(); wake_up(); 22357696f991SAndrea Parri LOAD Y LOAD X 22365726ce06SPaul E. McKenney 22377696f991SAndrea ParriIf a wakeup does occur, one (at least) of the two loads must see 1. If, on 22387696f991SAndrea Parrithe other hand, a wakeup does not occur, both loads might see 0. 22397696f991SAndrea Parri 22407696f991SAndrea Parriwake_up_process() always executes a general memory barrier. The barrier again 22417696f991SAndrea Parrioccurs before the task state is accessed. In particular, if the wake_up() in 22427696f991SAndrea Parrithe previous snippet were replaced by a call to wake_up_process() then one of 22437696f991SAndrea Parrithe two loads would be guaranteed to see 1. 22445726ce06SPaul E. McKenney 224550fa610aSDavid HowellsThe available waker functions include: 224650fa610aSDavid Howells 224750fa610aSDavid Howells complete(); 224850fa610aSDavid Howells wake_up(); 224950fa610aSDavid Howells wake_up_all(); 225050fa610aSDavid Howells wake_up_bit(); 225150fa610aSDavid Howells wake_up_interruptible(); 225250fa610aSDavid Howells wake_up_interruptible_all(); 225350fa610aSDavid Howells wake_up_interruptible_nr(); 225450fa610aSDavid Howells wake_up_interruptible_poll(); 225550fa610aSDavid Howells wake_up_interruptible_sync(); 225650fa610aSDavid Howells wake_up_interruptible_sync_poll(); 225750fa610aSDavid Howells wake_up_locked(); 225850fa610aSDavid Howells wake_up_locked_poll(); 225950fa610aSDavid Howells wake_up_nr(); 226050fa610aSDavid Howells wake_up_poll(); 226150fa610aSDavid Howells wake_up_process(); 226250fa610aSDavid Howells 22637696f991SAndrea ParriIn terms of memory ordering, these functions all provide the same guarantees of 22647696f991SAndrea Parria wake_up() (or stronger). 226550fa610aSDavid Howells 226650fa610aSDavid Howells[!] Note that the memory barriers implied by the sleeper and the waker do _not_ 226750fa610aSDavid Howellsorder multiple stores before the wake-up with respect to loads of those stored 226850fa610aSDavid Howellsvalues after the sleeper has called set_current_state(). For instance, if the 226950fa610aSDavid Howellssleeper does: 227050fa610aSDavid Howells 227150fa610aSDavid Howells set_current_state(TASK_INTERRUPTIBLE); 227250fa610aSDavid Howells if (event_indicated) 227350fa610aSDavid Howells break; 227450fa610aSDavid Howells __set_current_state(TASK_RUNNING); 227550fa610aSDavid Howells do_something(my_data); 227650fa610aSDavid Howells 227750fa610aSDavid Howellsand the waker does: 227850fa610aSDavid Howells 227950fa610aSDavid Howells my_data = value; 228050fa610aSDavid Howells event_indicated = 1; 228150fa610aSDavid Howells wake_up(&event_wait_queue); 228250fa610aSDavid Howells 228350fa610aSDavid Howellsthere's no guarantee that the change to event_indicated will be perceived by 228450fa610aSDavid Howellsthe sleeper as coming after the change to my_data. In such a circumstance, the 228550fa610aSDavid Howellscode on both sides must interpolate its own memory barriers between the 228650fa610aSDavid Howellsseparate data accesses. Thus the above sleeper ought to do: 228750fa610aSDavid Howells 228850fa610aSDavid Howells set_current_state(TASK_INTERRUPTIBLE); 228950fa610aSDavid Howells if (event_indicated) { 229050fa610aSDavid Howells smp_rmb(); 229150fa610aSDavid Howells do_something(my_data); 229250fa610aSDavid Howells } 229350fa610aSDavid Howells 229450fa610aSDavid Howellsand the waker should do: 229550fa610aSDavid Howells 229650fa610aSDavid Howells my_data = value; 229750fa610aSDavid Howells smp_wmb(); 229850fa610aSDavid Howells event_indicated = 1; 229950fa610aSDavid Howells wake_up(&event_wait_queue); 230050fa610aSDavid Howells 230150fa610aSDavid Howells 2302108b42b4SDavid HowellsMISCELLANEOUS FUNCTIONS 2303108b42b4SDavid Howells----------------------- 2304108b42b4SDavid Howells 2305108b42b4SDavid HowellsOther functions that imply barriers: 2306108b42b4SDavid Howells 2307108b42b4SDavid Howells (*) schedule() and similar imply full memory barriers. 2308108b42b4SDavid Howells 2309108b42b4SDavid Howells 23102e4f5382SPeter Zijlstra=================================== 23112e4f5382SPeter ZijlstraINTER-CPU ACQUIRING BARRIER EFFECTS 23122e4f5382SPeter Zijlstra=================================== 2313108b42b4SDavid Howells 2314108b42b4SDavid HowellsOn SMP systems locking primitives give a more substantial form of barrier: one 2315108b42b4SDavid Howellsthat does affect memory access ordering on other CPUs, within the context of 2316108b42b4SDavid Howellsconflict on any particular lock. 2317108b42b4SDavid Howells 2318108b42b4SDavid Howells 23192e4f5382SPeter ZijlstraACQUIRES VS MEMORY ACCESSES 23202e4f5382SPeter Zijlstra--------------------------- 2321108b42b4SDavid Howells 232279afecfaSAneesh KumarConsider the following: the system has a pair of spinlocks (M) and (Q), and 2323108b42b4SDavid Howellsthree CPUs; then should the following sequence of events occur: 2324108b42b4SDavid Howells 2325108b42b4SDavid Howells CPU 1 CPU 2 2326108b42b4SDavid Howells =============================== =============================== 23279af194ceSPaul E. McKenney WRITE_ONCE(*A, a); WRITE_ONCE(*E, e); 23282e4f5382SPeter Zijlstra ACQUIRE M ACQUIRE Q 23299af194ceSPaul E. McKenney WRITE_ONCE(*B, b); WRITE_ONCE(*F, f); 23309af194ceSPaul E. McKenney WRITE_ONCE(*C, c); WRITE_ONCE(*G, g); 23312e4f5382SPeter Zijlstra RELEASE M RELEASE Q 23329af194ceSPaul E. McKenney WRITE_ONCE(*D, d); WRITE_ONCE(*H, h); 2333108b42b4SDavid Howells 233481fc6323SJarek PoplawskiThen there is no guarantee as to what order CPU 3 will see the accesses to *A 2335108b42b4SDavid Howellsthrough *H occur in, other than the constraints imposed by the separate locks 2336108b42b4SDavid Howellson the separate CPUs. It might, for example, see: 2337108b42b4SDavid Howells 23382e4f5382SPeter Zijlstra *E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M 2339108b42b4SDavid Howells 2340108b42b4SDavid HowellsBut it won't see any of: 2341108b42b4SDavid Howells 23422e4f5382SPeter Zijlstra *B, *C or *D preceding ACQUIRE M 23432e4f5382SPeter Zijlstra *A, *B or *C following RELEASE M 23442e4f5382SPeter Zijlstra *F, *G or *H preceding ACQUIRE Q 23452e4f5382SPeter Zijlstra *E, *F or *G following RELEASE Q 2346108b42b4SDavid Howells 2347108b42b4SDavid Howells 2348108b42b4SDavid Howells================================= 2349108b42b4SDavid HowellsWHERE ARE MEMORY BARRIERS NEEDED? 2350108b42b4SDavid Howells================================= 2351108b42b4SDavid Howells 2352108b42b4SDavid HowellsUnder normal operation, memory operation reordering is generally not going to 2353108b42b4SDavid Howellsbe a problem as a single-threaded linear piece of code will still appear to 235450fa610aSDavid Howellswork correctly, even if it's in an SMP kernel. There are, however, four 2355108b42b4SDavid Howellscircumstances in which reordering definitely _could_ be a problem: 2356108b42b4SDavid Howells 2357108b42b4SDavid Howells (*) Interprocessor interaction. 2358108b42b4SDavid Howells 2359108b42b4SDavid Howells (*) Atomic operations. 2360108b42b4SDavid Howells 236181fc6323SJarek Poplawski (*) Accessing devices. 2362108b42b4SDavid Howells 2363108b42b4SDavid Howells (*) Interrupts. 2364108b42b4SDavid Howells 2365108b42b4SDavid Howells 2366108b42b4SDavid HowellsINTERPROCESSOR INTERACTION 2367108b42b4SDavid Howells-------------------------- 2368108b42b4SDavid Howells 2369108b42b4SDavid HowellsWhen there's a system with more than one processor, more than one CPU in the 2370108b42b4SDavid Howellssystem may be working on the same data set at the same time. This can cause 2371108b42b4SDavid Howellssynchronisation problems, and the usual way of dealing with them is to use 2372108b42b4SDavid Howellslocks. Locks, however, are quite expensive, and so it may be preferable to 2373108b42b4SDavid Howellsoperate without the use of a lock if at all possible. In such a case 2374108b42b4SDavid Howellsoperations that affect both CPUs may have to be carefully ordered to prevent 2375108b42b4SDavid Howellsa malfunction. 2376108b42b4SDavid Howells 2377108b42b4SDavid HowellsConsider, for example, the R/W semaphore slow path. Here a waiting process is 2378108b42b4SDavid Howellsqueued on the semaphore, by virtue of it having a piece of its stack linked to 2379108b42b4SDavid Howellsthe semaphore's list of waiting processes: 2380108b42b4SDavid Howells 2381108b42b4SDavid Howells struct rw_semaphore { 2382108b42b4SDavid Howells ... 2383108b42b4SDavid Howells spinlock_t lock; 2384108b42b4SDavid Howells struct list_head waiters; 2385108b42b4SDavid Howells }; 2386108b42b4SDavid Howells 2387108b42b4SDavid Howells struct rwsem_waiter { 2388108b42b4SDavid Howells struct list_head list; 2389108b42b4SDavid Howells struct task_struct *task; 2390108b42b4SDavid Howells }; 2391108b42b4SDavid Howells 2392108b42b4SDavid HowellsTo wake up a particular waiter, the up_read() or up_write() functions have to: 2393108b42b4SDavid Howells 2394108b42b4SDavid Howells (1) read the next pointer from this waiter's record to know as to where the 2395108b42b4SDavid Howells next waiter record is; 2396108b42b4SDavid Howells 239781fc6323SJarek Poplawski (2) read the pointer to the waiter's task structure; 2398108b42b4SDavid Howells 2399108b42b4SDavid Howells (3) clear the task pointer to tell the waiter it has been given the semaphore; 2400108b42b4SDavid Howells 2401108b42b4SDavid Howells (4) call wake_up_process() on the task; and 2402108b42b4SDavid Howells 2403108b42b4SDavid Howells (5) release the reference held on the waiter's task struct. 2404108b42b4SDavid Howells 2405108b42b4SDavid HowellsIn other words, it has to perform this sequence of events: 2406108b42b4SDavid Howells 2407108b42b4SDavid Howells LOAD waiter->list.next; 2408108b42b4SDavid Howells LOAD waiter->task; 2409108b42b4SDavid Howells STORE waiter->task; 2410108b42b4SDavid Howells CALL wakeup 2411108b42b4SDavid Howells RELEASE task 2412108b42b4SDavid Howells 2413108b42b4SDavid Howellsand if any of these steps occur out of order, then the whole thing may 2414108b42b4SDavid Howellsmalfunction. 2415108b42b4SDavid Howells 2416108b42b4SDavid HowellsOnce it has queued itself and dropped the semaphore lock, the waiter does not 2417108b42b4SDavid Howellsget the lock again; it instead just waits for its task pointer to be cleared 2418108b42b4SDavid Howellsbefore proceeding. Since the record is on the waiter's stack, this means that 2419108b42b4SDavid Howellsif the task pointer is cleared _before_ the next pointer in the list is read, 2420108b42b4SDavid Howellsanother CPU might start processing the waiter and might clobber the waiter's 2421108b42b4SDavid Howellsstack before the up*() function has a chance to read the next pointer. 2422108b42b4SDavid Howells 2423108b42b4SDavid HowellsConsider then what might happen to the above sequence of events: 2424108b42b4SDavid Howells 2425108b42b4SDavid Howells CPU 1 CPU 2 2426108b42b4SDavid Howells =============================== =============================== 2427108b42b4SDavid Howells down_xxx() 2428108b42b4SDavid Howells Queue waiter 2429108b42b4SDavid Howells Sleep 2430108b42b4SDavid Howells up_yyy() 2431108b42b4SDavid Howells LOAD waiter->task; 2432108b42b4SDavid Howells STORE waiter->task; 2433108b42b4SDavid Howells Woken up by other event 2434108b42b4SDavid Howells <preempt> 2435108b42b4SDavid Howells Resume processing 2436108b42b4SDavid Howells down_xxx() returns 2437108b42b4SDavid Howells call foo() 2438108b42b4SDavid Howells foo() clobbers *waiter 2439108b42b4SDavid Howells </preempt> 2440108b42b4SDavid Howells LOAD waiter->list.next; 2441108b42b4SDavid Howells --- OOPS --- 2442108b42b4SDavid Howells 2443108b42b4SDavid HowellsThis could be dealt with using the semaphore lock, but then the down_xxx() 2444108b42b4SDavid Howellsfunction has to needlessly get the spinlock again after being woken up. 2445108b42b4SDavid Howells 2446108b42b4SDavid HowellsThe way to deal with this is to insert a general SMP memory barrier: 2447108b42b4SDavid Howells 2448108b42b4SDavid Howells LOAD waiter->list.next; 2449108b42b4SDavid Howells LOAD waiter->task; 2450108b42b4SDavid Howells smp_mb(); 2451108b42b4SDavid Howells STORE waiter->task; 2452108b42b4SDavid Howells CALL wakeup 2453108b42b4SDavid Howells RELEASE task 2454108b42b4SDavid Howells 2455108b42b4SDavid HowellsIn this case, the barrier makes a guarantee that all memory accesses before the 2456108b42b4SDavid Howellsbarrier will appear to happen before all the memory accesses after the barrier 2457108b42b4SDavid Howellswith respect to the other CPUs on the system. It does _not_ guarantee that all 2458108b42b4SDavid Howellsthe memory accesses before the barrier will be complete by the time the barrier 2459108b42b4SDavid Howellsinstruction itself is complete. 2460108b42b4SDavid Howells 2461108b42b4SDavid HowellsOn a UP system - where this wouldn't be a problem - the smp_mb() is just a 2462108b42b4SDavid Howellscompiler barrier, thus making sure the compiler emits the instructions in the 24636bc39274SDavid Howellsright order without actually intervening in the CPU. Since there's only one 24646bc39274SDavid HowellsCPU, that CPU's dependency ordering logic will take care of everything else. 2465108b42b4SDavid Howells 2466108b42b4SDavid Howells 2467108b42b4SDavid HowellsATOMIC OPERATIONS 2468108b42b4SDavid Howells----------------- 2469108b42b4SDavid Howells 2470806654a9SWill DeaconWhile they are technically interprocessor interaction considerations, atomic 2471dbc8700eSDavid Howellsoperations are noted specially as some of them imply full memory barriers and 2472dbc8700eSDavid Howellssome don't, but they're very heavily relied on as a group throughout the 2473dbc8700eSDavid Howellskernel. 2474dbc8700eSDavid Howells 2475706eeb3eSPeter ZijlstraSee Documentation/atomic_t.txt for more information. 2476108b42b4SDavid Howells 2477108b42b4SDavid Howells 2478108b42b4SDavid HowellsACCESSING DEVICES 2479108b42b4SDavid Howells----------------- 2480108b42b4SDavid Howells 2481108b42b4SDavid HowellsMany devices can be memory mapped, and so appear to the CPU as if they're just 2482108b42b4SDavid Howellsa set of memory locations. To control such a device, the driver usually has to 2483108b42b4SDavid Howellsmake the right memory accesses in exactly the right order. 2484108b42b4SDavid Howells 2485108b42b4SDavid HowellsHowever, having a clever CPU or a clever compiler creates a potential problem 2486108b42b4SDavid Howellsin that the carefully sequenced accesses in the driver code won't reach the 2487108b42b4SDavid Howellsdevice in the requisite order if the CPU or the compiler thinks it is more 2488108b42b4SDavid Howellsefficient to reorder, combine or merge accesses - something that would cause 2489108b42b4SDavid Howellsthe device to malfunction. 2490108b42b4SDavid Howells 2491108b42b4SDavid HowellsInside of the Linux kernel, I/O should be done through the appropriate accessor 2492108b42b4SDavid Howellsroutines - such as inb() or writel() - which know how to make such accesses 2493806654a9SWill Deaconappropriately sequential. While this, for the most part, renders the explicit 249491553039SWill Deaconuse of memory barriers unnecessary, if the accessor functions are used to refer 249591553039SWill Deaconto an I/O memory window with relaxed memory access properties, then _mandatory_ 249691553039SWill Deaconmemory barriers are required to enforce ordering. 2497108b42b4SDavid Howells 24980fe397f0SHelmut GrohneSee Documentation/driver-api/device-io.rst for more information. 2499108b42b4SDavid Howells 2500108b42b4SDavid Howells 2501108b42b4SDavid HowellsINTERRUPTS 2502108b42b4SDavid Howells---------- 2503108b42b4SDavid Howells 2504108b42b4SDavid HowellsA driver may be interrupted by its own interrupt service routine, and thus the 2505108b42b4SDavid Howellstwo parts of the driver may interfere with each other's attempts to control or 2506108b42b4SDavid Howellsaccess the device. 2507108b42b4SDavid Howells 2508108b42b4SDavid HowellsThis may be alleviated - at least in part - by disabling local interrupts (a 2509108b42b4SDavid Howellsform of locking), such that the critical operations are all contained within 2510806654a9SWill Deaconthe interrupt-disabled section in the driver. While the driver's interrupt 2511108b42b4SDavid Howellsroutine is executing, the driver's core may not run on the same CPU, and its 2512108b42b4SDavid Howellsinterrupt is not permitted to happen again until the current interrupt has been 2513108b42b4SDavid Howellshandled, thus the interrupt handler does not need to lock against that. 2514108b42b4SDavid Howells 2515108b42b4SDavid HowellsHowever, consider a driver that was talking to an ethernet card that sports an 2516108b42b4SDavid Howellsaddress register and a data register. If that driver's core talks to the card 2517108b42b4SDavid Howellsunder interrupt-disablement and then the driver's interrupt handler is invoked: 2518108b42b4SDavid Howells 2519108b42b4SDavid Howells LOCAL IRQ DISABLE 2520108b42b4SDavid Howells writew(ADDR, 3); 2521108b42b4SDavid Howells writew(DATA, y); 2522108b42b4SDavid Howells LOCAL IRQ ENABLE 2523108b42b4SDavid Howells <interrupt> 2524108b42b4SDavid Howells writew(ADDR, 4); 2525108b42b4SDavid Howells q = readw(DATA); 2526108b42b4SDavid Howells </interrupt> 2527108b42b4SDavid Howells 2528108b42b4SDavid HowellsThe store to the data register might happen after the second store to the 2529108b42b4SDavid Howellsaddress register if ordering rules are sufficiently relaxed: 2530108b42b4SDavid Howells 2531108b42b4SDavid Howells STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA 2532108b42b4SDavid Howells 2533108b42b4SDavid Howells 2534108b42b4SDavid HowellsIf ordering rules are relaxed, it must be assumed that accesses done inside an 2535108b42b4SDavid Howellsinterrupt disabled section may leak outside of it and may interleave with 2536108b42b4SDavid Howellsaccesses performed in an interrupt - and vice versa - unless implicit or 2537108b42b4SDavid Howellsexplicit barriers are used. 2538108b42b4SDavid Howells 2539108b42b4SDavid HowellsNormally this won't be a problem because the I/O accesses done inside such 2540108b42b4SDavid Howellssections will include synchronous load operations on strictly ordered I/O 254191553039SWill Deaconregisters that form implicit I/O barriers. 2542108b42b4SDavid Howells 2543108b42b4SDavid Howells 2544108b42b4SDavid HowellsA similar situation may occur between an interrupt routine and two routines 2545108b42b4SDavid Howellsrunning on separate CPUs that communicate with each other. If such a case is 2546108b42b4SDavid Howellslikely, then interrupt-disabling locks should be used to guarantee ordering. 2547108b42b4SDavid Howells 2548108b42b4SDavid Howells 2549108b42b4SDavid Howells========================== 2550108b42b4SDavid HowellsKERNEL I/O BARRIER EFFECTS 2551108b42b4SDavid Howells========================== 2552108b42b4SDavid Howells 25534614bbdeSWill DeaconInterfacing with peripherals via I/O accesses is deeply architecture and device 25544614bbdeSWill Deaconspecific. Therefore, drivers which are inherently non-portable may rely on 25554614bbdeSWill Deaconspecific behaviours of their target systems in order to achieve synchronization 25564614bbdeSWill Deaconin the most lightweight manner possible. For drivers intending to be portable 25574614bbdeSWill Deaconbetween multiple architectures and bus implementations, the kernel offers a 25584614bbdeSWill Deaconseries of accessor functions that provide various degrees of ordering 25594614bbdeSWill Deaconguarantees: 2560108b42b4SDavid Howells 2561108b42b4SDavid Howells (*) readX(), writeX(): 2562108b42b4SDavid Howells 25630cde62a4SWill Deacon The readX() and writeX() MMIO accessors take a pointer to the 25640cde62a4SWill Deacon peripheral being accessed as an __iomem * parameter. For pointers 25650cde62a4SWill Deacon mapped with the default I/O attributes (e.g. those returned by 25660cde62a4SWill Deacon ioremap()), the ordering guarantees are as follows: 2567108b42b4SDavid Howells 25684614bbdeSWill Deacon 1. All readX() and writeX() accesses to the same peripheral are ordered 25699726840dSWill Deacon with respect to each other. This ensures that MMIO register accesses 25709726840dSWill Deacon by the same CPU thread to a particular device will arrive in program 25719726840dSWill Deacon order. 2572108b42b4SDavid Howells 25739726840dSWill Deacon 2. A writeX() issued by a CPU thread holding a spinlock is ordered 25749726840dSWill Deacon before a writeX() to the same peripheral from another CPU thread 25759726840dSWill Deacon issued after a later acquisition of the same spinlock. This ensures 25769726840dSWill Deacon that MMIO register writes to a particular device issued while holding 25779726840dSWill Deacon a spinlock will arrive in an order consistent with acquisitions of 25789726840dSWill Deacon the lock. 2579108b42b4SDavid Howells 25809726840dSWill Deacon 3. A writeX() by a CPU thread to the peripheral will first wait for the 25819726840dSWill Deacon completion of all prior writes to memory either issued by, or 25829726840dSWill Deacon propagated to, the same thread. This ensures that writes by the CPU 25839726840dSWill Deacon to an outbound DMA buffer allocated by dma_alloc_coherent() will be 25849726840dSWill Deacon visible to a DMA engine when the CPU writes to its MMIO control 25859726840dSWill Deacon register to trigger the transfer. 2586108b42b4SDavid Howells 25879726840dSWill Deacon 4. A readX() by a CPU thread from the peripheral will complete before 25889726840dSWill Deacon any subsequent reads from memory by the same thread can begin. This 25899726840dSWill Deacon ensures that reads by the CPU from an incoming DMA buffer allocated 25909726840dSWill Deacon by dma_alloc_coherent() will not see stale data after reading from 25919726840dSWill Deacon the DMA engine's MMIO status register to establish that the DMA 25929726840dSWill Deacon transfer has completed. 25939726840dSWill Deacon 25949726840dSWill Deacon 5. A readX() by a CPU thread from the peripheral will complete before 25959726840dSWill Deacon any subsequent delay() loop can begin execution on the same thread. 25969726840dSWill Deacon This ensures that two MMIO register writes by the CPU to a peripheral 25979726840dSWill Deacon will arrive at least 1us apart if the first write is immediately read 25989726840dSWill Deacon back with readX() and udelay(1) is called prior to the second 25999726840dSWill Deacon writeX(): 2600108b42b4SDavid Howells 26010cde62a4SWill Deacon writel(42, DEVICE_REGISTER_0); // Arrives at the device... 26020cde62a4SWill Deacon readl(DEVICE_REGISTER_0); 26030cde62a4SWill Deacon udelay(1); 26040cde62a4SWill Deacon writel(42, DEVICE_REGISTER_1); // ...at least 1us before this. 26050cde62a4SWill Deacon 26060cde62a4SWill Deacon The ordering properties of __iomem pointers obtained with non-default 26070cde62a4SWill Deacon attributes (e.g. those returned by ioremap_wc()) are specific to the 26080cde62a4SWill Deacon underlying architecture and therefore the guarantees listed above cannot 26090cde62a4SWill Deacon generally be relied upon for accesses to these types of mappings. 2610108b42b4SDavid Howells 26114614bbdeSWill Deacon (*) readX_relaxed(), writeX_relaxed(): 2612108b42b4SDavid Howells 2613a8e0aeadSWill Deacon These are similar to readX() and writeX(), but provide weaker memory 2614a8e0aeadSWill Deacon ordering guarantees. Specifically, they do not guarantee ordering with 26159726840dSWill Deacon respect to locking, normal memory accesses or delay() loops (i.e. 26169726840dSWill Deacon bullets 2-5 above) but they are still guaranteed to be ordered with 26179726840dSWill Deacon respect to other accesses from the same CPU thread to the same 26189726840dSWill Deacon peripheral when operating on __iomem pointers mapped with the default 26199726840dSWill Deacon I/O attributes. 26204614bbdeSWill Deacon 26214614bbdeSWill Deacon (*) readsX(), writesX(): 26224614bbdeSWill Deacon 26234614bbdeSWill Deacon The readsX() and writesX() MMIO accessors are designed for accessing 26244614bbdeSWill Deacon register-based, memory-mapped FIFOs residing on peripherals that are not 26254614bbdeSWill Deacon capable of performing DMA. Consequently, they provide only the ordering 26264614bbdeSWill Deacon guarantees of readX_relaxed() and writeX_relaxed(), as documented above. 26274614bbdeSWill Deacon 26284614bbdeSWill Deacon (*) inX(), outX(): 26294614bbdeSWill Deacon 26304614bbdeSWill Deacon The inX() and outX() accessors are intended to access legacy port-mapped 26314614bbdeSWill Deacon I/O peripherals, which may require special instructions on some 26324614bbdeSWill Deacon architectures (notably x86). The port number of the peripheral being 26334614bbdeSWill Deacon accessed is passed as an argument. 26344614bbdeSWill Deacon 26354614bbdeSWill Deacon Since many CPU architectures ultimately access these peripherals via an 26360cde62a4SWill Deacon internal virtual memory mapping, the portable ordering guarantees 26370cde62a4SWill Deacon provided by inX() and outX() are the same as those provided by readX() 26380cde62a4SWill Deacon and writeX() respectively when accessing a mapping with the default I/O 26390cde62a4SWill Deacon attributes. 26404614bbdeSWill Deacon 26414614bbdeSWill Deacon Device drivers may expect outX() to emit a non-posted write transaction 26424614bbdeSWill Deacon that waits for a completion response from the I/O peripheral before 26434614bbdeSWill Deacon returning. This is not guaranteed by all architectures and is therefore 26444614bbdeSWill Deacon not part of the portable ordering semantics. 26454614bbdeSWill Deacon 26464614bbdeSWill Deacon (*) insX(), outsX(): 26474614bbdeSWill Deacon 26484614bbdeSWill Deacon As above, the insX() and outsX() accessors provide the same ordering 26490cde62a4SWill Deacon guarantees as readsX() and writesX() respectively when accessing a 26500cde62a4SWill Deacon mapping with the default I/O attributes. 2651108b42b4SDavid Howells 26520cde62a4SWill Deacon (*) ioreadX(), iowriteX(): 2653108b42b4SDavid Howells 265481fc6323SJarek Poplawski These will perform appropriately for the type of access they're actually 2655108b42b4SDavid Howells doing, be it inX()/outX() or readX()/writeX(). 2656108b42b4SDavid Howells 26579726840dSWill DeaconWith the exception of the string accessors (insX(), outsX(), readsX() and 26589726840dSWill DeaconwritesX()), all of the above assume that the underlying peripheral is 26599726840dSWill Deaconlittle-endian and will therefore perform byte-swapping operations on big-endian 26609726840dSWill Deaconarchitectures. 26614614bbdeSWill Deacon 2662108b42b4SDavid Howells 2663108b42b4SDavid Howells======================================== 2664108b42b4SDavid HowellsASSUMED MINIMUM EXECUTION ORDERING MODEL 2665108b42b4SDavid Howells======================================== 2666108b42b4SDavid Howells 2667108b42b4SDavid HowellsIt has to be assumed that the conceptual CPU is weakly-ordered but that it will 2668108b42b4SDavid Howellsmaintain the appearance of program causality with respect to itself. Some CPUs 2669108b42b4SDavid Howells(such as i386 or x86_64) are more constrained than others (such as powerpc or 2670108b42b4SDavid Howellsfrv), and so the most relaxed case (namely DEC Alpha) must be assumed outside 2671108b42b4SDavid Howellsof arch-specific code. 2672108b42b4SDavid Howells 2673108b42b4SDavid HowellsThis means that it must be considered that the CPU will execute its instruction 2674108b42b4SDavid Howellsstream in any order it feels like - or even in parallel - provided that if an 267581fc6323SJarek Poplawskiinstruction in the stream depends on an earlier instruction, then that 2676108b42b4SDavid Howellsearlier instruction must be sufficiently complete[*] before the later 2677108b42b4SDavid Howellsinstruction may proceed; in other words: provided that the appearance of 2678108b42b4SDavid Howellscausality is maintained. 2679108b42b4SDavid Howells 2680108b42b4SDavid Howells [*] Some instructions have more than one effect - such as changing the 2681108b42b4SDavid Howells condition codes, changing registers or changing memory - and different 2682108b42b4SDavid Howells instructions may depend on different effects. 2683108b42b4SDavid Howells 2684108b42b4SDavid HowellsA CPU may also discard any instruction sequence that winds up having no 2685108b42b4SDavid Howellsultimate effect. For example, if two adjacent instructions both load an 2686108b42b4SDavid Howellsimmediate value into the same register, the first may be discarded. 2687108b42b4SDavid Howells 2688108b42b4SDavid Howells 2689108b42b4SDavid HowellsSimilarly, it has to be assumed that compiler might reorder the instruction 2690108b42b4SDavid Howellsstream in any way it sees fit, again provided the appearance of causality is 2691108b42b4SDavid Howellsmaintained. 2692108b42b4SDavid Howells 2693108b42b4SDavid Howells 2694108b42b4SDavid Howells============================ 2695108b42b4SDavid HowellsTHE EFFECTS OF THE CPU CACHE 2696108b42b4SDavid Howells============================ 2697108b42b4SDavid Howells 2698108b42b4SDavid HowellsThe way cached memory operations are perceived across the system is affected to 2699108b42b4SDavid Howellsa certain extent by the caches that lie between CPUs and memory, and by the 2700108b42b4SDavid Howellsmemory coherence system that maintains the consistency of state in the system. 2701108b42b4SDavid Howells 2702108b42b4SDavid HowellsAs far as the way a CPU interacts with another part of the system through the 2703108b42b4SDavid Howellscaches goes, the memory system has to include the CPU's caches, and memory 2704108b42b4SDavid Howellsbarriers for the most part act at the interface between the CPU and its cache 2705108b42b4SDavid Howells(memory barriers logically act on the dotted line in the following diagram): 2706108b42b4SDavid Howells 2707108b42b4SDavid Howells <--- CPU ---> : <----------- Memory -----------> 2708108b42b4SDavid Howells : 2709108b42b4SDavid Howells +--------+ +--------+ : +--------+ +-----------+ 2710108b42b4SDavid Howells | | | | : | | | | +--------+ 2711108b42b4SDavid Howells | CPU | | Memory | : | CPU | | | | | 2712108b42b4SDavid Howells | Core |--->| Access |----->| Cache |<-->| | | | 2713108b42b4SDavid Howells | | | Queue | : | | | |--->| Memory | 2714108b42b4SDavid Howells | | | | : | | | | | | 2715108b42b4SDavid Howells +--------+ +--------+ : +--------+ | | | | 2716108b42b4SDavid Howells : | Cache | +--------+ 2717108b42b4SDavid Howells : | Coherency | 2718108b42b4SDavid Howells : | Mechanism | +--------+ 2719108b42b4SDavid Howells +--------+ +--------+ : +--------+ | | | | 2720108b42b4SDavid Howells | | | | : | | | | | | 2721108b42b4SDavid Howells | CPU | | Memory | : | CPU | | |--->| Device | 2722108b42b4SDavid Howells | Core |--->| Access |----->| Cache |<-->| | | | 2723108b42b4SDavid Howells | | | Queue | : | | | | | | 2724108b42b4SDavid Howells | | | | : | | | | +--------+ 2725108b42b4SDavid Howells +--------+ +--------+ : +--------+ +-----------+ 2726108b42b4SDavid Howells : 2727108b42b4SDavid Howells : 2728108b42b4SDavid Howells 2729108b42b4SDavid HowellsAlthough any particular load or store may not actually appear outside of the 2730108b42b4SDavid HowellsCPU that issued it since it may have been satisfied within the CPU's own cache, 2731108b42b4SDavid Howellsit will still appear as if the full memory access had taken place as far as the 2732108b42b4SDavid Howellsother CPUs are concerned since the cache coherency mechanisms will migrate the 2733108b42b4SDavid Howellscacheline over to the accessing CPU and propagate the effects upon conflict. 2734108b42b4SDavid Howells 2735108b42b4SDavid HowellsThe CPU core may execute instructions in any order it deems fit, provided the 2736108b42b4SDavid Howellsexpected program causality appears to be maintained. Some of the instructions 2737108b42b4SDavid Howellsgenerate load and store operations which then go into the queue of memory 2738108b42b4SDavid Howellsaccesses to be performed. The core may place these in the queue in any order 2739108b42b4SDavid Howellsit wishes, and continue execution until it is forced to wait for an instruction 2740108b42b4SDavid Howellsto complete. 2741108b42b4SDavid Howells 2742108b42b4SDavid HowellsWhat memory barriers are concerned with is controlling the order in which 2743108b42b4SDavid Howellsaccesses cross from the CPU side of things to the memory side of things, and 2744108b42b4SDavid Howellsthe order in which the effects are perceived to happen by the other observers 2745108b42b4SDavid Howellsin the system. 2746108b42b4SDavid Howells 2747108b42b4SDavid Howells[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see 2748108b42b4SDavid Howellstheir own loads and stores as if they had happened in program order. 2749108b42b4SDavid Howells 2750108b42b4SDavid Howells[!] MMIO or other device accesses may bypass the cache system. This depends on 2751108b42b4SDavid Howellsthe properties of the memory window through which devices are accessed and/or 2752108b42b4SDavid Howellsthe use of any special device communication instructions the CPU may have. 2753108b42b4SDavid Howells 2754108b42b4SDavid Howells 2755108b42b4SDavid HowellsCACHE COHERENCY VS DMA 2756108b42b4SDavid Howells---------------------- 2757108b42b4SDavid Howells 2758108b42b4SDavid HowellsNot all systems maintain cache coherency with respect to devices doing DMA. In 2759108b42b4SDavid Howellssuch cases, a device attempting DMA may obtain stale data from RAM because 2760108b42b4SDavid Howellsdirty cache lines may be resident in the caches of various CPUs, and may not 2761108b42b4SDavid Howellshave been written back to RAM yet. To deal with this, the appropriate part of 2762108b42b4SDavid Howellsthe kernel must flush the overlapping bits of cache on each CPU (and maybe 2763108b42b4SDavid Howellsinvalidate them as well). 2764108b42b4SDavid Howells 2765108b42b4SDavid HowellsIn addition, the data DMA'd to RAM by a device may be overwritten by dirty 2766108b42b4SDavid Howellscache lines being written back to RAM from a CPU's cache after the device has 276781fc6323SJarek Poplawskiinstalled its own data, or cache lines present in the CPU's cache may simply 276881fc6323SJarek Poplawskiobscure the fact that RAM has been updated, until at such time as the cacheline 276981fc6323SJarek Poplawskiis discarded from the CPU's cache and reloaded. To deal with this, the 277081fc6323SJarek Poplawskiappropriate part of the kernel must invalidate the overlapping bits of the 2771108b42b4SDavid Howellscache on each CPU. 2772108b42b4SDavid Howells 2773f556082dSAkira YokosawaSee Documentation/core-api/cachetlb.rst for more information on cache 2774f556082dSAkira Yokosawamanagement. 2775108b42b4SDavid Howells 2776108b42b4SDavid Howells 2777108b42b4SDavid HowellsCACHE COHERENCY VS MMIO 2778108b42b4SDavid Howells----------------------- 2779108b42b4SDavid Howells 2780108b42b4SDavid HowellsMemory mapped I/O usually takes place through memory locations that are part of 278181fc6323SJarek Poplawskia window in the CPU's memory space that has different properties assigned than 2782108b42b4SDavid Howellsthe usual RAM directed window. 2783108b42b4SDavid Howells 2784108b42b4SDavid HowellsAmongst these properties is usually the fact that such accesses bypass the 2785108b42b4SDavid Howellscaching entirely and go directly to the device buses. This means MMIO accesses 2786108b42b4SDavid Howellsmay, in effect, overtake accesses to cached memory that were emitted earlier. 2787108b42b4SDavid HowellsA memory barrier isn't sufficient in such a case, but rather the cache must be 2788108b42b4SDavid Howellsflushed between the cached memory write and the MMIO access if the two are in 2789108b42b4SDavid Howellsany way dependent. 2790108b42b4SDavid Howells 2791108b42b4SDavid Howells 2792108b42b4SDavid Howells========================= 2793108b42b4SDavid HowellsTHE THINGS CPUS GET UP TO 2794108b42b4SDavid Howells========================= 2795108b42b4SDavid Howells 2796108b42b4SDavid HowellsA programmer might take it for granted that the CPU will perform memory 279781fc6323SJarek Poplawskioperations in exactly the order specified, so that if the CPU is, for example, 2798108b42b4SDavid Howellsgiven the following piece of code to execute: 2799108b42b4SDavid Howells 28009af194ceSPaul E. McKenney a = READ_ONCE(*A); 28019af194ceSPaul E. McKenney WRITE_ONCE(*B, b); 28029af194ceSPaul E. McKenney c = READ_ONCE(*C); 28039af194ceSPaul E. McKenney d = READ_ONCE(*D); 28049af194ceSPaul E. McKenney WRITE_ONCE(*E, e); 2805108b42b4SDavid Howells 280681fc6323SJarek Poplawskithey would then expect that the CPU will complete the memory operation for each 2807108b42b4SDavid Howellsinstruction before moving on to the next one, leading to a definite sequence of 2808108b42b4SDavid Howellsoperations as seen by external observers in the system: 2809108b42b4SDavid Howells 2810108b42b4SDavid Howells LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E. 2811108b42b4SDavid Howells 2812108b42b4SDavid Howells 2813108b42b4SDavid HowellsReality is, of course, much messier. With many CPUs and compilers, the above 2814108b42b4SDavid Howellsassumption doesn't hold because: 2815108b42b4SDavid Howells 2816108b42b4SDavid Howells (*) loads are more likely to need to be completed immediately to permit 2817108b42b4SDavid Howells execution progress, whereas stores can often be deferred without a 2818108b42b4SDavid Howells problem; 2819108b42b4SDavid Howells 2820108b42b4SDavid Howells (*) loads may be done speculatively, and the result discarded should it prove 2821108b42b4SDavid Howells to have been unnecessary; 2822108b42b4SDavid Howells 282381fc6323SJarek Poplawski (*) loads may be done speculatively, leading to the result having been fetched 282481fc6323SJarek Poplawski at the wrong time in the expected sequence of events; 2825108b42b4SDavid Howells 2826108b42b4SDavid Howells (*) the order of the memory accesses may be rearranged to promote better use 2827108b42b4SDavid Howells of the CPU buses and caches; 2828108b42b4SDavid Howells 2829108b42b4SDavid Howells (*) loads and stores may be combined to improve performance when talking to 2830108b42b4SDavid Howells memory or I/O hardware that can do batched accesses of adjacent locations, 2831108b42b4SDavid Howells thus cutting down on transaction setup costs (memory and PCI devices may 2832108b42b4SDavid Howells both be able to do this); and 2833108b42b4SDavid Howells 2834806654a9SWill Deacon (*) the CPU's data cache may affect the ordering, and while cache-coherency 2835108b42b4SDavid Howells mechanisms may alleviate this - once the store has actually hit the cache 2836108b42b4SDavid Howells - there's no guarantee that the coherency management will be propagated in 2837108b42b4SDavid Howells order to other CPUs. 2838108b42b4SDavid Howells 2839108b42b4SDavid HowellsSo what another CPU, say, might actually observe from the above piece of code 2840108b42b4SDavid Howellsis: 2841108b42b4SDavid Howells 2842108b42b4SDavid Howells LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B 2843108b42b4SDavid Howells 2844108b42b4SDavid Howells (Where "LOAD {*C,*D}" is a combined load) 2845108b42b4SDavid Howells 2846108b42b4SDavid Howells 2847108b42b4SDavid HowellsHowever, it is guaranteed that a CPU will be self-consistent: it will see its 2848108b42b4SDavid Howells_own_ accesses appear to be correctly ordered, without the need for a memory 2849108b42b4SDavid Howellsbarrier. For instance with the following code: 2850108b42b4SDavid Howells 28519af194ceSPaul E. McKenney U = READ_ONCE(*A); 28529af194ceSPaul E. McKenney WRITE_ONCE(*A, V); 28539af194ceSPaul E. McKenney WRITE_ONCE(*A, W); 28549af194ceSPaul E. McKenney X = READ_ONCE(*A); 28559af194ceSPaul E. McKenney WRITE_ONCE(*A, Y); 28569af194ceSPaul E. McKenney Z = READ_ONCE(*A); 2857108b42b4SDavid Howells 2858108b42b4SDavid Howellsand assuming no intervention by an external influence, it can be assumed that 2859108b42b4SDavid Howellsthe final result will appear to be: 2860108b42b4SDavid Howells 2861108b42b4SDavid Howells U == the original value of *A 2862108b42b4SDavid Howells X == W 2863108b42b4SDavid Howells Z == Y 2864108b42b4SDavid Howells *A == Y 2865108b42b4SDavid Howells 2866108b42b4SDavid HowellsThe code above may cause the CPU to generate the full sequence of memory 2867108b42b4SDavid Howellsaccesses: 2868108b42b4SDavid Howells 2869108b42b4SDavid Howells U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A 2870108b42b4SDavid Howells 2871108b42b4SDavid Howellsin that order, but, without intervention, the sequence may have almost any 28729af194ceSPaul E. McKenneycombination of elements combined or discarded, provided the program's view 28739af194ceSPaul E. McKenneyof the world remains consistent. Note that READ_ONCE() and WRITE_ONCE() 28749af194ceSPaul E. McKenneyare -not- optional in the above example, as there are architectures 28759af194ceSPaul E. McKenneywhere a given CPU might reorder successive loads to the same location. 28769af194ceSPaul E. McKenneyOn such architectures, READ_ONCE() and WRITE_ONCE() do whatever is 28779af194ceSPaul E. McKenneynecessary to prevent this, for example, on Itanium the volatile casts 28789af194ceSPaul E. McKenneyused by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq 28799af194ceSPaul E. McKenneyand st.rel instructions (respectively) that prevent such reordering. 2880108b42b4SDavid Howells 2881108b42b4SDavid HowellsThe compiler may also combine, discard or defer elements of the sequence before 2882108b42b4SDavid Howellsthe CPU even sees them. 2883108b42b4SDavid Howells 2884108b42b4SDavid HowellsFor instance: 2885108b42b4SDavid Howells 2886108b42b4SDavid Howells *A = V; 2887108b42b4SDavid Howells *A = W; 2888108b42b4SDavid Howells 2889108b42b4SDavid Howellsmay be reduced to: 2890108b42b4SDavid Howells 2891108b42b4SDavid Howells *A = W; 2892108b42b4SDavid Howells 28939af194ceSPaul E. McKenneysince, without either a write barrier or an WRITE_ONCE(), it can be 28942ecf8101SPaul E. McKenneyassumed that the effect of the storage of V to *A is lost. Similarly: 2895108b42b4SDavid Howells 2896108b42b4SDavid Howells *A = Y; 2897108b42b4SDavid Howells Z = *A; 2898108b42b4SDavid Howells 28999af194ceSPaul E. McKenneymay, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be 29009af194ceSPaul E. McKenneyreduced to: 2901108b42b4SDavid Howells 2902108b42b4SDavid Howells *A = Y; 2903108b42b4SDavid Howells Z = Y; 2904108b42b4SDavid Howells 2905108b42b4SDavid Howellsand the LOAD operation never appear outside of the CPU. 2906108b42b4SDavid Howells 2907108b42b4SDavid Howells 2908108b42b4SDavid HowellsAND THEN THERE'S THE ALPHA 2909108b42b4SDavid Howells-------------------------- 2910108b42b4SDavid Howells 2911108b42b4SDavid HowellsThe DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, 2912108b42b4SDavid Howellssome versions of the Alpha CPU have a split data cache, permitting them to have 291381fc6323SJarek Poplawskitwo semantically-related cache lines updated at separate times. This is where 2914f556082dSAkira Yokosawathe address-dependency barrier really becomes necessary as this synchronises 2915f556082dSAkira Yokosawaboth caches with the memory coherence system, thus making it seem like pointer 2916108b42b4SDavid Howellschanges vs new data occur in the right order. 2917108b42b4SDavid Howells 2918f28f0868SPaul E. McKenneyThe Alpha defines the Linux kernel's memory model, although as of v4.15 29198ca924aeSWill Deaconthe Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly 29208ca924aeSWill Deaconreduced its impact on the memory model. 2921108b42b4SDavid Howells 29220b6fa347SSeongJae Park 29236a65d263SMichael S. TsirkinVIRTUAL MACHINE GUESTS 29243dbf0913SSeongJae Park---------------------- 29256a65d263SMichael S. Tsirkin 29266a65d263SMichael S. TsirkinGuests running within virtual machines might be affected by SMP effects even if 29276a65d263SMichael S. Tsirkinthe guest itself is compiled without SMP support. This is an artifact of 29286a65d263SMichael S. Tsirkininterfacing with an SMP host while running an UP kernel. Using mandatory 29296a65d263SMichael S. Tsirkinbarriers for this use-case would be possible but is often suboptimal. 29306a65d263SMichael S. Tsirkin 29316a65d263SMichael S. TsirkinTo handle this case optimally, low-level virt_mb() etc macros are available. 29326a65d263SMichael S. TsirkinThese have the same effect as smp_mb() etc when SMP is enabled, but generate 29336a65d263SMichael S. Tsirkinidentical code for SMP and non-SMP systems. For example, virtual machine guests 29346a65d263SMichael S. Tsirkinshould use virt_mb() rather than smp_mb() when synchronizing against a 29356a65d263SMichael S. Tsirkin(possibly SMP) host. 29366a65d263SMichael S. Tsirkin 29376a65d263SMichael S. TsirkinThese are equivalent to smp_mb() etc counterparts in all other respects, 29386a65d263SMichael S. Tsirkinin particular, they do not control MMIO effects: to control 29396a65d263SMichael S. TsirkinMMIO effects, use mandatory barriers. 2940108b42b4SDavid Howells 29410b6fa347SSeongJae Park 294290fddabfSDavid Howells============ 294390fddabfSDavid HowellsEXAMPLE USES 294490fddabfSDavid Howells============ 294590fddabfSDavid Howells 294690fddabfSDavid HowellsCIRCULAR BUFFERS 294790fddabfSDavid Howells---------------- 294890fddabfSDavid Howells 294990fddabfSDavid HowellsMemory barriers can be used to implement circular buffering without the need 295090fddabfSDavid Howellsof a lock to serialise the producer with the consumer. See: 295190fddabfSDavid Howells 2952d8a121e3SMauro Carvalho Chehab Documentation/core-api/circular-buffers.rst 295390fddabfSDavid Howells 295490fddabfSDavid Howellsfor details. 295590fddabfSDavid Howells 295690fddabfSDavid Howells 2957108b42b4SDavid Howells========== 2958108b42b4SDavid HowellsREFERENCES 2959108b42b4SDavid Howells========== 2960108b42b4SDavid Howells 2961108b42b4SDavid HowellsAlpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, 2962108b42b4SDavid HowellsDigital Press) 2963108b42b4SDavid Howells Chapter 5.2: Physical Address Space Characteristics 2964108b42b4SDavid Howells Chapter 5.4: Caches and Write Buffers 2965108b42b4SDavid Howells Chapter 5.5: Data Sharing 2966108b42b4SDavid Howells Chapter 5.6: Read/Write Ordering 2967108b42b4SDavid Howells 2968108b42b4SDavid HowellsAMD64 Architecture Programmer's Manual Volume 2: System Programming 2969108b42b4SDavid Howells Chapter 7.1: Memory-Access Ordering 2970108b42b4SDavid Howells Chapter 7.4: Buffering and Combining Memory Writes 2971108b42b4SDavid Howells 2972f1ab25a3SPaul E. McKenneyARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile) 2973f1ab25a3SPaul E. McKenney Chapter B2: The AArch64 Application Level Memory Model 2974f1ab25a3SPaul E. McKenney 2975108b42b4SDavid HowellsIA-32 Intel Architecture Software Developer's Manual, Volume 3: 2976108b42b4SDavid HowellsSystem Programming Guide 2977108b42b4SDavid Howells Chapter 7.1: Locked Atomic Operations 2978108b42b4SDavid Howells Chapter 7.2: Memory Ordering 2979108b42b4SDavid Howells Chapter 7.4: Serializing Instructions 2980108b42b4SDavid Howells 2981108b42b4SDavid HowellsThe SPARC Architecture Manual, Version 9 2982108b42b4SDavid Howells Chapter 8: Memory Models 2983108b42b4SDavid Howells Appendix D: Formal Specification of the Memory Models 2984108b42b4SDavid Howells Appendix J: Programming with the Memory Models 2985108b42b4SDavid Howells 2986f1ab25a3SPaul E. McKenneyStorage in the PowerPC (Stone and Fitzgerald) 2987f1ab25a3SPaul E. McKenney 2988108b42b4SDavid HowellsUltraSPARC Programmer Reference Manual 2989108b42b4SDavid Howells Chapter 5: Memory Accesses and Cacheability 2990108b42b4SDavid Howells Chapter 15: Sparc-V9 Memory Models 2991108b42b4SDavid Howells 2992108b42b4SDavid HowellsUltraSPARC III Cu User's Manual 2993108b42b4SDavid Howells Chapter 9: Memory Models 2994108b42b4SDavid Howells 2995108b42b4SDavid HowellsUltraSPARC IIIi Processor User's Manual 2996108b42b4SDavid Howells Chapter 8: Memory Models 2997108b42b4SDavid Howells 2998108b42b4SDavid HowellsUltraSPARC Architecture 2005 2999108b42b4SDavid Howells Chapter 9: Memory 3000108b42b4SDavid Howells Appendix D: Formal Specifications of the Memory Models 3001108b42b4SDavid Howells 3002108b42b4SDavid HowellsUltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 3003108b42b4SDavid Howells Chapter 8: Memory Models 3004108b42b4SDavid Howells Appendix F: Caches and Cache Coherency 3005108b42b4SDavid Howells 3006108b42b4SDavid HowellsSolaris Internals, Core Kernel Architecture, p63-68: 3007108b42b4SDavid Howells Chapter 3.3: Hardware Considerations for Locks and 3008108b42b4SDavid Howells Synchronization 3009108b42b4SDavid Howells 3010108b42b4SDavid HowellsUnix Systems for Modern Architectures, Symmetric Multiprocessing and Caching 3011108b42b4SDavid Howellsfor Kernel Programmers: 3012108b42b4SDavid Howells Chapter 13: Other Memory Models 3013108b42b4SDavid Howells 3014108b42b4SDavid HowellsIntel Itanium Architecture Software Developer's Manual: Volume 1: 3015108b42b4SDavid Howells Section 2.6: Speculation 3016108b42b4SDavid Howells Section 4.4: Memory Access 3017