Java compare and swap semantics and performance

axel22 picture axel22 · Nov 15, 2010 · Viewed 11.3k times · Source

What is the semantics of compare and swap in Java? Namely, does the compare and swap method of an AtomicInteger just guarantee ordered access between different threads to the particular memory location of the atomic integer instance, or does it guarantee ordered access to all the locations in memory, i.e. it acts as if it were a volatile (a memory fence).

From the docs:

  • weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
  • compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.

It's apparent from the API documentation that compareAndSet acts as if it were a volatile variable. However, weakCompareAndSet is supposed to just change its specific memory location. Thus, if that memory location is exclusive to the cache of a single processor, weakCompareAndSet is supposed to be much faster than the regular compareAndSet.

I'm asking this because I've benchmarked the following methods by running threadnum different threads, varying threadnum from 1 to 8, and having totalwork=1e9 (the code is written in Scala, a statically compiled JVM language, but both its meaning and bytecode translation are isomorphic to that of Java in this case - this short snippets should be clear):

val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
  override def initialValue = new AtomicInteger(0)
}

def loop_atomic_tlocal_cas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.compareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_cnt
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_tlocal_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

on an AMD with 4 dual 2.8 GHz cores, and a 2.67 GHz 4-core i7 processor. The JVM is Sun Server Hotspot JVM 1.6. The results show no performance difference.

Specs: AMD 8220 4x dual-core @ 2.8 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3) 7504.562 7502.817 7504.626 (avg = 7415.637 min = 7147.628 max = 7504.886 )

  • Thread num.: 2

Run times: (showing last 3) 3751.553 3752.589 3751.519 (avg = 3713.5513 min = 3574.708 max = 3752.949 )

  • Thread num.: 4

Run times: (showing last 3) 1890.055 1889.813 1890.047 (avg = 2065.7207 min = 1804.652 max = 3755.852 )

  • Thread num.: 8

Run times: (showing last 3) 960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3) 7325.425 7057.03 7325.407 (avg = 7231.8682 min = 7057.03 max = 7325.45 )

  • Thread num.: 2

Run times: (showing last 3) 3663.21 3665.838 3533.406 (avg = 3607.2149 min = 3529.177 max = 3665.838 )

  • Thread num.: 4

Run times: (showing last 3) 3664.163 1831.979 1835.07 (avg = 2014.2086 min = 1797.997 max = 3664.163 )

  • Thread num.: 8

Run times: (showing last 3) 940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3) 7502.876 7502.857 7502.933 (avg = 7414.8132 min = 7145.869 max = 7502.933 )

  • Thread num.: 2

Run times: (showing last 3) 3752.623 3751.53 3752.434 (avg = 3710.1782 min = 3574.398 max = 3752.623 )

  • Thread num.: 4

Run times: (showing last 3) 1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351 )

  • Thread num.: 8

Run times: (showing last 3) 959.329 1010.53 969.767 (avg = 1072.8444 min = 959.329 max = 1880.049 )

Specs: Intel i7 quad-core @ 2.67 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3) 8138.3175 8130.0044 8130.1535 (avg = 8119.2888 min = 8049.6497 max = 8150.1950 )

  • Thread num.: 2

Run times: (showing last 3) 4067.7399 4067.5403 4068.3747 (avg = 4059.6344 min = 4026.2739 max = 4068.5455 )

  • Thread num.: 4

Run times: (showing last 3) 2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3) 8130.5620 8129.9963 8132.3382 (avg = 8114.0052 min = 8042.0742 max = 8132.8542 )

  • Thread num.: 2

Run times: (showing last 3) 4066.9559 4067.0414 4067.2080 (avg = 4086.0608 min = 4023.6822 max = 4335.1791 )

  • Thread num.: 4

Run times: (showing last 3) 2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3) 8132.5267 8132.0299 8132.2415 (avg = 8114.9328 min = 8043.3674 max = 8134.0418 )

  • Thread num.: 2

Run times: (showing last 3) 4066.5924 4066.5797 4066.6519 (avg = 4059.1911 min = 4025.0703 max = 4066.8547 )

  • Thread num.: 4

Run times: (showing last 3) 2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )


While it's possible that thread locals in the example above end up in the same cache lines, it seems to me that there is no observable performance difference between regular CAS and its weak version.

This could mean that, in fact, a weak compare and swap acts as fully fledged memory fence, i.e. acts as if it were a volatile variable.

Question: Is this observation correct? Also, is there a known architecture or Java distribution for which a weak compare and set is actually faster? If not, what is the advantage of using a weak CAS in the first place?

Answer

Daniel picture Daniel · Nov 15, 2010

The x86 instruction for "atomically compare and swap" is LOCK CMPXCHG. This instruction creates a full memory fence.

There is no instruction that does this job without creating a memory fence, so it is very likely that both compareAndSet and weakCompareAndSet map to LOCK CMPXCHG and perform a full memory fence.

But that's for x86, other architectures (including future variants of x86) may do things differently.