Are some implementations better than others for specific applications? Is there anything to earn by rolling out your own?
Check out the description of the Test-and-set machine instruction on Wikipedia, which alludes to how atomic operations are achieved at the machine level. I can imagine most language-level mutex implementations rely on machine-level support such as Test-and-set.