The mystery of stuck inactive msbuild.exe processes, locked Stylecop.dll, Nuget AccessViolationException and CI builds clashing with each other

Jon Rea picture Jon Rea · Nov 22, 2012 · Viewed 17.7k times · Source

Observations:

  • On our Jenkins build server, we were seeing lots of msbuild.exe processes (~100) hanging around after job completion with around 20mb memory usage and 0% CPU activity.

  • Builds using different versions of stylecop were intermittently failing:

    workspace\packages\StyleCop.MSBuild.4.7.41.0\tools\StyleCop.targets(109,7): error MSB4131: The "ViolationCount" parameter is not supported by the "StyleCopTask" task. Verify the parameter exists on the task, and it is a gettable public instance property.

  • Nuget.exe was intermittently exiting with the following access violation error (0x0000005):

    .\workspace\.nuget\nuget install .\workspace\packages.config -o .\workspace\packages" exited with code -1073741819.

MsBuild was launched in the following way via a Jenkins Matrix job, with 'BuildInParallel' enabled:

    `msbuild /t:%Targets% /m
    /p:Client=%Client%;LOCAL_BUILD=%LOCAL_BUILD%;BUILD_NUMBER=%BUILD_NUMBER%;
    JOB_NAME=%JOB_NAME%;Env=%Env%;Configuration=%Configuration%;Platform=%Platform%;
    Clean=%Clean%; %~dp0\_Jenkins\Build.proj`

Answer

Jon Rea picture Jon Rea · Nov 22, 2012

After a lot of digging around and trying various things to no effect, I eventually ended up creating a new minimal solution which reproduced the issue with very little else going on. The issue turned out to be caused by msbuild's multi-core parallelisation - the 'm' parameter.

  • The 'm' parameter tells msbuild to spawn "nodes", these will remain alive after the build has ended, and are then re-used by new builds!
  • The StyleCop 'ViolationCount' error was caused by a given build re-using an old version of the stylecop.dll from another build's workspace, where ViolationCount was not supported. This was odd, because the CI workspace only contained the new version. It seems that once the StyleCop.dll was loaded into a given MsBuild node, it would remain loaded for the next build. I can only assume this is because StyleCop loads some sort of singleton into the nodes processs? This also explains the file-locking between builds.
  • The nuget access violation crash has now gone (with no other changes), so is evidently related to the above node re-use issue.
  • As the 'm' parameter defaults to the number of cores - we were seeing 24 msbuild instances created on our build server for a given job.

The following posts were helpful:

The fix:

  • Add the line set MSBUILDDISABLENODEREUSE=1 to the batch file which launches msbuild
  • Launch msbuild with /m:4 /nr:false
  • The 'nr' paremeter tells msbuild to not use "Node Reuse" - so msbuild instances are closed after the build is completed and no longer clash with each other - resulting in the above errors.
  • The 'm' parameter is set to 4 to stop too many nodes spawning per-job