Skip to content
This repository has been archived by the owner on Aug 11, 2022. It is now read-only.

Possible lifecycle race condition #7074

Closed
Qard opened this issue Jan 6, 2015 · 21 comments
Closed

Possible lifecycle race condition #7074

Qard opened this issue Jan 6, 2015 · 21 comments

Comments

@Qard
Copy link

Qard commented Jan 6, 2015

As reported in IRC, I've been having issues with the install lifecycle step getting run for leveldown. The module folder is populated, but it doesn't even attempt to build the native addon. The issue was discovered on appneta/node-traceview. I've created a standalone package.json to reproduce the issue.

If I start removing deps from that the issue becomes intermittent and eventually disappears entirely if I remove enough. It seems very likely that it's caused by some sort of race condition, I'm not yet sure where.

npm version: 2.1.17, 2.1.18
node version: 0.10.35
os: Ubuntu 14.10

@smikes
Copy link
Contributor

smikes commented Jan 7, 2015

Is the symptom that npm install sometimes fails? I tried downloading the package.json and running npm install and (apart from the absence of postgresql) it worked fine.

Could you please expand on this bug report a bit more?

@Qard
Copy link
Author

Qard commented Jan 7, 2015

It appears to install fine, but doesn't build the native binding. If you try require('leveldown') it will blow up.

@smikes
Copy link
Contributor

smikes commented Jan 7, 2015

Ah excellent. I will look at that later today. (Hi from -25 edmonton btw)

On Wed, Jan 7, 2015 at 9:06 AM, Stephen Belanger notifications@github.com
wrote:

It appears to install fine, but doesn't build the native binding. If you try require('leveldown') it will blow up.

Reply to this email directly or view it on GitHub:
#7074 (comment)

@smikes
Copy link
Contributor

smikes commented Jan 8, 2015

Here is a smaller repro set that produces failure for me:

mkdir test; cd test
cat >package.json <<EOF
{
  "name": "leveldown-repro",
  "version": "1.0.0",
  "dependencies": {
    "cls-bluebird": "~1.0.0",
    "shimmer": "~1.0.0",
    "leveldown": "~1.0.0"
  }
}
EOF
rm -fr node_modules; npm install --silent && node -e 'console.log(require("leveldown"))'

fails with:

...
leveldown@1.0.0 node_modules/leveldown
├── bindings@1.2.1
├── fast-future@1.0.1
├── nan@1.3.0
└── abstract-leveldown@2.0.3 (xtend@3.0.0)

/Users/smikes/src/github/bar/node_modules/leveldown/node_modules/bindings/bindings.js:91
  throw err
        ^
Error: Could not locate the bindings file. Tried:
 → /Users/smikes/src/github/bar/node_modules/leveldown/build/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/build/Debug/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/build/Release/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/out/Debug/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/Debug/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/out/Release/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/Release/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/build/default/leveldown.node
 → /Users/smikes/src/github/bar/node_modules/leveldown/compiled/0.10.35/darwin/x64/leveldown.node
    at bindings (/Users/smikes/src/github/bar/node_modules/leveldown/node_modules/bindings/bindings.js:88:9)
    at Object.<anonymous> (/Users/smikes/src/github/bar/node_modules/leveldown/leveldown.js:4:46)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at [eval]:1:13
    at Object.<anonymous> ([eval]-wrapper:6:22)

Gist of npm install -ddd is here: https://gist.github.com/smikes/154a668f26520b6fe9f6

fails reproducibly OSX, Yosemite, bash npm 2.1.17, node 0.10.35

@Qard
Copy link
Author

Qard commented Jan 12, 2015

I can confirm that smaller repro also fails to locate bindings on Ubuntu 14.10.

It currently fails on all npm 2.x versions, including next, and on both node 0.10.35 and 0.11.14.

@Qard
Copy link
Author

Qard commented Apr 2, 2015

I'm pretty sure this is the same as #5400, and the several other referenced issues. It appears to still be an issue as of npm 2.7.4.

I did some more testing and it seems to happen when a package.json depends on all of:

  • a native module
  • a module with peerDependencies
  • shimmer

Simple repro:

{
  "dependencies": {
    "shimmer": "~1.0.0",
    "cls-q": "~1.0.0",
    "weak": "*"
  }
}

Weirdly, the only constant seems to be shimmer. I can change to any other native module, or any other module with peerDependencies and still get the same issue.

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

I have literally no idea why shimmer could be causing this failure. It has no dependencies, no lifecycle scripts, and its devDependencies are themselves pretty simple. I'll dig into this a little and see what I can dig up. Thanks for the reduced repro case!

@Qard
Copy link
Author

Qard commented Apr 2, 2015

Yeah, it's super weird. I suspect it's a race condition that shimmer happens to take just long enough to install for it to surface. It may vary on other systems?

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

I can reliably reproduce this in my dev environment, so it's probably not a race condition. My current guess is that it's a peerDependency problem and it's mostly the presence of a regular dependency (with no dependencies of its own???) that's causing the build step to get skipped.

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

Wow. This is an entry for the WTF files:

bauchelain% curl $(npm view weak dist.tarball) | tar xf - package/package.json ; json scripts -f package/package.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7814  100  7814    0     0   184k      0 --:--:-- --:--:-- --:--:--  186k
{
  "test": "mocha -gc --reporter spec"
}

vs

bauchelain% npm view weak scripts --json

{
  "test": "mocha -gc --reporter spec",
  "install": "node-gyp rebuild"
}

That is, the scripts are different in the package.json in the tarball than the one on the registry. I'm very curious if @TooTallNate has any idea how that happened, but the long and the short of it is that that's a weird edge case, and the only reason it's being hit is that you were right the first time, and having enough other dependencies causes a race that leads to npm using the package.json from the tarball rather than the copy of the registry's version stored in the cache folder.

I still need to understand how that happens before I can fix it, but at least now I no longer feel like I'm losing my marbles trying to figure out which piece of npm was surreptitiously deleting stuff from the scripts object.

@iarna
Copy link
Contributor

iarna commented Apr 2, 2015

Oh-ho, I bet that's related to this bug: #7243

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

@iarna Maybe. The especially confounding detail I forgot to include was that you have to observe this happening in flight, because at the end of the install process / when the race is won, it's the registry's package.json data that ends up in node_modules, even though during the install process, what's on disk when the race is lost is the tarball's. I had to instrument npm to catch it in the act before I had enough of a clue to even check the tarball, because it was always the correct data on disk after npm finished. This bug sucks.

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

I think @Qard's reduction includes exactly the set of moving pieces necessary to cause this bug to manifest:

  • a package where there's a discrepancy between what's on the registry and what's in the tarball so that we can observe the fact that the build step is reading from the package before it's completely written (i.e. before the registry's package.json data lands in node_modules, which marks the end of writing the package from the cache)
  • a package with peerDependencies, which cause new packages to be added to the install process mid-stream, which seems to be scrambling the installer's brain somehow
  • one or more ordinary packages to bump the timing around a little and cause the race to be lost

Now I get to read a bunch of debug logging output and try to figure out exactly what's happening.

@Qard
Copy link
Author

Qard commented Apr 2, 2015

It seems every native module I try includes that same discrepancy of the lacking install task.

@othiym23
Copy link
Contributor

othiym23 commented Apr 2, 2015

Ah! OK, that probably explains it – there's a thing in normalize-package-data that looks for bindings.gyp and adds an "install":"node-gyp rebuild" stanza if one is there. Unfortunately, that means I really have to fix this, because it's not just some obscure corner case but a very important corner case.

We should also probably run normalize-package-data such that package/package.json in the tarball and what gets published to the registry matches, but that's part 2 of this.

@Qard
Copy link
Author

Qard commented Apr 2, 2015

From what I can tell, the readJson call here does a normalize-package-data internally, changing the data sent to the registry, but the differences are never saved back to the file, so the tar stream here does not include those changes.

@othiym23
Copy link
Contributor

othiym23 commented Apr 3, 2015

I have figured this out! 🎉 🚑 🏥

Here is some log output from an instrumented build of npm. Note that lines 1-4 are npm writing node_modules/weak/package.json and then immediately calling readJSON() on the written file. What this shows is that read-package-json memoizes each package.json for the lifetime of a CLI process. The first version of package.json it reads is what it holds onto forever.

Most of the time this is fine, because the several versions of package.json that are heading to disk are functionally identical, but because of the way read-package-json automatically adds a "install":"node-gyp rebuild" stanza to scripts before package.json metadata is sent to the registry during npm publish, packages that include native bindings will have different scripts stanzas in the package.json in the package tarball and on the registry. This sets up a race between the two different package.jsons to see which one makes it to disk before the first time readJSON() is called, and the winner of the race determines whether node-gyp gets run or not for the native bindings. So it's a race condition and a cache invalidation bug.

There are two bugs here:

  1. npm should be writing the normalized package.json to the tarball as part of npm pack.
  2. package.json files in node_modules shouldn't be cached by readJSON() or the cache should be invalidated when `package.json is written.

2 is more important to fix, because there's a large overhang of these weird packages (basically every package with a native binding included) in the wild. Given the complexity of doing cache invalidation properly, I'm just going to rip the whole cache out of read-package-json and make holding onto package.json files the responsibility of the OS's filesystem-level cache.

othiym23 added a commit to npm/read-package-json that referenced this issue Apr 3, 2015
It counts if I fixed it by removing the cache, right? Part of taking
care of npm/npm#7074.
@Qard
Copy link
Author

Qard commented Apr 3, 2015

Yikes! That's definitely some good bugs to squash. Thanks for figuring this one out. 👍

othiym23 added a commit that referenced this issue Apr 3, 2015
A bunch of the tests are now getting more complete from / resolved
information because the contents of rewritten package.json files on disk
are no longer masked by `read-package-json` memoizing fs reads of
package.json files. Part of fixing #7074.
@othiym23
Copy link
Contributor

othiym23 commented Apr 3, 2015

I think I may have inadvertently fixed bug 1 described above when doing some unrelated changes to the caching logic a while ago, but I will verify and make sure that any changes refer to this bug. Thanks to all, especially @Qard, for their help with this.

@smikes
Copy link
Contributor

smikes commented Apr 3, 2015

Epic.

On Thu, Apr 2, 2015 at 8:47 PM, Forrest L Norvell
notifications@github.com wrote:

I think I may have inadvertently fixed bug 1 described above when doing some unrelated changes to the caching logic a while ago, but I will verify and make sure that any changes refer to this bug. Thanks to all, especially @Qard, for their help with this.

Reply to this email directly or view it on GitHub:
#7074 (comment)

@othiym23
Copy link
Contributor

othiym23 commented Apr 3, 2015

Bug 1 is not fixed, and fixing it will require some reasonably extensive changes to either fstream-npm to allow it to replace the package.json from the folder being packed with what comes out of read-package-json, or the packing process will have to be made 2-phase, so that the new package.json can be put in place before being handed off to fstream-npm. I'll have to come back to this later.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants