Skip to content

Fixes to integrate pythonnet into Unity #745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

benoithudson
Copy link
Contributor

What does this implement/fix? Explain your changes.

This patch implements a fix to two successive crashes when integrating pythonnet into Unity3d, see #714

(1) On domain reload, shut down pythonnet. Otherwise there'll be a bunch of objects in python that point to functions implemented in C# code that isn't there anymore. In particular, the trick of replacing the import function means that on an import statement in the second domain, python is calling into deallocated memory.

Having fixed that crash we see a second crash, fixed by:
(2) Implement the garbage collection support in native code in memory that is not deallocated. Python's garbage collector keeps pointers to python objects that were allocated in the first domain and leaked on Py_Finalize. Python doesn't clear that list in Py_Initialize. This means it will try to collect these objects in the second domain. This led to a crash because garbage collection support for the python objects was implemented in C#, so on the first gc in the second domain, python was calling into deallocated memory.

Support for number 2:
(3) Use python's platform module to determine the platform at runtime so we know what native code to build and how to allocate an executable page. I'm preferring to check at runtime rather than at compile time so we can get away from shipping a .net DLL per configuration.

(4) Unit tests for domain reload, the platform module, and for allocating a page of memory.

Support is currently windows, linux, darwin; i386 and x86_64. Unit tests will fail with a descriptive message on other platforms.

Does this close any currently open issues?

#714 .

Any other comments?

It will be nice to fix a lot of the leaks, which I see some action on. But we can't rely on there being no leaks: client C# code can always XIncref some of their own stuff one too many times (or client C++ code). Best not to have everything come crashing down with inexplicable errors if client code made that mistake.

It seems like the only reason we really need the native code page hack is that we use the size field for special magic stuff in metatype.cs:

            // Hmm - the standard subtype_traverse, clear look at ob_size to
            // do things, so to allow gc to work correctly we need to move
            // our hidden handle out of ob_size. Then, in theory we can
            // comment this out and still not crash.

If we can fix that, we can avoid leaking a page per domain reload, and it'll be easier to port to other platforms.

Checklist

Check all those that are applicable and complete.

  • [ x ] Make sure to include one or more tests for your change
  • [ n/a ] If an enhancement PR, please create docs and at best an example
  • [ x ] Add yourself to AUTHORS
  • [ x ] Updated the CHANGELOG

Benoit Hudson and others added 12 commits September 14, 2018 15:20
When this test is not ignored, we get a crash.

The bug fixes are coming soon (there's three bugs).
Shut down python when the domain that loaded python reloads.
Also added comments explaining what the hashset is about.
1. Implement tp_traverse, tp_clear, and tp_is_gc in native code in memory that
   is *not* released upon domain reload.

   Effect: fixes the crash on domain reload.
   Side effect: leaks a page every domain reload.

2. Use python's platform package to determine what we're running on, so we can
   use the right mmap/mprotect or VirtualAlloc/VirtualProtect routines, the
   right flags for mmap, and (theoretically) the right native code.

   It should be easy to add another system, as long as it has the same kinds of
   functionality.

3. Enable the domain reload test. Added some unit tests for the new stuff.

4. Remove tp_traverse, tp_clear, and tp_is_gc where it was implemented.

Note: I haven't actually tested on windows and linux yet, I'm checking in so I
can do that easily.
Nothing major:
- python platform.machine() returns x86_64 on mac, and AMD64 on windows,
	for the same platform. Parse them properly.
- Microsoft's runtime C# compiler wants language version 2.0, so no
	$"{foo}" syntax in our debug printing.
- Microsoft has decided you can't catch SEGV even in a unit test, so
	disable that test on windows.

Most of my time was spent trying to convince VS to see the unit tests.
mac mmap allows it, so just set it on both platforms.
@benoithudson
Copy link
Contributor Author

Travis uses netstandard rather than 4.0 so it won't compile the unit test for domain reloading (everything else should go fine).

Suggestion on how best to proceed? Disabling the unit test on travis seems sad, but netstandard doesn't seem to fully support application domains... which is what I'm testing.

@den-run-ai
Copy link
Contributor

@benoithudson does AppDomain parts that you use work in Mono? By "netstandard" I presume you tested it in .NET Core? I'm ok with skipping otherwise.

What is relation of this code to Unity Technologies?

Note that at least 2 core developers should review and approve before this can be merged.

This test doesn't compile on appveyor and travis, because we use APIs from .NET
Framework 4.0 rather than .NET Standard. Even if it were fixed to compile, it
would throw exceptions because .NET Core doesn't have application domains.
@codecov
Copy link

codecov bot commented Oct 3, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@b6417ca). Click here to learn what that means.
The diff coverage is 84.11%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #745   +/-   ##
=========================================
  Coverage          ?   77.16%           
=========================================
  Files             ?       62           
  Lines             ?     5675           
  Branches          ?      897           
=========================================
  Hits              ?     4379           
  Misses            ?     1001           
  Partials          ?      295
Flag Coverage Δ
#setup_linux 69.42% <100%> (?)
#setup_windows 76.35% <84.11%> (?)
Impacted Files Coverage Δ
src/runtime/extensiontype.cs 79.16% <ø> (ø)
src/runtime/classbase.cs 81.37% <ø> (ø)
src/runtime/pythonengine.cs 78.69% <100%> (ø)
setup.py 87.41% <100%> (ø)
src/runtime/typemanager.cs 82.65% <79.68%> (ø)
src/runtime/runtime.cs 91.15% <89.47%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6417ca...386a034. Read the comment docs.

@benoithudson
Copy link
Contributor Author

This works in Mono, indeed -- initial dev was on Darwin. I then ported the fix to windows using the microsoft toolchain, and to linux using mono.

I didn't test on .NET Core myself, but that's what continuous integration is using. CI is failing to compile the test case (and the test case would fail trying to create domains even if it did compile). I've just pushed up disabling the test case. It's just unfortunate that this test case will have to rely on humans.

Unity funded my team at Imaginary Spaces to integrate pythonnet for an internal project and to ship the results to the public by posting a PR to pythonnet, so anyone handy enough can use python in Unity. If you need anything from their legal I can ask them to provide it.

Benoit Hudson added 4 commits October 3, 2018 10:02
Also disable the domain reload test under either NETSTANDARD or NETCOREAPP (the
former was used in the runtime, the latter in embed_tests).
When run with mono or in visual studio, DllImport("__Internal", ...) works; but
when run via dotnet, it fails to load the library somehow. mmap is in libc, so
get it that way now.

Exceptions thrown in NUnit setup/teardown get silently swallowed, so I also
added try/catch to dynamic.cs to help debug future initialization exceptions.
We can't check memory protections in an NUnit tests under Microsoft .NET
runtimes. I was assuming that meant windows only, but the travis CI pointed out
that it can mean linux as well.
@benoithudson benoithudson changed the title Pythonnet master Fixes to integrate pythonnet into Unity Oct 3, 2018
@filmor
Copy link
Member

filmor commented Oct 4, 2018

Seeing in the CPython code that this apparently a known problem (essentially https://github.com/python/cpython/blob/5903296045b586b9cd1fce0b1e02caf896028d1d/Python/pylifecycle.c#L84-L89), I don't think we have another choice but doing something hacky.

Did you check if it was feasible to iterate over all still unreleased .NET objects and just overwriting their tp_traverse etc. slots with 0?

@benoithudson
Copy link
Contributor Author

That crashes. I didn't investigate whether it was calling 0 or calling a default function, which then failed on metatype only.

python3 is crashing in a new and interesting way to me; I initially developed on python3/net40/mac but apparently the netcore path has some important differences.

The mess was caused by:
1. run a test that runs PythonEngine.Initialize() (probably implicitly)
2. run a test that does Runtime.Shutdown()
3. run a test that runs PythonEngine.Initialize() (probably implicitly)
Test number 3 ends up with an engine that's initialized but a runtime that's
shut down.

Somehow that was OK under python27.

Solution for now: don't shut down the runtime in tests.
@benoithudson
Copy link
Contributor Author

I'll need help debugging those remaining CI errors.

Appveyor fails the python34 tests on pip install --upgrade -r requirements.txt --quiet which is before pythonnet even gets its boots on. It passes all other test cases.

Travis is having a parse error on the pythonnet.sln file for the non-xplat tests. The solution file hasn't changed since January. The xplat tests pass.

Is there a way to check if the master get a green light right now, or whether some new version of something killed it? The most recent green-light test run seems to be July.

@benoithudson
Copy link
Contributor Author

I get the same error running the commands from he travis CI manually on the master branch.

Has something changed in the version of packages we download since the last PR merge in July?

Benoit Hudson added 3 commits October 11, 2018 12:48
I can't repro the error that appveyor is having on python34 at home.
Fixes a build break on travis. Seems the latest nuget is finding msbuild 15, whereas pythonnet.sln is for msbuild 14.
I'm hoping this will allow finding MSVC for python 3.4?
Benoit Hudson added 6 commits October 11, 2018 15:24
The sub-domain writelines were coming all at once, not interwoven as expected.

This seems to be the cause of an occasional TCP transport error on travis (we're writing to a closed pipe).
The previous attempt got further but got stuck trying to build psutil from source, with the infamous "error: INCLUDE environment variable is empty" error that means your windows setup won't be building wheels from source today.
This reverts commit f947e3b.

It didn't work because conda isn't apparently available.
Someone else can try to fix the problem that psutil is distributed as source, but appveyor won't build it.
Having the domains both writing to the console seems to have been causing trouble.
@benoithudson
Copy link
Contributor Author

Hallelujah!

I can make a separate PR for the CI fixes if that's helpful.

Travis is fixed by 14bc2e2. Use msbuild 14 rather than 15, now that travis has 15 available.

Appveyor is "fixed" by making python 3.4 an allowed-fail in ec6f6c5 and 386a034. The issue is that psutil is a source distribution on py34 and there's some error compiling it.

@filmor
Copy link
Member

filmor commented Oct 12, 2018

Awesome, great work! If you can split this out into a separate, squashed PR, I'll merge both. @denfromufa, fine with you as well?

@den-run-ai
Copy link
Contributor

den-run-ai commented Oct 12, 2018 via email

@benoithudson
Copy link
Contributor Author

OK, I've created a PR #751 just for the travis/appveyor.

Once that's merged, I'll squash the domain reload work into a nice form.

@benoithudson
Copy link
Contributor Author

I'll close this PR, and create a new one with the squashed code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants