-
Notifications
You must be signed in to change notification settings - Fork 747
Fixes to integrate pythonnet into Unity #745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to integrate pythonnet into Unity #745
Conversation
When this test is not ignored, we get a crash. The bug fixes are coming soon (there's three bugs).
Shut down python when the domain that loaded python reloads.
Also added comments explaining what the hashset is about.
1. Implement tp_traverse, tp_clear, and tp_is_gc in native code in memory that is *not* released upon domain reload. Effect: fixes the crash on domain reload. Side effect: leaks a page every domain reload. 2. Use python's platform package to determine what we're running on, so we can use the right mmap/mprotect or VirtualAlloc/VirtualProtect routines, the right flags for mmap, and (theoretically) the right native code. It should be easy to add another system, as long as it has the same kinds of functionality. 3. Enable the domain reload test. Added some unit tests for the new stuff. 4. Remove tp_traverse, tp_clear, and tp_is_gc where it was implemented. Note: I haven't actually tested on windows and linux yet, I'm checking in so I can do that easily.
Nothing major: - python platform.machine() returns x86_64 on mac, and AMD64 on windows, for the same platform. Parse them properly. - Microsoft's runtime C# compiler wants language version 2.0, so no $"{foo}" syntax in our debug printing. - Microsoft has decided you can't catch SEGV even in a unit test, so disable that test on windows. Most of my time was spent trying to convince VS to see the unit tests.
mac mmap allows it, so just set it on both platforms.
…nity-Technologies/pythonnet into uni-63112-hotreload-crash-test
…sh-test Uni 63112 hotreload crash test
Travis uses netstandard rather than 4.0 so it won't compile the unit test for domain reloading (everything else should go fine). Suggestion on how best to proceed? Disabling the unit test on travis seems sad, but netstandard doesn't seem to fully support application domains... which is what I'm testing. |
@benoithudson does AppDomain parts that you use work in Mono? By "netstandard" I presume you tested it in .NET Core? I'm ok with skipping otherwise. What is relation of this code to Unity Technologies? Note that at least 2 core developers should review and approve before this can be merged. |
This test doesn't compile on appveyor and travis, because we use APIs from .NET Framework 4.0 rather than .NET Standard. Even if it were fixed to compile, it would throw exceptions because .NET Core doesn't have application domains.
Codecov Report
@@ Coverage Diff @@
## master #745 +/- ##
=========================================
Coverage ? 77.16%
=========================================
Files ? 62
Lines ? 5675
Branches ? 897
=========================================
Hits ? 4379
Misses ? 1001
Partials ? 295
Continue to review full report at Codecov.
|
This works in Mono, indeed -- initial dev was on Darwin. I then ported the fix to windows using the microsoft toolchain, and to linux using mono. I didn't test on .NET Core myself, but that's what continuous integration is using. CI is failing to compile the test case (and the test case would fail trying to create domains even if it did compile). I've just pushed up disabling the test case. It's just unfortunate that this test case will have to rely on humans. Unity funded my team at Imaginary Spaces to integrate pythonnet for an internal project and to ship the results to the public by posting a PR to pythonnet, so anyone handy enough can use python in Unity. If you need anything from their legal I can ask them to provide it. |
Also disable the domain reload test under either NETSTANDARD or NETCOREAPP (the former was used in the runtime, the latter in embed_tests).
When run with mono or in visual studio, DllImport("__Internal", ...) works; but when run via dotnet, it fails to load the library somehow. mmap is in libc, so get it that way now. Exceptions thrown in NUnit setup/teardown get silently swallowed, so I also added try/catch to dynamic.cs to help debug future initialization exceptions.
We can't check memory protections in an NUnit tests under Microsoft .NET runtimes. I was assuming that meant windows only, but the travis CI pointed out that it can mean linux as well.
Seeing in the CPython code that this apparently a known problem (essentially https://github.com/python/cpython/blob/5903296045b586b9cd1fce0b1e02caf896028d1d/Python/pylifecycle.c#L84-L89), I don't think we have another choice but doing something hacky. Did you check if it was feasible to iterate over all still unreleased .NET objects and just overwriting their |
That crashes. I didn't investigate whether it was calling 0 or calling a default function, which then failed on metatype only. python3 is crashing in a new and interesting way to me; I initially developed on python3/net40/mac but apparently the netcore path has some important differences. |
The mess was caused by: 1. run a test that runs PythonEngine.Initialize() (probably implicitly) 2. run a test that does Runtime.Shutdown() 3. run a test that runs PythonEngine.Initialize() (probably implicitly) Test number 3 ends up with an engine that's initialized but a runtime that's shut down. Somehow that was OK under python27. Solution for now: don't shut down the runtime in tests.
I'll need help debugging those remaining CI errors. Appveyor fails the python34 tests on Travis is having a parse error on the pythonnet.sln file for the non-xplat tests. The solution file hasn't changed since January. The xplat tests pass. Is there a way to check if the master get a green light right now, or whether some new version of something killed it? The most recent green-light test run seems to be July. |
I get the same error running the commands from he travis CI manually on the master branch. Has something changed in the version of packages we download since the last PR merge in July? |
I can't repro the error that appveyor is having on python34 at home.
Fixes a build break on travis. Seems the latest nuget is finding msbuild 15, whereas pythonnet.sln is for msbuild 14.
I'm hoping this will allow finding MSVC for python 3.4?
The sub-domain writelines were coming all at once, not interwoven as expected. This seems to be the cause of an occasional TCP transport error on travis (we're writing to a closed pipe).
The previous attempt got further but got stuck trying to build psutil from source, with the infamous "error: INCLUDE environment variable is empty" error that means your windows setup won't be building wheels from source today.
This reverts commit f947e3b. It didn't work because conda isn't apparently available.
Someone else can try to fix the problem that psutil is distributed as source, but appveyor won't build it.
Having the domains both writing to the console seems to have been causing trouble.
Hallelujah! I can make a separate PR for the CI fixes if that's helpful. Travis is fixed by 14bc2e2. Use msbuild 14 rather than 15, now that travis has 15 available. Appveyor is "fixed" by making python 3.4 an allowed-fail in ec6f6c5 and 386a034. The issue is that psutil is a source distribution on py34 and there's some error compiling it. |
Awesome, great work! If you can split this out into a separate, squashed PR, I'll merge both. @denfromufa, fine with you as well? |
Let me review tomorrow
…On Thu, Oct 11, 2018, 11:50 PM Benedikt Reinartz ***@***.***> wrote:
Awesome, great work! If you can split this out into a separate, squashed
PR, I'll merge both. @denfromufa <https://github.com/denfromufa>, fine
with you as well?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#745 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHgZ5VgqopZNFKkd1z11SoMsKtYBX4huks5ukB-RgaJpZM4XE3Kj>
.
|
OK, I've created a PR #751 just for the travis/appveyor. Once that's merged, I'll squash the domain reload work into a nice form. |
I'll close this PR, and create a new one with the squashed code. |
What does this implement/fix? Explain your changes.
This patch implements a fix to two successive crashes when integrating pythonnet into Unity3d, see #714
(1) On domain reload, shut down pythonnet. Otherwise there'll be a bunch of objects in python that point to functions implemented in C# code that isn't there anymore. In particular, the trick of replacing the import function means that on an import statement in the second domain, python is calling into deallocated memory.
Having fixed that crash we see a second crash, fixed by:
(2) Implement the garbage collection support in native code in memory that is not deallocated. Python's garbage collector keeps pointers to python objects that were allocated in the first domain and leaked on Py_Finalize. Python doesn't clear that list in Py_Initialize. This means it will try to collect these objects in the second domain. This led to a crash because garbage collection support for the python objects was implemented in C#, so on the first gc in the second domain, python was calling into deallocated memory.
Support for number 2:
(3) Use python's platform module to determine the platform at runtime so we know what native code to build and how to allocate an executable page. I'm preferring to check at runtime rather than at compile time so we can get away from shipping a .net DLL per configuration.
(4) Unit tests for domain reload, the platform module, and for allocating a page of memory.
Support is currently windows, linux, darwin; i386 and x86_64. Unit tests will fail with a descriptive message on other platforms.
Does this close any currently open issues?
#714 .
Any other comments?
It will be nice to fix a lot of the leaks, which I see some action on. But we can't rely on there being no leaks: client C# code can always XIncref some of their own stuff one too many times (or client C++ code). Best not to have everything come crashing down with inexplicable errors if client code made that mistake.
It seems like the only reason we really need the native code page hack is that we use the size field for special magic stuff in metatype.cs:
If we can fix that, we can avoid leaking a page per domain reload, and it'll be easier to port to other platforms.
Checklist
Check all those that are applicable and complete.
AUTHORS
CHANGELOG