Skip to content

Add soft shutdown #958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 163 commits into from
Oct 8, 2020
Merged

Add soft shutdown #958

merged 163 commits into from
Oct 8, 2020

Conversation

amos402
Copy link
Member

@amos402 amos402 commented Sep 17, 2019

What does this implement/fix? Explain your changes.

An upgraded version for #754
An implementation for #957
Also included some minor fixes.

Still simple and roughly, reset all GC objects by a save state, the old domain objects will be leaking and unaccessible.
Here's my next plan, I think I can replace all the clr objects slots to solid callbacks, that may help the memory leaking.

How to use: use PythonEngine.Initialize(softShutdown: true) for startup.

Updated Details:

  • Changes from above: objects in GC chains will not remove by default, it can be setting by RuntimeState.ShouldRestoreObjects, and modules which not in sys.modules after just call initial CPython will be remove from sys.modules on soft shutdown mode.
  • Multiple fixes about memory leaks and reference count errors.
  • The heap types which represent managed types will degenerate to normal type objects by set their slot callbacks to some function that builtin CPython regardless of normal mode or soft shutdown mode. Before their degenerations, through using tp_travese and tp_clear to release their sub objects, then use SlotsHolder.ReleaseTypeSlots reset their slots.
    Points can be optimize for the above mechanism:
    • tp_travese and tp_clear are not necessary for all time, they will become overhead for CPython's GC, maybe we can just use C# methods instead.
    • Maybe name of SlotsHolder is not suitable because it also handle the memory release.
  • tp_is_gc and some other slots are no need to be override at start, they can inherit from PyType_Type by default.
  • Interop do not holding the delegate objects by calling GetThunk anymore, the SlotsHolder will take care of them.
  • __import__ will restore at shutdown.

Does this close any currently open issues?

#957

Any other comments?

Checklist

Check all those that are applicable and complete.

  • Make sure to include one or more tests for your change
  • If an enhancement PR, please create docs and at best an example
  • Add yourself to AUTHORS
  • Updated the CHANGELOG

@lostmsu
Copy link
Member

lostmsu commented Sep 17, 2019

Can you explain how does soft shutdown work in more details?

@amos402
Copy link
Member Author

amos402 commented Sep 17, 2019

The base concept is, after Py_Initialize, save all GC objects and modules, and reset them when you shutting down.
Since the domain will be unload, the saved state cannot use any C# data types, also we need a dummy GC for storing the old objects for make sure some dealloc operations can call _PyObject_GC_UnTrack macro without a nullptr error. You can see the details in runtime_state.cs.
After discuss with @mjmvisser, he points some problems that relate restart Python will crash the PyQt. So I try to make a solution for that. I was intend to create two modes, one for reset all GC objects and modules(for now) and one for reset all slots of CLR types after previous mode, but I just saw this issue(#957), so I push it in advance.
Well, although my computer pass all tests, but the CI still not, I still need some fixes for the CI.

@benoithudson
Copy link
Contributor

I'm unsure how this relates to #957. Seems to be more about improving the GC handling?

The prototype that @BadSingleton did was, in implementation, fairly simple (took us a while to figure it out of course):

  1. Don't Py_Finalize in Shutdown
  2. On Shutdown, go through the TypeManager and set all the slots to zero (except the gc ones).

@amos402
Copy link
Member Author

amos402 commented Sep 18, 2019

Well, Python crash is connect with GC handling, after domain unload, when the GC collect occurred, it will traverse all GC objects, but the memory from old domain has been invalid, any access will result to crash.
If you just take them from GC chains, you can avoid access those invalid memory access to prevent crash. It maybe simple, but lots of memory will leak. So I need to do a lot of work to reduce the leaking, but it still cannot avoid them at all.

@@ -39,6 +39,48 @@ public static void Reset()
cache = new Dictionary<Type, ClassBase>(128);
}

public static IList<ClassBase> GetManagedTypes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be internal. Also, I don't see it used anywhere.

This method also needs a XML doc note, that it is not thread-safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why internal? The class is a internal already, and as a getter, I prefer use public.
Hummm, seems I only use it in my previous version, I will check it latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to address thread safety. If called from different threads this can cause reading stale values from memory.

@@ -28,7 +28,7 @@ public ExtensionType()

IntPtr py = Runtime.PyType_GenericAlloc(tp, 0);

GCHandle gc = GCHandle.Alloc(this);
GCHandle gc = AllocGCHandle(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllocGCHandle(true) - specify the parameter name here - it is unclear what true means.
E.g. AllocGCHandle(owned: true)

Comment on lines 267 to 275
private static IntPtr GetObjectDict(IntPtr ob)
{
return Marshal.ReadIntPtr(ob, ObjectOffset.DictOffset(ob));
}

private static void SetObjectDict(IntPtr ob, IntPtr value)
{
Marshal.WriteIntPtr(ob, ObjectOffset.DictOffset(ob), value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a XML doc to both methods, mentioning, that neither of them affect refcounts.

var gc = PyImport_ImportModule("gc");
var get_objects = PyObject_GetAttrString(gc, "get_objects");
var objs = PyObject_CallObject(get_objects, IntPtr.Zero);
var length = PyList_Size(objs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check the return values of these calls.

Also, wrap code, operating on objs and gc in try .. finally ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, checking return values is necessary although using these builtin functions will be failed by less probability.
But the try is not necessary, they're all C functions actually.

@benoithudson
Copy link
Contributor

benoithudson commented Nov 7, 2019

I've summarized my understanding in this document: https://docs.google.com/document/d/1a9OLsdKHXJ6MxHjo0WlVRcfjAiaP5hONNKUKKH6Tr-o

I see two PRs that would be easy to pull out of amos' code and check in right now:

I don't yet understand what the gc chains thing is trying to address. I also don't understand how amos handles the modules. I suspect they're related.

It seems like amos is turning off python gc for the proxy objects. If that works, we should just turn it all off and then we can eliminate the native code page. But that's yet another PR.

There seems to be some refcount fixes as well. If we can get those fixes in separate PRs that makes it easier to review.

* Drop C module dependency when getting _PyObject_NextNotImplemented
* Exception details for SetNoSiteFlag
filmor pushed a commit that referenced this pull request Dec 2, 2019
* Add exception helper
* Fixed refcnt error in ExtensionType.FinalizeObject
* Fixed typename leaking
* Fix refcnt error by using `using`
@@ -11,8 +11,7 @@ namespace Python.Runtime
public bool IsNull => this.pointer == IntPtr.Zero;

/// <summary>Gets a raw pointer to the Python object</summary>
public IntPtr DangerousGetAddress()
=> this.IsNull ? throw new NullReferenceException() : this.pointer;
public IntPtr DangerousGetAddress() => this.pointer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changed. If a function is needed, that allows access to the raw pointer when it is null, it should be a separate function for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an unchecked version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That checking is not necessary, because the method already named DangerousXXX, why it even can't obtain a null pointer. If it needs some null validation, it should do by the invoker not the internal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amos402 if the check is missed by accident by the invoker, the bug would be much easier to trace if it fails at the first access rather than much later on, when the raw null pointer is actually dereferenced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I concern is the method conformed the single responsibility or not. It is unfit that a method like this can't obtain a null pointer. For example, std::shared_ptr<T>::get can illustrate this point.
Anyway, I don't want to argue about this, if you insist, just check it, I don't mind it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @amos402, a method prefixed with Dangerous should let me do dangerous things. However, I think this modification should go in another PR as it modifies the current behaviour, and adding a checked version would add up a lot of changes all over the place.

Comment on lines +134 to +138
if (_isInitialized)
{
return;
}
_isInitialized = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not an issue in this PR, but static methods generally should be thread-safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this is not necessary for acceptance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's to avoid double initialization of the runtime. With domain reloads we can't rely anymore on Py_IsInitialized.


// Initialize modules that depend on the runtime class.
AssemblyManager.Initialize();
#if !NETSTANDARD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid compile-time checks, as we eventually want to produce a single .NET Standard DLL. Can you replace this with a runtime check for AppDomain unloading support.

Marshal.WriteIntPtr(mem, (IntPtr)ms.Length);
Marshal.Copy(data, 0, mem + IntPtr.Size, (int)ms.Length);

IntPtr capsule = PySys_GetObject("clr_data").DangerousGetAddress();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this work with BorrowedReference directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but it would not improve readability. While we do get a borrowed reference, we really do own this reference.

Copy link
Member

@lostmsu lostmsu Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure. It took some time to figure out, that in 73-80 capsule is borrowed, in 80-82 it is a new object, and that in 81 the reference to it is not being stolen, and required decref in 82.

Maybe extract clearing of existing clr_data into a method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I get what you are saying now. I'll factor out the clearing.

BadSingleton and others added 10 commits September 15, 2020 14:31
Even if netstandard lacks (for now) the necessary APIs for domain
(un)loading, it is still possible to (un)load domains via the Mono C
runtime.
…e pointer"

We Don't support python 2 anymore, but the CI machines may still be
using it to build.
Also adds implicit IntPtr conversion operators to simplify their use.
…mments-3

* Remove compile-time check on NETSTANDARD
* Extract method `ClearCLRData`
@lostmsu
Copy link
Member

lostmsu commented Sep 29, 2020

@BadSingleton / @amos402 can you please merge from master?

@codecov-io
Copy link

codecov-io commented Oct 8, 2020

Codecov Report

Merging #958 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #958   +/-   ##
=======================================
  Coverage   86.25%   86.25%           
=======================================
  Files           1        1           
  Lines         291      291           
=======================================
  Hits          251      251           
  Misses         40       40           
Flag Coverage Δ
#setup_linux 64.94% <ø> (ø)
#setup_windows 72.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c79be84...8d00e4c. Read the comment docs.

Comment on lines +112 to +121
if (!FullGCCollect())
{
Assert.IsTrue(WaitForCollected(op, hash, 10000));
}
Assert.IsFalse(shortWeak.IsAlive);
garbage = Finalizer.Instance.GetCollectedObjects();
Assert.IsNotEmpty(garbage, "The garbage object should be collected");
Assert.IsTrue(garbage.Any(r => ReferenceEquals(r.Target, longWeak.Target)),
"Garbage should contains the collected object");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, avoid leaving partial changes in the history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not partial, it's a complete one, although it can't pass on all runtimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants