-
Notifications
You must be signed in to change notification settings - Fork 28
Improve Python.NET interop performance #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice performance improvement!
- It would be nice to add more comments for the different changes so its easier to understand and assert that the new implementation is equivelent to the previous one.
- As for any performance change, testing and measuring is the key stone, it would be good to add some testing information and results to the PR showcasing the improvements. Also noting how were these changes tested windows/linux, with what etc.
- Wondering if these changes have been presented at https://github.com/pythonnet/pythonnet ?
src/runtime/Util.cs
Outdated
{ | ||
byte* address = (byte*)ptr + byteOffset; | ||
*((IntPtr*)address) = value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the marshal implementation pointed https://github.com/dotnet/coreclr/blob/e083b2a4ab3045450005645dab8c009574a75d58/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/Marshal.cs#L488 corresponds to .Net Core
but PythonNet is using .Net Framework
, the decompiled implementation on my windows machine points to the following code (at Windows\Microsoft.NET\Framework\v4.0.30319\mscorlib.dll
), doesn't seem to be doing any alignment check.
[SecurityCritical]
public static unsafe void WriteInt64(IntPtr ptr, int ofs, long val)
{
try
{
byte* numPtr1 = (byte*) ((IntPtr) (void*) ptr + ofs);
if (((int) numPtr1 & 7) == 0)
{
*(long*) numPtr1 = val;
}
else
{
byte* numPtr2 = (byte*) &val;
*numPtr1 = *numPtr2;
numPtr1[1] = numPtr2[1];
numPtr1[2] = numPtr2[2];
numPtr1[3] = numPtr2[3];
numPtr1[4] = numPtr2[4];
numPtr1[5] = numPtr2[5];
numPtr1[6] = numPtr2[6];
numPtr1[7] = numPtr2[7];
}
}
catch (NullReferenceException ex)
{
throw new AccessViolationException();
}
}
Maybe these changes were tested using .Net Core
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an alignment check: if (((int) numPtr1 & 7) == 0)
} | ||
|
||
public static readonly int ob_data; | ||
public static readonly int ob_dict; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ob_dict
and ob_data
are never being assigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are assigned via reflection in the static class constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should likely be private
and perhaps a comment stating that they're assigned/accessed via reflection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mchandschuh the visibility on these follows visibility on similar fields in other *Offset
classes. One of the reasons for that is for reflection to be able to discern between fields mapping to python object fields, and fields, that are necessary for this class methods to function properly.
I don't think comment here will do them any good, as in the next implementation they might be set by different means, and comments like this tend to be overlooked.
I intend to keep this part as is, if there are no further objections.
src/runtime/interop.cs
Outdated
|
||
public static int Size { get { return size; } } | ||
|
||
static readonly int size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to improve simplicity this could be turned into an auto-property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep implementation explicit to ensure that the backing field is not found by reflection (though I agree that would keep it private
).
|
||
public static int TypeDictOffset(IntPtr type) | ||
{ | ||
return ManagedDataOffsets.DictOffset(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide more detailed comments on why TypeDictOffset()
and magic()
new implementations are equivalent to the previous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned that in the PR: Python.NET always adds these fields to the end of the objects it allocates. So offsets to them can be computed as objAddr + objSize - const
, no matter the type of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to leave comments describing the why in the source code and even in the commit messages as it makes understanding the code later much easier. Consider in a time, far far away, where github ceases to exist, this perfect comment describing the why of these offsets would be lost forever :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I put the explanation into the commit message of the squashed commit.
I don't think it makes sense to do in the code, as the old code will be gone after making this change. Clarified a bit in the review itself though.
I only tested on a Windows box with sample data from https://github.com/QuantConnect/Lean/tree/master/Data . As suggested by Jared, I run random data generator to fill the date range 2010-2018. I put the following into // algorithm class selector
"algorithm-type-name": "C:\\Users\\SCRAMBLED\\Projects\\QuantConnect\\Lean\\Algorithm.Python\\Benchmarks\\IndicatorRibbonBenchmark",
// Algorithm language selector - options CSharp, FSharp, VisualBasic, Python, Java
"algorithm-language": "Python",
//Physical DLL location
"algorithm-location": "../../../Algorithm.Python/Benchmarks/IndicatorRibbonBenchmark.py", Without the changes I got ~13k dps, and ~17k dps with the changes. I have also tested (initially) by adding a 128 iteration loop around the body
I will upstream them as soon as I have time for that. P.S. Completed answering to this pass of review, ready for a new iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executed my own performance and stability tests, PR looks good. I think it would be very useful to get these commits into the root repo of PythonNet so more expert eyers can review them.
Next step: release new QC/PythonNet and give it a round of tests in the cloud too.
Stability tests
- Windows Lean unit and regression tests and PythonNet unit tests [multiple times].
- Linux tested Lean regression tests:
BasicTemplateAlgorithm, BasicTemplateFrameworkAlgorithm, RegressionAlgorithm, MACDTrendAlgorithm, IndicatorSuiteAlgorithm, CustomDataRegressionAlgorithm, CustomChartingAlgorithm, CustomModelsAlgorithm, DropboxBaseDataUniverseSelectionAlgorithm, AddAlphaModelAlgorithm, RenkoConsolidatorAlgorithm
Windows performance tests - IndicatorRibbonBenchmark
PythonNet pr - Lean mode debug - 9K
Completed in 87.40 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.85 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.66 seconds at 9k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode debug - no Marshal changes - 8-9K
Completed in 93.39 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 90.00 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 92.21 seconds at 8k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - 12K
Completed in 68.01 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 67.23 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 64.39 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - no Marshal changes - 12K
Completed in 64.20 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 65.38 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 66.80 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet master - Lean mode debug - 8K
Completed in 101.67 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 98.48 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 103.05 seconds at 8k data points per second. Processing total of 782,223 data points
PythonNet master - Lean mode release - 9K
Completed in 84.99 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 84.16 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 85.19 seconds at 9k data points per second. Processing total of 782,223 data points.
@Martin-Molinero dropping Marshal changes improves performance? That looks suspicious. |
I'd rather say, based on these tests results, that its exposing the fact that the Marshal changes didn't improve performance significantly. |
This addresses the following scenario: 1. A C# object `a` is created and filled with some data. 2. `a` is passed via Python.NET to Python. To do that Python.NET creates a wrapper object `w`, and stores reference to `a` in one of its fields. 3. Python code later passes `w` back to C#, e.g. calls `SomeCSharpMethod(w)`. 4. Python.NET has to unwrap `w`, so it reads the reference to `a` from it. Prior to this change in 4. Python.NET had to determine what kind of an object `a` is. If it is an exception, a different offset needed to be used. That check was very expensive (up to 4 calls into Python interpreter). This change replaces that check with computing offset unconditionally by subtracting a constant from the object size (which is read from the wrapper), thus avoiding calls to Python interpreter.
@Martin-Molinero I removed the |
@Martin-Molinero I recommend merging it using Squash and Merge, otherwise a merge commit will be created. Or manually cherry-pick c70724a |
food for thought -- It's great seeing the macro-level performance tests, but this repository may also benefit from micro-level speed tests. This will help us track impacts of performance changes on individual operations. For example, this PR seems to directly impact performance of invoking a member (method/property/etc) on a native C# object that has been returned from the python runtime. We could have a performance test that specifically measures this case. Also, we would be able to make better informed decisions regarding improvements on the micro-scale, such as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted minor questions -- I think this PR would greatly benefit from a performance test that documents what exactly was improved here. From reading the code, it appears the main performance improvement is caching these offsets instead of computing them on each access (love it!), but it would still be nice to have some sort of benchmark value for things like a property access or method invocation for C# objects returned from the python runtime.
@@ -39,6 +39,7 @@ | |||
- Sam Winstanley ([@swinstanley](https://github.com/swinstanley)) | |||
- Sean Freitag ([@cowboygneox](https://github.com/cowboygneox)) | |||
- Serge Weinstock ([@sweinst](https://github.com/sweinst)) | |||
- Victor Milovanov([@lostmsu](https://github.com/lostmsu)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- always nice to get credit!
|
||
private static int BaseOffset(IntPtr type) | ||
{ | ||
Debug.Assert(type != IntPtr.Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Martin-Molinero -- Do we build w/ #DEBUG
compiler flag in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No 👍 we build targeting the ReleaseWin/Mono
configuration that doesn't use the #Debug
flag nor do we add it
} | ||
|
||
public static readonly int ob_data; | ||
public static readonly int ob_dict; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should likely be private
and perhaps a comment stating that they're assigned/accessed via reflection.
for (int i = 0; i < fi.Length; i++) | ||
{ | ||
fi[i].SetValue(null, (i * size) + ObjectOffset.ob_type + size); | ||
fi[i].SetValue(null, (i * IntPtr.Size) + OriginalObjectOffsets.Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - why inline IntPtr.Size
here -- it certainly safe to assume that the value doesn't change over the coarse of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saves a jump to the original line to figure out what this size is supposed to be.
@mchandschuh , I created a project in the main Python.NET repo to track benchmarking: https://github.com/pythonnet/pythonnet/projects/5 Done with this iteration. Note: I haven't made any code changes. If you strongly believe some are required, please respond to one of the conversations. |
Hey guys! Any changes required for this to get in? |
Various team members getting back from vacation and travels - will be able to test and get in soon. |
Merging so we start cloud integration and performance testing. |
What does this implement/fix? Explain your changes.
This improves the performance of Python -> C# calls by up to 50%.
Offset calculation for instance unwrapping
Commits 3344110 and 0d158f5 can actually be squashed into one change.
They addresses the following scenario:
a
is created and filled with some data.a
is passed via Python.NET to Python. To do that Python.NET creates a wrapper objectw
, and stores reference toa
in one of the its fields.w
back to C#, e.g. callsSomeCSharpMethod(w)
.w
, so it reads the reference toa
from it.Prior to this change in 4. Python.NET had to determine what kind of an object
a
is. If it is an exception, a different offset needed to be used. That check was very expensive (up to 4 calls into Python interpreter).The change replaces that check with computing offset unconditionally from the object size it reads from the wrapper.
Replacing
Marshal.Read*/Marshal.Write*
methods in hot paths 12d1378Marshal
class methods provide safe way to read and write primitive values from unmanaged memory. Safety involves two checks:NullReferenceException
in case pointer isnull
: the contract, provided byMarshal
class requires that to be converted toAccessViolationException
.Both are expensive, and not needed in Python.NET
Marshal
.So a couple of methods are added to
Util
class to perform faster unmanaged memory access, andMarshal
calls replaced with them in hot paths (mostly inCLRObject
constructor).Does this close any currently open issues?
N/A
Any other comments?
N/A
Checklist
Check all those that are applicable and complete.
AUTHORS
CHANGELOG