Improve Python.NET interop performance #31

lostmsu · 2019-06-17T19:21:41Z

What does this implement/fix? Explain your changes.

This improves the performance of Python -> C# calls by up to 50%.

Offset calculation for instance unwrapping

Commits 3344110 and 0d158f5 can actually be squashed into one change.
They addresses the following scenario:

A C# object a is created and filled with some data.
a is passed via Python.NET to Python. To do that Python.NET creates a wrapper object w, and stores reference to a in one of the its fields.
Python code later passes w back to C#, e.g. calls SomeCSharpMethod(w).
Python.NET has to unwrap w, so it reads the reference to a from it.

Prior to this change in 4. Python.NET had to determine what kind of an object a is. If it is an exception, a different offset needed to be used. That check was very expensive (up to 4 calls into Python interpreter).

The change replaces that check with computing offset unconditionally from the object size it reads from the wrapper.

Replacing `Marshal.Read/Marshal.Write` methods in hot paths `12d1378`

Marshal class methods provide safe way to read and write primitive values from unmanaged memory. Safety involves two checks:

An exception handler for NullReferenceException in case pointer is null: the contract, provided by Marshal class requires that to be converted to AccessViolationException.
Checking if the primitive is properly aligned in memory, and using a different procedure to access it otherwise.

Both are expensive, and not needed in Python.NET

Python.NET never handles exceptions from Marshal.
All allocations in Python are aligned, as it uses standard C allocator, that ensures alignment for large structures (e.g. PythonObject instances).

So a couple of methods are added to Util class to perform faster unmanaged memory access, and Marshal calls replaced with them in hot paths (mostly in CLRObject constructor).

Does this close any currently open issues?

N/A

Any other comments?

N/A

Checklist

Check all those that are applicable and complete.

Make sure to include one or more tests for your change (N/A: no new features)
If an enhancement PR, please create docs and at best an example (N/A: no new features)
Add yourself to AUTHORS
Updated the CHANGELOG

Martin-Molinero

Nice performance improvement!

It would be nice to add more comments for the different changes so its easier to understand and assert that the new implementation is equivelent to the previous one.
As for any performance change, testing and measuring is the key stone, it would be good to add some testing information and results to the PR showcasing the improvements. Also noting how were these changes tested windows/linux, with what etc.
Wondering if these changes have been presented at https://github.com/pythonnet/pythonnet ?

Martin-Molinero · 2019-06-18T01:33:04Z

src/runtime/Util.cs

+        {
+            byte* address = (byte*)ptr + byteOffset;
+            *((IntPtr*)address) = value;
+        }


I believe the marshal implementation pointed https://github.com/dotnet/coreclr/blob/e083b2a4ab3045450005645dab8c009574a75d58/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/Marshal.cs#L488 corresponds to .Net Core but PythonNet is using .Net Framework, the decompiled implementation on my windows machine points to the following code (at Windows\Microsoft.NET\Framework\v4.0.30319\mscorlib.dll), doesn't seem to be doing any alignment check.

[SecurityCritical] public static unsafe void WriteInt64(IntPtr ptr, int ofs, long val) { try { byte* numPtr1 = (byte*) ((IntPtr) (void*) ptr + ofs); if (((int) numPtr1 & 7) == 0) { *(long*) numPtr1 = val; } else { byte* numPtr2 = (byte*) &val; *numPtr1 = *numPtr2; numPtr1[1] = numPtr2[1]; numPtr1[2] = numPtr2[2]; numPtr1[3] = numPtr2[3]; numPtr1[4] = numPtr2[4]; numPtr1[5] = numPtr2[5]; numPtr1[6] = numPtr2[6]; numPtr1[7] = numPtr2[7]; } } catch (NullReferenceException ex) { throw new AccessViolationException(); } }

Maybe these changes were tested using .Net Core instead?

This looks like an alignment check: if (((int) numPtr1 & 7) == 0)

Martin-Molinero · 2019-06-18T01:43:27Z

src/runtime/interop.cs

+        }
+
+        public static readonly int ob_data;
+        public static readonly int ob_dict;


I think ob_dict and ob_data are never being assigned?

These are assigned via reflection in the static class constructor.

These should likely be private and perhaps a comment stating that they're assigned/accessed via reflection.

@mchandschuh the visibility on these follows visibility on similar fields in other *Offset classes. One of the reasons for that is for reflection to be able to discern between fields mapping to python object fields, and fields, that are necessary for this class methods to function properly.

I don't think comment here will do them any good, as in the next implementation they might be set by different means, and comments like this tend to be overlooked.

I intend to keep this part as is, if there are no further objections.

Martin-Molinero · 2019-06-18T01:44:23Z

src/runtime/interop.cs

+
+        public static int Size { get { return size; } }
+
+        static readonly int size;


Maybe to improve simplicity this could be turned into an auto-property

I wanted to keep implementation explicit to ensure that the backing field is not found by reflection (though I agree that would keep it private).

Martin-Molinero · 2019-06-18T01:58:52Z

src/runtime/interop.cs

+
+        public static int TypeDictOffset(IntPtr type)
+        {
+            return ManagedDataOffsets.DictOffset(type);


Could you please provide more detailed comments on why TypeDictOffset() and magic() new implementations are equivalent to the previous?

I mentioned that in the PR: Python.NET always adds these fields to the end of the objects it allocates. So offsets to them can be computed as objAddr + objSize - const, no matter the type of the object.

It's nice to leave comments describing the why in the source code and even in the commit messages as it makes understanding the code later much easier. Consider in a time, far far away, where github ceases to exist, this perfect comment describing the why of these offsets would be lost forever :)

That's why I put the explanation into the commit message of the squashed commit.

src/runtime/interop.cs

lostmsu · 2019-06-18T03:42:36Z

It would be nice to add more comments for the different changes so its easier to understand and assert that the new implementation is equivelent to the previous one.

I don't think it makes sense to do in the code, as the old code will be gone after making this change. Clarified a bit in the review itself though.

As for any performance change, testing and measuring is the key stone, it would be good to add some testing information and results to the PR showcasing the improvements. Also noting how were these changes tested windows/linux, with what etc.

I only tested on a Windows box with sample data from https://github.com/QuantConnect/Lean/tree/master/Data . As suggested by Jared, I run random data generator to fill the date range 2010-2018.

I put the following into config.json, and run Launcher.

// algorithm class selector
"algorithm-type-name": "C:\\Users\\SCRAMBLED\\Projects\\QuantConnect\\Lean\\Algorithm.Python\\Benchmarks\\IndicatorRibbonBenchmark",

// Algorithm language selector - options CSharp, FSharp, VisualBasic, Python, Java
"algorithm-language": "Python",

//Physical DLL location
"algorithm-location": "../../../Algorithm.Python/Benchmarks/IndicatorRibbonBenchmark.py",

Without the changes I got ~13k dps, and ~17k dps with the changes.

I have also tested (initially) by adding a 128 iteration loop around the body AlgorithmPythonWrapper.OnData to reduce the effect of data loading on overall run time, and got a decrease from ~35s to ~25s run time as reported by the Launcher (about 5s in both these numbers are spent in data reading).

Wondering if these changes have been presented at https://github.com/pythonnet/pythonnet ?

I will upstream them as soon as I have time for that.

P.S. Completed answering to this pass of review, ready for a new iteration.

Martin-Molinero

Executed my own performance and stability tests, PR looks good. I think it would be very useful to get these commits into the root repo of PythonNet so more expert eyers can review them.
Next step: release new QC/PythonNet and give it a round of tests in the cloud too.

Stability tests

Windows Lean unit and regression tests and PythonNet unit tests [multiple times].
Linux tested Lean regression tests:
BasicTemplateAlgorithm, BasicTemplateFrameworkAlgorithm, RegressionAlgorithm, MACDTrendAlgorithm, IndicatorSuiteAlgorithm, CustomDataRegressionAlgorithm, CustomChartingAlgorithm, CustomModelsAlgorithm, DropboxBaseDataUniverseSelectionAlgorithm, AddAlphaModelAlgorithm, RenkoConsolidatorAlgorithm

Windows performance tests - IndicatorRibbonBenchmark

PythonNet pr - Lean mode debug - 9K
Completed in 87.40 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.85 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.66 seconds at 9k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode debug - no Marshal changes - 8-9K
Completed in 93.39 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 90.00 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 92.21 seconds at 8k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - 12K
Completed in 68.01 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 67.23 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 64.39 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - no Marshal changes - 12K
Completed in 64.20 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 65.38 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 66.80 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet master - Lean mode debug - 8K
Completed in 101.67 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 98.48 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 103.05 seconds at 8k data points per second. Processing total of 782,223 data points
PythonNet master - Lean mode release - 9K
Completed in 84.99 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 84.16 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 85.19 seconds at 9k data points per second. Processing total of 782,223 data points.

lostmsu · 2019-06-18T17:30:04Z

@Martin-Molinero dropping Marshal changes improves performance? That looks suspicious.

Martin-Molinero · 2019-06-18T17:37:43Z

@Martin-Molinero dropping Marshal changes improves performance? That looks suspicious.

I'd rather say, based on these tests results, that its exposing the fact that the Marshal changes didn't improve performance significantly.
There is clearly a dispersion in the result values.

This addresses the following scenario: 1. A C# object `a` is created and filled with some data. 2. `a` is passed via Python.NET to Python. To do that Python.NET creates a wrapper object `w`, and stores reference to `a` in one of its fields. 3. Python code later passes `w` back to C#, e.g. calls `SomeCSharpMethod(w)`. 4. Python.NET has to unwrap `w`, so it reads the reference to `a` from it. Prior to this change in 4. Python.NET had to determine what kind of an object `a` is. If it is an exception, a different offset needed to be used. That check was very expensive (up to 4 calls into Python interpreter). This change replaces that check with computing offset unconditionally by subtracting a constant from the object size (which is read from the wrapper), thus avoiding calls to Python interpreter.

lostmsu · 2019-06-18T18:12:01Z

@Martin-Molinero I removed the Marshal-related change, and squashed the rest to a single commit.

lostmsu · 2019-06-18T18:56:33Z

@Martin-Molinero I recommend merging it using Squash and Merge, otherwise a merge commit will be created. Or manually cherry-pick c70724a

mchandschuh · 2019-06-19T03:49:46Z

food for thought -- It's great seeing the macro-level performance tests, but this repository may also benefit from micro-level speed tests. This will help us track impacts of performance changes on individual operations. For example, this PR seems to directly impact performance of invoking a member (method/property/etc) on a native C# object that has been returned from the python runtime. We could have a performance test that specifically measures this case. Also, we would be able to make better informed decisions regarding improvements on the micro-scale, such as the Marshal changes (since removed)

mchandschuh

Posted minor questions -- I think this PR would greatly benefit from a performance test that documents what exactly was improved here. From reading the code, it appears the main performance improvement is caching these offsets instead of computing them on each access (love it!), but it would still be nice to have some sort of benchmark value for things like a property access or method invocation for C# objects returned from the python runtime.

mchandschuh · 2019-06-19T03:55:47Z

AUTHORS.md

@@ -39,6 +39,7 @@
 -   Sam Winstanley ([@swinstanley](https://github.com/swinstanley))
 -   Sean Freitag ([@cowboygneox](https://github.com/cowboygneox))
 -   Serge Weinstock ([@sweinst](https://github.com/sweinst))
+-   Victor Milovanov([@lostmsu](https://github.com/lostmsu))


👍 -- always nice to get credit!

mchandschuh · 2019-06-19T03:57:02Z

src/runtime/interop.cs

+
+        private static int BaseOffset(IntPtr type)
+        {
+            Debug.Assert(type != IntPtr.Zero);


@Martin-Molinero -- Do we build w/ #DEBUG compiler flag in production?

No 👍 we build targeting the ReleaseWin/Mono configuration that doesn't use the #Debug flag nor do we add it

mchandschuh · 2019-06-19T03:57:52Z

src/runtime/interop.cs

+        }
+
+        public static readonly int ob_data;
+        public static readonly int ob_dict;


These should likely be private and perhaps a comment stating that they're assigned/accessed via reflection.

mchandschuh · 2019-06-19T03:59:26Z

src/runtime/interop.cs

            for (int i = 0; i < fi.Length; i++)
            {
-                fi[i].SetValue(null, (i * size) + ObjectOffset.ob_type + size);
+                fi[i].SetValue(null, (i * IntPtr.Size) + OriginalObjectOffsets.Size);


nit - why inline IntPtr.Size here -- it certainly safe to assume that the value doesn't change over the coarse of the loop.

Saves a jump to the original line to figure out what this size is supposed to be.

lostmsu · 2019-06-19T20:32:40Z

@mchandschuh , I created a project in the main Python.NET repo to track benchmarking: https://github.com/pythonnet/pythonnet/projects/5
Since there's no existing infra for it, I think it is out of scope for this change.

Done with this iteration. Note: I haven't made any code changes. If you strongly believe some are required, please respond to one of the conversations.

lostmsu · 2019-06-22T03:18:12Z

Hey guys! Any changes required for this to get in?

jaredbroad · 2019-06-22T03:42:22Z

Various team members getting back from vacation and travels - will be able to test and get in soon.

Martin-Molinero · 2019-06-24T18:46:18Z

Merging so we start cloud integration and performance testing.
Note: adding a new merge commit to follow existing merging policy.

jaredbroad requested a review from Martin-Molinero June 17, 2019 20:12

Martin-Molinero reviewed Jun 18, 2019

View reviewed changes

Martin-Molinero approved these changes Jun 18, 2019

View reviewed changes

lostmsu force-pushed the perf/Interop branch from 3fbc82b to c70724a Compare June 18, 2019 18:10

mchandschuh reviewed Jun 19, 2019

View reviewed changes

Martin-Molinero merged commit 2e56c59 into QuantConnect:master Jun 24, 2019

filmor mentioned this pull request Aug 2, 2019

Improve performance of unwrapping .NET objects passed from Python pythonnet/pythonnet#930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Python.NET interop performance #31

Improve Python.NET interop performance #31

lostmsu commented Jun 17, 2019 •

edited

Loading

Martin-Molinero left a comment •

edited

Loading

Martin-Molinero Jun 18, 2019

lostmsu Jun 18, 2019

Martin-Molinero Jun 18, 2019

lostmsu Jun 18, 2019

mchandschuh Jun 19, 2019

lostmsu Jun 19, 2019

Martin-Molinero Jun 18, 2019

lostmsu Jun 18, 2019

Martin-Molinero Jun 18, 2019

lostmsu Jun 18, 2019

mchandschuh Jun 19, 2019

lostmsu Jun 19, 2019

lostmsu commented Jun 18, 2019 •

edited

Loading

Martin-Molinero left a comment •

edited

Loading

lostmsu commented Jun 18, 2019

Martin-Molinero commented Jun 18, 2019

lostmsu commented Jun 18, 2019

lostmsu commented Jun 18, 2019 •

edited

Loading

mchandschuh commented Jun 19, 2019 •

edited

Loading

mchandschuh left a comment

mchandschuh Jun 19, 2019

mchandschuh Jun 19, 2019

Martin-Molinero Jun 19, 2019 •

edited

Loading

mchandschuh Jun 19, 2019

mchandschuh Jun 19, 2019

lostmsu Jun 19, 2019

lostmsu commented Jun 19, 2019

lostmsu commented Jun 22, 2019

jaredbroad commented Jun 22, 2019

Martin-Molinero commented Jun 24, 2019


		public static int Size { get { return size; } }

		static readonly int size;

Improve Python.NET interop performance #31

Improve Python.NET interop performance #31

Conversation

lostmsu commented Jun 17, 2019 • edited Loading

What does this implement/fix? Explain your changes.

Offset calculation for instance unwrapping

Replacing Marshal.Read*/Marshal.Write* methods in hot paths 12d1378

Does this close any currently open issues?

Any other comments?

Checklist

Martin-Molinero left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lostmsu commented Jun 18, 2019 • edited Loading

Martin-Molinero left a comment • edited Loading

Choose a reason for hiding this comment

Stability tests

Windows performance tests - IndicatorRibbonBenchmark

lostmsu commented Jun 18, 2019

Martin-Molinero commented Jun 18, 2019

lostmsu commented Jun 18, 2019

lostmsu commented Jun 18, 2019 • edited Loading

mchandschuh commented Jun 19, 2019 • edited Loading

mchandschuh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Martin-Molinero Jun 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lostmsu commented Jun 19, 2019

lostmsu commented Jun 22, 2019

jaredbroad commented Jun 22, 2019

Martin-Molinero commented Jun 24, 2019

lostmsu commented Jun 17, 2019 •

edited

Loading

Replacing `Marshal.Read/Marshal.Write` methods in hot paths `12d1378`

Martin-Molinero left a comment •

edited

Loading

lostmsu commented Jun 18, 2019 •

edited

Loading

Martin-Molinero left a comment •

edited

Loading

lostmsu commented Jun 18, 2019 •

edited

Loading

mchandschuh commented Jun 19, 2019 •

edited

Loading

Martin-Molinero Jun 19, 2019 •

edited

Loading