Skip to content

Improve Python.NET interop performance #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2019

Conversation

lostmsu
Copy link

@lostmsu lostmsu commented Jun 17, 2019

What does this implement/fix? Explain your changes.

This improves the performance of Python -> C# calls by up to 50%.

Offset calculation for instance unwrapping

Commits 3344110 and 0d158f5 can actually be squashed into one change.
They addresses the following scenario:

  1. A C# object a is created and filled with some data.
  2. a is passed via Python.NET to Python. To do that Python.NET creates a wrapper object w, and stores reference to a in one of the its fields.
  3. Python code later passes w back to C#, e.g. calls SomeCSharpMethod(w).
  4. Python.NET has to unwrap w, so it reads the reference to a from it.

Prior to this change in 4. Python.NET had to determine what kind of an object a is. If it is an exception, a different offset needed to be used. That check was very expensive (up to 4 calls into Python interpreter).

The change replaces that check with computing offset unconditionally from the object size it reads from the wrapper.

Replacing Marshal.Read*/Marshal.Write* methods in hot paths 12d1378

Marshal class methods provide safe way to read and write primitive values from unmanaged memory. Safety involves two checks:

  1. An exception handler for NullReferenceException in case pointer is null: the contract, provided by Marshal class requires that to be converted to AccessViolationException.
  2. Checking if the primitive is properly aligned in memory, and using a different procedure to access it otherwise.

Both are expensive, and not needed in Python.NET

  1. Python.NET never handles exceptions from Marshal.
  2. All allocations in Python are aligned, as it uses standard C allocator, that ensures alignment for large structures (e.g. PythonObject instances).

So a couple of methods are added to Util class to perform faster unmanaged memory access, and Marshal calls replaced with them in hot paths (mostly in CLRObject constructor).

Does this close any currently open issues?

N/A

Any other comments?

N/A

Checklist

Check all those that are applicable and complete.

  • Make sure to include one or more tests for your change (N/A: no new features)
  • If an enhancement PR, please create docs and at best an example (N/A: no new features)
  • Add yourself to AUTHORS
  • Updated the CHANGELOG

Copy link
Member

@Martin-Molinero Martin-Molinero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice performance improvement!

  • It would be nice to add more comments for the different changes so its easier to understand and assert that the new implementation is equivelent to the previous one.
  • As for any performance change, testing and measuring is the key stone, it would be good to add some testing information and results to the PR showcasing the improvements. Also noting how were these changes tested windows/linux, with what etc.
  • Wondering if these changes have been presented at https://github.com/pythonnet/pythonnet ?

{
byte* address = (byte*)ptr + byteOffset;
*((IntPtr*)address) = value;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the marshal implementation pointed https://github.com/dotnet/coreclr/blob/e083b2a4ab3045450005645dab8c009574a75d58/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/Marshal.cs#L488 corresponds to .Net Core but PythonNet is using .Net Framework, the decompiled implementation on my windows machine points to the following code (at Windows\Microsoft.NET\Framework\v4.0.30319\mscorlib.dll), doesn't seem to be doing any alignment check.

[SecurityCritical]
    public static unsafe void WriteInt64(IntPtr ptr, int ofs, long val)
    {
      try
      {
        byte* numPtr1 = (byte*) ((IntPtr) (void*) ptr + ofs);
        if (((int) numPtr1 & 7) == 0)
        {
          *(long*) numPtr1 = val;
        }
        else
        {
          byte* numPtr2 = (byte*) &val;
          *numPtr1 = *numPtr2;
          numPtr1[1] = numPtr2[1];
          numPtr1[2] = numPtr2[2];
          numPtr1[3] = numPtr2[3];
          numPtr1[4] = numPtr2[4];
          numPtr1[5] = numPtr2[5];
          numPtr1[6] = numPtr2[6];
          numPtr1[7] = numPtr2[7];
        }
      }
      catch (NullReferenceException ex)
      {
        throw new AccessViolationException();
      }
    }

Maybe these changes were tested using .Net Core instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an alignment check: if (((int) numPtr1 & 7) == 0)

}

public static readonly int ob_data;
public static readonly int ob_dict;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ob_dict and ob_data are never being assigned?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are assigned via reflection in the static class constructor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should likely be private and perhaps a comment stating that they're assigned/accessed via reflection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchandschuh the visibility on these follows visibility on similar fields in other *Offset classes. One of the reasons for that is for reflection to be able to discern between fields mapping to python object fields, and fields, that are necessary for this class methods to function properly.

I don't think comment here will do them any good, as in the next implementation they might be set by different means, and comments like this tend to be overlooked.

I intend to keep this part as is, if there are no further objections.


public static int Size { get { return size; } }

static readonly int size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to improve simplicity this could be turned into an auto-property

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep implementation explicit to ensure that the backing field is not found by reflection (though I agree that would keep it private).


public static int TypeDictOffset(IntPtr type)
{
return ManagedDataOffsets.DictOffset(type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide more detailed comments on why TypeDictOffset() and magic() new implementations are equivalent to the previous?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that in the PR: Python.NET always adds these fields to the end of the objects it allocates. So offsets to them can be computed as objAddr + objSize - const, no matter the type of the object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to leave comments describing the why in the source code and even in the commit messages as it makes understanding the code later much easier. Consider in a time, far far away, where github ceases to exist, this perfect comment describing the why of these offsets would be lost forever :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I put the explanation into the commit message of the squashed commit.

@lostmsu
Copy link
Author

lostmsu commented Jun 18, 2019

It would be nice to add more comments for the different changes so its easier to understand and assert that the new implementation is equivelent to the previous one.

I don't think it makes sense to do in the code, as the old code will be gone after making this change. Clarified a bit in the review itself though.

As for any performance change, testing and measuring is the key stone, it would be good to add some testing information and results to the PR showcasing the improvements. Also noting how were these changes tested windows/linux, with what etc.

I only tested on a Windows box with sample data from https://github.com/QuantConnect/Lean/tree/master/Data . As suggested by Jared, I run random data generator to fill the date range 2010-2018.

I put the following into config.json, and run Launcher.

// algorithm class selector
"algorithm-type-name": "C:\\Users\\SCRAMBLED\\Projects\\QuantConnect\\Lean\\Algorithm.Python\\Benchmarks\\IndicatorRibbonBenchmark",

// Algorithm language selector - options CSharp, FSharp, VisualBasic, Python, Java
"algorithm-language": "Python",

//Physical DLL location
"algorithm-location": "../../../Algorithm.Python/Benchmarks/IndicatorRibbonBenchmark.py",

Without the changes I got ~13k dps, and ~17k dps with the changes.

I have also tested (initially) by adding a 128 iteration loop around the body AlgorithmPythonWrapper.OnData to reduce the effect of data loading on overall run time, and got a decrease from ~35s to ~25s run time as reported by the Launcher (about 5s in both these numbers are spent in data reading).

Wondering if these changes have been presented at https://github.com/pythonnet/pythonnet ?

I will upstream them as soon as I have time for that.

P.S. Completed answering to this pass of review, ready for a new iteration.

Copy link
Member

@Martin-Molinero Martin-Molinero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executed my own performance and stability tests, PR looks good. I think it would be very useful to get these commits into the root repo of PythonNet so more expert eyers can review them.
Next step: release new QC/PythonNet and give it a round of tests in the cloud too.

Stability tests

  • Windows Lean unit and regression tests and PythonNet unit tests [multiple times].
  • Linux tested Lean regression tests:
    BasicTemplateAlgorithm, BasicTemplateFrameworkAlgorithm, RegressionAlgorithm, MACDTrendAlgorithm, IndicatorSuiteAlgorithm, CustomDataRegressionAlgorithm, CustomChartingAlgorithm, CustomModelsAlgorithm, DropboxBaseDataUniverseSelectionAlgorithm, AddAlphaModelAlgorithm, RenkoConsolidatorAlgorithm

Windows performance tests - IndicatorRibbonBenchmark

PythonNet pr - Lean mode debug - 9K
Completed in 87.40 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.85 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 88.66 seconds at 9k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode debug - no Marshal changes - 8-9K
Completed in 93.39 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 90.00 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 92.21 seconds at 8k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - 12K
Completed in 68.01 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 67.23 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 64.39 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet pr - Lean mode release - no Marshal changes - 12K
Completed in 64.20 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 65.38 seconds at 12k data points per second. Processing total of 782,223 data points.
Completed in 66.80 seconds at 12k data points per second. Processing total of 782,223 data points.
PythonNet master - Lean mode debug - 8K
Completed in 101.67 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 98.48 seconds at 8k data points per second. Processing total of 782,223 data points.
Completed in 103.05 seconds at 8k data points per second. Processing total of 782,223 data points
PythonNet master - Lean mode release - 9K
Completed in 84.99 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 84.16 seconds at 9k data points per second. Processing total of 782,223 data points.
Completed in 85.19 seconds at 9k data points per second. Processing total of 782,223 data points.

@lostmsu
Copy link
Author

lostmsu commented Jun 18, 2019

@Martin-Molinero dropping Marshal changes improves performance? That looks suspicious.

@Martin-Molinero
Copy link
Member

@Martin-Molinero dropping Marshal changes improves performance? That looks suspicious.

I'd rather say, based on these tests results, that its exposing the fact that the Marshal changes didn't improve performance significantly.
There is clearly a dispersion in the result values.

This addresses the following scenario:

1. A C# object `a` is created and filled with some data.
2. `a` is passed via Python.NET to Python. To do that Python.NET
   creates a wrapper object `w`, and stores reference to `a` in one of
   its fields.
3. Python code later passes `w` back to C#, e.g. calls `SomeCSharpMethod(w)`.
4. Python.NET has to unwrap `w`, so it reads the reference to `a` from it.

Prior to this change in 4. Python.NET had to determine what kind of an
object `a` is. If it is an exception, a different offset needed to be
used. That check was very expensive (up to 4 calls into Python
interpreter).

This change replaces that check with computing offset unconditionally
by subtracting a constant from the object size (which is read from the wrapper),
thus avoiding calls to Python interpreter.
@lostmsu
Copy link
Author

lostmsu commented Jun 18, 2019

@Martin-Molinero I removed the Marshal-related change, and squashed the rest to a single commit.

@lostmsu
Copy link
Author

lostmsu commented Jun 18, 2019

@Martin-Molinero I recommend merging it using Squash and Merge, otherwise a merge commit will be created. Or manually cherry-pick c70724a

@mchandschuh
Copy link

mchandschuh commented Jun 19, 2019

food for thought -- It's great seeing the macro-level performance tests, but this repository may also benefit from micro-level speed tests. This will help us track impacts of performance changes on individual operations. For example, this PR seems to directly impact performance of invoking a member (method/property/etc) on a native C# object that has been returned from the python runtime. We could have a performance test that specifically measures this case. Also, we would be able to make better informed decisions regarding improvements on the micro-scale, such as the Marshal changes (since removed)

Copy link

@mchandschuh mchandschuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted minor questions -- I think this PR would greatly benefit from a performance test that documents what exactly was improved here. From reading the code, it appears the main performance improvement is caching these offsets instead of computing them on each access (love it!), but it would still be nice to have some sort of benchmark value for things like a property access or method invocation for C# objects returned from the python runtime.

@@ -39,6 +39,7 @@
- Sam Winstanley ([@swinstanley](https://github.com/swinstanley))
- Sean Freitag ([@cowboygneox](https://github.com/cowboygneox))
- Serge Weinstock ([@sweinst](https://github.com/sweinst))
- Victor Milovanov([@lostmsu](https://github.com/lostmsu))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- always nice to get credit!


private static int BaseOffset(IntPtr type)
{
Debug.Assert(type != IntPtr.Zero);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Martin-Molinero -- Do we build w/ #DEBUG compiler flag in production?

Copy link
Member

@Martin-Molinero Martin-Molinero Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No 👍 we build targeting the ReleaseWin/Mono configuration that doesn't use the #Debug flag nor do we add it

}

public static readonly int ob_data;
public static readonly int ob_dict;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should likely be private and perhaps a comment stating that they're assigned/accessed via reflection.

for (int i = 0; i < fi.Length; i++)
{
fi[i].SetValue(null, (i * size) + ObjectOffset.ob_type + size);
fi[i].SetValue(null, (i * IntPtr.Size) + OriginalObjectOffsets.Size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - why inline IntPtr.Size here -- it certainly safe to assume that the value doesn't change over the coarse of the loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saves a jump to the original line to figure out what this size is supposed to be.

@lostmsu
Copy link
Author

lostmsu commented Jun 19, 2019

@mchandschuh , I created a project in the main Python.NET repo to track benchmarking: https://github.com/pythonnet/pythonnet/projects/5
Since there's no existing infra for it, I think it is out of scope for this change.

Done with this iteration. Note: I haven't made any code changes. If you strongly believe some are required, please respond to one of the conversations.

@lostmsu
Copy link
Author

lostmsu commented Jun 22, 2019

Hey guys! Any changes required for this to get in?

@jaredbroad
Copy link
Member

Various team members getting back from vacation and travels - will be able to test and get in soon.

@Martin-Molinero Martin-Molinero merged commit 2e56c59 into QuantConnect:master Jun 24, 2019
@Martin-Molinero
Copy link
Member

Merging so we start cloud integration and performance testing.
Note: adding a new merge commit to follow existing merging policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants