Skip to content

Improve performance of deletions in initial fine-grained cache load #4701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 9, 2018

Conversation

msullivan
Copy link
Collaborator

@msullivan msullivan commented Mar 8, 2018

To do this we just fully ditch process_graph for doing initial cache loads, since we don't want to do anything but load the graph and trees from the cache.

This has the side effect of disabling any writes to the cache while doing a successful cache load, so for consistency we also disable cache writes when a fine-grained cache load fails and we give up on it.

@msullivan msullivan requested a review from JukkaL March 8, 2018 01:34
Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'm a little worried about test coverage -- cache load performance seems like a tricky area that regress in the future (see my comments for some more concrete ideas).

@@ -280,6 +280,9 @@ def initialize_fine_grained(self, sources: List[mypy.build.BuildSource]) -> Dict
result = mypy.build.build(sources=sources,
options=self.options,
alt_lib_path=self.alt_lib_path)
# build will clear use_fine_grained_cache if it needs to give up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: looks like it would be better to move this outside the try statement, since this can't generate a CompileError exception.

@@ -174,8 +174,8 @@ def __init__(self,
# for the cache. This is kind of a hack and it might be better to have
# this directly reflected in load_graph's interface.
self.options.cache_dir = os.devnull
self.options.use_fine_grained_cache = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the options object it's kind of ugly. Maybe at least mention that it gets mutated in the docstring?

@@ -280,6 +280,9 @@ def initialize_fine_grained(self, sources: List[mypy.build.BuildSource]) -> Dict
result = mypy.build.build(sources=sources,
options=self.options,
alt_lib_path=self.alt_lib_path)
# build will clear use_fine_grained_cache if it needs to give up
# on doing a cache load.
cache_load_succeeded = self.options.use_fine_grained_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mutating the use_fine_grained_cache attribute, would it be cleaner to add another attribute to the build result? Having a side channel like this seems questionable.

# we process all SCCs as fresh SCCs so that we have all of the symbol
# tables and fine-grained dependencies available.
# We fail the loading of any SCC that we can't load a meta for, so we
# don't have anything *but* fresh SCCs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested what happens if we fail loading some SCCs? Will they be picked up by the build later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is tested by the testAddModuleAfterCache tests.

mypy/build.py Outdated
if manager.options.use_fine_grained_cache and len(graph) < 0.50 * len(sources):
manager.log("Redoing load_graph because too much was missing")
manager.only_load_from_cache = False
manager.log("Redoing load_graph without cache because too much was missing")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that this still gets triggered correctly? What about setting a flag when we get here and having a test case where we test the existence of the flag (enabled using a magic comment in test description, for example)? The test could also delete some subset of cache files (maybe based on module name or something).

mypy/build.py Outdated
@@ -2128,7 +2133,10 @@ def dispatch(sources: List[BuildSource], manager: BuildManager) -> Graph:
if manager.options.dump_graph:
dump_graph(graph)
return graph
process_graph(graph, manager)
if manager.options.use_fine_grained_cache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment here describing why we have two different ways of processing the cache.

@@ -624,6 +624,22 @@ c.f(1)
==
a.py:2: error: Too many arguments for "f"

[case testRenameAndDeleteModule]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this test case failing on master? Do we have sufficient test coverage? The details of loading the import graph are pretty subtle. Maybe we could, say, keep track of the number of files parsed during load_graph and test that it is zero in certain cases when using the cache (and non-zero when not using the cache)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't failing, no. I ran this test with verbosity on to track that the right thing was happening. Which, of course, is a pretty good sign that there should be some automated scheme.

@msullivan msullivan changed the base branch from dmypy-fg-testing to master March 8, 2018 23:06
@msullivan
Copy link
Collaborator Author

I added some test hooks that I think should cover things, and fixed a bug in the fallback to doing a full load that it revealed.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I have some remaining questions and ideas about comments.

@@ -15,6 +15,11 @@
-- <optional output from batch pass>
-- ==
-- <optional output from first incremental pass>
--
--
-- Modules that are expected to be detected as changed can be checked with [stale ...]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the description more detailed. Say, does changed mean that source code is changed or the file is added/deleted, or only some of these? How does it differ for the first run and successive runs, and what about tests using cache or not using cache?

Also mention [stale2 etc.

--
-- Modules that are expected to be detected as changed can be checked with [stale ...]
-- while modules that are reprocessed by update (which can include cached files
-- that need to be loaded) can be checked with [rechecked ...]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more explicit -- here reprocessing only means full module reprocessing, not that we process only some target in the module when following fine-grained dependencies.


-- in no cache mode, there is no way to know about b yet,
-- but a should still get triggered
[stale a]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why a is included here, since only b had file changes (it was deleted). Please explain in the comment why a is triggered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial cache mode run picks up changes to files that are not "fresh", in order to pick up modules with changed dependencies.

-- regular tests.
[case testAddModuleAfterCache1]
-- doesn't appear in the cache, for cache mode. They are run in
-- cache mode oly because stale and rechecked differ heavily between
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: oly

[file b.py.3]
def foo(x: int) -> None: pass

[stale2 b]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include [rechecked2 ...?


-- No files should be stale or reprocessed in the first step since the large number
-- of missing files will force build to give up on cache loading.
[stale]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for [rechecked here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is no rechecked, the value of stale is used. This is behavior inherited from the regular incremental test suite. I can document this better in the block comment or just add all the recheckeds in.


[file b.py.3]
def foo(x: int) -> None: pass
[stale2 b]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above.

@@ -1064,14 +1093,20 @@ def foo() -> None: pass
[file b.py.2]
import a
a.foo(10)

[stale a, b]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and various other places below we don't check for [rechecked. Is this intentional?

mypy/build.py Outdated
@@ -569,6 +571,7 @@ class BuildManager:
flush_errors: A function for processing errors after each SCC
saved_cache: Dict with saved cache state for coarse-grained dmypy
(read-write!)
cache_enabled: Whether cache usage is enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth mentioning how this relates to use_fine_grained_cache -- the latter means that we attempt to use the cache if it exists, and this indicates whether we actually used it or something.

@msullivan msullivan merged commit 9374acf into master Mar 9, 2018
@msullivan msullivan deleted the delete_optimize branch March 9, 2018 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants