Skip to content

TST: Azure suddenly missing from PR CI lists #12158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tylerjereddy opened this issue Oct 12, 2018 · 44 comments
Closed

TST: Azure suddenly missing from PR CI lists #12158

tylerjereddy opened this issue Oct 12, 2018 · 44 comments

Comments

@tylerjereddy
Copy link
Contributor

The Azure service is suddenly (and silently !) missing from some recent PRs that don't have much in common and shouldn't do anything that could disable Azure:

#12154
#12157
#12153

@charris
Copy link
Member

charris commented Oct 12, 2018

Yes, I was just looking at that. Either our account was temporary (1 week?) or github notifications have died.

@charris
Copy link
Member

charris commented Oct 12, 2018

I was able to manually start a rebuild of an older test, so it looks like notifications have stalled.

@charris
Copy link
Member

charris commented Oct 12, 2018

I noticed a stall a couple of days ago, but azure recovered.

@tylerjereddy
Copy link
Contributor Author

cc @chrisrpatterson @ericsciple @jimlamb

@tylerjereddy
Copy link
Contributor Author

Yeah your manual trigger on the set numeric ops deprecate PR was the first run in 7 hours or so.

@charris
Copy link
Member

charris commented Oct 12, 2018

Looks like the two coverage tests are also gone, so we are missing three overall.

@charris
Copy link
Member

charris commented Oct 12, 2018

Maybe a network problem?

@tylerjereddy
Copy link
Contributor Author

The coverage reports will often not show up if anything else is wrong or stalled--we have !appveyor and !circle set in the .codecov.yml file to ignore those services as a gate before reporting coverage, but I'm not sure it was respecting those flags, though we could try adding the other services in that list too.

I think that's supposed to be a feature, since the coverage report is often intended as an accumulation reported from various platforms / services.

The Azure failure seem more surprising though!

@chrispat
Copy link

That is strange. I see some of the interim PRs have been triggered and report status. I will send this issue to one of the engineers on the team.

@tylerjereddy
Copy link
Contributor Author

@chrisrpatterson Thanks for checking on this for us

@lkillgore
Copy link

Hi @tylerjereddy, could you point me to a specific PR / Pipeline pair that you expect should have been triggered?

@tylerjereddy
Copy link
Contributor Author

@lkillgore this is a recent one: #12159

You can find the pipeline that stopped triggering off PRs 7+ hours ago here: https://dev.azure.com/numpy/numpy/_build?definitionId=4

There's one manual trigger we tried 3h ago, but that's it.

@lkillgore
Copy link

Thanks, Tyler. Could you double-check a couple of things for me? Could you verify that the webhook is still there on the repository? (Should be here: https://github.com/numpy/numpy/settings/hooks)
image
You should have two: one for push, and one for pull_requests

Could you also verify that the payload for the webhook was successfully delivered to Azure DevOps from that webhook? If you click on edit, you'll see a list of payloads at the bottom. One of them should have the sha for your PR in the merge_commit_sha. The result back from Azure DevOps should be 200.

Thanks!

@charris
Copy link
Member

charris commented Oct 12, 2018

@lkillgore We don't use the webhook, the recommended method is the app at the github marketplace and is under "Integrations & services" in settings. I checked that and it is still there. What I am wondering is if it was a "try it for one week" sort of thing.

@charris
Copy link
Member

charris commented Oct 12, 2018

@charris
Copy link
Member

charris commented Oct 12, 2018

Also, being a free plan, there is no payment method. I can see that sort of thing possibly causing problems...

@lkillgore
Copy link

Hi @charris, the good news is that this is a full product, so it should be working. I added some logging to your account so I can hopefully get more visibility into what's going on. Could you please try to open another PR to generate some logs for me to look through? Thanks!

@charris
Copy link
Member

charris commented Oct 12, 2018

Let me try close/reopen on an existing PR, that would previously issue a new request... OK, that did nothing, the old test was still there but it didn't retest. New PR at #12160.

@lkillgore
Copy link

Thanks!

I don't see any GitHub App events coming through for your repository / org (I do see plenty from other users/orgs) This suggests to me that there's something misconfigured with the app, but I don't know what that could be unless someone uninstalled the application or changed it in some way.

You mentioned that you checked to see that it's still installed. Could you double check that numpy (the repository) still has access?
image
Either 'all repositories' should be selected or 'numpy/numpy' should be in the list of granted repositories.

@tylerjereddy
Copy link
Contributor Author

I think @charris will have to do that as an owner of the repo.

@charris
Copy link
Member

charris commented Oct 12, 2018

It is configured for all repositories, so none selected. The configuration hasn't changed since this pipeline started running. I can try being selective, but as said, it was working.

@charris
Copy link
Member

charris commented Oct 12, 2018

I note that the time it quit working doesn't really fit with someone fooling with the configuration. BTW, we can probably give you permissions if you need them to work on this, although I would have thought you had "root" permissions :)

@charris
Copy link
Member

charris commented Oct 12, 2018

Hmm, the last triggered run is interesting, it covered two merges that took place close together. I would have thought the app would queue them up separately by commit hash, but apparently the pipeline just downloads the master branch that is current at run time. Maybe there was some sort of collision/timing problem that messed things up?

@charris
Copy link
Member

charris commented Oct 12, 2018

I have a feeling that reinstalling the app would fix things, at least temporarily, but the potential problem would still be there, waiting.

@mattip
Copy link
Member

mattip commented Oct 13, 2018

Now that the Azure team have more extensive logging perhaps getting things going will allow them to find the problem.

@charris
Copy link
Member

charris commented Oct 14, 2018

@mattip It's time. @lkillgore I'm going to delete the app from numpy and reinstall it.

@charris
Copy link
Member

charris commented Oct 14, 2018

The newly re-installed app isn't triggering either. I think NumPy has been blocked somewhere.

@charris
Copy link
Member

charris commented Oct 14, 2018

Note that there doesn't seem to be anyway to use an existing pipeline when the app is installed. Going to turn on appveyor again until this is corrected.

@charris
Copy link
Member

charris commented Oct 14, 2018

I think we have reached the "enemy action" stage :) Might be worth checking the nature of the numpy purchase of the app, there might be a built in time limitation, the one week we got still seems suspicious to me.

@lkillgore
Copy link

Hi @charris, I'm beginning to think your assessment of a time limitation may be correct :( . Thanks for letting us know about this, and I'll let you know as soon as I have more info on my side. Thanks!

@tylerjereddy
Copy link
Contributor Author

Azure CI is in the list and running all of a sudden in #12159 after I revised & force-pushed to that branch a few minutes ago.

@mattip
Copy link
Member

mattip commented Oct 16, 2018

@lkillgore, @chrisrpatterson the builds seem to be back. Please follow up and let us know what happened so we may close the issue

@lkillgore
Copy link

Hi @mattip, I patched a bug yesterday; I'm not done testing it yet, but I'll let you know when I'm confident we have this matter resolved. And, thank you all for your help in drawing our attention to this problem.

@charris
Copy link
Member

charris commented Oct 16, 2018

Azure is not currently triggering on the maintenance/1.15.x branch, although it is configured to do so.

@lkillgore
Copy link

Hi @charris, I was just about to write back on this to declare that I think that this bug is fixed.

Something I see that's odd in your definition (and, I really hope that this isn't the problem) is that there's a trailing space in your trigger filter. Could you try removing that space and see if that fixes it? If this is what's causing the issue, I'll get that fixed ASAP!

I'll take a look at this further tomorrow.

@charris
Copy link
Member

charris commented Oct 16, 2018

@lkillgore The good news is that it wasn't the trailing space. The bad news is that it still doesn't trigger :)

@charris
Copy link
Member

charris commented Oct 17, 2018

Looks like merges to the branch are working now.

@charris
Copy link
Member

charris commented Oct 17, 2018

Found the problem: The PRs also need an added filter, not just CI. Note that there is no obvious way to get back to the pipelines when in editing mode, no easy way to finish with it.

@charris
Copy link
Member

charris commented Oct 17, 2018

@tylerjereddy We should add a trigger: section to the *.yml.

@charris
Copy link
Member

charris commented Oct 17, 2018

Going to disable the appveyor builds again, it is just too slow.

@tylerjereddy
Copy link
Contributor Author

@charris re: triggers -- noted

@lkillgore
Copy link

Thanks for letting me know, @charris. It sounds to me that everything is working, but there could be some improvements in the UX.

@charris
Copy link
Member

charris commented Oct 17, 2018

@lkillgore Thanks for your rapid response to this issue. I expect we would all like to know what the problem was if you can manage to tell us in a couple of lines :)

@charris charris closed this as completed Oct 17, 2018
@lkillgore
Copy link

Sure thing! We added a feature. Formerly, if your GitHub repo: numpy/numpy is using dev.azure.com/numpy, then numpy/numpy2 has to use dev.azure.com/numpy. Now numpy/numpy2 can be used in a different Azure DevOps account. Because we stage rollouts of new features to increasing sets of users, not all of our servers had this code yet, so our servers didn't communicate w/ each other correctly which caused the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants