C++: New Query: missing return-value check for scanf-like functions #10163

d10c · 2022-08-24T18:22:08Z

Introduces cpp/missing-check-scanf, a query with a similar idea as this external contribution: find scanf output arguments that are read without checking that they were written in the first place. The main differences are an IR-based flow and a more conservative treatment of unguarded reads of already-initialized scanf output variables.

This PR squashes and supercedes draft PR #10033.

These keep getting added, by the Makefile extension I believe.

…le was initialized, as in real world cases we haven't seen it done safely.

There are still some remaining FPs (haven't fully tested them) that should be ironed out in a follow-up to increase the precision, e.g.: * if scanf(&i) != 1 return if maybe() && scanf(&i) != 1 return use(i) // should be OK on both counts * The minimum guard constant for the *_s variants may not be right. * int i[2] scanf(i, i+1) // second i is flagged as a use of the first * Maybe loosen the "unguarded or badly guarded use() = bad" policy to "unguarded but already-initialized = good" and "badly guarded = bad", since a lot of FPs in MRVA fall into the "unguarded but already- initialized" bucket.

MathiasVP

Looks really good! Here are my first round of comments.

cpp/ql/src/Critical/MissingCheckScanf.ql

MathiasVP · 2022-08-24T19:07:43Z

cpp/ql/src/Critical/MissingCheckScanf.ql

+    exists(Instruction dst |
+      this.bigStep(instr) = dst and
+      dst.getAst() = result and
+      valueNumber(dst) = valNum


Could we move this conjunct into smallStep instead? i.e., require that every small step preserves the value number? I think that will reduce the size of the smallStep relation (and more importantly, the bigStep relation!) quite a bit.

That would be tricky, because while we do want the final Accesses to have the same ValueNumber, we don't want all intermediate instructions to have to have the same ValueNumber too, because then some crucial paths will be severed. Currently, bigStep is the same the .getASuccessor+() closure starting from instr, minus all paths that go through a barrier as defined by isBarrier(). I would hesitantly wager that this is as compact as it gets.

Actually, on second thought, maybe bigStep(i) can be optimized by having i = instr, always :) Because none of the i != instr tuples are ever needed.

That would be tricky, because while we do want the final Accesses to have the same ValueNumber, we don't want all intermediate instructions to have to have the same ValueNumber too, because then some crucial paths will be severed. Currently, bigStep is the same the .getASuccessor+() closure starting from instr, minus all paths that go through a barrier as defined by isBarrier(). I would hesitantly wager that this is as compact as it gets.

Ah, right. Yeah, I do see what you mean here.

Actually, on second thought, maybe bigStep(i) can be optimized by having i = instr, always :) Because none of the i != instr tuples are ever needed.

Yes, I think that's a good idea.

Speaking of optimizations: Currently, you're big step is defined something like:

predicate bigStep(Instruction from, Instruction to) { source(from) and smallStep*(from, to) and sink(to) }

The smallStep*(from, to) computation produces a lot of intermediate tuples that you don't really care about. We have a lot of experience in how to do this efficiently in QL, and I'd be happy to chat tomorrow about how to compute this properly. Let me just dump it here for you to read at your own pace:

The trick is to define a unary predicate to avoid producing a lot of unnecessary tuples:

// Holds if `i` can be reached starting at a source. predicate reachedFromSource(Instruction i) { source(i) or exists(Instruction mid | reachedFromSource(mid) | step(mid, i)) } // Holds if `i` can be reached starting at a source, and can eventually reach a sink. predicate reachesSink(Instruction i) { reachedFromSource(i) and ( sink(i) or exists(Instruction mid | reachesSink(mid) | step(i, mid)) ) }

Then, you can use this predicate to define a smaller step relation:

predicate prunedSmallStep(Instruction from, Instruction to) { reachesSink(from) and reachesSink(to) and smallStep(from, to) }

and finally, you can rewrite your bigStep as:

predicate bigStep(Instruction from, Instruction to) { source(from) and prunedSmallStep*(from, to) and sink(to) }

I have saved the best for last, implementing your proposed optimizations ^. Here are the before-and-after MRVA runs... Let's see how much of a difference it makes :)
Before: https://github.com/github/semmle-code/actions/runs/2927651707
After: https://github.com/github/semmle-code/actions/runs/2927920561

Hmm... Nope, I'm calling it. It's not worth it. I've had to inline some predicates to avoid the runaway you see in the "After", and still it's not a significant difference. This is my WIP, so it could be improved in the future: 7cc0d71

P.S. MRVA top 100 run on that WIP code: https://github.com/github/semmle-code/actions/runs/2929372003

The fact that the new code is not timing out on opencv suggest that's it's a worthwhile improvement to make. I've left a comment on your branch. We can also discuss them tomorrow if you want :)

Yeah, adding the pragma[inline] to the cartesian-product-prone predicate removed the timeout on opencv 😛 Maybe a bindingset[this] annotation would have done the trick too..? And you're right, I missed the first line from reachesSink, which might make all the difference...

So along with removing the this argument from these predicates (which might be tricky in some cases, since isBarrier does depend on this), I'll leave the exploration of this type of optimization for a follow-up issue, unless you think the performance improvements would be so drastic that they would be worth having in the initial release of this query (that is, if you think we can analyze opencv in significantly less than an hour on GitHub Actions [Locally, it's just a few minutes, fyi])

Yeah, adding the pragma[inline] to the cartesian-product-prone predicate removed the timeout on opencv 😛 Maybe a bindingset[this] annotation would have done the trick too..? And you're right, I missed the first line from reachesSink, which might make all the difference...

Yep, I think bindingset would do the same thing as a bindingset annotation implies inlining (as far as I know, at least). I also think the missing conjunct in reachesSink might make all the difference 😄, indeed.

I'm still not completely sure how to read the MRVA run output. I assume the 50 minutes of runtime on opencv is because we also have to compute the IR and all the other underlying things that the query makes use of. If it only takes a few minutes once the cached libraries are computed that sounds fine.

cpp/ql/test/query-tests/Critical/MissingCheckScanf/.clang-format

cpp/ql/src/Critical/MissingCheckScanf.ql

change-notes/2022-08-24-missing-check-scanf.md

Extract some of the logic from the `cpp/missing-check-scanf` query into the more generally useful `getOutputArgument(int index)`, `getAnOutputArgument()`, and `getNumberOfOutputArguments()` predicates.

Passes tests.

because they are potentially too global, belong in a separate PR.

geoffw0

Query and .qhelp look great. It looks like this query became a fair bit more complicated than we had expected!

cpp/ql/src/Critical/MissingCheckScanf.qhelp

geoffw0 · 2022-08-25T23:19:13Z

cpp/ql/src/Critical/MissingCheckScanf.ql

+    result =
+      varargIndex + 1 -
+        count(ScanfFormatLiteral f, int n |
+          // Special case: %n writes to an argument without reading any input.


This is great attention to detail.

Thanks! I think there are even more special cases, that I'm parking for a follow-up issue for now (namely, the *_s variants of scanf introduced in C11).

mchammer01

@d10c 👋🏻 - I've made a few comments and suggestions, but in generally this looks great to me ✨
It would be awesome if we could simplify the query description. Feel free to ignore any feedback you don't agree with 😃

cpp/ql/src/Critical/MissingCheckScanf.ql

cpp/ql/src/Critical/MissingCheckScanf.qhelp

@mchammer01

Thanks to @mchammer01 and @geoffw0 for the suggestions latest.

MathiasVP · 2022-08-30T11:45:31Z

I'm quite happy with this PR now. I don't see a DCA run anywhere, though. I think we should run one to be safe.

d10c · 2022-08-30T11:47:37Z

Another thing, currently the query metadata reads:

@problem.severity recommendation
@security-severity 4.5
@precision medium

I wasn't clear on how to set these numbers, so I just winged the security-severity to "some number in the medium range", and the severity as "recommendation" because the MissingNullCheck query was also a recommendation. But maybe the security properties of this query are sufficiently different that these numbers merit more scrutiny. Maybe @geoffw0 can chime in?

geoffw0 · 2022-08-30T12:48:56Z

@problem.severity recommendation

Severity is about how bad the problem is when you have a true positive.

recommendation is for bad coding style, hard to maintain code, something that's not quite ideal but there is no reason to believe the code actually behaves incorrectly as a result. Often subjective.
warning is for something that's generally considered a bad idea and might cause problems. Anything that doesn't fit the above or below categories.
error is for an issue that objectively causes incorrect program behaviour, crash, or vulnerability. No sensible developer thinks doing this thing is OK.

I think I could justify error in this case but recommendation is fine.

@security-severity 4.5

Discussed elsewhere.

@precision medium

This is about what proportion of results can be expected to be true positives. The exact meanings are somewhat controversial, I tend to think them as:

medium: in most projects at least 50% of results are plausibly true positives.
high: at least 75% (some people will probably give a higher number here) of results are plausibly true positives, and some effort has been made to confirm this.
very high: basically all (95%+) results are true positive and this has been confirmed extensively in real world code.

Its a bit tricky, because it isn't always easy to determine if a result in someone else's code is a false positive, and in high quality projects FP rates will be higher (as there are fewer TPs).

We usually target high precision but sometimes fall short and end up at medium. Given the work you've done on false positives I suspect you can set this to high, or maybe upgrade it to high after fixing a few more false positives in a follow-up issue.

See https://github.com/github/codeql/blob/main/docs/query-metadata-style-guide.md for more info about this stuff.

@geoffw0

Though the codeql/cwe-scores update-queries.py script did not make any changes on its own, I looked up the score of the CWEs that @geoffw0 suggested using the explain.py script. As discussed elsewhere, this should be more of a warning than a recommendation.

cpp/ql/src/Critical/MissingCheckScanf.ql

As that seems to be appropriate for this query.

geoffw0

LGTM.

d10c and others added 6 commits August 11, 2022 12:18

Update .gitignore for .vscode/*.log temporaries

94c43c0

These keep getting added, by the Makefile extension I believe.

C++: Add test and placeholder query.

76ef779

C++: First working. We now prefer flagging the cases where the variab…

c62ae3b

…le was initialized, as in real world cases we haven't seen it done safely.

.clang-format: do not autoformat test.cpp

69911d4

Write MissingCheckScanf.qhelp

170d12b

d10c added enhancement New feature or request C++ labels Aug 24, 2022

d10c requested a review from a team as a code owner August 24, 2022 18:22

github-actions bot added the documentation label Aug 24, 2022

d10c mentioned this pull request Aug 24, 2022

C++: Missing return-value check for scanf-like functions #1076 #10033

Closed

d10c added 2 commits August 24, 2022 21:09

Change note

6158ee1

Merge branch 'main' into missing-check-scanf-squashed

5c894ae

MathiasVP reviewed Aug 24, 2022

View reviewed changes

change-notes/2022-08-24-missing-check-scanf.md Outdated Show resolved Hide resolved

d10c added ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. and removed ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. labels Aug 25, 2022

d10c added 7 commits August 25, 2022 14:32

C++: new helper predicates in ScanfFunctionCall

d8800c0

Extract some of the logic from the `cpp/missing-check-scanf` query into the more generally useful `getOutputArgument(int index)`, `getAnOutputArgument()`, and `getNumberOfOutputArguments()` predicates.

C++: Remove unique-Instruction kludge in ScanfOutput

e39229d

Passes tests.

C++: clarify ScanfOutput.getMinimumGuardConstant()

a6a30b3

C++: Small improvements to query qldoc and message

ad56274

C++: improve change note and move to right place

2bd866c

Revert changes to .gitignore and .clang-format

02772ed

because they are potentially too global, belong in a separate PR.

C++: Optimize MissingCheckScanf/bigStep()

7d24d96

d10c added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Aug 25, 2022

geoffw0 reviewed Aug 25, 2022

View reviewed changes

mchammer01 previously approved these changes Aug 30, 2022

View reviewed changes

cpp/ql/src/Critical/MissingCheckScanf.ql Outdated Show resolved Hide resolved

cpp/ql/src/Critical/MissingCheckScanf.qhelp Outdated Show resolved Hide resolved

cpp/ql/src/Critical/MissingCheckScanf.qhelp Outdated Show resolved Hide resolved

C++: Improve docs based on doc-review

e10042b

Thanks to @mchammer01 and @geoffw0 for the suggestions latest.

d10c dismissed mchammer01’s stale review via e10042b August 30, 2022 09:16

d10c requested a review from mchammer01 August 30, 2022 09:33

Merge branch 'main' into missing-check-scanf-squashed

ce1e4ad

MathiasVP reviewed Aug 31, 2022

View reviewed changes

cpp/ql/src/Critical/MissingCheckScanf.ql Outdated Show resolved Hide resolved

d10c added 3 commits August 31, 2022 16:58

C++: Correct CWE tags in metadata

38f185b

C++: Add correctness tag

f5a30c7

As that seems to be appropriate for this query.

Merge branch 'main' into missing-check-scanf-squashed

f956999

geoffw0 approved these changes Sep 1, 2022

View reviewed changes

d10c merged commit 7584434 into github:main Sep 1, 2022

d10c deleted the missing-check-scanf-squashed branch September 22, 2022 14:08

MathiasVP mentioned this pull request Apr 13, 2023

C++: Use the new dataflow library in cpp/missing-check-scanf #12818

Merged

C++: New Query: missing return-value check for scanf-like functions #10163

C++: New Query: missing return-value check for scanf-like functions #10163

Uh oh!

Conversation

d10c commented Aug 24, 2022

Uh oh!

MathiasVP left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MathiasVP Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d10c Aug 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mchammer01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MathiasVP commented Aug 30, 2022

Uh oh!

d10c commented Aug 30, 2022

Uh oh!

geoffw0 commented Aug 30, 2022

Uh oh!

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MathiasVP Aug 24, 2022 •

edited

Loading

MathiasVP Aug 24, 2022 •

edited

Loading

d10c Aug 25, 2022 •

edited

Loading