Skip to content

C++: New Query: missing return-value check for scanf-like functions #10163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
94c43c0
Update .gitignore for .vscode/*.log temporaries
d10c Aug 11, 2022
76ef779
C++: Add test and placeholder query.
geoffw0 Jul 28, 2022
c62ae3b
C++: First working. We now prefer flagging the cases where the variab…
geoffw0 Jul 28, 2022
69911d4
.clang-format: do not autoformat test.cpp
d10c Aug 17, 2022
ca162a4
C++: complete initial implementation of `cpp/missing-check-scanf`
d10c Aug 11, 2022
170d12b
Write MissingCheckScanf.qhelp
d10c Aug 24, 2022
6158ee1
Change note
d10c Aug 24, 2022
5c894ae
Merge branch 'main' into missing-check-scanf-squashed
d10c Aug 24, 2022
d8800c0
C++: new helper predicates in ScanfFunctionCall
d10c Aug 25, 2022
e39229d
C++: Remove unique-Instruction kludge in ScanfOutput
d10c Aug 25, 2022
a6a30b3
C++: clarify ScanfOutput.getMinimumGuardConstant()
d10c Aug 25, 2022
ad56274
C++: Small improvements to query qldoc and message
d10c Aug 25, 2022
2bd866c
C++: improve change note and move to right place
d10c Aug 25, 2022
02772ed
Revert changes to .gitignore and .clang-format
d10c Aug 25, 2022
7d24d96
C++: Optimize MissingCheckScanf/bigStep()
d10c Aug 25, 2022
e10042b
C++: Improve docs based on doc-review
d10c Aug 30, 2022
ce1e4ad
Merge branch 'main' into missing-check-scanf-squashed
d10c Aug 30, 2022
0729e42
C++: Update metadata based on cwe-scores
d10c Aug 31, 2022
38f185b
C++: Correct CWE tags in metadata
d10c Aug 31, 2022
f5a30c7
C++: Add correctness tag
d10c Aug 31, 2022
f956999
Merge branch 'main' into missing-check-scanf-squashed
d10c Sep 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions cpp/ql/lib/semmle/code/cpp/commons/Scanf.qll
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,28 @@ class ScanfFunctionCall extends FunctionCall {
* (rather than a `char*`).
*/
predicate isWideCharDefault() { this.getScanfFunction().isWideCharDefault() }

/**
* Gets the output argument at position `n` in the vararg list of this call.
*
* The range of `n` is from `0` to `this.getNumberOfOutputArguments() - 1`.
*/
Expr getOutputArgument(int n) {
result = this.getArgument(this.getTarget().getNumberOfParameters() + n) and
n >= 0
}

/**
* Gets an output argument given to this call in vararg position.
*/
Expr getAnOutputArgument() { result = this.getOutputArgument(_) }

/**
* Gets the number of output arguments present in this call.
*/
int getNumberOfOutputArguments() {
result = this.getNumberOfArguments() - this.getTarget().getNumberOfParameters()
}
}

/**
Expand Down
17 changes: 17 additions & 0 deletions cpp/ql/src/Critical/MissingCheckScanf.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
int i, j, r;

r = scanf("%d %d", &i, &j);

use(i); // BAD: i is not guarded

if (r >= 1) {
use(i); // GOOD: i is guarded correctly
use(j); // BAD: j is guarded incorrectly
}

if (r != 2)
return;

use(j); // GOOD: j is guarded correctly
}
51 changes: 51 additions & 0 deletions cpp/ql/src/Critical/MissingCheckScanf.qhelp
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
<!DOCTYPE qhelp PUBLIC
"-//Semmle//qhelp//EN"
"qhelp.dtd">
<qhelp>


<overview>
<p>
This query finds calls of <tt>scanf</tt>-like functions with missing or
improper return-value checking.
</p>
<p>
Specifically, the query flags uses of variables that may have been modified by
<tt>scanf</tt> and subsequently are used without being guarded by a correct
return-value check. A proper check is one that ensures that the corresponding
<tt>scanf</tt> has returned (at least) a certain minimum constant.
</p>
<p>
Functions in the <tt>scanf</tt> family return either EOF (a negative value)
in case of IO failure, or the number of items successfully read from the
input. Consequently, a simple check that the return value is truthy (nonzero)
is not enough.
</p>
<warning>
This query has medium precision because, in the current implementation, it
takes a strict stance on unguarded uses of output variables, and flags them
as problematic even if they have already been initialized.
</warning>
</overview>

<recommendation>
<p>
Ensure that all subsequent uses of <tt>scanf</tt> output arguments occur in a
branch of an <tt>if</tt> statement (or similar), in which it is known that the
corresponding <tt>scanf</tt> call has in fact read all possible items from its
input. This can be done by comparing the return value to a numerical constant.
</p>
</recommendation>

<example>
<p>This example shows different ways of guarding a <tt>scanf</tt> output:
</p>
<sample src="MissingCheckScanf.cpp" />
</example>

<references>
<li>SEI CERT C++ Coding Standard: <a href="https://wiki.sei.cmu.edu/confluence/display/cplusplus/ERR62-CPP.+Detect+errors+when+converting+a+string+to+a+number">ERR62-CPP. Detect errors when converting a string to a number</a>.</li>
<li>SEI CERT C Coding Standard: <a href="https://wiki.sei.cmu.edu/confluence/display/c/ERR33-C.+Detect+and+handle+standard+library+errors">ERR33-C. Detect and handle standard library errors</a>.</li>
<li>cppreference.com: <a href="https://en.cppreference.com/w/c/io/fscanf">scanf, fscanf, sscanf, scanf_s, fscanf_s, sscanf_s</a>.</li>
</references>
</qhelp>
122 changes: 122 additions & 0 deletions cpp/ql/src/Critical/MissingCheckScanf.ql
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
/**
* @name Missing return-value check for a 'scanf'-like function
* @description Failing to check that a call to 'scanf' actually writes to an
* output variable can lead to unexpected behavior at reading time.
* @kind problem
* @problem.severity warning
* @security-severity 7.5
* @precision medium
* @id cpp/missing-check-scanf
* @tags security
* correctness
* external/cwe/cwe-252
* external/cwe/cwe-253
*/

import cpp
import semmle.code.cpp.commons.Scanf
import semmle.code.cpp.controlflow.Guards
import semmle.code.cpp.dataflow.DataFlow
import semmle.code.cpp.ir.IR
import semmle.code.cpp.ir.ValueNumbering

/** An expression appearing as an output argument to a `scanf`-like call */
class ScanfOutput extends Expr {
ScanfFunctionCall call;
int varargIndex;
Instruction instr;
ValueNumber valNum;

ScanfOutput() {
this = call.getOutputArgument(varargIndex).getFullyConverted() and
instr.getConvertedResultExpression() = this and
valueNumber(instr) = valNum
}

ScanfFunctionCall getCall() { result = call }

/**
* Returns the smallest possible `scanf` return value that would indicate
* success in writing this output argument.
*/
int getMinimumGuardConstant() {
result =
varargIndex + 1 -
count(ScanfFormatLiteral f, int n |
// Special case: %n writes to an argument without reading any input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great attention to detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think there are even more special cases, that I'm parking for a follow-up issue for now (namely, the *_s variants of scanf introduced in C11).

// It does not increase the count returned by `scanf`.
n <= varargIndex and f.getUse() = call and f.getConversionChar(n) = "n"
)
}

predicate hasGuardedAccess(Access e, boolean isGuarded) {
e = this.getAnAccess() and
if
exists(int value, int minGuard | minGuard = this.getMinimumGuardConstant() |
e.getBasicBlock() = blockGuardedBy(value, "==", call) and minGuard <= value
or
e.getBasicBlock() = blockGuardedBy(value, "<", call) and minGuard - 1 <= value
or
e.getBasicBlock() = blockGuardedBy(value, "<=", call) and minGuard <= value
)
then isGuarded = true
else isGuarded = false
}

/**
* Get a subsequent access of the same underlying storage,
* but before it gets reset or reused in another `scanf` call.
*/
Access getAnAccess() {
exists(Instruction dst |
this.bigStep() = dst and
dst.getAst() = result and
valueNumber(dst) = valNum
Copy link
Contributor

@MathiasVP MathiasVP Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this conjunct into smallStep instead? i.e., require that every small step preserves the value number? I think that will reduce the size of the smallStep relation (and more importantly, the bigStep relation!) quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be tricky, because while we do want the final Accesses to have the same ValueNumber, we don't want all intermediate instructions to have to have the same ValueNumber too, because then some crucial paths will be severed. Currently, bigStep is the same the .getASuccessor+() closure starting from instr, minus all paths that go through a barrier as defined by isBarrier(). I would hesitantly wager that this is as compact as it gets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, maybe bigStep(i) can be optimized by having i = instr, always :) Because none of the i != instr tuples are ever needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be tricky, because while we do want the final Accesses to have the same ValueNumber, we don't want all intermediate instructions to have to have the same ValueNumber too, because then some crucial paths will be severed. Currently, bigStep is the same the .getASuccessor+() closure starting from instr, minus all paths that go through a barrier as defined by isBarrier(). I would hesitantly wager that this is as compact as it gets.

Ah, right. Yeah, I do see what you mean here.

Copy link
Contributor

@MathiasVP MathiasVP Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, maybe bigStep(i) can be optimized by having i = instr, always :) Because none of the i != instr tuples are ever needed.

Yes, I think that's a good idea.

Speaking of optimizations: Currently, you're big step is defined something like:

predicate bigStep(Instruction from, Instruction to) { source(from) and smallStep*(from, to) and sink(to) }

The smallStep*(from, to) computation produces a lot of intermediate tuples that you don't really care about. We have a lot of experience in how to do this efficiently in QL, and I'd be happy to chat tomorrow about how to compute this properly. Let me just dump it here for you to read at your own pace:

The trick is to define a unary predicate to avoid producing a lot of unnecessary tuples:

// Holds if `i` can be reached starting at a source.
predicate reachedFromSource(Instruction i) {
  source(i)
  or
  exists(Instruction mid | reachedFromSource(mid) | step(mid, i))
}

// Holds if `i` can be reached starting at a source, and can eventually reach a sink.
predicate reachesSink(Instruction i) {
  reachedFromSource(i) and
  (
    sink(i)
    or
    exists(Instruction mid | reachesSink(mid) | step(i, mid))
  )
}

Then, you can use this predicate to define a smaller step relation:

predicate prunedSmallStep(Instruction from, Instruction to) {
  reachesSink(from) and reachesSink(to) and smallStep(from, to)
}

and finally, you can rewrite your bigStep as:

predicate bigStep(Instruction from, Instruction to) { source(from) and prunedSmallStep*(from, to) and sink(to) }

Copy link
Contributor Author

@d10c d10c Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have saved the best for last, implementing your proposed optimizations ^. Here are the before-and-after MRVA runs... Let's see how much of a difference it makes :)
Before: https://github.com/github/semmle-code/actions/runs/2927651707
After: https://github.com/github/semmle-code/actions/runs/2927920561

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Nope, I'm calling it. It's not worth it. I've had to inline some predicates to avoid the runaway you see in the "After", and still it's not a significant difference. This is my WIP, so it could be improved in the future: 7cc0d71

P.S. MRVA top 100 run on that WIP code: https://github.com/github/semmle-code/actions/runs/2929372003

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the new code is not timing out on opencv suggest that's it's a worthwhile improvement to make. I've left a comment on your branch. We can also discuss them tomorrow if you want :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding the pragma[inline] to the cartesian-product-prone predicate removed the timeout on opencv 😛 Maybe a bindingset[this] annotation would have done the trick too..? And you're right, I missed the first line from reachesSink, which might make all the difference...

So along with removing the this argument from these predicates (which might be tricky in some cases, since isBarrier does depend on this), I'll leave the exploration of this type of optimization for a follow-up issue, unless you think the performance improvements would be so drastic that they would be worth having in the initial release of this query (that is, if you think we can analyze opencv in significantly less than an hour on GitHub Actions [Locally, it's just a few minutes, fyi])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding the pragma[inline] to the cartesian-product-prone predicate removed the timeout on opencv 😛 Maybe a bindingset[this] annotation would have done the trick too..? And you're right, I missed the first line from reachesSink, which might make all the difference...

Yep, I think bindingset would do the same thing as a bindingset annotation implies inlining (as far as I know, at least). I also think the missing conjunct in reachesSink might make all the difference 😄, indeed.

I'm still not completely sure how to read the MRVA run output. I assume the 50 minutes of runtime on opencv is because we also have to compute the IR and all the other underlying things that the query makes use of. If it only takes a few minutes once the cached libraries are computed that sounds fine.

)
}

private Instruction bigStep() {
result = this.smallStep(instr)
or
exists(Instruction i | i = this.bigStep() | result = this.smallStep(i))
}

private Instruction smallStep(Instruction i) {
instr.getASuccessor*() = i and
i.getASuccessor() = result and
not this.isBarrier(result)
}

private predicate isBarrier(Instruction i) {
valueNumber(i) = valNum and
exists(Expr e | i.getAst() = e |
i = any(StoreInstruction s).getDestinationAddress()
or
[e, e.getParent().(AddressOfExpr)] instanceof ScanfOutput
)
}
}

/** Returns a block guarded by the assertion of `value op call` */
BasicBlock blockGuardedBy(int value, string op, ScanfFunctionCall call) {
exists(GuardCondition g, Expr left, Expr right |
right = g.getAChild() and
value = left.getValue().toInt() and
DataFlow::localExprFlow(call, right)
|
g.ensuresEq(left, right, 0, result, true) and op = "=="
or
g.ensuresLt(left, right, 0, result, true) and op = "<"
or
g.ensuresLt(left, right, 1, result, true) and op = "<="
)
}

from ScanfOutput output, ScanfFunctionCall call, Access access
where
output.getCall() = call and
output.hasGuardedAccess(access, false)
select access,
"$@ is read here, but may not have been written. " +
"It should be guarded by a check that the $@ returns at least " +
output.getMinimumGuardConstant() + ".", access, access.toString(), call, call.toString()
4 changes: 4 additions & 0 deletions cpp/ql/src/change-notes/2022-08-24-missing-check-scanf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
category: newQuery
---
* Added a new medium-precision query, `cpp/missing-check-scanf`, which detects `scanf` output variables that are used without a proper return-value check to see that they were actually written. A variation of this query was originally contributed as an [experimental query by @ihsinme](https://github.com/github/codeql/pull/8246).
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
| test.cpp:30:7:30:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:30:7:30:7 | i | i | test.cpp:29:3:29:7 | call to scanf | call to scanf |
| test.cpp:46:7:46:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:46:7:46:7 | i | i | test.cpp:45:3:45:7 | call to scanf | call to scanf |
| test.cpp:63:7:63:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:63:7:63:7 | i | i | test.cpp:62:3:62:7 | call to scanf | call to scanf |
| test.cpp:75:7:75:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:75:7:75:7 | i | i | test.cpp:74:3:74:7 | call to scanf | call to scanf |
| test.cpp:87:7:87:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:87:7:87:7 | i | i | test.cpp:86:3:86:8 | call to fscanf | call to fscanf |
| test.cpp:94:7:94:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:94:7:94:7 | i | i | test.cpp:93:3:93:8 | call to sscanf | call to sscanf |
| test.cpp:143:8:143:8 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:143:8:143:8 | i | i | test.cpp:141:7:141:11 | call to scanf | call to scanf |
| test.cpp:152:8:152:8 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:152:8:152:8 | i | i | test.cpp:150:7:150:11 | call to scanf | call to scanf |
| test.cpp:184:8:184:8 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:184:8:184:8 | i | i | test.cpp:183:7:183:11 | call to scanf | call to scanf |
| test.cpp:203:8:203:8 | j | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 2. | test.cpp:203:8:203:8 | j | j | test.cpp:200:7:200:11 | call to scanf | call to scanf |
| test.cpp:227:9:227:9 | d | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 2. | test.cpp:227:9:227:9 | d | d | test.cpp:225:25:225:29 | call to scanf | call to scanf |
| test.cpp:231:9:231:9 | d | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 2. | test.cpp:231:9:231:9 | d | d | test.cpp:229:14:229:18 | call to scanf | call to scanf |
| test.cpp:243:7:243:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:243:7:243:7 | i | i | test.cpp:242:3:242:7 | call to scanf | call to scanf |
| test.cpp:251:7:251:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:251:7:251:7 | i | i | test.cpp:250:3:250:7 | call to scanf | call to scanf |
| test.cpp:259:7:259:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:259:7:259:7 | i | i | test.cpp:258:3:258:7 | call to scanf | call to scanf |
| test.cpp:271:7:271:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:271:7:271:7 | i | i | test.cpp:270:3:270:7 | call to scanf | call to scanf |
| test.cpp:281:8:281:12 | ptr_i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:281:8:281:12 | ptr_i | ptr_i | test.cpp:280:3:280:7 | call to scanf | call to scanf |
| test.cpp:289:7:289:7 | i | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:289:7:289:7 | i | i | test.cpp:288:3:288:7 | call to scanf | call to scanf |
| test.cpp:383:25:383:25 | u | $@ is read here, but may not have been written. It should be guarded by a check that the $@ returns at least 1. | test.cpp:383:25:383:25 | u | u | test.cpp:382:6:382:11 | call to sscanf | call to sscanf |
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Critical/MissingCheckScanf.ql
Loading