Skip to content

Commit 2deaa30

Browse files
doc: add new blog
1 parent c71d5bc commit 2deaa30

File tree

3 files changed

+391
-138
lines changed

3 files changed

+391
-138
lines changed
Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
---
2+
author:
3+
- name: Herrington Darkholme
4+
sidebar: false
5+
date: 2024-12-25
6+
head:
7+
- - meta
8+
- property: og:type
9+
content: website
10+
- - meta
11+
- property: og:title
12+
content: Design Space for Code Search Query
13+
- - meta
14+
- property: og:url
15+
content: https://ast-grep.github.io/blog/code-search-design-space.html
16+
- - meta
17+
- property: og:description
18+
content: A review of the design space for code search tools.
19+
- - meta
20+
- property: og:image
21+
content: https://ast-grep.github.io/image/blog/query-design.png
22+
---
23+
24+
# Design Space for Code Search Query
25+
26+
Code search is a critical tool for modern software development. It enables developers to quickly locate, understand, and reuse existing code, boosting productivity and ensuring code consistency across projects.
27+
28+
At its core, ast-grep is a [code search](/guide/introduction.html#motivation) tool. Its other features, such as [linting](/guide/scan-project.html) and code [rewriting](/guide/rewrite-code.html), are built upon the foundation of its code search capabilities.
29+
30+
This blog post delves into the design space of code search, with a particular focus on how queries are designed and used. We'll be drawing inspiration from the excellent paper, "[Code Search: A Survey of Techniques for Finding Code](https://www.lucadigrazia.com/papers/acmcsur2022.pdf)". But we won't be covering every single detail from that paper. Instead, our focus will be on the diverse ways that code search tools allow users to express their search intent.
31+
32+
33+
## Query Design and Query Types
34+
35+
Every code search begins with a query, which is simply a way to tell the search engine what kind of code we're looking for. The way these queries are designed is crucial. Code search tool designers aim to achieve several key goals:
36+
37+
#### Easy
38+
A query should be easy to write, allowing users to quickly search without needing extensive learning. If it's too difficult to write a query, people might get discouraged from using the tool altogether.
39+
#### Expressive
40+
Users should be able to express whatever they're looking for. If the query language is too limited, you simply cannot find some results.
41+
#### Precise
42+
43+
The query should be specific enough to yield relevant results, avoiding irrelevant findings. An imprecise query will lead to a lot of noise.
44+
45+
---
46+
47+
Achieving all three of these goals simultaneously is challenging, as they often pull in opposing directions. For example, a very simple and easy query language might be expressive enough, or a very precise query language might be too complex for the average user.
48+
49+
How do code search tools balance these goals? The blog categorizes code search queries into a few main types, each with its own characteristics: informal queries, formal queries, and hybrid queries.
50+
51+
![query design](/image/blog/query-design.png)
52+
53+
## Informal Queries
54+
55+
These queries are closest to how we naturally express ourselves, and can be further divided into:
56+
57+
### Free-Form Queries
58+
59+
These are often free-form, using natural language to describe the desired code functionality, like web search. For example, "read file line by line" or "FileReader close."
60+
61+
* **Pros:** Easy for users to formulate, similar to using a web search engine, and highly expressive.
62+
* **Cons:** Can be ambiguous and less precise due to the nature of natural language and potential vocabulary mismatches between the query and the code base.
63+
64+
Tools like [GitHub Copilot](https://docs.github.com/en/enterprise-cloud@latest/copilot/using-github-copilot/asking-github-copilot-questions-in-github) use this approach.
65+
66+
### Input-Output Examples
67+
These queries specify the desired behavior of the code by providing input-output pairs. You specify the desired behavior using pairs of inputs and their corresponding outputs. For example, the input "susie@mail.com" should result in the output "susie".
68+
69+
* **Pros**: Allows to precisely specify desired behavior
70+
* **Cons**: May require some effort to provide sufficient examples
71+
72+
This approach is more common in academic research than practical tools. This blog has not been aware of open source tools that use this approach.
73+
74+
75+
_We will not discuss informal queries in detail, as it is not precise._
76+
77+
## **Formal Queries Based on Existing Programming Languages**
78+
79+
Formal queries use a structured approach, making them more precise. They can be further divided into several subcategories.
80+
81+
### Plain Code
82+
83+
The simplest version involves providing an exact code snippet that needs to be matched in the codebase. For instance, a user might search for instances of the following Java snippet:
84+
85+
```java
86+
try {
87+
File file = File.createTempFile("foo", "bar");
88+
} catch (IOException e) {
89+
}
90+
```
91+
92+
Not many tools directly support plain code search. They usually break search queries into smaller parts through the tokenization process, like traditional search engines.
93+
94+
A notable example may be [grep.app](https://grep.app).
95+
96+
### Code with Holes
97+
98+
This approach involves providing code snippets with placeholders to search for code fragments. For example, a user might search for the following pattern in Java:
99+
100+
```java
101+
public void actionClose (JButton a, JFrame f) {
102+
$$$BODY
103+
}
104+
```
105+
106+
Here, `$$$BODY` is a placeholder, and the code search engine will try to locate all matching code. ast-grep falls into this category, treating the query as an Abstract Syntax Tree (AST) with holes. The holes in ast-grep are called metavariables.
107+
108+
Other tools like gritql and the [structural search feature](https://www.jetbrains.com/help/idea/tutorial-work-with-structural-search-and-replace.html) in IntelliJ IDEA also use this technique.
109+
110+
### Code with Pattern Matching Symbols
111+
112+
These queries make use of special symbols to represent and match code structures. For example, the following query in [Comby](https://comby.dev/docs/basic-usage#how-matching-works) attempts to find all if statements where the condition is a comparison.
113+
114+
```comby
115+
if (:[var] <= :[rest])
116+
```
117+
118+
In Comby, the `:[var]` and `:[rest]` are special markers that match strings of code.
119+
120+
```java{1}
121+
if (width <= 1280 && height <= 800) {
122+
return 1;
123+
}
124+
```
125+
126+
The `:[var]` matches any string until a `<=` character is found and in this case is `width`. `:[rest]` matches everything that follows, `1280 && height <= 800`. Unlike ast-grep, Comby is not AST-aware, as the `:[rest]` in the example spans across multiple AST nodes. Tools like [Comby](https://comby.dev/) and [Shisho](https://github.com/flatt-security/shisho) use this approach.
127+
128+
129+
### Pros and Cons
130+
131+
**Pros:** Easy to formulate for developers familiar with programming languages.
132+
133+
**Cons:** Parsing incomplete code snippets can be a challenge.
134+
135+
The downside of using existing languages is also emphasized in the IntelliJ IDEA documentation:
136+
137+
> Any (SSR) template entered should be a well formed Java construction ...
138+
139+
An off-the-shelf grammar of the programming language may not be able to parse a query because the query is [incomplete or ambiguous](/advanced/pattern-parse.html#pattern-is-ast-based).
140+
For example, `"key": "value"` is not a valid JSON object, a JSON parser will reject and will fail to create a query. Maybe it is clear to a human that it is a key-value pair, but the parser does not know that. Other examples will be like [distinguishing function calls](/catalog/c/) and macro invocation in C/C++.
141+
142+
:::tip
143+
ast-grep takes a unique approach to this problem. It uses a [pattern object](/guide/rule-config/atomic-rule.html#pattern-object) to represent and disambiguate a complete and valid code snippet, and then leverages a [`selector`](/reference/rule.html#pattern) to extract the part that matches the query.
144+
:::
145+
146+
147+
## Formal Queries using Custom Languages
148+
149+
### Significant Extensions of Existing Programming Languages
150+
151+
These languages extend existing programming languages with features like wildcard tokens or regular expression operators. For example, the pattern `$(if $$ else $) $+` might be used to find all nested if-else statements in a codebase. [Coccinelle](https://coccinelle.gitlabpages.inria.fr/website/) and [Semgrep](https://semgrep.dev/) are tools that take this approach.
152+
153+
154+
Semgrep's pattern-syntax, for example, has extensive features such as [ellipsis metavariables](https://semgrep.dev/docs/writing-rules/pattern-syntax#ellipsis-metavariables), [typed metavariables](https://semgrep.dev/docs/writing-rules/pattern-syntax#typed-metavariables), and [deep expression operators](https://semgrep.dev/docs/writing-rules/pattern-syntax#deep-expression-operator), that cannot be parsed by a standard programming language' implementation.
155+
156+
:::code-group
157+
```yaml [Ellipsis Metavariables]
158+
# combine ellipses and metavariables to match a sequence of ASTs
159+
# note the ellipsis is not valid programming language syntax
160+
pattern: foo($...ARGS, 3, $...ARGS)
161+
# this pattern will match foo(1, 2, 3, 4, 5)
162+
```
163+
164+
```yaml [Typed Metavariables]
165+
# look for calls to the log method on Logger objects.
166+
# A simple pattern like this will match `Math.log()` as well
167+
pattern: $LOGGER.log(...)
168+
# typed metavariable can put a type constraint on the metavariable
169+
# but it is no longer valid Java code
170+
pattern: (java.util.logging.Logger $LOGGER).log(...)
171+
```
172+
173+
```yaml [Deep Expression operators]
174+
# Use the deep expression operator <... [your_pattern] ...>
175+
# to match an expression that
176+
# could be deeply nested within another expression
177+
pattern: |
178+
if <... $USER.is_admin() ...>:
179+
...
180+
```
181+
:::
182+
183+
**Pros**: These languages can be more expressive than plain programming languages.
184+
185+
**Cons**: Users need to learn new syntax and semantics and tool developers to support the extension
186+
187+
:::warning Difference from ast-grep
188+
Note ast-grep also supports multi meta variables in the form of `$$$VARS`. Compared to Semgrep, ast-grep's metavariables still produce valid code snippets.
189+
:::
190+
191+
We can represent also search query using **Domain Specific Language**
192+
193+
### Logic-based Querying Languages
194+
195+
These languages utilize first-order logic or languages like Datalog to express code properties. For example, a user can find all classes with the name "HelloWorld". Some of these languages also resemble SQL. [CodeQL](https://codeql.github.com/) and [Glean](https://glean.software/docs/angle/intro/) are two notable examples. Here is an example from CodeQL:
196+
197+
```sql
198+
from If ifstmt, Stmt pass
199+
where pass = ifstmt.getStmt(0) and
200+
pass instanceof Pass
201+
select ifstmt, "This 'if' statement is redundant."
202+
```
203+
204+
This CodeQL query will identify redundant if statements in Python, where the first statement within the if block is a pass statement.
205+
206+
:::details Explaination of the query
207+
* `from If ifstmt, Stmt pass`: This part of the query defines two variables, `ifstmt` and `pass`, which will be used in the query.
208+
* `where pass = ifstmt.getStmt(0) and pass instanceof Pass`: This part of the query filters the results. It checks if the first statement in the `ifstmt` is a `Pass` statement.
209+
* `select ifstmt, "This 'if' statement is redundant."`: This part of the query selects the results. It returns the `ifstmt` and a message.
210+
:::
211+
212+
**Pros:** These languages can precisely express complex code properties beyond syntax.
213+
214+
**Cons:** Learning curve is steep.
215+
216+
### Embedded Domain Specific Language
217+
218+
Embedded DSLs are using the host language to express the query. The query is embedded in the host language, and the host language provides the necessary constructs to express the query. The query is then parsed and interpreted by the tool.
219+
220+
There are further two flavors of embedded DSLs: configuration-based and program-based.
221+
222+
#### Configuration-based eDSL
223+
224+
Configuration-based eDSLs allow user to provide configuration objects that describes the query. The tool then interprets this configuration object to perform the search. ast-grep CLI and semgrep CLI both adopt this approach using YAML files.
225+
226+
:::code-group
227+
```yaml [ast-grep YAML rule]
228+
id: match-function-call
229+
language: c
230+
rule:
231+
pattern:
232+
context: $M($$$);
233+
selector: call_expression
234+
```
235+
236+
```yaml [Semgrep YAML rule]
237+
rules:
238+
- id: my-pattern-name
239+
pattern: |
240+
TODO
241+
message: "Some message to display to the user"
242+
languages: [python]
243+
severity: ERROR
244+
```
245+
246+
:::
247+
248+
Configuration files are more expressive than patterns and still relatively easy to write. Users usually already know the host language (YAML) and can leverage its constructs to express the query.
249+
250+
251+
#### Program-based eDSL
252+
253+
Program-based eDSLs provide direct access to the AST through AST node objects.
254+
255+
Examples of programmatic APIs include [JSCodeshift](https://jscodeshift.com/build/api-reference/), the [Code Property Graph](https://docs.joern.io/code-property-graph/) from [Joern](https://joern.io/), and ast-grep's [NAPI](https://ast-grep.github.io/guide/api-usage.html).
256+
257+
:::code-group
258+
```typescript [@ast-grep/napi]
259+
import { parse, Lang } from '@ast-grep/napi'
260+
261+
let source = `console.log("hello world")`
262+
const ast = parse(Lang.JavaScript, source) // 1. parse the source
263+
const root = ast.root() // 2. get the root
264+
const node = root.find('console.log($A)') // 3. find the node
265+
node.getMatch('A').text() // 4. collect the info
266+
// "hello world"
267+
```
268+
269+
```javascript [JSCodeshift]
270+
const j = require('jscodeshift');
271+
272+
const root = j(`const a = 1; const b = 2;`);
273+
274+
const types = root.find(j.VariableDeclarator).getTypes();
275+
console.log(types); // Set { 'VariableDeclarator' }
276+
```
277+
278+
```scala [Code Property Graph]
279+
import io.shiftleft.codepropertygraph.Cpg
280+
import io.shiftleft.semanticcpg.language._
281+
282+
object FindExecCalls {
283+
def main(args: Array[String]): Unit = {
284+
// Load the C codebase
285+
val cpg: Cpg = Cpg.apply("path/to/your/codebase")
286+
287+
// Find all `exec` function calls and print their locations
288+
cpg.call("exec").location.l.foreach(println)
289+
}
290+
}
291+
```
292+
:::
293+
294+
295+
**Pros:** Offer more precision and expressiveness and are relatively easy to write.
296+
297+
**Cons**: The overhead to communicate between the host language and the search tool can be high.
298+
299+
### General Purpose Like Programming Language
300+
301+
Finally, tools can also design their own general purpose programming languages. These languages provide a full programming language to describe code properties. [GritQL](https://about.grit.io/) is an example of this approach.
302+
303+
For example, this GritQL query rewrites all `console.log` calls to `winston.debug` and all `console.error` calls to `winston.warn`:
304+
305+
```gritql
306+
`console.$method($msg)` => `winston.$method($msg)` where {
307+
$method <: or {
308+
`log` => `debug`,
309+
`error` => `warn`
310+
}
311+
}
312+
```
313+
314+
:::details Explaination of the Query
315+
316+
1. **Pattern Matching**: The pattern `console.$method($msg)` is used to match code where there is a `console` object with a method (`$method`) and an argument (`$msg`). Here, `$method` and `$msg` are placeholders for any method and argument, respectively.
317+
318+
2. **Rewrite**: The rewrite symbole `=>` specifies that the matched `console` code should be transformed to use `winston`, followed by the method (`$method`) and the argument (`$msg`).
319+
320+
3. **Method Mapping**: The `where` clause specifies additional constraints on the rewrite. Specifically, `$method <: or { 'log' => 'debug', 'error' => 'warn' }` means:
321+
- If `$method` is `log`, it should be transformed to `debug`.
322+
- If `$method` is `error`, it should be transformed to `warn`.
323+
324+
In sum, this rule replaces console logging methods with their corresponding Winston logging methods:
325+
- `console.log('message')` becomes `winston.debug('message')`
326+
- `console.error('message')` becomes `winston.warn('message')`
327+
:::
328+
329+
**Pros:** It offers more precision and expressiveness compared to simple patterns and configuration-based embedded DSLs. But it may not be as flexible as program-based eDSL nor as powerful as logic-based languages.
330+
331+
**Cons:** Have the drawback of requiring users to learn the custom language first. It is easier to learn than logic-based languages, but still requires some learning compared to using embedded DSL.
332+
333+
334+
## Hybrid Queries
335+
336+
Hybrid queries combine multiple query types. For example, you can combine free-form queries with input-output examples, or combine natural language queries with program element references.
337+
338+
339+
ast-grep is a great example of a tool that uses hybrid queries. You can define patterns directly in a YAML rule or use a programmatic API.
340+
341+
First, you can embed the pattern in the YAML rule, like this:
342+
343+
```yaml
344+
rule:
345+
pattern: console.log($A)
346+
inside:
347+
kind: function_declaration
348+
```
349+
350+
You can also use the similar concept in the programmatic API
351+
352+
```typescript
353+
import { Lang, parse } from '@ast-grep/napi'
354+
355+
const sg = parse(Lang.JavaScript, code)
356+
sg.root().find({
357+
rule: {
358+
pattern: 'console.log($A)',
359+
inside: {
360+
kind: 'function_declaration'
361+
}
362+
}
363+
})
364+
```
365+
366+
This flexible design allows you to combine basic queries into larger, more complex ones, and you can always use a general-purpose language for very complex and specific searches.
367+
368+
:::warning ast-grep favors existing programming languages
369+
We don't want the user to learn a new language, but rather use the existing language constructs to describe the query. We also think TypeScript is a great language with [great type system](/blog/typed-napi.html). There is no need to reinvent a new language to express code search logic.
370+
:::
371+
372+
## ast-grep's Design Choices
373+
374+
Designing a code search tool involves a delicate balancing act. It's challenging to simultaneously achieve ease of use, expressiveness, and precision, as these goals often conflict. Code search tools must carefully navigate these trade-offs to meet the diverse needs of their users.
375+
376+
ast-grep makes specific choices to address this challenge:
377+
378+
* **Prioritizing Familiarity**: It uses pattern matching based on existing programming language syntax, making it easy for developers to start using the tool with familiar coding structures.
379+
* **Extending with Flexibility**: It incorporates configuration-based (YAML) and program-based (NAPI) embedded DSLs, providing additional expressiveness for complex searches.
380+
* **Hybrid, and Progressive, Design**: Its pattern matching, YAML rules, and NAPI are designed for hybrid use, allowing users to start simple and gradually add complexity. The concepts in each API are also transferable, enabling users to progressively learn more advanced techniques.
381+
* **AST-Based Precision**: It emphasizes precision by requiring all queries to be AST-based, ensuring accurate results. Though it comes with the trade-off that queries should be carefully crafted.
382+
* **Multi-language Support**: Instead of creating a new query language for all programming languages or significantly extending existing ones for code search purposes, which would be an enormous undertaking, ast-grep reuses the familiar syntax of the existing programming languages in its patterns. This makes the tool more approachable for developers working across multiple languages.
383+
384+
## Additional Considerations
385+
386+
While we've focused on query design, there are other factors that influence the effectiveness of code search tools. These include:
387+
388+
* Offline Indexing: This is crucial for rapid offline searching. Currently, ast-grep always builds an AST in memory for each query, meaning it doesn't support offline indexing. Tools like grep.app, which do use indexing, is faster for searching across millions of repositories.
389+
* Information Indexing: Code search can index various kinds of information besides just code elements. Variable scopes, type information, definitions, and control and data flow are all valuable data for code search. Currently, ast-grep only indexes the AST itself.
390+
* Retrieval Techniques: How a tool finds matching code given a query is a critical aspect. Various algorithmic and machine learning approaches exist for this. ast-grep uses a manual implementation that compares the query's AST with the code's AST.
391+
* Ranking and Pruning: How search results are ordered is also a critical factor in providing good search results.

0 commit comments

Comments
 (0)