Skip to content

Serialize StateToReproduce to JSON #1261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jinhuix
Copy link
Contributor

@jinhuix jinhuix commented Jul 27, 2025

Motivation

To support a standalone reducer in SQLancer, we need a way to serialize the StateToReproduce object, which encapsulates all the information needed to reproduce a bug.

Solution

  • This PR introduces a mechanism to serialize StateToReproduce into a JSON file when a bug is found.
  • The serialization captures all essential fields: statements, errorType (EXCEPTION, NOREC, TLP_WHERE), databaseName, databaseProvider, reproducerData, exception
  • Serialization is controlled via a new command-line option: --serialize-reproduce-state, and serialized .json files are written to logs/<dbms>/reproduce/.
  • A private static inner class StateToReproduceSerializor is introduced to control which fields are included in the output.

Example Output

{
  "statements": [
    "CREATE VIRTUAL TABLE rt0 USING rtree(c0, c1, c2, c3, c4);",
    ...
  ],
  "databaseName": "database6",
  "databaseProvider": "sqlancer.sqlite3.SQLite3Provider",
  "errorType": "NOREC",
  "expectedErrorsMap": {
    "INSERT OR FAIL INTO rt0 VALUES (NULL, 0.9105867290640974, \u0027\u0027, x\u00270a74\u0027, NULL), (x\u0027\u0027, NULL, 0x3cef0261, NULL, NULL), (\u0027280862872\u0027, \u00271022296673\u0027, \u0027n\u0027, x\u0027c9c3f71d\u0027, NULL);": [
      "[SQLITE_CONSTRAINT]  Abort due to constraint violation (rtree constraint failed: rt0.(c1\u003c\u003dc2))"
    ],
    ...
  },
  "reproducerData": {
    "optimizedQueryString": "SELECT COUNT(*) FROM rt0 WHERE (((SQLITE_SOURCE_ID())\u003c\u003d(rt0.c4)))",
    "unoptimizedQueryString": "SELECT SUM(count) FROM (SELECT ((((SQLITE_SOURCE_ID())\u003c\u003d(rt0.c4))) IS TRUE)  as count FROM rt0)",
    "shouldUseAggregate": "true"
  }
}

@jinhuix jinhuix force-pushed the serialize-state-to-reproduce branch from f3352e8 to a9cd21c Compare July 27, 2025 13:13
@mrigger
Copy link
Contributor

mrigger commented Jul 28, 2025

Why do we convert it to JSON? Can we not just use the Java serialization API?

@jinhuix
Copy link
Contributor Author

jinhuix commented Jul 28, 2025

Why do we convert it to JSON? Can we not just use the Java serialization API?

Yes. I used JSON for readability, but Java serialization is indeed simpler. I’m fine switching to that.

- Replaced JSON with Java serialization in StateToReproduce
- Removed unused getters from StateToReproduce
Copy link

@KabilanMA KabilanMA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@mrigger

@mrigger
Copy link
Contributor

mrigger commented Aug 7, 2025

Sorry for the delay and thanks for pinging me, @KabilanMA.

I think this could be simplified a lot. At a high level, I see what we are adding an additional class StateToReproduceSerializor as well as various fields and methods that we use to store the data and information about the test oracles. Would it not be possible to just serialize the existing objects, without creating any extra classes or fields?

@jinhuix
Copy link
Contributor Author

jinhuix commented Aug 8, 2025

Yes, we can serialize StateToReproduce directly by making it Serializable, marking non-serializable fields (statements, databaseProvider, localState) as transient, and handling them manually.

I added errorType, expectedErrorsMap, and reproducerData to StateToReproduce as they seem essential for reducing bugs, so I would prefer to keep them.

@KabilanMA @mrigger Does this sound good to you?

@KabilanMA
Copy link

We can serialize the StateToReproduce class directly to store the necessary information, unless it's used in other contexts with different requirements. If that's the case, it's cleaner to keep the serialization in a separate class and include only the required fields.

Either approach is valid; it ultimately comes down to maintainability. Keeping a separate serializable class provides better clarity and separation of concerns, especially if StateToReproduce serves multiple purposes.

@mrigger
Copy link
Contributor

mrigger commented Aug 11, 2025

I think we'd add much technical depth by proceeding with having these additional fields and classes. Essentially, we already have all the information we need in the relevant objects. Converting things to a more low-level representation and having fields such as errorType would go against object-oriented programming principles, I think.

@KabilanMA
Copy link

Agreed.

@jinhuix
Copy link
Contributor Author

jinhuix commented Aug 13, 2025

This update simplifies serialization by using direct serialization of the class StateToReproduce, removing all previously added fields/classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants