Skip to content

[PECO-1260] Support results compression #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 30, 2024

Conversation

kravets-levko
Copy link
Contributor

@kravets-levko kravets-levko commented Jan 11, 2024

This PR adds a support for LZ4 compressed results (both Arrow-based and CloudFetch). This almost doesn't affect CPU and memory usage (LZ4 is focused on compression and decompression speed), but noticeably decreases amount of data being downloaded. For example, on synthetic tests uncompressed CloudFetch batches have size of 15371496 bytes each, and with LZ4 compression enabled their size varies between 5041380 and 5042153 bytes (approx. 33% from uncompressed size). Of course this ratio depends on actual data.

Profiler reports (synthetic tests, 10000000 records, CloudFetch with single download thread):

Before

image

After

image

Signed-off-by: Levko Kravets <levko.ne@gmail.com>
Signed-off-by: Levko Kravets <levko.ne@gmail.com>
Signed-off-by: Levko Kravets <levko.ne@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5b01d59) 90.55% compared to head (384886e) 90.59%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #216      +/-   ##
==========================================
+ Coverage   90.55%   90.59%   +0.03%     
==========================================
  Files          62       62              
  Lines        1429     1435       +6     
  Branches      241      245       +4     
==========================================
+ Hits         1294     1300       +6     
  Misses         84       84              
  Partials       51       51              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@nithinkdb nithinkdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. A few comments about unifying with pysql and golang plus a thought about making lz4 on/off part of a test matrix so we have more complete test coverage.

@@ -84,6 +84,8 @@ export default class DBSQLClient extends EventEmitter implements IDBSQLClient, I

useCloudFetch: false,
cloudFetchConcurrentDownloads: 10,

useResultsCompression: true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unification between golang, nodejs, and pysql can you call this either useLz4Compression or just lz4Compression? In pysql we call this flag explicitly lz4_compression. In golang we call it useLz4Compression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP, will rename 👍

this.context = context;
this.source = source;
this.arrowSchema = arrowSchema;
this.isLZ4Compressed = isLZ4Compressed ?? false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a difference between pysql and golang: pysql enables lz4 compression by default. golang does not. Is there a compelling reason to not enable this by default in nodejs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason. I'll change it to be enabled by default

Copy link
Contributor Author

@kravets-levko kravets-levko Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just realized that it already defaults to true - the default value is set in DBSQLClient.getDefaultConfig. This ?? false handles a case when server responds with metadata.lz4Compressed = undefined (which means that actual result is not compressed)

async (session) => {
const operation = await session.executeStatement(`SELECT * FROM ${tableName}`);
const result = await operation.fetchAll();
expect(fixArrowResult(result)).to.deep.equal(expectedArrow);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming: this is the assertion that verifies the decompressed output generated by fixArrowResult(result) is equivalent to expectedArrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. fixArrowResult does very minor processing of the result to fix floating point fields

const session = await openSession({ cloudFetchConcurrentDownloads });
const session = await openSession({
cloudFetchConcurrentDownloads,
useResultsCompression: false,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to run the other e2e tests with lz4 both true and false like a test matrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test for useResultsCompression: true.

I agree that it's a good idea to have some sort of test matrix, an even use it more widely in tests. But I think I won't do it right now. I have some tests improvements tasks in my backlog, will add this one as well

@@ -86,4 +89,38 @@ describe('CloudFetch', () => {

expect(fetchedRowCount).to.be.equal(queriedRowsCount);
});

it('should handle LZ4 compressed data', async () => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it would be useful to make lz4=true|false part of a matrix for this test so that the setup/teardown is not duplicated between tests. Less copy and paste, and a more trustworthy test result.

Signed-off-by: Levko Kravets <levko.ne@gmail.com>
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kravets-levko kravets-levko merged commit f3c53a5 into main Jan 30, 2024
@kravets-levko kravets-levko deleted the PECO-1260-results-compression branch January 30, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants