-
Notifications
You must be signed in to change notification settings - Fork 39
[PECO-1260] Support results compression #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Levko Kravets <levko.ne@gmail.com>
Signed-off-by: Levko Kravets <levko.ne@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #216 +/- ##
==========================================
+ Coverage 90.55% 90.59% +0.03%
==========================================
Files 62 62
Lines 1429 1435 +6
Branches 241 245 +4
==========================================
+ Hits 1294 1300 +6
Misses 84 84
Partials 51 51 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good. A few comments about unifying with pysql and golang plus a thought about making lz4 on/off part of a test matrix so we have more complete test coverage.
lib/DBSQLClient.ts
Outdated
@@ -84,6 +84,8 @@ export default class DBSQLClient extends EventEmitter implements IDBSQLClient, I | |||
|
|||
useCloudFetch: false, | |||
cloudFetchConcurrentDownloads: 10, | |||
|
|||
useResultsCompression: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP, will rename 👍
this.context = context; | ||
this.source = source; | ||
this.arrowSchema = arrowSchema; | ||
this.isLZ4Compressed = isLZ4Compressed ?? false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's a difference between pysql and golang: pysql enables lz4 compression by default. golang does not. Is there a compelling reason to not enable this by default in nodejs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason. I'll change it to be enabled by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just realized that it already defaults to true
- the default value is set in DBSQLClient.getDefaultConfig
. This ?? false
handles a case when server responds with metadata.lz4Compressed = undefined
(which means that actual result is not compressed)
async (session) => { | ||
const operation = await session.executeStatement(`SELECT * FROM ${tableName}`); | ||
const result = await operation.fetchAll(); | ||
expect(fixArrowResult(result)).to.deep.equal(expectedArrow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming: this is the assertion that verifies the decompressed output generated by fixArrowResult(result)
is equivalent to expectedArrow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. fixArrowResult
does very minor processing of the result to fix floating point fields
tests/e2e/cloudfetch.test.js
Outdated
const session = await openSession({ cloudFetchConcurrentDownloads }); | ||
const session = await openSession({ | ||
cloudFetchConcurrentDownloads, | ||
useResultsCompression: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to run the other e2e tests with lz4 both true
and false
like a test matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test for useResultsCompression: true
.
I agree that it's a good idea to have some sort of test matrix, an even use it more widely in tests. But I think I won't do it right now. I have some tests improvements tasks in my backlog, will add this one as well
@@ -86,4 +89,38 @@ describe('CloudFetch', () => { | |||
|
|||
expect(fetchedRowCount).to.be.equal(queriedRowsCount); | |||
}); | |||
|
|||
it('should handle LZ4 compressed data', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it would be useful to make lz4=true|false
part of a matrix for this test so that the setup/teardown is not duplicated between tests. Less copy and paste, and a more trustworthy test result.
Signed-off-by: Levko Kravets <levko.ne@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This PR adds a support for LZ4 compressed results (both Arrow-based and CloudFetch). This almost doesn't affect CPU and memory usage (LZ4 is focused on compression and decompression speed), but noticeably decreases amount of data being downloaded. For example, on synthetic tests uncompressed CloudFetch batches have size of 15371496 bytes each, and with LZ4 compression enabled their size varies between 5041380 and 5042153 bytes (approx. 33% from uncompressed size). Of course this ratio depends on actual data.
Profiler reports (synthetic tests, 10000000 records, CloudFetch with single download thread):
Before
After