Skip to content

Optimize LZ4_memcpy_using_offset #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Apr 8, 2023

LZ4_memcpy_using_offset is modified, as follows:

When the offset is a power of 2 lower or equal than sizeof(reg_t), instead of loading sizeof(reg_t) bytes onto the stack and doing memcpy, a variable of type reg_t is filled using only 1 multiplication. Then, the variable is written onto the output on a memset style.

The third parameter BYTE* dstEnd was changed to size_t length. This has two advantages: In copying loops stop conditions, the source pointer is compared against an immediate numbers instead of another variable; On LZ4_decompress_generic, at the end of the unsafe decode loop (lines ~2130 through ~2135), it is not needed to set and maintain the variable cpy while doing the copy onto the output.

@Nicoshev Nicoshev force-pushed the optimize_LZ4_memcpy_using_offset branch from 2109a5c to f176796 Compare September 6, 2023 19:33
@Cyan4973
Copy link
Member

Cyan4973 commented Sep 7, 2023

Just as a data point,
this is more of the same :
benchmark is unable to establish a clear win,
due to bigger amount of "noise", attributed to Instruction Alignment,
so results variations are too large to confirm the impact of a small speed improvement.

 PR1222                                                                         │ dev

compile LZ4 with gcc-7                                                          │compile LZ4 with gcc-7
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3024.9 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3023.2 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2866.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2880.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2650.0 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2634.2 MB/s
compile LZ4 with gcc-8                                                          │compile LZ4 with gcc-8
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2932.6 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3387.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2717.5 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3339.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2524.9 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3041.8 MB/s
compile LZ4 with gcc-9                                                          │compile LZ4 with gcc-9
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3113.8 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2907.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2919.5 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2701.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2697.6 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2503.7 MB/s
compile LZ4 with gcc-10                                                         │compile LZ4 with gcc-10
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3205.7 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3554.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3091.0 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3524.6 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2832.1 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3217.8 MB/s
compile LZ4 with clang-6.0                                                      │compile LZ4 with clang-6.0
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3014.4 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3123.9 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2887.4 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3052.0 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2656.3 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2798.6 MB/s
compile LZ4 with clang-7                                                        │compile LZ4 with clang-7
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2913.8 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3195.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2831.2 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3192.6 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2590.9 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2922.1 MB/s
compile LZ4 with clang-8                                                        │compile LZ4 with clang-8
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3086.9 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2916.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3041.8 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2824.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2775.7 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2586.4 MB/s
compile LZ4 with clang-9                                                        │compile LZ4 with clang-9
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3215.6 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3245.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3098.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3109.7 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2846.0 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2862.1 MB/s
compile LZ4 with clang-10                                                       │compile LZ4 with clang-10
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2958.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3048.3 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2732.4 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2883.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2527.4 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2664.8 MB/s
compile LZ4 with clang-11                                                       │compile LZ4 with clang-11
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3135.1 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3192.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2935.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2955.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2700.6 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2742.0 MB/s
compile LZ4 with clang-12                                                       │compile LZ4 with clang-12
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3330.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2991.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3253.5 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2850.6 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2986.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2629.8 MB/s

I will try to find a better way to establish the improvements produced by a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants