Skip to content

Diffs recognizing less similarity since 4.10 #129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stnagel opened this issue Oct 17, 2021 · 4 comments
Closed

Diffs recognizing less similarity since 4.10 #129

stnagel opened this issue Oct 17, 2021 · 4 comments
Assignees

Comments

@stnagel
Copy link

stnagel commented Oct 17, 2021

Since version 4.10 of the library, diffs are recognizing less similarity between texts.

In the attached program, under version 4.9, the library correctly recognizes that there is only a 5-character difference between the texts (3 letters + 2 whitespace characters). Under version 4.10, the library reports that a large block of identical text has been deleted and then added.

TestCase.txt

4.9 output:

   1 EQUAL    apple1                                                                           apple1                                                                          
   2 EQUAL    apple2                                                                           apple2                                                                          
   3 EQUAL    apple3                                                                           apple3                                                                          
   4 CHANGE   A man named Frankenstein==oldCHANGE==> abc <==old==to Switzerland for cookies!   A man named Frankenstein                                                        
   5 CHANGE                                                                                    ==newCHANGE==>xyz<==new==                                                       
   6 CHANGE                                                                                    to Switzerland for cookies!                                                     
   7 EQUAL    banana1                                                                          banana1                                                                         
   8 EQUAL    banana2                                                                          banana2                                                                         
   9 EQUAL    banana3                                                                          banana3                                                                         

4.10 output:

   1 EQUAL    apple1                                                                           apple1                                                                          
   2 EQUAL    apple2                                                                           apple2                                                                          
   3 EQUAL    apple3                                                                           apple3                                                                          
   4 CHANGE   A man named Frankenstein==oldDELETE==> abc to Switzerland for cookies!<==old==   A man named Frankenstein                                                        
   5 INSERT                                                                                    ==newINSERT==>xyz<==new==                                                       
   6 INSERT                                                                                    ==newINSERT==>to Switzerland for cookies!<==new==                               
   7 EQUAL    banana1                                                                          banana1                                                                         
   8 EQUAL    banana2                                                                          banana2                                                                         
   9 EQUAL    banana3                                                                          banana3                                                                         

Admittedly, there are aspects of the 4.10 output that are improved over the 4.9 output. For example, the fact that line 6 of the 4.9 output is indicated as a CHANGE, but there is no oldLine text and no changes in the newLine text can be confusing. However, the sacrifice in accuracy in 4.10 is far less desirable. In 4.9, the line 6 difference is indeed a change, but it's almost like a new tag is needed to indicate a group (?) change to make it clear that the change is a continuation of the line 4 difference.

@wumpz wumpz pinned this issue Oct 19, 2021
@wumpz wumpz self-assigned this Oct 19, 2021
@wumpz
Copy link
Collaborator

wumpz commented Oct 20, 2021

There was a change to somehow decompress the computed deltas that have different target and source sizes. That was needed to correct some issues with multi line diffs. But I will look into it. Maybe we could make this decompression skippable.

@wumpz wumpz closed this as completed in 0fd38db Oct 21, 2021
@wumpz
Copy link
Collaborator

wumpz commented Oct 21, 2021

Could you check it? I added the possibility to skip the delta decompression. After it is switched off you get your old result.s

@stnagel
Copy link
Author

stnagel commented Nov 2, 2021

This does indeed appear to address my issue. Thank you very much for the quick turn-around (quicker than I've even checked back). I worry a little about yet-another-builder-option, but I defer to your understanding of the code.

Do you have some idea when the next release will be? In my production environment, due to security and network restrictions, I can only download published maven artifacts.

If you are curious, here is sample data for a test case that is closer to my real-world scenario:
issue129_1.txt
issue129_2.txt
My users are highly interested in seeing only substantive changes in text; in this data, nothing but white-space formatting has changed. With decompressDeltas(false), the text is recognized as a series of CHANGEs with small spacing differences; with decompressDeltas(true), the code outputs a series of CHANGEs and then an INSERT which would be counterproductive to my user base.

Thank you again!

@wumpz wumpz reopened this Nov 8, 2021
@wumpz wumpz unpinned this issue Mar 10, 2022
@wumpz
Copy link
Collaborator

wumpz commented May 14, 2022

Sorry, about the late answer:

What about .ignoreWhiteSpaces(true)?

@wumpz wumpz closed this as completed Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants