-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
remove/delete method for zipfile/tarfile objects #51067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It would be most helpful if a method could be included in the TarFile Usage to remove a single file from an archive would be as follows: import zipfile
zipFileObject = zipfile.ZipFile(archiveName,'a')
zipFileObject.remove(fileToRemove)
zipFileObject.close() Such a method should probably only apply to archives that are in append One possible extra to be included is to allow a list of file names or |
+1 |
Slight change to: "Such a method should probably only apply to archives that are in append The method should probably still apply to an archive in write mode. It |
-1. I don't think this can be implemented in a reasonable way, assuming |
-1, although I can only speak for tarfile. Removing members from a tar |
In light of Lars's comment on the matter, perhaps this functionality |
Are you sure they are not creating a new file in order to delete
Please do - you'll find that deletion from zipfiles comes in a can |
I done it In a very *violent* way. |
In the form in which you have done it, it is clearly In addition, notice that code is for tarfile, whereas
This is not how this works. If you want us to take |
I have attempted to implement a ZipFile.remove function. It seems to work fine. I have submitted a patch. The method of implementation is: find the file's index in the file list, then sum the lengths of the file entries before it to find its location in the archive. Then simply read in all the bytes after it, write them out at that location, and truncate the file x bytes shorter, where x is the length of the record. This works because the directory listing is created when the file is closed, so there's no harm in truncating. I've also made it truncate the zip file after reading in the existing files upon creation, because the directory information is not used after this point. This could use some testing on large files. This is my first patch, so let me know if I've done anything wrong. |
My patch had some bugs, I'll need to do some more testing. Sorry about that. |
What's the status with this patch? If nobody's looking at it I can try to see if it works and write the test and documentation for it. |
Please feel free to test, revise, and write. Though 'removed', the file is still accessible via the history list. (Click 'zipfile_remove.patch' and then 'download'.) |
I fixed the bugs I found, added tests and documentation. What do you guys think? |
Fixed the bugs Martin pointed out and added the relevant tests. Sadly I had to move some stuff around, but I think the changes are all for the better. I wasn't sure about the right convention for the 2 constants I added btw. |
Martin did a review of the newer patch; maybe you didn’t get the mail (there’s a Rietveld bug when a user name without email is given to the Cc field). |
I'm not sure I understand how http://bugs.python.org/review/6818/show works. I've looked all over and only found remarks for "zipfile.remove.patch" and not for "zipfile.remove.2.patch" which addressed all the aforementioned issues. Also, I don't understand how to add myself to the CC of this issue's review page. |
Yuval, can you please submit a contributor agreement? See |
As for adding yourself to the CC list: notice the string "ubershmekel" appearing in the "CC" field of http://bugs.python.org/review/6818/show. It means that you are already on the CC list. |
Yuval has submitted a CLA. I'm moving the proposal to 3.4 as 3.3 is in feature freeze mode. |
Ping. Has this been postponed? |
I agree with Martin and Lars, this issue is not so easy at looks at first glance. For ZIP files we should distinct two different operations.
2a. The safer way is to create temporary file in the same directory, copy the content of original ZIP file excluding deleted file, and then replace original ZIP file by modified copy. Be aware about file and parent directory permissions, owners, and disk space. 2b. The faster but less safe way is to "shift" the content of the ZIP file after deleted file by reading it and writing back in the same ZIP file at different position. This way is not safe because when something bad happen at writing, we can lost all data. And of course there are crafty ZIP files in which the order of files doesn't match the order in central directory or even files data overlap. For performance may be we should implement (2) not as a method to remove single file, but as a method which takes the filter function and then left in the ZIP file only files for which it returns true. Or may be implement (1) and then add a method which cleans up the ZIP archive be removing all files removed from the central directory. We should discuss alternatives. And as for concrete patch, zipfile.remove.2.patch can read the content of all ZIP file in the memory. This is not appropriate, because ZIP file can be very large. |
I'd be interested in taking up the zip portion at Pycon 2015 this year. I recently had need to delete file(s) from a zipfile. To do it today with the existing API requires you to unpack the zip and repack it. The unpack is slow and you need enough free disk space for the uncompressed files. My strategy is essentially exactly what msg229893 2a said: copy binary blobs to a tempfile, then overwrite the original when complete. I would use a name filter function to decide what to delete and optional parameter for the temp file (falling back to tempfile.tempfile if None). IIRC, this is the same strategy used in the dotNet zip library. |
I think it would be better if new method would copy filtered content to other ZIP file. Not alway you want to modifying the original file. The question is what to do with the data before the start of the ZIP file or between files in the ZIP file, or if files in the ZIP file are overlapped. Here is a sample of such file. |
Hi, I've recently been working on a Python module for the Adobe universal container format (UCF) which extends the zip specification - as part of this I wanted to be able to remove and rename files in an archive. I discovered this issue when writing the module so realised there wasn't currently a solution - so I went down the rabbit hole. I've attached a patch which supports the removal and renaming of files in a zip archive. You can also look at this python module in a git-repo which is a the same code but separated out into a class that extends ZipFile: https://github.com/gambl/zipextended. The patch provides 4 main new "public" functions for the zipfile library:
The patch is in part modelled on the rubyzip solution. Remove and rename will initially only update the ZipFile's infolist. Changes are then persisted via a commit function which can be called manually - or will be called automatically upon close. Commit will then clone the zipfile with the necessary changes to a temporary file and replace the original file when that operation has completed successfully. An alternative to remove files without modifying the original is via the clone method directly. This is in the spirit of Serhiy's suggestion of filtering the content and not modifying the original. You can pass a list of filenames or fileinfos of the files to be included in the clone. I have also attempted to address Serhiy's concern with respect to the tricky.zip - "hidden files" in between members of the archive. The clone method will by default retain any hidden files and maintain the same relative order in the archive. You can also elect to ignore the hidden files, and clone with just the files listed in the central directory. I did have to modify the tricky.zip attached to this issue manually as the CRC of file two (with file three embedded) was incorrect - and would therefore fail testzip(). I'm not actually sure how one would create such an archive - but I think that it's valid according to the zip spec. I've actually included the modified version in the patch for a few of the tests. I appreciate that this is a large-ish patch and may take some time to review - but as suggested in the comments - this wasn't as straight forward as is seems! Look forward to your comments. The signatures of the main functions are described below: remove(self, zinfo_or_arcname):
--- rename(self, zinfo_or_arcname, filename):
clone(self, file, filenames_or_infolist=None, ignore_hidden_files=False):
commit(self):
|
All of you who have or might submit patches -- Victorlee, Troy, Mathew, or anyone else, please sign a PSF contributor agreement. We should not even look at a patch from you before you do. Info: https://www.python.org/psf/contrib/ Receipt is official when a * appears after your name. This usually takes about a week. |
Sorry, typo, here's the new PR - |
I'm a bit overwhelmed about all the comments and linked Issues and PR's. |
Read the four longest from recent comments and ignore the small ones about process. issues with possible approaches: #51067 (comment) (which summarizes previous comments) The one PR that’s linked and open is from the author of the latest comment: #19358 So the status is to convince core devs that the approach is worth it, then ensure the PR contains complete docs and tests and applies to the current main branch, and get reviews. |
I have updated the code based on yudilevi2's version and added tests and docs. Should I discuss in that thread or open a new PR? |
Discussions about the bug itself and the general idea of how to fix it should be here. |
Some folks are concerned about that, if a I will argue that this will also happen in most in-place file modification scenarios. A good similar example is appending a file to a ZIP, in which case the appended file data will first overwrite the central directory, and the ZIP will be corrupted if there is an interruption before the final new central directory data is written. In other words: if one isn't so concerned that appending a file may corrupt the ZIP, he shouldn't be so concerned about removing, either. Although there is always a more sophisticated (and likely more indirect) technique to make an "in-place" file modification more secure, such as cloning data to an intermediate temporary file, it should be reasonable enough for a basic official implementation to work in this way. |
Pardon. Is there still something that prevents the patch from implemented? It seems that lacking tests and docs is the only thing for the previous one? |
can someone share the lastest PR for this? I also have a copy applied to the latest version of zipfile that I can create a new PR for if we don't have anything concrete |
See the top post for PR link |
@merwok yup, sorry - so indeed what else is missing? I can work on that |
This code not work for me on python3.8.1 ubuntu20.04 When you use mode import zipfile
# just copied from current commit(f3450f1)'s change
class ZFile(zipfile.ZipFile):
def remove(self, zinfo_or_arcname):
"""Remove a member from the archive."""
if self.mode not in ('w', 'x', 'a'):
raise ValueError("remove() requires mode 'w', 'x', or 'a'")
if not self.fp:
raise ValueError(
"Attempt to write to ZIP archive that was already closed")
if self._writing:
raise ValueError(
"Can't write to ZIP archive while an open writing handle exists"
)
# Make sure we have an existing info object
if isinstance(zinfo_or_arcname, zipfile.ZipInfo):
zinfo = zinfo_or_arcname
# make sure zinfo exists
if zinfo not in self.filelist:
raise KeyError(
'There is no item %r in the archive' % zinfo_or_arcname)
else:
# get the info object
zinfo = self.getinfo(zinfo_or_arcname)
return self._remove_members({zinfo})
def _remove_members(self, members, *, remove_physical=True, chunk_size=2 ** 20):
"""Remove members in a zip file.
All members (as zinfo) should exist in the zip; otherwise the zip file
will erroneously end in an inconsistent state.
"""
fp = self.fp
entry_offset = 0
member_seen = False
# get a sorted filelist by header offset, in case the dir order
# doesn't match the actual entry order
filelist = sorted(self.filelist, key=lambda x: x.header_offset)
for i in range(len(filelist)):
info = filelist[i]
is_member = info in members
if not (member_seen or is_member):
continue
# get the total size of the entry
try:
offset = filelist[i + 1].header_offset
except IndexError:
offset = self.start_dir
entry_size = offset - info.header_offset
if is_member:
member_seen = True
entry_offset += entry_size
# update caches
self.filelist.remove(info)
try:
del self.NameToInfo[info.filename]
except KeyError:
pass
continue
# update the header and move entry data to the new position
if remove_physical:
old_header_offset = info.header_offset
info.header_offset -= entry_offset
read_size = 0
while read_size < entry_size:
fp.seek(old_header_offset + read_size)
data = fp.read(min(entry_size - read_size, chunk_size))
fp.seek(info.header_offset + read_size)
fp.write(data)
fp.flush()
read_size += len(data)
# Avoid missing entry if entries have a duplicated name.
# Reverse the order as NameToInfo normally stores the last added one.
for info in reversed(self.filelist):
self.NameToInfo.setdefault(info.filename, info)
# update state
if remove_physical:
self.start_dir -= entry_offset
self._didModify = True
# seek to the start of the central dir
fp.seek(self.start_dir)
with ZFile('test_zip.zip', 'w') as f:
f.write('README.md', '1/R.md')
f.write('README.md', 'FFF.md')
with ZFile('test_zip.zip', 'a') as f:
f.remove('FFF.md')
with ZFile('test_zip.zip', 'r') as f:
print(f.getinfo('1/R.md'))
print(f.getinfo('FFF.md')) It should raise a But when I remove a file in the first with block, it's okay. with ZFile('test_zip.zip', 'w') as f:
f.write('README.md', '1/R.md')
f.write('README.md', 'FFF.md')
f.remove('FFF.md')
with ZFile('test_zip.zip', 'r') as f:
print(f.filelist)
print(f.getinfo('1/R.md'))
print(f.getinfo('FFF.md')) # FFF.md is no longer here this time |
@narugo1992 This is due to a bug in Python < 3.8.7 that if sys.version_info < (3, 8, 7):
# Fix an issue causing zip not truncated
class ZipFile(zipfile.ZipFile):
def _write_end_record(self):
super()._write_end_record()
if self.mode == "a":
self.fp.truncate()
self.fp.flush()
zipfile.ZipFile = ZipFile |
This issue has been open long time. Is there any appetite for someone to get this working? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: