Skip to content

httpreserve/tikalinkextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tikalinkextract

Tika client for httpreserve.

About

Tikalinkextract requires users start the Tika HTTP server, and then it provides a way for them to automate the batch processing of those files into its text extraction mechanism. The text is then processed to look for hyperlinks which are extracted and output to stdout. There are examples you can try below.

More information is available on the OPF website: Hyperlinks in your files? How to get them out using tikalinkextract

Demo

asciicast

Use with Wget

Extract the links from your files using seeds option

./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt

Use the seeds to generate a warc file

wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links  --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt

See explainshell.com

Resources that might be useful

License

Tika is licensed as Apache License 2.0.

This tool is licensed GNU General Public License Version 3.