Tika client for httpreserve.
Tikalinkextract requires users start the Tika HTTP server, and then it provides a way for them to automate the batch processing of those files into its text extraction mechanism. The text is then processed to look for hyperlinks which are extracted and output to stdout. There are examples you can try below.
More information is available on the OPF website: Hyperlinks in your files? How to get them out using tikalinkextract
./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt
wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt
See explainshell.com
Tika is licensed as Apache License 2.0.
This tool is licensed GNU General Public License Version 3.