Cs572 HW Nutch
Cs572 HW Nutch
Cs572 HW Nutch
1. Objective
In Homework: Tika you used Tika parsers to extract text from the provided PDF files and helped search
for UFOs in the FBIs vault dataset.
But its time to take a step back. Lets consider how did we obtain all those PDF files in the first place?
The short answer is that we crawled the FBIs Vault site http://vault.fbi.gov/ and downloaded all of the
relevant PDFs from that site, dropped them in a directory, and then packaged the whole directory into a
tarball.
We used Apache Nutch (http://nutch.apache.org/) to collect the vault.tar.gz dataset. Now, you get to
do it too! Youd think getting all the PDF files would be a snap. That is, of course, if they were named
with URLs that ended with .pdf. And of course, provided that the PDFs were all available from a similar
directory structure, like http://vault.fbi.gov/pdfs/xxx.pdf. Unfortunately, theyre not.
Your goal in this assignment is to download, install and leverage the Apache Nutch web crawling
framework to obtain all of the PDFs from a tiny subset of the FBIs Vault website that we have mirrored
at http://fs.vsoeclassrooms.usc.edu/minivault/, and then to package those PDFs and in order to create your
own vault.tar.gz file.
2. Preliminaries
We will be making use of the UNIX servers of the Student Compute Facility (SCF) at aludra.usc.edu for
this exercise.1 We will assume a working knowledge of the SCF from CSCI571 or other prerequisite
course; if you are not familiar with this please look up the ITS documentation available at:
http://www.usc.edu/its/unix/servers/scf.html
http://www.usc.edu/its/unix/
http://www.usc.edu/its/ssh/putty.html
http://www.usc.edu/its/ssh/
http://www.usc.edu/its/sftp/
If you are an experienced Unix/Linux user, with the OS already installed, you may want to consider doing this on your
own computer. Check with the TA first to obtain permission before proceeding.
Nutch:
65MB unpacked
Crawl data:
up to 200MB
Extracted PDFs: approximately 100MB
plus other assorted logs and outputs
We recommend that you have more than 400+MB free on your account before beginning this exercise!
You can check your current disk usage using the command:
quota -v
To see how much space is occupied by the contents of each of your directories, you can use:
du -k
Further information on managing the storage space available on your SCF account can be found at:
http://www.usc.edu/its/accounts/unixquotas.html
WARNING: done correctly, this exercise requires a very large portion of your SCF disk quota. Be very
careful with your settings if you botch something, you may end up exceeding your available disk space
and/or encounter other technical issues that can potentially be painful to fix. In particular, pay close
attention to the scope of the crawl to ensure that it is properly confined to our mirror site and does not
extend beyond to FBI websites, or worse, the entire Internet!
If you get anything lower than 1.6.0, you will need to change it! Full instructions in Appendix A.
It is probably a good idea to back all your data up before you begin this exercise anyway, in case something goes wrong.
Make sure to chmod +x bin/nutch, and then to test it out by running bin/nutch and making sure
you see command line output.
Note: Once you have finished unpacking Nutch, you should delete the bin.tar.gz to free up disk space.
3.2.1 Politeness
Crawlers can retrieve data much quicker and in greater depth than human users, so they can have a
crippling impact on the performance of a site. A server would naturally have a hard time keeping up with
requests from multiple crawlers as some of you may have already discovered while doing the earlier
crawler assignment!
Except this time all of you are going to be torturing the same server. Please be kind to it if it dies under
your bombardment, none of you will be able to complete the assignment!
To avoid overloading the server, you should set an appropriate interval between successive requests to the
same server. (Do not exceed the default settings in Nutch!)
For sanitys sake, please use a wired connection if possible to do your crawling.
Configuring Nutch to perform this task will likely be one of the trickiest parts of the assignment.
Pay particular attention to the RegexURLFilter required to crawl the site, and to the crawl command
arguments/parameters.
It may be helpful to do manual inspection of the website to gain a better understanding of what you
are dealing with, and what settings you need to adjust.
o The site that you are crawling, http://fs.vsoeclassrooms.usc.edu/minivault/ was created by
deleting a large portion of our original mirror site in order to reduce its size. There is no need
to be alarmed by the 404 errors that you encounter; those files were intentionally deleted.
o Consequently you may find it more useful to inspect the larger version of the mirror at
http://fs.vsoeclassrooms.usc.edu/vault/ or the original site at http://vault.fbi.gov/. However, DO
NOT CRAWL THESE SITES! They will exceed the space you have.
You should observe that some of the URLs include spaces; these need to be percent encoded to
normalize the URL to a valid form that Nutch can work with.
The result of this command will be to extract all PDF files from <crawl dir>, and to output them as named
individual PDF files to the directory path identified by <output dir>.
Once you have completed your crawl, the PDF extraction step may be done in any environment you
prefer Java program PDFExtractor.java ought to be cross-platform, so there is no strict
requirement for this step to be done on aludra.
Successful PDF extraction should have 20 files. If you dont have all the files, it may be because
you failed to extract them, but it could also be because you failed to crawl them in the first place.
This should be obvious: you can test your extracted PDF files by opening them in Adobe Reader.
6. Submission Guidelines
This assignment is to be submitted electronically, before the start of class on the specified due date, via
https://blackboard.usc.edu/ under Homework: Apache Nutch. A how-to movie is available at:
http://www.usc.edu/its/blackboard/support/bb9/assets/movies/bb91/submittingassignment/index.html
Include all Nutch configuration files and/or code that you needed to change or implement in order to
crawl the FBI Vault website. Avoid including unnecessary files. Place in a directory nutchconfig.
Include all source code and external libraries needed to extract the PDFs from Sequence File
compressed format. Put these in a directory pdfextract.
o All source code is expected to be commented, to compile, and to run. You should have (at least)
one Java source file: PDFExtractor.java, containing your main() function. You should
also include other java source files that you added, if any. Do not submit *.class files. We
will compile your program from submitted source.
o There is no need to submit jar files for Tika, Nutch and/or Hadoop. If you have used any other
external libraries, you should include those jar files in your submission.
o Prepare a readme.txt containing a detailed explanation of how to compile and execute your
program. Be especially specific on how to include other external libraries if you have used them.
Also include your name, USC ID number and email address in the readme.txt.
Compress all of the above (nutchconfig folder, pdfextract folder and readme.txt) into a
single zip archive and name it according to the following filename convention:
<lastname>_<firstname>_CSCI572_HW_NUTCH.zip
Use only standard zip format. Do not use other formats such as zipx, rar, ace, etc.
Important Notes:
Make sure that you have attached the file the when submitting. Failure to do so will be treated as
non-submission.
Successful submission will be indicated in the assignments submission history. We advise that
you check to verify the timestamp, download and double check your zip file for good measure.
Academic Integrity Notice: Your code will be compared to all other student's code via the MOSS
code comparison tool to detect plagiarism.
If you get anything lower than 1.6.0, you need to change it.
In order to setup java properly, you need to know which shell you are using. By default, aludra accounts
are setup to use TC Shell and users have the option to change that. Here is how to check which shell you
are using:
aludra.usc.edu(3): ps -p $$
PID TTY
TIME CMD
28945 pts/168
0:00 tcsh
aludra.usc.edu(4):
alternatively, if the above command does not work for you, use:
aludra.usc.edu(5): finger $USER
Login name: ttrojan
In real life: Tommy Trojan
Directory: /home/scf-XX/ttrojan
Shell: /bin/bash
On since Sep 5 07:28:12 on pts/472 from xxx.usc.edu
No unread mail
No Plan.
aludra.usc.edu(6):
If neither of the above commands worked for you, it is safe to assume that you are using tcsh since it is
the default shell.
If you are using tcsh or csh, append the following two lines to your ~/.cshrc file:
source /usr/usc/jdk/1.6.0_23/setup.csh
setenv CLASSPATH .
If you are using bash, append the following two lines to your ~/.bashrc file:
source /usr/usc/jdk/1.6.0_23/setup.sh
export CLASSPATH=.
Once you have completed the steps outlined in that guide, check again to make sure that you now have
the correct version of Java set up. Note that you may need to log out and sign back in again for the
changes to take effect.
If you are still not able to get the desired Java version after following the above instructions, post on
Piazza to obtain TA assistance.