|
| 1 | +<title>Hitlist Highlighting with Oracle Text</title> |
| 2 | +<H1><FONT COLOR=BLUE>Hitlist Highlighting with Oracle Text</FONT></H1> |
| 3 | + |
| 4 | +<b>Get a <a HREF="KWIC.zip">zip archive</a> of all the files for this example.</b> |
| 5 | + |
| 6 | +<H3><FONT COLOR=BLUE>Introduction</FONT></H3> |
| 7 | + |
| 8 | +The result of a free-text search is commonly a list of documents - or |
| 9 | +"hit list" - which match the search. The user will typically scan this |
| 10 | +list looking for items which are relevent. To make this process |
| 11 | +efficient, we must present as much useful information as possible in |
| 12 | +the hitlist. Sometimes the best way to do this is to present |
| 13 | +information such as the title of the document. But often it helps to |
| 14 | +be able to highlight a section, or sections, of the main text of the |
| 15 | +document which best matches the search terms, with those search terms |
| 16 | +highlighted. |
| 17 | + |
| 18 | +<P> |
| 19 | +This techique is sometimes known as "Key Words In Context", or KWIC. |
| 20 | + |
| 21 | +<P> |
| 22 | +This paper will demonstrate how to generate such a KWIC hitlist. |
| 23 | +It uses a Java bean to do the main work, and demonstrates calling that |
| 24 | +bean from a Java Server Page (JSP). The bean, or its component code, |
| 25 | +could easily be called from other Java environments, or from a PL/SQL |
| 26 | +procedure as a database procedure stored in Java. |
| 27 | + |
| 28 | +<P> |
| 29 | +The application illustrated is a simplified version of one used within |
| 30 | +Oracle to search a mailing list archive. It should be easily adaptable |
| 31 | +to customers own requirements. |
| 32 | + |
| 33 | +<H3><FONT COLOR=BLUE>The Algorithm</FONT></H3> |
| 34 | + |
| 35 | +The full algorith is presented in the file <code>algorithm.html</code> |
| 36 | +in the zip file (see the top of this document). It uses the concepts |
| 37 | +of "relevance" and "novelty". Relevance is a measure of the number of |
| 38 | +words which match in a particular segment. Novelty is the number of |
| 39 | +new words which have not been found in previous segments. |
| 40 | + |
| 41 | +<H3><FONT COLOR=BLUE>Testing the Application</FONT></H3> |
| 42 | + |
| 43 | +Create and populate a simple test table using <code>Create.sql</code> |
| 44 | +from the download zip file. You will need to run it initially as a |
| 45 | +DBA user such as SYS or SYSTEM. It will create a user called |
| 46 | +"testuser", then it will log on as that user, then create a table and |
| 47 | +populate it with two rows of nonsense text. Note that if your |
| 48 | +environment requires a connect string, you will need to add it to the |
| 49 | +testuser/testuser username/password pair. |
| 50 | + |
| 51 | + |
| 52 | +<P> |
| 53 | +Then edit test.jsp, substituting in the necessary parameters for your |
| 54 | +database in the lines: |
| 55 | + |
| 56 | +<pre><font color=blue> ods.setServerName ("eddie"); |
| 57 | + ods.setPortNumber (1521); |
| 58 | + ods.setDatabaseName ("eddi10b"); |
| 59 | +</font></pre> |
| 60 | + |
| 61 | +The first value is the server machine name, the second the SQL*Net |
| 62 | +port number (default is 1521) and the third is the database SID. |
| 63 | +If you are unsure of these values, check in your "<code>sqlnet.ora</code>" file or |
| 64 | +consult your database administrator. |
| 65 | + |
| 66 | +<P> |
| 67 | +Now copy test.jsp into your Oracle webserver's HTDOCS directory. The |
| 68 | +default, on Unix, will be <code>$ORACLE_HOME/Apache/Apache/htdocs</code>. |
| 69 | + |
| 70 | +<P> |
| 71 | +Now compile the Java source: |
| 72 | + |
| 73 | +<pre><font color=blue> javac KWIC.java |
| 74 | +</font></pre> |
| 75 | + |
| 76 | +If you don't already have "<code>javac</code>" in your PATH, you will find it |
| 77 | +(on Unix) in <code>$ORACLE_HOME/jdk/bin</code>. |
| 78 | + |
| 79 | +<P> |
| 80 | +If all goes well, this will generate KWIC.class. This must be |
| 81 | +copied into the directory |
| 82 | +<code>$ORACLE_HOME/Apache/Apache/htdocs/WEB-INF/classes</code> (or the equivalent |
| 83 | +on non-Unix systems). It will probably also be necessary to copy the |
| 84 | +file <code>$ORACLE_HOME/jdbc/lib/classes12.jar</code> into the same directory. |
| 85 | + |
| 86 | +Now go to your web browser, and type in |
| 87 | + |
| 88 | +<pre><font color=blue> http://yourservername/test.jsp |
| 89 | +</font></pre> |
| 90 | + |
| 91 | +If your Apache web server is running on a different port, such as |
| 92 | +8888, you would need to say: |
| 93 | + |
| 94 | +<pre><font color=blue> http://yourservername:8888/test.jsp |
| 95 | +</font></pre> |
| 96 | + |
| 97 | +If all goes well you should get a web page consisting of a search |
| 98 | +box and a "number of rows" box. Enter "quick and brown" in the search |
| 99 | +box and hit "submit". |
| 100 | + |
| 101 | +<P> |
| 102 | +You should now see a hitlist with two items, and the search terms |
| 103 | +highlighted in context. |
| 104 | + |
| 105 | +<H3><FONT COLOR=BLUE>Customizing</FONT></H3> |
| 106 | + |
| 107 | +<P> |
| 108 | +To work with your own table, it shouldn't be necessary to change the |
| 109 | +KWIC bean at all. You should be able to make all the necessary |
| 110 | +modifications in the JSP file. |
| 111 | + |
| 112 | +<P> |
| 113 | +Important: if you change the name of "test.jsp" you MUST change the line |
| 114 | + |
| 115 | +<pre><font color=blue><form action="test.jsp" method="GET" > |
| 116 | +</font></pre> |
| 117 | + |
| 118 | +to refer to the new |
| 119 | + name. Otherwise you'll find that when you hit SUBMIT you'll be |
| 120 | + running the old version rather than the new one. |
| 121 | + |
| 122 | +<H3><FONT COLOR=BLUE>Dealing with Long or Formatted Documents</FONT></H3> |
| 123 | + |
| 124 | +<p> The example above uses very short pieces of text. It is important to |
| 125 | +realise that this code will <em>not</em> work well where the text is |
| 126 | +very large, and/or the text is derived from formatted documents (eg |
| 127 | +Word or PDF format). This is because it uses the CTX_DOC.HIGHLIGHT |
| 128 | +document service, which must fetch (and perhaps filter) the entire |
| 129 | +document, even if you only want to deal with the first few kilobytes |
| 130 | +of text. |
| 131 | + |
| 132 | +<p> In this case, it will be necessary to fetch, and maybe pre-filter |
| 133 | +the text and store the first few KB of each row in a "shadow" |
| 134 | +table. An index would be created on the shadow table, but need not be |
| 135 | +populated (use the NOPOPULATE parameter). The concordance bean should |
| 136 | +then be called with the name of the NOPOPULATE index, and the |
| 137 | +appropriate rowid from the shadow table. |
| 138 | + |
| 139 | +<p> |
| 140 | +We'll illustrate this with an example. It is up to you to provide the |
| 141 | +documents for this example - any Word, PDF or other formatted text |
| 142 | +documents will be suitable. The table will contain the file names of |
| 143 | +the documents, and the FILE_DATASTORE will be used in the index on them. |
| 144 | + |
| 145 | +Run this as the TESTUSER user created in the previous example: |
| 146 | + |
| 147 | +<pre><font color=blue> create table mydocs ( |
| 148 | + file_name varchar2(255)); |
| 149 | + |
| 150 | + -- change the following to represent your own files |
| 151 | + |
| 152 | + insert into mydocs values ('c:\MyDocs\hello1.doc'); |
| 153 | + insert into mydocs values ('c:\MyDocs\hello2.doc'); |
| 154 | + |
| 155 | + commit; |
| 156 | + |
| 157 | + create index mydocs_index on mydocs(file_name) |
| 158 | + indextype is ctxsys.context |
| 159 | + parameters ('datastore ctxsys.file_datastore'); |
| 160 | + |
| 161 | + -- create the "shadow" table |
| 162 | + |
| 163 | + create table mydocs_shadow( |
| 164 | + mydocs_rowid rowid, -- foreign key to rowid in mydocs |
| 165 | + text_extract clob); |
| 166 | + |
| 167 | + -- now create a dummy index on the shadow table |
| 168 | + |
| 169 | + create index shadow_index on mydocs_shadow(text_extract) |
| 170 | + indextype is ctxsys.context |
| 171 | + parameters ('nopopulate'); |
| 172 | + |
| 173 | + -- this procedure will loop through mydocs, and will filter the document |
| 174 | + -- for each row and insert the first 6000 characters into the shadow table |
| 175 | + |
| 176 | + declare |
| 177 | + extract_size integer := 6000; -- amount to use as extract (max 32K) |
| 178 | + fetch_size integer; -- amount actually fetched |
| 179 | + the_text clob; -- filtered text |
| 180 | + extract varchar2(32767); -- extract of filtered text |
| 181 | + begin |
| 182 | + for c in (select rowid from mydocs) loop |
| 183 | + ctx_doc.set_key_type('ROWID'); -- use rowids rather than primary keys |
| 184 | + ctx_doc.filter( |
| 185 | + index_name => 'mydocs_index', |
| 186 | + textkey => c.rowid, |
| 187 | + restab => the_text, |
| 188 | + plaintext => TRUE); |
| 189 | + fetch_size := extract_size; |
| 190 | + dbms_lob.read(the_text, fetch_size, 1, extract); |
| 191 | + insert into mydocs_shadow values (c.rowid, extract); |
| 192 | + end loop; |
| 193 | + end; |
| 194 | + / |
| 195 | +</font></pre> |
| 196 | + |
| 197 | +So after running that lot, you should have two tables - one containing |
| 198 | +the file names of some formatted documents, and the other (the |
| 199 | +"shadow" table) containing the first 6000 characters of the filtered |
| 200 | +text from those documents. |
| 201 | + |
| 202 | +<P> |
| 203 | + |
| 204 | +<code>Test2.jsp</code> (in the zip file) is a variant on the earlier Test.jsp, |
| 205 | +modified to suit this two-table approach. Rather than a simple query with |
| 206 | +a CONTAINS clause, it uses a join so it can fetch the rowid from the |
| 207 | +shadow table. |
| 208 | + |
| 209 | +<P> |
| 210 | +When it calls the KWIC java bean, it now passes the rowid from the shadow |
| 211 | +table, and specifies the index that was created on the shadow table. Since |
| 212 | +the shadow table contains a short, pre-filtered summary, it is very fast to |
| 213 | +aceess and process that table. |
| 214 | + |
| 215 | +<P> |
| 216 | + |
| 217 | +<P><HR><P> |
| 218 | +Last Modified: 08-DEC-2003 by <a href="mailto:roger.ford@oracle.com">Roger Ford</a> |
| 219 | +<P><HR><P> |
| 220 | + |
0 commit comments