Skip to content

Commit 3f77208

Browse files
RogerFordgvenzl
andauthored
Added readme and removed archive (oracle-samples#296)
* Oracle Text query parser * New Oracle Text examples * Split archive into individual directories * Cleaned up, removed passwords and IP addresses * Removed archive, added readme * Update readme.md --------- Co-authored-by: Gerald Venzl <gerald.venzl@oracle.com>
1 parent f817037 commit 3f77208

File tree

864 files changed

+213618
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

864 files changed

+213618
-0
lines changed
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
-- this should be a test of german decompounding, but it doesn't seem to
2+
-- work
3+
-- Although when we index 'hauptbahnhof' we get tokens of type 9 for
4+
-- haupt and bahnhof, we can't seem to search on them
5+
6+
drop table detext;
7+
8+
create table detext
9+
( id number(4) not null
10+
, name varchar2(30)
11+
)
12+
/
13+
14+
insert into detext values (1, 'Hauptbahnhof');
15+
16+
exec ctx_ddl.drop_preference('mylex')
17+
18+
begin
19+
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
20+
ctx_ddl.set_attribute ( 'mylex', 'index_stems', 'GERMAN');
21+
ctx_ddl.set_attribute ( 'mylex', 'composite', 'GERMAN');
22+
ctx_ddl.set_attribute ( 'mylex', 'alternate_spelling', 'GERMAN');
23+
end;
24+
/
25+
26+
exec ctx_ddl.drop_preference('myword')
27+
28+
begin
29+
ctx_ddl.create_preference('myword', 'BASIC_WORDLIST');
30+
ctx_ddl.set_attribute('myword','FUZZY_MATCH','GERMAN');
31+
end;
32+
/
33+
34+
create index detext_idx1 on detext(name)
35+
indextype is ctxsys.context
36+
parameters ('WORDLIST myword DATASTORE ctxsys.default_datastore LEXER mylex SYNC (on commit)')
37+
/
38+
39+
select token_text, token_type from dr$detext_idx1$i;
40+
41+
-- this doesn't work
42+
SELECT id, name, score(1) as score FROM detext WHERE contains(name, '$bahnhof', 1) > 0;
43+
44+
-- but this does
45+
SELECT id, name, score(1) as score FROM detext WHERE contains(name, '$haupt', 1) > 0;
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
create user testuser identified by testuser default tablespace users temporary tablespace temp;
2+
3+
grant connect, resource, ctxapp to testuser;
4+
5+
connect testuser/testuser
6+
7+
create table EmailArchive (
8+
PK NUMBER,
9+
FOLDER VARCHAR2(255),
10+
SUBJECT VARCHAR2(255),
11+
MAILDATE DATE,
12+
MAILTO VARCHAR2(255),
13+
MAILFROM VARCHAR2(255),
14+
MAILCC VARCHAR2(255),
15+
TEXT CLOB);
16+
17+
insert into emailarchive values (1, 'INBOX', 'The QB Fox', '10-JUL-2002','john.smith@oracle.com', 'roger.ford@oracle.com', null,
18+
'The quick brown fox jumps over the lazy dog.
19+
The lazy dog was surprised to see a quick brown
20+
fox jumping over it, and thought to itself "goodness,
21+
that looked like a wolf - or at least a brown fox -
22+
jumping over me".');
23+
24+
insert into emailarchive values (1, 'INBOX', 'Now is the time...', '11-JUL-2002','roger.ford@oracle.com', 'john.smith@oracle.com', null,
25+
'Now is the time for all good men to come to the aid of the party.
26+
The party is not doing very well at the moment as it has run out of
27+
drink, and most of the guests have gone home. What we need is
28+
for someone to come along with more drink, and hopefully with a
29+
quick brown fox as a companion.
30+
');
31+
32+
commit;
33+
34+
create index email_text on emailarchive(text)
35+
indextype is ctxsys.context;
36+
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
<title>Hitlist Highlighting with Oracle Text</title>
2+
<H1><FONT COLOR=BLUE>Hitlist Highlighting with Oracle Text</FONT></H1>
3+
4+
<b>Get a <a HREF="KWIC.zip">zip archive</a> of all the files for this example.</b>
5+
6+
<H3><FONT COLOR=BLUE>Introduction</FONT></H3>
7+
8+
The result of a free-text search is commonly a list of documents - or
9+
"hit list" - which match the search. The user will typically scan this
10+
list looking for items which are relevent. To make this process
11+
efficient, we must present as much useful information as possible in
12+
the hitlist. Sometimes the best way to do this is to present
13+
information such as the title of the document. But often it helps to
14+
be able to highlight a section, or sections, of the main text of the
15+
document which best matches the search terms, with those search terms
16+
highlighted.
17+
18+
<P>
19+
This techique is sometimes known as "Key Words In Context", or KWIC.
20+
21+
<P>
22+
This paper will demonstrate how to generate such a KWIC hitlist.
23+
It uses a Java bean to do the main work, and demonstrates calling that
24+
bean from a Java Server Page (JSP). The bean, or its component code,
25+
could easily be called from other Java environments, or from a PL/SQL
26+
procedure as a database procedure stored in Java.
27+
28+
<P>
29+
The application illustrated is a simplified version of one used within
30+
Oracle to search a mailing list archive. It should be easily adaptable
31+
to customers own requirements.
32+
33+
<H3><FONT COLOR=BLUE>The Algorithm</FONT></H3>
34+
35+
The full algorith is presented in the file <code>algorithm.html</code>
36+
in the zip file (see the top of this document). It uses the concepts
37+
of "relevance" and "novelty". Relevance is a measure of the number of
38+
words which match in a particular segment. Novelty is the number of
39+
new words which have not been found in previous segments.
40+
41+
<H3><FONT COLOR=BLUE>Testing the Application</FONT></H3>
42+
43+
Create and populate a simple test table using <code>Create.sql</code>
44+
from the download zip file. You will need to run it initially as a
45+
DBA user such as SYS or SYSTEM. It will create a user called
46+
"testuser", then it will log on as that user, then create a table and
47+
populate it with two rows of nonsense text. Note that if your
48+
environment requires a connect string, you will need to add it to the
49+
testuser/testuser username/password pair.
50+
51+
52+
<P>
53+
Then edit test.jsp, substituting in the necessary parameters for your
54+
database in the lines:
55+
56+
<pre><font color=blue> ods.setServerName ("eddie");
57+
ods.setPortNumber (1521);
58+
ods.setDatabaseName ("eddi10b");
59+
</font></pre>
60+
61+
The first value is the server machine name, the second the SQL*Net
62+
port number (default is 1521) and the third is the database SID.
63+
If you are unsure of these values, check in your "<code>sqlnet.ora</code>" file or
64+
consult your database administrator.
65+
66+
<P>
67+
Now copy test.jsp into your Oracle webserver's HTDOCS directory. The
68+
default, on Unix, will be <code>$ORACLE_HOME/Apache/Apache/htdocs</code>.
69+
70+
<P>
71+
Now compile the Java source:
72+
73+
<pre><font color=blue> javac KWIC.java
74+
</font></pre>
75+
76+
If you don't already have "<code>javac</code>" in your PATH, you will find it
77+
(on Unix) in <code>$ORACLE_HOME/jdk/bin</code>.
78+
79+
<P>
80+
If all goes well, this will generate KWIC.class. This must be
81+
copied into the directory
82+
<code>$ORACLE_HOME/Apache/Apache/htdocs/WEB-INF/classes</code> (or the equivalent
83+
on non-Unix systems). It will probably also be necessary to copy the
84+
file <code>$ORACLE_HOME/jdbc/lib/classes12.jar</code> into the same directory.
85+
86+
Now go to your web browser, and type in
87+
88+
<pre><font color=blue> http://yourservername/test.jsp
89+
</font></pre>
90+
91+
If your Apache web server is running on a different port, such as
92+
8888, you would need to say:
93+
94+
<pre><font color=blue> http://yourservername:8888/test.jsp
95+
</font></pre>
96+
97+
If all goes well you should get a web page consisting of a search
98+
box and a "number of rows" box. Enter "quick and brown" in the search
99+
box and hit "submit".
100+
101+
<P>
102+
You should now see a hitlist with two items, and the search terms
103+
highlighted in context.
104+
105+
<H3><FONT COLOR=BLUE>Customizing</FONT></H3>
106+
107+
<P>
108+
To work with your own table, it shouldn't be necessary to change the
109+
KWIC bean at all. You should be able to make all the necessary
110+
modifications in the JSP file.
111+
112+
<P>
113+
Important: if you change the name of "test.jsp" you MUST change the line
114+
115+
<pre><font color=blue>&lt;form action="test.jsp" method="GET" &gt;
116+
</font></pre>
117+
118+
to refer to the new
119+
name. Otherwise you'll find that when you hit SUBMIT you'll be
120+
running the old version rather than the new one.
121+
122+
<H3><FONT COLOR=BLUE>Dealing with Long or Formatted Documents</FONT></H3>
123+
124+
<p> The example above uses very short pieces of text. It is important to
125+
realise that this code will <em>not</em> work well where the text is
126+
very large, and/or the text is derived from formatted documents (eg
127+
Word or PDF format). This is because it uses the CTX_DOC.HIGHLIGHT
128+
document service, which must fetch (and perhaps filter) the entire
129+
document, even if you only want to deal with the first few kilobytes
130+
of text.
131+
132+
<p> In this case, it will be necessary to fetch, and maybe pre-filter
133+
the text and store the first few KB of each row in a "shadow"
134+
table. An index would be created on the shadow table, but need not be
135+
populated (use the NOPOPULATE parameter). The concordance bean should
136+
then be called with the name of the NOPOPULATE index, and the
137+
appropriate rowid from the shadow table.
138+
139+
<p>
140+
We'll illustrate this with an example. It is up to you to provide the
141+
documents for this example - any Word, PDF or other formatted text
142+
documents will be suitable. The table will contain the file names of
143+
the documents, and the FILE_DATASTORE will be used in the index on them.
144+
145+
Run this as the TESTUSER user created in the previous example:
146+
147+
<pre><font color=blue> create table mydocs (
148+
file_name varchar2(255));
149+
150+
-- change the following to represent your own files
151+
152+
insert into mydocs values ('c:\MyDocs\hello1.doc');
153+
insert into mydocs values ('c:\MyDocs\hello2.doc');
154+
155+
commit;
156+
157+
create index mydocs_index on mydocs(file_name)
158+
indextype is ctxsys.context
159+
parameters ('datastore ctxsys.file_datastore');
160+
161+
-- create the "shadow" table
162+
163+
create table mydocs_shadow(
164+
mydocs_rowid rowid, -- foreign key to rowid in mydocs
165+
text_extract clob);
166+
167+
-- now create a dummy index on the shadow table
168+
169+
create index shadow_index on mydocs_shadow(text_extract)
170+
indextype is ctxsys.context
171+
parameters ('nopopulate');
172+
173+
-- this procedure will loop through mydocs, and will filter the document
174+
-- for each row and insert the first 6000 characters into the shadow table
175+
176+
declare
177+
extract_size integer := 6000; -- amount to use as extract (max 32K)
178+
fetch_size integer; -- amount actually fetched
179+
the_text clob; -- filtered text
180+
extract varchar2(32767); -- extract of filtered text
181+
begin
182+
for c in (select rowid from mydocs) loop
183+
ctx_doc.set_key_type('ROWID'); -- use rowids rather than primary keys
184+
ctx_doc.filter(
185+
index_name => 'mydocs_index',
186+
textkey => c.rowid,
187+
restab => the_text,
188+
plaintext => TRUE);
189+
fetch_size := extract_size;
190+
dbms_lob.read(the_text, fetch_size, 1, extract);
191+
insert into mydocs_shadow values (c.rowid, extract);
192+
end loop;
193+
end;
194+
/
195+
</font></pre>
196+
197+
So after running that lot, you should have two tables - one containing
198+
the file names of some formatted documents, and the other (the
199+
"shadow" table) containing the first 6000 characters of the filtered
200+
text from those documents.
201+
202+
<P>
203+
204+
<code>Test2.jsp</code> (in the zip file) is a variant on the earlier Test.jsp,
205+
modified to suit this two-table approach. Rather than a simple query with
206+
a CONTAINS clause, it uses a join so it can fetch the rowid from the
207+
shadow table.
208+
209+
<P>
210+
When it calls the KWIC java bean, it now passes the rowid from the shadow
211+
table, and specifies the index that was created on the shadow table. Since
212+
the shadow table contains a short, pre-filtered summary, it is very fast to
213+
aceess and process that table.
214+
215+
<P>
216+
217+
<P><HR><P>
218+
Last Modified: 08-DEC-2003 by <a href="mailto:roger.ford@oracle.com">Roger Ford</a>
219+
<P><HR><P>
220+

0 commit comments

Comments
 (0)