Full-text search (with emphasis on Japanese)

FULL-TEXT SEARCH
Nick Zadrozny
websolr.com :: bonsai.io

id title …

1 hello, world!

2 hello, 東京!

3 東京こんにちは！

4 こんにちはシニアプロジェクトマネージャー

5 関西国際空港から出発した

6 成田国際空港から出発した

id title LIKE "hello"; …

1 hello, world!

2 hello, 東京!





SQL LIKE
O(N) = SLOW ☹

SQL like is SLOW

- You are scanning
through your entire
database table and
checking each record.

hello Search

• hello, world!
however, you do get
results, and so that's as
• hello, 東京! far as some go.

but there is another
reason why SQL LIKE is a
bad idea…

hello world Search

Let's pretend we are the
customer or a user, and
start entering more
queries. This should
work, right?

id title LIKE "hello world"; …

1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した here's the same table
again



1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した uh-oh…

That didn't work, because
SQL like is basically just
6 成田国際空港から出発した testing the exact
equality of bytes here.
That comma breaks our
query.


1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した And it gets worse: we're
not done with our search
yet. We still have to
check the rest of the
table!
Again, we're back to being
slow.


1 hello, world!

2 hello, 東京!





hello world Search

No results found to match your query.
We just scanned through
our entire table of data
に一致する情報は見つかりませんでした。 without getting any
useful results, and the
user has no idea why.

And it gets worse.

hello 東京 Search

It's easy to make up
queries here that seem
に一致する情報は見つかりませんでした。 like they should match
something, but they
don't.

際空こんにちは Search

Ultimately, if you want
flexible searches, you
に一致する情報は見つかりませんでした。 need to parse your
queries and combine the
results of multiple
searches.

But that is slow!

QUERY PARSING

Flexible queries combine the results of many separate
queries.

Required and optional terms

Flexible order of terms
Users expect flexible
queries, but making
queries flexible will make
But many slow searches is even slower! a slow search much
slower.

STEP ONE:
MAKE IT FAST
So let's revisit the slow
search problem and try
to make it faster

id title …

1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した This is our original data.
It's stored sensibly
enough for SQL, but it's
not really optimized for
searching.
6 成田国際空港から出発した Let's improve it.

id term

1 hello

1 world

2 hello
This is one step in a
better direction.
Separate each term, and
maintain its association
2 東京 with the original record
that it appears in.

id term

1 hello

2 hello

1 world This means we can do something
clever, like sort by term instead,
which will let us run a faster
binary search.

This hypothetical index looks a
bit like a normal database index.
2 東京

PROBLEM:
But we have a problem.
How do you decide what
makes a "term"?

This is easy in English,

WHAT IS A TERM?
where words are separated
by whitespace and
punctuation.

But languages like
Japanese don't use
whitespace, and have
relatively little
punctuation.

N-GRAM
One approach is to split
the text into "n-grams"

We can take the original
text and break it into
t wo-character "pairs"

シニアプロジェクトマネージャー
シニニアアププロ …
関西国際空港
関西西国国際際空空港
The results would look
something like this.

It's not a very good
technique, but it's better
than nothing.

N-GRAM

Generates too many “terms”

Terms don't preserve meaning
The problem:
Bad for index size and relevancy
1. it generates a lot of terms
2. many of these "terms" don't have
meaning
We can do better! 3. bad for index size and performance
4. bad for relevancy

we can do better!

MORPHOLOGICAL
ANALYSIS
Morphological analysis
uses a dictionary and
statistical modeling of
the language to identify
terms

KUROMOJI
NEW IN LUCENE 3.6.0

Happily, Lucene 3.6.0 was
released one week ago
with an EXCELLENT
Japanese morphological
analyzer package called
Kuromoji

シニアプロジェクトマネージャー
シニアプロジェクトマネージャ
関西国際空港
関西国際空港

We can see right away—if you read
Japanese—that we get much better
terms from this kind of analysis.

Since we are confident that we can
tokenize Japanese, let's continue
building our hypothetical index

term id
hello 1
hello 2
world 1
東京 2
東京 3
こんにちは 3 Our table now includes a
few of the tokenized
こんにちは 4 Japanese terms

シニア 4
プロジェクト 4
マネージャ 4

term id term id
hello 1 プロジェクト 4
hello 2 マネージャ 4
world 1 出発 5
から 5 出発 6
から 6 国際 5
こんにちは 3 国際 6
こんにちは 4 成田 6
し 5 東京 2
し 6 東京 3
When we finish
tokenizing all the text,
シニア 4 空港 5
we end up with a table
that looks something like
this.
た 5 空港 6
This is progress—but we
た 6 関西 5
can do better.

term id
hello 1, 2
world 1
から 5, 6
こんにちは 3, 4
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6 This table here maintains
one entry per term,
成田 6 associated with a set of
IDs for the records that
東京 2, 3 are included.

空港 5, 6
関西 5

LUCENE
“INVERSE INDEX”
We have been building a
structure that is similar to the
"inverse index" built by Lucene.

Lucene is a library that
specializes in creating and
maintaining efficient data
structures for your index.

空港 Search

Let's try a few more
searches against this
new data structure to
compare it to our earlier
slow searches

term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4 Because we now have a
sorted list of each term,
出発 5, 6 we can perform a binary
search.
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6 We check the middle
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4 Check the middle again
出発 5, 6
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
And check the middle
マネージャ 4 again.
出発 5, 6 Three operations to find
our matching records,
国際 5, 6 from a list of 15 terms!

成田 6
東京 2, 3
空港 5, 6
関西 5

id title …

1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した Now that we have the
matching IDs, it is a
simple matter for SQL to
fetch the matching rows

空港 Search

• 関西国際空港から出発した
And we have our search
results, in much less time

• 成田国際空港から出発した

FAST!
O(LOG N)

So the bottom line is
that an inverse index is
very fast!

We can take advantage
of this speed for better
queries

空港のマネージャ Search

A good search engine
processes your query
into tokens, the same as
it does your data, and
runs a separate "query"
for each term

空港 term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6
Let's find documents
成田 6 matching our first term

東京 2, 3
空港 5, 6
関西 5

空港 term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

の term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6 Now let's look for the
second term
成田 6
東京 2, 3
空港 5, 6
関西 5

の term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

の term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6 We didn't find it, but
that's okay
成田 6
東京 2, 3
空港 5, 6
関西 5

マネージャ term id
hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6 Now let's look for
documents matching the
成田 6 third term

東京 2, 3
空港 5, 6
関西 5

hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6
成田 6
東京 2, 3
空港 5, 6
関西 5

hello 1, 2
world 1
から 5, 6
し 5, 6
シニア 4
た 5, 6
マネージャ 4
出発 5, 6
国際 5, 6
We can combine the
成田 6 matched documents in
many different ways
東京 2, 3 using set theory

空港 5, 6
関西 5

id title …

1 hello, world!

2 hello, 東京!



5 関西国際空港から出発した in this case, let's just
fetch all the documents
that match any of the
terms


空港のマネージャ Search

• こんにちはシニアプロジェクトマネージャー

• 関西国際空港から出発した and we have our results!

• 成田国際空港から出発した

REVIEW

Using an index is FASTER

Using an index is more FLEXIBLE

Lucene creates and manages eﬃcient index structures

USING LUCENE:
SOLR,
ELASTICSEARCH

SOLR, ELASTICSEARCH

HTTP interface to Lucene.

Scale separately from your application.

Use with any language or framework.

Abstract away low-level Lucene implementation details.

Solr ElasticSearch

Created in 2004. Created in 2010.

Well-established and widely Growing quickly with early-
adopted. adopters.

Pre-RESTful API design. Modern RESTful JSON.

Minimal JSON/YAML
XML configuration files.
configuration.
Distribution & real-time
Distributed & real-time by design.
a work in progress

More features. More minimalist.

Many developers. Just one “benevolent dictator.”

GIVE IT A TRY!
https://devcenter.heroku.com/articles/websolr

https://devcenter.heroku.com/articles/bonsai

Full-text search (with emphasis on Japanese)

More Related Content

Full-text search (with emphasis on Japanese)

Editor's Notes