@@ -7,16 +7,13 @@ via a regular expression. Accepts the following settings:
7
7
The following are settings that can be set for a `pattern` analyzer
8
8
type:
9
9
10
- [cols="<,<",options="header",]
11
- |===================================================================
12
- |Setting |Description
13
- |`lowercase` |Should terms be lowercased or not. Defaults to `true`.
14
- |`pattern` |The regular expression pattern, defaults to `\W+`.
15
- |`flags` |The regular expression flags.
16
- |`stopwords` |A list of stopwords to initialize the stop filter with.
17
- Defaults to an 'empty' stopword list Check
18
- <<analysis-stop-analyzer,Stop Analyzer>> for more details.
19
- |===================================================================
10
+ [horizontal]
11
+ `lowercase`:: Should terms be lowercased or not. Defaults to `true`.
12
+ `pattern`:: The regular expression pattern, defaults to `\W+`.
13
+ `flags`:: The regular expression flags.
14
+ `stopwords`:: A list of stopwords to initialize the stop filter with.
15
+ Defaults to an 'empty' stopword list Check
16
+ <<analysis-stop-analyzer,Stop Analyzer>> for more details.
20
17
21
18
*IMPORTANT*: The regular expression should match the *token separators*,
22
19
not the tokens themselves.
@@ -29,101 +26,103 @@ Pattern API] for more details about `flags` options.
29
26
==== Pattern Analyzer Examples
30
27
31
28
In order to try out these examples, you should delete the `test` index
32
- before running each example:
33
-
34
- [source,js]
35
- --------------------------------------------------
36
- curl -XDELETE localhost:9200/test
37
- --------------------------------------------------
29
+ before running each example.
38
30
39
31
[float]
40
32
===== Whitespace tokenizer
41
33
42
34
[source,js]
43
35
--------------------------------------------------
44
- curl -XPUT 'localhost:9200/test' -d '
45
- {
46
- "settings":{
47
- "analysis": {
48
- "analyzer": {
49
- "whitespace":{
50
- "type": "pattern",
51
- "pattern":"\\\\s+"
52
- }
53
- }
54
- }
36
+ DELETE test
37
+
38
+ PUT /test
39
+ {
40
+ "settings": {
41
+ "analysis": {
42
+ "analyzer": {
43
+ "whitespace": {
44
+ "type": "pattern",
45
+ "pattern": "\\s+"
55
46
}
56
- }'
47
+ }
48
+ }
49
+ }
50
+ }
57
51
58
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
59
- # "foo,bar", "baz"
52
+ GET /test/_analyze?analyzer=whitespace&text=foo,bar baz
53
+
54
+ # "foo,bar", "baz"
60
55
--------------------------------------------------
56
+ // AUTOSENSE
61
57
62
58
[float]
63
59
===== Non-word character tokenizer
64
60
65
61
[source,js]
66
62
--------------------------------------------------
67
-
68
- curl -XPUT 'localhost:9200/test' -d '
69
- {
70
- "settings":{
71
- "analysis": {
72
- "analyzer": {
73
- "nonword":{
74
- "type": "pattern",
75
- "pattern":"[^\\\\w]+"
76
- }
77
- }
78
- }
63
+ DELETE test
64
+
65
+ PUT /test
66
+ {
67
+ "settings": {
68
+ "analysis": {
69
+ "analyzer": {
70
+ "nonword": {
71
+ "type": "pattern",
72
+ "pattern": "[^\\w]+" <1>
79
73
}
80
- }'
74
+ }
75
+ }
76
+ }
77
+ }
81
78
82
- curl 'localhost:9200 /test/_analyze?pretty=1& analyzer=nonword' -d ' foo,bar baz'
83
- # "foo,bar baz" becomes "foo", "bar", "baz"
79
+ GET /test/_analyze?analyzer=nonword&text= foo,bar baz
80
+ # "foo,bar baz" becomes "foo", "bar", "baz"
84
81
85
- curl 'localhost:9200 /test/_analyze?pretty=1& analyzer=nonword' -d ' type_1-type_4'
86
- # "type_1","type_4"
82
+ GET /test/_analyze?analyzer=nonword&text= type_1-type_4
83
+ # "type_1","type_4"
87
84
--------------------------------------------------
85
+ // AUTOSENSE
86
+
88
87
89
88
[float]
90
89
===== CamelCase tokenizer
91
90
92
91
[source,js]
93
92
--------------------------------------------------
94
-
95
- curl -XPUT 'localhost:9200/test?pretty=1' -d '
96
- {
97
- "settings":{
98
- "analysis": {
99
- "analyzer": {
100
- "camel":{
101
- "type": "pattern",
102
- "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
103
- }
104
- }
105
- }
93
+ DELETE test
94
+
95
+ PUT /test?pretty=1
96
+ {
97
+ "settings": {
98
+ "analysis": {
99
+ "analyzer": {
100
+ "camel": {
101
+ "type": "pattern",
102
+ "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
106
103
}
107
- }'
104
+ }
105
+ }
106
+ }
107
+ }
108
108
109
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
110
- MooseX::FTPClass2_beta
111
- '
112
- # "moose","x","ftp","class","2","beta"
109
+ GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
110
+ # "moose","x","ftp","class","2","beta"
113
111
--------------------------------------------------
112
+ // AUTOSENSE
114
113
115
114
The regex above is easier to understand as:
116
115
117
116
[source,js]
118
117
--------------------------------------------------
119
118
120
- ([^\\ p{L}\ \d]+) # swallow non letters and numbers,
121
- | (?<=\\ D)(?=\ \d) # or non-number followed by number,
122
- | (?<=\\ d)(?=\ \D) # or number followed by non-number,
123
- | (?<=[ \\ p{L} && [^\ \p{Lu}]]) # or lower case
124
- (?=\ \p{Lu}) # followed by upper case,
125
- | (?<=\ \p{Lu}) # or upper case
126
- (?=\ \p{Lu} # followed by upper case
127
- [\\ p{L}&&[^\ \p{Lu}]] # then lower case
128
- )
119
+ ([^\p{L}\d]+) # swallow non letters and numbers,
120
+ | (?<=\D)(?=\d) # or non-number followed by number,
121
+ | (?<=\d)(?=\D) # or number followed by non-number,
122
+ | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
123
+ (?=\p{Lu}) # followed by upper case,
124
+ | (?<=\p{Lu}) # or upper case
125
+ (?=\p{Lu} # followed by upper case
126
+ [\ p{L}&&[^\p{Lu}]] # then lower case
127
+ )
129
128
--------------------------------------------------
0 commit comments