Module:headword utilities
- The following documentation is located at Module:headword utilities/documentation. [edit]
- Useful links: subpage list • links • transclusions • testcases • sandbox
Exported functions
export.parse_term_with_modifiers
function export.parse_term_with_modifiers(data)
Parse a single inflection form that may have inline modifiers attached. data
is an object with the following fields:
val
: The raw value to parse. Required.paramname
: The name of the parameter from which the value was taken; used in error messages. Required.frob
: An optional function of one value to apply to the form after inline modifiers have been removed (i.e. to apply to the.term
field of the returned object).include_mods
: List of extra inline modifiers to include, besides the default ones (see below). Each list item is either a string specifying a recognized extra inline modifier (seeoptional_param_mods
in the code), or a two-item list of modifier name and modifier spec, where the spec should follow the syntax for modifier specs inparse_inline_modifiers
in Module:parse utilities.exclude_mods
: List of default inline modifiers to not include. Returns an object suitable for storing as one element of one of the lists inheaddata.inflections
, whereheaddata
is the structure passed to Module:headword.
The following default inline modifiers are currently recognized:
q
: Left qualifier.qq
: Right qualifier.l
: Comma-separated list of left labels. No space should follow the comma.ll
: Comma-separated list of right labels. No space should follow the comma.ref
: Reference or references. See{{IPA}}
for the syntax.id
: Sense ID, in case there are multiple senses. See{{l}}
. The following are the recognized additional inline modifiers:g
: Comma-separated list of genders.alt
: Display text.lang
: Language code of language of the form, if different from the language of the headword.sc
: Script code of script of the form. Almost never needed.t
: Gloss for the form.gloss
: Gloss for the form (alias fort
).pos
: Part of speech of the form.lit
: Literal meaning of the form.tr
: Manual transliteration of the form.ts
: Transcription of the form, for languages where the transliteration differs markedly from the pronunciation.face
: Face to display the form in, e.g."hypothetical"
for a hypothetical form (unlinkable and displayed in italics).nolinkinfl
: Make the form unlinkable.
export.parse_term_list_with_modifiers
function export.parse_term_list_with_modifiers(data)
Parse a list of inflection forms that may have inline modifiers attached. data
is an object with the following fields:
forms
: The list of raw values to parse. Required.paramname
: The name of the first parameter from which the value was taken; used in error messages. If this is a two-element list, the first element is the first parameter and the second element is the prefix of the remaining parameters. Parameter names that are numbers are handled correctly, as are those with \1 in it marking where the parameter index goes. Required.qualifiers
: If specified, a possibly gappy list of left qualifiers to add to the parsed terms (for compatibility purposes).frob
,include_mods
,exclude_mods
: As inparse_term_with_modifiers()
. Returns a list of objects, suitable for storing as one of the lists inheaddata.inflections
(once a label is added), whereheaddata
is the structure passed to Module:headword.
export.check_term_list_missing
function export.check_term_list_missing(data)
Check if any of a list of parsed terms (as returned by parse_term_list_with_modifiers()
) are red links (i.e. nonexistent pages). If so, a category such as is added to headdata.categories
. data
is an object with the following fields:
headdata
: The headword structure passed to Module:headword. Required.terms
: The list of parsed terms. Required.lang
: The language object for the language of the terms. Required.plpos
: The plural part of speech, for the category name. Required.
export.glossary_link
function export.glossary_link(entry, text)
Construct a link to Appendix:Glossary for entry
. If text
is specified, it is the display text; otherwise, entry
is used.
export.insert_inflection
function export.insert_inflection(data)
Insert previously-parsed terms into headdata.inflections
. data
is an object with the following fields:
headdata
: The headword structure passed to Module:headword. Required.terms
: The list of parsed terms. Ifnil
or omitted, nothing happens.label
: The label that the inflections are given; any parts of the label surrounded in <<...>> are linked to the glossary. (If the contents of <<...> contain a | in them, they are a two-part link.) Required.accel
: If specified, a full accelerator object to add to the inflections.check_missing
: If specified, check the parsed terms for red links, and if so, add a category such as toheaddata.categories
. If this is given, so mustlang
andplpos
.lang
: The language object for the language of the terms. Required ifcheck_missing
is given.plpos
: The plural part of speech, for the category name. Required ifcheck_missing
is given.
export.parse_and_insert_inflection
function export.parse_and_insert_inflection(data)
Parse raw arguments from forms
for inline modifiers, and insert the resulting terms (which should not require significant additional processing) into headdata.inflections
. data
is an object with the following fields:
forms
: The list of raw values to parse. Ifnil
or omitted, nothing happens.headdata
: The headword structure passed to Module:headword. Required.paramname
: As inparse_term_list_with_modifiers()
. Required.qualifiers
,frob
,include_mods
,exclude_mods
: As inparse_term_list_with_modifiers()
.label
: As ininsert_inflection()
. Required.accel
,check_missing
,lang,
plpos: As in
insert_inflection()`.
export.combine_qualifiers_or_labels
function export.combine_qualifiers_or_labels(quals1, quals2)
Combine two sets of qualifiers or labels. If either is nil
, just return the other, and if both are nil
, return nil
.
export.combine_termobj_qualifiers_labels
function export.combine_termobj_qualifiers_labels(destobj, srcobj)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.termobj_has_qualifiers_or_labels
function export.termobj_has_qualifiers_or_labels(obj)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.default_split_apostrophe
function export.default_split_apostrophe(word, data)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.add_links_to_multiword_term
function export.add_links_to_multiword_term(term, data)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.add_lemma_links
function export.add_lemma_links(lemma, split_hyphen_when_space)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.apply_link_modifiers
function export.apply_link_modifiers(linked_term, modifier_spec)
Given a linked_term
that is the output of add_links_to_multiword_term(), apply modifications as given in modifier_spec
to change the link destination of subterms (normally single-word non-lemma forms; sometimes collections of adjacent words). This is usually used to link non-lemma forms to their corresponding lemma, but can also be used to replace a span of adjacent separately-linked words to a single multiword lemma. The format of modifier_spec
is one or more semicolon-separated subterm specs, where each such spec is of the form SUBTERM:DEST, where SUBTERM is one or more words in the linked_term
but without brackets in them, and DEST is the corresponding link destination to link the subterm to. Any occurrence of ~ in DEST is replaced with SUBTERM. Alternatively, a single modifier spec can be of the form BEGIN[FROM:TO], which is equivalent to writing BEGINFROM:BEGINTO (see example below).
For example, given the source phrase il bue che dice cornuto all'asino "the pot calling the kettle black" (literally "the ox that calls the donkey horned/cuckolded"), the result of calling add_links_to_multiword_term() is il bue che dice cornuto all'asino. With a modifier_spec of 'dice:dire', the result is il bue che dice cornuto all'asino. Here, based on the modifier spec, the non-lemma form dice is replaced with the two-part link dice.
Another example: given the source phrase chi semina vento raccoglie tempesta "sow the wind, reap the whirlwind" (literally (he) who sows wind gathers [the] tempest"). The result of calling add_links_to_multiword_term() is chi semina vento raccoglie tempesta, and with a modifier_spec of 'semina:~re; raccoglie:~re', the result is chi semina vento raccoglie tempesta. Here we use the ~ notation to stand for the non-lemma form in the destination link.
A more complex example is se non hai altri moccoli puoi andare a letto al buio, which becomes se non hai altri moccoli puoi andare a letto al buio after calling add_links_to_multiword_term(). With the following modifier_spec: 'hai:avere; altr[i:o]; moccol[i:o]; puoi: potere; andare a letto:~; al buio:~', the result of applying the spec is se non hai altri moccoli puoi andare a letto al buio. Here, we rely on the alternative notation mentioned above for e.g. 'altr[i:o]', which is equivalent to 'altri:altro', and link multiword subterms using e.g. 'andare a letto:~'. (The code knows how to handle multiword subexpressions properly, and if the link text and destination are the same, only a single-part link is formed.)
local export = {}
local table_module = "Module:table"
local string_utilities_module = "Module:string utilities"
local parse_utilities_module = "Module:parse utilities"
local rfind = mw.ustring.find
local rmatch = mw.ustring.match
local rsplit = mw.text.split
local rsubn = mw.ustring.gsub
local dump = mw.dumpObject
-- version of rsubn() that discards all but the first return value
local function rsub(term, foo, bar)
local retval = rsubn(term, foo, bar)
return retval
end
local function track(track_id)
require("Module:debug/track")("headword utilities/" .. track_id)
return true
end
local param_mods = {
id = {},
q = {type = "qualifier"},
qq = {type = "qualifier"},
l = {type = "labels"},
ll = {type = "labels"},
-- [[Module:headword]] expects part references in `.refs`.
ref = {item_dest = "refs", type = "references"},
}
local optional_param_mods = {
g = {item_dest = "genders", sublist = true},
alt = {},
lang = {type = "language"},
sc = {type = "script"},
t = {item_dest = "gloss"},
gloss = {},
pos = {},
lit = {},
tr = {item_dest = "translit"},
ts = {item_dest = "transcription"},
face = {},
nolinkinfl = {type = "boolean"},
}
--[==[
Parse a single inflection form that may have inline modifiers attached. `data` is an object with the following fields:
* `val`: The raw value to parse. Required.
* `paramname`: The name of the parameter from which the value was taken; used in error messages. Required.
* `frob`: An optional function of one value to apply to the form after inline modifiers have been removed (i.e. to
apply to the `.term` field of the returned object).
* `include_mods`: List of extra inline modifiers to include, besides the default ones (see below). Each list item is
either a string specifying a recognized extra inline modifier (see `optional_param_mods` in the code), or a two-item
list of modifier name and modifier spec, where the spec should follow the syntax for modifier specs in
`parse_inline_modifiers` in [[Module:parse utilities]].
* `exclude_mods`: List of default inline modifiers to not include.
Returns an object suitable for storing as one element of one of the lists in `headdata.inflections`, where `headdata`
is the structure passed to [[Module:headword]].
The following default inline modifiers are currently recognized:
* `q`: Left qualifier.
* `qq`: Right qualifier.
* `l`: Comma-separated list of left labels. No space should follow the comma.
* `ll`: Comma-separated list of right labels. No space should follow the comma.
* `ref`: Reference or references. See {{tl|IPA}} for the syntax.
* `id`: Sense ID, in case there are multiple senses. See {{tl|l}}.
The following are the recognized additional inline modifiers:
* `g`: Comma-separated list of genders.
* `alt`: Display text.
* `lang`: Language code of language of the form, if different from the language of the headword.
* `sc`: Script code of script of the form. Almost never needed.
* `t`: Gloss for the form.
* `gloss`: Gloss for the form (alias for `t`).
* `pos`: Part of speech of the form.
* `lit`: Literal meaning of the form.
* `tr`: Manual transliteration of the form.
* `ts`: Transcription of the form, for languages where the transliteration differs markedly from the pronunciation.
* `face`: Face to display the form in, e.g. {"hypothetical"} for a hypothetical form (unlinkable and displayed in italics).
* `nolinkinfl`: Make the form unlinkable.
]==]
function export.parse_term_with_modifiers(data)
local paramname, val, frob = data.paramname, data.val, data.frob
local function generate_obj(term, parse_err)
if frob then
term = frob(term, parse_err)
end
return {term = term}
end
-- Check for inline modifier, e.g. מרים<tr:Miryem>. But exclude top-level HTML entry with <span ...>,
-- <sup> or similar in it.
if val:find("<") and not require(parse_utilities_module).term_contains_top_level_html(val) then
local param_mods = param_mods
if data.include_mods or data.exclude_mods then
param_mods = require(table).shallowcopy(param_mods)
if data.include_mods then
for _, mod in ipairs(data.include_mods) do
if type(mod) == "table" then
if #mod ~= 2 then
error(("Internal error: Modifier spec %s in `include_mods` should be of length 2"):format(
dump(mod)))
end
local modkey, modvalue = unpack(mod)
param_mods[modkey] = modvalue
elseif not optional_param_mods[mod] then
error(("Internal error: Unrecognized modifier spec %s in `include_mods`"):format(
dump(mod)))
else
param_mods[mod] = optional_param_mods[mod]
end
end
end
if data.exclude_mods then
for _, mod in ipairs(data.exclude_mods) do
if not param_mods[mod] then
error(("Internal error: Modifier spec %s in `exclude_mods` not found among existing modifiers"
):format(dump(mod)))
else
param_mods[mod] = nil
end
end
end
end
return require(parse_utilities_module).parse_inline_modifiers(val, {
paramname = paramname,
param_mods = param_mods,
generate_obj = generate_obj,
})
else
return generate_obj(val)
end
end
--[==[
Parse a list of inflection forms that may have inline modifiers attached. `data` is an object with the following fields:
* `forms`: The list of raw values to parse. Required.
* `paramname`: The name of the first parameter from which the value was taken; used in error messages. If this is a
two-element list, the first element is the first parameter and the second element is the prefix of the remaining
parameters. Parameter names that are numbers are handled correctly, as are those with \1 in it marking where the
parameter index goes. Required.
* `qualifiers`: If specified, a possibly gappy list of left qualifiers to add to the parsed terms (for compatibility
purposes).
* `frob`, `include_mods`, `exclude_mods`: As in `parse_term_with_modifiers()`.
Returns a list of objects, suitable for storing as one of the lists in `headdata.inflections` (once a label is added),
where `headdata` is the structure passed to [[Module:headword]].
]==]
function export.parse_term_list_with_modifiers(data)
local paramname, forms = data.paramname, data.forms
local qualifiers = data.qualifiers
local first, restpref
if type(paramname) == "table" then
first = paramname[1]
restpref = paramname[2]
else
first = paramname
restpref = paramname
end
local terms = {}
for i, val in ipairs(forms) do
terms[i] = export.parse_term_with_modifiers {
paramname = i == 1 and first or type(restpref) == "number" and restpref + i - 1 or
restpref:find("\1") and restpref:gsub("\1", tostring(i)) or restpref .. i,
val = val,
frob = data.frob,
include_mods = data.include_mods,
exclude_mods = data.exclude_mods,
}
if qualifiers and qualifiers[i] then
terms[i].q = {qualifiers[i]}
end
end
return terms
end
--[==[
Check if any of a list of parsed terms (as returned by `parse_term_list_with_modifiers()`) are red links (i.e.
nonexistent pages). If so, a category such as [[Category:Spanish nouns with red links in their headword lines]] is added
to `headdata.categories`. `data` is an object with the following fields:
* `headdata`: The headword structure passed to [[Module:headword]]. Required.
* `terms`: The list of parsed terms. Required.
* `lang`: The language object for the language of the terms. Required.
* `plpos`: The plural part of speech, for the category name. Required.
]==]
function export.check_term_list_missing(data)
local headdata, terms, lang, plpos = data.headdata, data.terms, data.lang, data.plpos
for _, term in ipairs(terms) do
if type(term) == "table" then
term = term.term
end
if term then
local title = mw.title.new(term)
if title and not title:getContent() then
table.insert(headdata.categories, lang:getFullName() .. " " .. plpos ..
" with red links in their headword lines")
end
end
end
end
--[==[
Construct a link to [[Appendix:Glossary]] for `entry`. If `text` is specified, it is the display text; otherwise,
`entry` is used.
]==]
function export.glossary_link(entry, text)
text = text or entry
return "[[Appendix:Glossary#" .. entry .. "|" .. text .. "]]"
end
--[==[
Insert previously-parsed terms into `headdata.inflections`. `data` is an object with the following fields:
* `headdata`: The headword structure passed to [[Module:headword]]. Required.
* `terms`: The list of parsed terms. If {nil} or omitted, nothing happens.
* `label`: The label that the inflections are given; any parts of the label surrounded in <<...>> are linked to the
glossary. (If the contents of <<...> contain a | in them, they are a two-part link.) Required.
* `accel`: If specified, a full accelerator object to add to the inflections.
* `check_missing`: If specified, check the parsed terms for red links, and if so, add a category such as
[[Category:Spanish nouns with red links in their headword lines]] to `headdata.categories`. If this is given, so must
`lang` and `plpos`.
* `lang`: The language object for the language of the terms. Required if `check_missing` is given.
* `plpos`: The plural part of speech, for the category name. Required if `check_missing` is given.
]==]
function export.insert_inflection(data)
local headdata, terms, label = data.headdata, data.terms, data.label
if terms and terms[1] then
if label:find("<<") then
label = label:gsub("<<(.-)|(.-)>>", export.glossary_link):gsub("<<(.-)>>", export.glossary_link)
end
if terms[1].term == "-" then
-- FIXME: Generate an error if there is more than one term or qualifiers or labels specified?
table.insert(headdata.inflections, {label = "no " .. label})
else
if data.check_missing then
export.check_term_list_missing {
headdata = headdata,
terms = terms,
lang = data.lang,
plpos = data.plpos,
}
end
terms.label = label
if data.accel then
terms.accel = data.accel
end
table.insert(headdata.inflections, terms)
end
end
end
--[==[
Parse raw arguments from `forms` for inline modifiers, and insert the resulting terms (which should not require
significant additional processing) into `headdata.inflections`. `data` is an object with the following fields:
* `forms`: The list of raw values to parse. If {nil} or omitted, nothing happens.
* `headdata`: The headword structure passed to [[Module:headword]]. Required.
* `paramname`: As in `parse_term_list_with_modifiers()`. Required.
* `qualifiers`, `frob`, `include_mods`, `exclude_mods`: As in `parse_term_list_with_modifiers()`.
* `label`: As in `insert_inflection()`. Required.
* `accel`, `check_missing`, `lang, `plpos`: As in `insert_inflection()`.
]==]
function export.parse_and_insert_inflection(data)
local forms = data.forms
if forms and forms[1] then
data = require(table_module).shallowcopy(data)
data.forms = forms
data.terms = export.parse_term_list_with_modifiers(data)
export.insert_inflection(data)
end
end
--[==[
Combine two sets of qualifiers or labels. If either is {nil}, just return the other, and if both are {nil}, return
{nil}.
]==]
function export.combine_qualifiers_or_labels(quals1, quals2)
if not quals1 and not quals2 then
return nil
end
if not quals1 then
return quals2
end
if not quals2 then
return quals1
end
local m_table = require(table_module)
local combined = m_table.shallowcopy(quals1)
for _, note in ipairs(quals2) do
m_table.insertIfNot(combined, note)
end
return combined
end
function export.combine_termobj_qualifiers_labels(destobj, srcobj)
destobj.q = export.combine_qualifiers_or_labels(destobj.q, srcobj.q)
destobj.qq = export.combine_qualifiers_or_labels(destobj.qq, srcobj.qq)
destobj.l = export.combine_qualifiers_or_labels(destobj.l, srcobj.l)
destobj.ll = export.combine_qualifiers_or_labels(destobj.ll, srcobj.ll)
return destobj
end
function export.termobj_has_qualifiers_or_labels(obj)
return obj.q and obj.q[1] or obj.qq and obj.qq[1] or obj.l and obj.l[1] or obj.ll and obj.ll[1] or
obj.refs and obj.refs[1]
end
local function link_hyphen_split_component(word, data)
if data.link_hyphen_split_component then
return data.link_hyphen_split_component(word)
else
return "[[" .. word .. "]]"
end
end
-- Default function to split a word on apostrophes. Don't split apostrophes at the beginning or end of a word (e.g.
-- [['ndrangheta]] or [[po']]). Handle multiple apostrophes correctly, e.g. [[l'altr'ieri]] -> [[l']][altr']][[ieri]].
function export.default_split_apostrophe(word, data)
local begapo, inner_word, endapo = word:match("^('*)(.-)('*)$")
local apostrophe_parts = rsplit(word, "'")
local linked_apostrophe_parts = {}
local apostrophes_at_beginning = ""
local i = 1
-- Apostrophes at beginning get attached to the first word after (which will always exist but may
-- be blank if the word consists only of apostrophes).
while i < #apostrophe_parts do -- <, not <=, in case the word consists only of apostrophes
local apostrophe_part = apostrophe_parts[i]
i = i + 1
if apostrophe_part == "" then
apostrophes_at_beginning = apostrophes_at_beginning .. "'"
else
break
end
end
apostrophe_parts[i] = apostrophes_at_beginning .. apostrophe_parts[i]
-- Now, do the remaining parts. A blank part indicates more than one apostrophe in a row; we join
-- all of them to the preceding word.
while i <= #apostrophe_parts do
local apostrophe_part = apostrophe_parts[i]
if apostrophe_part == "" then
linked_apostrophe_parts[#linked_apostrophe_parts] =
linked_apostrophe_parts[#linked_apostrophe_parts] .. "'"
elseif i == #apostrophe_parts then
table.insert(linked_apostrophe_parts, apostrophe_part)
else
table.insert(linked_apostrophe_parts, apostrophe_part .. "'")
end
i = i + 1
end
for i, tolink in ipairs(linked_apostrophe_parts) do
linked_apostrophe_parts[i] = link_hyphen_split_component(tolink, data)
end
return table.concat(linked_apostrophe_parts)
end
--[=[
Auto-add links to a word that should not have spaces but may have hyphens and/or apostrophes. We split off final
punctuation, then split on hyphens if `data.split_hyphen` is given, and also split on apostrophes if
`data.split_apostrophe` is given. We only split on hyphens if they are in the middle of the word, not at the beginning
or end (hyphens at the beginning or end indicate suffixes or prefixes, respectively). `include_hyphen_prefixes`, if
given, is a set of prefixes (not including the final hyphen) where we should include the final hyphen in the prefix.
Hence, e.g. if "anti" is in the set, a Portuguese word like [[anti-herói]] "anti-hero" will be split [[anti-]][[herói]]
(whereas a word like [[código-fonte]] "source code" will be split as [[código]]-[[fonte]]).
If `data.split_apostrophe` is specified, we split on apostrophes unless `data.no_split_apostrophe_words` is given and
the word is in the specified set, such as French [[c'est]] and [[quelqu'un]]. If `data.split_apostrophe` is true, the
default algorithm applies, which splits on all apostrophes except those at the beginning and end of a word (as in
Italian [['ndrangheta]] or [[po']]), and includes the apostrophe in the link to its left (so we auto-split French
[[l'eau]] as [[l']][[eau]] and [[l'altr'ieri]] as [[l']][altr']][[ieri]]). If `data.split_apostrophe` is specified
but not `true`, it should be a function of one argument that does custom apostrophe-splitting. The argument is the word
to split, and the return value should be the split and linked word.
]=]
local function add_single_word_links(space_word, data, term_has_spaces)
local space_word_no_punct, punct
local punct_pattern = data.punctuation
if punct_pattern == nil then
punct_pattern = "[,;:?!]"
end
if type(punct_pattern) == "function" then
space_word_no_punct, punct = punct_pattern(space_word)
elseif type(punct_pattern) == "string" then
space_word_no_punct, punct = rmatch(space_word, "^(.*)(" .. punct_pattern .. ")$")
end
space_word_no_punct = space_word_no_punct or space_word
punct = punct or ""
local words
if space_word_no_punct:find("^%-") or space_word_no_punct:find("%-$") then
-- don't split prefixes and suffixes
words = {space_word_no_punct}
else
local splitter
if term_has_spaces then
splitter = data.split_hyphen_when_space
else
splitter = data.split_hyphen_when_no_space
end
if type(splitter) == "function" then
words = splitter(space_word_no_punct)
if type(words) == "string" then
return words .. punct
end
end
end
if not words then
local split_hyphen
if term_has_spaces then
split_hyphen = data.split_hyphen_when_space
else
split_hyphen = data.split_hyphen_when_no_space
if split_hyphen == nil then -- default to true; use `false` to avoid this
split_hyphen = true
end
end
if split_hyphen then
words = rsplit(space_word_no_punct, "%-")
else
words = {space_word_no_punct}
end
end
local linked_words = {}
for j, word in ipairs(words) do
if j < #words and data.include_hyphen_prefixes and data.include_hyphen_prefixes[word] then
word = "[[" .. word .. "-]]"
elseif j > 1 and data.include_hyphen_suffixes and data.include_hyphen_suffixes[word] then
word = "[[-" .. word .. "]]"
else
-- Don't split on apostrophes if the word is in `no_split_apostrophe_words`.
if (not data.no_split_apostrophe_words or not data.no_split_apostrophe_words[word]) and
data.split_apostrophe and word:find("'") then
if data.split_apostrophe == true then
word = export.default_split_apostrophe(word, data)
else -- custom apostrophe splitter/linker
word = data.split_apostrophe(word)
end
else
word = link_hyphen_split_component(word, data)
end
if j < #words then
word = word .. "-"
end
end
table.insert(linked_words, word)
end
return table.concat(linked_words) .. punct
end
--[=[
Auto-add links to a multiword term. `data` contains fields customizing how to do this. By default we proceed as follows:
(1) If the term already has embedded links in it, they are left unchanged.
(2) Otherwise, if there are spaces present, we split on spaces and link each word separately.
(3) If a given space-separated component ends in punctuation (defaulting to [,;:?!]), it is separated off, the remainder
of the algorithm run, and the punctuation pasted back on.
(4) If there are hyphens in a given space-separated component, we may link each hyphenated term separately depending
on the settings in `data`. Normally the hyphens are not included in the linked terms, but this can be overridden
for specific prefixes and/or suffixes. By default, if there are spaces in the multiword term, we do not link
hyphenated components (because of cases like "boire du petit-lait" where "petit-lait" should be linked as a whole),
but do so otherwise (e.g. for "avant-avant-hier"); this can overridden for cases like "croyez-le ou non".
Cases where only some of the hyphens should be split can always be handled by explicitly specifying the head (e.g.
"Nord-Pas-de-Calais" given as head=[[Nord]]-[[Pas-de-Calais]]).
(5) If there are apostrophes in a given component, we may link each apostrophe-separated term separately depending
on the settings in `data`, including the apostrophe in the link to its left (so we split "de l'eau" as
"[[de]] [[l']][[eau]]").
The settings in `data` are as follows:
`split_hyphen_when_no_space`: Whether to split on hyphens when the term has no spaces. Defaults to true if set to `nil`.
This can be a function of one argument, to implement a custom splitting algorithm for hyphen-separated terms. If
this returns [FIXME: FINISH ME ...]
If `data.split_apostrophe` is specified, we split on apostrophes unless `data.no_split_apostrophe_words` is given and
the word is in the specified set, such as French [[c'est]] and [[quelqu'un]]. If `data.split_apostrophe` is true, the
default algorithm applies, which splits on all apostrophes except those at the beginning and end of a word (as in
Italian [['ndrangheta]] or [[po']]), and includes the apostrophe in the link to its left (so we auto-split French
[[l'eau]] as [[l']][[eau]] and [[l'altr'ieri]] as [[l']][altr']][[ieri]]). If `data.split_apostrophe` is specified
but not `true`, it should be a function of one argument that does custom apostrophe-splitting. The argument is the word
to split, and the return value should be the split and linked word.
We don't always split on hyphens because of cases like "boire du petit-lait" where "petit-lait" should be linked as a
whole, but provide the option to do it for cases like "croyez-le ou non". If there's no space, however, then it makes
sense to split on hyphens by `no_split_apostrophe_words` and `include_hyphen_prefixes` allow for special-case handling
of particular words and are as described in the comment above add_single_word_links().
]=]
function export.add_links_to_multiword_term(term, data)
if rfind(term, "[%[%]]") then
return term
end
local words = rsplit(term, " ")
local term_has_spaces = #words > 1
local linked_words = {}
for _, word in ipairs(words) do
table.insert(linked_words, add_single_word_links(word, data, term_has_spaces))
end
local retval = table.concat(linked_words, " ")
-- If we ended up with a single link consisting of the entire term,
-- remove the link.
local unlinked_retval = rmatch(retval, "^%[%[([^%[%]]*)%]%]$")
return unlinked_retval or retval
end
-- Badly named older entry point. FIXME: Obsolete me!
function export.add_lemma_links(lemma, split_hyphen_when_space)
track("add-lemma-links")
return export.add_links_to_multiword_term(lemma, {split_hyphen_when_space = split_hyphen_when_space})
end
-- Ensure that brackets display literally in error messages. Replacing with equivalent HTML escapes doesn't work
-- because they are displayed literally; but inserting a Unicode word-joiner symbol works.
local function escape_wikicode(term)
return require(parse_utilities_module).escape_wikicode(term)
end
--[==[
Given a `linked_term` that is the output of add_links_to_multiword_term(), apply modifications as given in
`modifier_spec` to change the link destination of subterms (normally single-word non-lemma forms; sometimes
collections of adjacent words). This is usually used to link non-lemma forms to their corresponding lemma, but can
also be used to replace a span of adjacent separately-linked words to a single multiword lemma. The format of
`modifier_spec` is one or more semicolon-separated subterm specs, where each such spec is of the form
SUBTERM:DEST, where SUBTERM is one or more words in the `linked_term` but without brackets in them, and DEST is the
corresponding link destination to link the subterm to. Any occurrence of ~ in DEST is replaced with SUBTERM.
Alternatively, a single modifier spec can be of the form BEGIN[FROM:TO], which is equivalent to writing
BEGINFROM:BEGINTO (see example below).
For example, given the source phrase [[il bue che dice cornuto all'asino]] "the pot calling the kettle black"
(literally "the ox that calls the donkey horned/cuckolded"), the result of calling add_links_to_multiword_term()
is [[il]] [[bue]] [[che]] [[dice]] [[cornuto]] [[all']][[asino]]. With a modifier_spec of 'dice:dire', the result
is [[il]] [[bue]] [[che]] [[dire|dice]] [[cornuto]] [[all']][[asino]]. Here, based on the modifier spec, the
non-lemma form [[dice]] is replaced with the two-part link [[dire|dice]].
Another example: given the source phrase [[chi semina vento raccoglie tempesta]] "sow the wind, reap the whirlwind"
(literally (he) who sows wind gathers [the] tempest"). The result of calling add_links_to_multiword_term() is
[[chi]] [[semina]] [[vento]] [[raccoglie]] [[tempesta]], and with a modifier_spec of 'semina:~re; raccoglie:~re',
the result is [[chi]] [[seminare|semina]] [[vento]] [[raccogliere|raccoglie]] [[tempesta]]. Here we use the ~
notation to stand for the non-lemma form in the destination link.
A more complex example is [[se non hai altri moccoli puoi andare a letto al buio]], which becomes
[[se]] [[non]] [[hai]] [[altri]] [[moccoli]] [[puoi]] [[andare]] [[a]] [[letto]] [[al]] [[buio]] after calling
add_links_to_multiword_term(). With the following modifier_spec:
'hai:avere; altr[i:o]; moccol[i:o]; puoi: potere; andare a letto:~; al buio:~', the result of applying the spec is
[[se]] [[non]] [[avere|hai]] [[altro|altri]] [[moccolo|moccoli]] [[potere|puoi]] [[andare a letto]] [[al buio]].
Here, we rely on the alternative notation mentioned above for e.g. 'altr[i:o]', which is equivalent to 'altri:altro',
and link multiword subterms using e.g. 'andare a letto:~'. (The code knows how to handle multiword subexpressions
properly, and if the link text and destination are the same, only a single-part link is formed.)
]==]
function export.apply_link_modifiers(linked_term, modifier_spec)
local split_modspecs = rsplit(modifier_spec, "%s*;%s*")
for j, modspec in ipairs(split_modspecs) do
local subterm, dest, otherlang
local begin_from, begin_to, rest, end_from, end_to = modspec:match("^%[(.-):(.*)%]([^:]*)%[(.-):(.*)%]$")
if begin_from then
subterm = begin_from .. rest .. end_from
dest = begin_to .. rest .. end_to
end
if not subterm then
rest, end_from, end_to = modspec:match("^([^:]*)%[(.-):(.*)%]$")
if rest then
subterm = rest .. end_from
dest = rest .. end_to
end
end
if not subterm then
begin_from, begin_to, rest = modspec:match("^%[(.-):(.*)%]([^:]*)$")
if begin_from then
subterm = begin_from .. rest
dest = begin_to .. rest
end
end
if not subterm then
subterm, dest = modspec:match("^(.-)%s*:%s*(.*)$")
if subterm and subterm ~= "^" and subterm ~= "$" then
local langdest
-- Parse off an initial language code (e.g. 'en:Higgs', 'la:minūtia' or 'grc:σκατός'). Also handle
-- Wikipedia prefixes ('w:Abatemarco' or 'w:it:Colle Val d'Elsa').
otherlang, langdest = dest:match("^([A-Za-z0-9._-]+):([^ ].*)$")
if otherlang == "w" then
local foreign_wikipedia, foreign_term = langdest:match("^([A-Za-z0-9._-]+):([^ ].*)$")
if foreign_wikipedia then
otherlang = otherlang .. ":" .. foreign_wikipedia
langdest = foreign_term
end
dest = ("%s:%s"):format(otherlang, langdest)
otherlang = nil
elseif otherlang then
otherlang = require("Module:languages").getByCode(otherlang, true, "allow etym")
dest = langdest
end
end
end
if not subterm then
error(("Single modifier spec %s should be of the form SUBTERM:DEST where SUBTERM is one or more words in a multiword "
.. "term and DEST is the destination to link the subterm to (possibly prefixed by a language code); or of "
.. "the form BEGIN[FROM:TO], which is equivalent to BEGINFROM:BEGINTO; or similarly [FROM:TO]END, which is "
.. "equivalent to FROMEND:TOEND"):
format(modspec))
end
if subterm == "^" then
linked_term = dest:gsub("_", " ") .. linked_term
elseif subterm == "$" then
linked_term = linked_term .. dest:gsub("_", " ")
else
if subterm:find("%[") then
error(("Subterm '%s' in modifier spec '%s' cannot have brackets in it"):format(
escape_wikicode(subterm), escape_wikicode(modspec)))
end
local strutil = require(string_utilities_module)
local escaped_subterm = strutil.pattern_escape(subterm)
local subterm_re = "%[%[" .. escaped_subterm:gsub("(%%?[ '%-])", "%%]*%1%%[*") .. "%]%]"
local expanded_dest
if dest:find("~") then
expanded_dest = dest:gsub("~", strutil.replacement_escape(subterm))
else
expanded_dest = dest
end
if otherlang then
expanded_dest = expanded_dest .. "#" .. otherlang:getCanonicalName()
end
local subterm_replacement
if expanded_dest:find("%[") then
-- Use the destination directly if it has brackets in it (e.g. to put brackets around parts of a word).
subterm_replacement = expanded_dest
elseif expanded_dest == subterm then
subterm_replacement = "[[" .. subterm .. "]]"
else
subterm_replacement = "[[" .. expanded_dest .. "|" .. subterm .. "]]"
end
local replaced_linked_term = rsub(linked_term, subterm_re, strutil.replacement_escape(subterm_replacement))
if replaced_linked_term == linked_term then
error(("Subterm '%s' could not be located in %slinked expression %s, or replacement same as subterm"):format(
subterm, j > 1 and "intermediate " or "", escape_wikicode(linked_term)))
else
linked_term = replaced_linked_term
end
end
end
return linked_term
end
return export