VOOZH about

URL: https://en.wiktionary.org/wiki/Module:languages

⇱ Module:languages - Wiktionary, the free dictionary


Jump to content
From Wiktionary, the free dictionary

The following documentation is located at Module:languages/documentation. [edit]
Useful links: subpage listlinkstransclusionstestcasessandbox

This module is used to retrieve and manage the languages that can have Wiktionary entries, and the information associated with them. See Wiktionary:Languages for more information.

For the languages and language varieties that may be used in etymologies, see Module:etymology languages. For language families, which sometimes also appear in etymologies, see Module:families.

This module provides access to other modules. To access the information from within a template, see Module:languages/templates.

The information itself is stored in the various data modules that are subpages of this module. These modules should not be used directly by any other module, the data should only be accessed through the functions provided by this module.

Data submodules:

Extra data submodules (for less frequently used data):

Finding and retrieving languages

The module exports a number of functions that are used to find languages.

This module implements fetching of language-specific information and processing text in a given language.

Types of languages

There are two types of languages: full languages and etymology-only languages. The essential difference is that only full languages appear in L2 headings in vocabulary entries, and hence categories like Category:French nouns exist only for full languages. Etymology-only languages have either a full language or another etymology-only language as their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only language as their parent, a full language can always be derived by following the parent links upwards. For example, "Canadian French", code fr-CA, is an etymology-only language whose parent is the full language "French", code fr. An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code ang-nor, which has "Anglian Old English", code ang-ang as its parent; this is an etymology-only language whose parent is "Old English", code ang, which is a full language. (This is because Northumbrian Old English is considered a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code und; this is the case, for example, for "substrate" languages such as "Pre-Greek", code qsb-grc, and "the BMAC substrate", code qsb-bma.

It is important to distinguish language parents from language ancestors. The parent-child relationship is one of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant relationship is one of descent in time. For example, "Classical Latin", code la-cla, and "Late Latin", code la-lat, are both etymology-only languages with "Latin", code la, as their parents, because both of the former are varieties of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of Classical Latin; rather, it is a descendant. There is in fact a separate ancestors field that is used to express the ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens, for example, with "Old Italian" (code roa-oit), which is an etymology-only variant of full language "Italian" (code it), and with "Old Latin" (code itc-ola), which is an etymology-only variant of Latin. In both cases, the full language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin using the {{inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance in time, rather than in the parent-child sense); likewise for Italian and Old Italian.

Full languages come in three subtypes:

  • regular: This indicates a full language that is attested according to WT:CFI and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in the Reconstruction namespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages.
  • reconstructed: This language is not attested according to WT:CFI, and therefore is allowed only in the Reconstruction namespace. All terms in this language are reconstructed, and must be prefixed with
  • . Languages such as Proto-Indo-European and Proto-Germanic are in this category.
  • appendix-constructed: This language is attested but does not meet the additional requirements set out for constructed languages (WT:CFI#Constructed languages). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links. Most constructed languages are of this subtype.

Both full languages and etymology-only languages have a Language object associated with them, which is fetched using the getByCode function in to convert a language code to a Language object. Depending on the options supplied to this function, etymology-only languages may or may not be accepted, and family codes may be accepted (returning a Family object as described in Module:families). There are also separate getByCanonicalName functions in and Module:etymology languages to convert a language's canonical name to a Language object (depending on whether the canonical name refers to a full or etymology-only language).

Textual representations

Textual strings belonging to a given language come in several different text variants:

  1. The input text is what the user supplies in wikitext, in the parameters to {{m}}, {{l}}, {{ux}}, {{t}}, {{lang}} and the like.
  2. The corrected input text is the input text with some corrections and/or normalizations applied, such as bad-character replacements for certain languages, like replacing l or 1 to palochka in some languages written in Cyrillic. (FIXME: This currently goes under the name display text but that will be repurposed below. Also, User:Surjection suggests renaming this to normalized input text, but "normalized" is used in a different sense in Module:usex.)
  3. The display text is the text in the form as it will be displayed to the user. This is what appears in headwords, in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped display text (see below), as well as embedded bracketed links that are variously processed further. The display text is generated from the corrected input text by applying language-specific transformations; for most languages, there will be no such transformations. The general reason for having a difference between input and display text is to allow for extra information in the input text that is not displayed to the user but is sent to the transliteration module. Note that having different display and input text is only supported currently through special-casing but will be generalized. Examples of transformations are: (1) Removing the ^ that is used in certain East Asian (and possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for Arabic, removing a sukūn diacritic placed over a tāʔ marbūṭa (like this: ةْ) to indicate that the tāʔ marbūṭa is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as [กรีน/กฺรีน], which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific templates like {{th-usex}}].
    1. The right-resolved display text is the result of removing brackets around one-part embedded links and resolving two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form). The process of right-resolution is what happens when you call remove_links() in Module:links on some text. When applied to the display text, it produces exactly what the user sees, without any link markup.
  4. The stripped display text is the result of applying diacritic-stripping to the display text.
    1. The left-resolved stripped display text [NEED BETTER NAME] is the result of applying left-resolution to the stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying diacritic stripping and left-resolution produces the logical pagename.
  5. The physical pagename text is the result of converting the stripped display text into physical page links. If the stripped display text contains embedded links, the left side of those links is converted into physical page links; otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g. Unsupported titles/<code>gt</code> in place of the logical name >; (2) handles certain special-cased unsupported-title logical pagenames, such as Unsupported titles/Space in place of [space] and Unsupported titles/Ancient Greek dish in place of a very long Greek name for a gourmet dish as found in Aristophanes; (3) converts "mammoth" pagenames such as a into their appropriate split component, e.g. a/languages A to L.
  6. The source translit text is the text as supplied to the language-specific transliterate() method. The form of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the text through the transliterate mechanism, and for others (those listed with "cont" in substitution in Module:languages/data) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is still unclear to me.)
  7. The transliterated text (or transliteration) is the result of transliterating the source translit text. Unlike for all the other text variants except the transcribed text, it is always in the Latin script.
  8. The transcribed text (or transcription) is the result of transcribing the source translit text, where "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian, Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form. Unlike for all the other text variants other than the transliterated text, it is always in the Latin script. Currently, the transcribed text is always supplied manually be the user; there is no such thing as a transcribe() method on language objects.
  9. The sort key is the text used in sort keys for determining the placing of pages in categories they belong to. The sort key is generated from the pagename or a specified sort base by lowercasing, doing language-specific transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it needs to be converted to display text, have embedded links removed through right-resolution and have diacritic-stripping applied.
  10. There are other text variants that occur in usexes (specifically, there are normalized variants of several of the above text variants), but we can skip them for now.

The following methods exist on Language objects to convert between different text variants:

  1. correctInputText (currently called makeDisplayText): This converts input text to corrected input text.
  2. stripDiacritics: This converts to stripped display text. [FIXME: This needs some rethinking. In particular, stripDiacritics is sometimes called on input text, corrected input text or display text (in various paths inside of Module:links, and, in the case of input text, usually from other modules). We need to make sure we don't try to convert input text to display text twice, but at the same time we need to support calling it directly on input text since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input, corrected input, or display text; if the former two, we call correctInputText ourselves.]
  3. logicalToPhysical: This converts logical pagenames to physical pagenames.
  4. transliterate: This appears to convert input text with embedded brackets removed into a transliteration. [FIXME: This needs some rethinking. In particular, it calls processDisplayText on its input, which won't work for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code; a lot of callers remove the links themselves before calling transliterate(), which I assume is wrong.]
  5. makeSortKey: This converts display text (?) to a sort key. [FIXME: Clarify this.]

export.getDataModuleName

functionexport.getDataModuleName(code)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

export.getExtraDataModuleName

functionexport.getExtraDataModuleName(code)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

export.makeObject

functionexport.makeObject(code,data,dontCanonicalizeAliases)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

export.getByCode

functionexport.getByCode(code,paramForError,allowEtymLang,allowFamily)

Finds the language whose code matches the one provided. If it exists, it returns a Language object representing the language. Otherwise, it returns nil, unless paramForError is given, in which case an error is generated. If paramForError is true, a generic error message mentioning the bad code is generated; otherwise paramForError should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If allowEtymLang is specified, etymology-only language codes are allowed and looked up along with normal language codes. If allowFamily is specified, language family codes are allowed and looked up along with normal language codes.

export.getByCanonicalName

functionexport.getByCanonicalName(name,errorIfInvalid,allowEtymLang,allowFamily)

Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a Language object representing the language. Otherwise, it returns nil, unless paramForError is given, in which case an error is generated. If allowEtymLang is specified, etymology-only language codes are allowed and looked up along with normal language codes. If allowFamily is specified, language family codes are allowed and looked up along with normal language codes. The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result. This function is powered by Module:languages/canonical names, which contains a pre-generated mapping of full-language canonical names to codes. It is generated by going through the Category:Language data modules for full languages. When allowEtymLang is specified for the above function, Module:etymology languages/canonical names may also be used, and when allowFamily is specified for the above function, Module:families/canonical names may also be used.

export.finalizeData

functionexport.finalizeData(data,main_type,variety)

Used by Module:languages/data/2 (et al.) and Module:etymology languages/data, Module:families/data, Module:scripts/data and Module:writing systems/data to finalize the data into the format that is actually returned.

export.err

functionexport.err(lang_code,param,code_desc,template_tag,not_real_lang)

For backwards compatibility only; modules should require the error themselves.

Language objects

A Language object is returned from one of the functions above. It is a Lua representation of a language and the data associated with it. It has a number of methods that can be called on it, using the : syntax. For example:

localm_languages=require("Module:languages")
locallang=m_languages.getByCode("fr")
localname=lang:getCanonicalName()
-- "name" will now be "French"

Language:getCode

functionLanguage:getCode()

Returns the language code of the language. Example: "fr" for French.

Language:getCanonicalName

functionLanguage:getCanonicalName()

Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example: "French" for French.

Language:getDisplayForm

functionLanguage:getDisplayForm()

Return the display form of the language. The display form of a language, family or script is the form it takes when appearing as the source in categories such as English terms derived from source or English given names from source, and is also the displayed text in makeCategoryLink() links. For full and etymology-only languages, this is the same as the canonical name, but for families, it reads "name languages" (e.g. "Indo-Iranian languages"), and for scripts, it reads "name script" (e.g. "Arabic script").

Language:getHTMLAttribute

functionLanguage:getHTMLAttribute(sc,region)

Returns the value which should be used in the HTML lang= attribute for tagged text in the language.

Language:getAliases

functionLanguage:getAliases()

Returns a table of the aliases that the language is known by, excluding the canonical name. Aliases are synonyms for the language in question. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {"High German","New High German","Deutsch"} for German.

Language:getVarieties

functionLanguage:getVarieties(flatten)

Return a table of the known subvarieties of a given language, excluding subvarieties that have been given explicit etymology-only language codes. The names are not guaranteed to be unique, in that sometimes a given name refers to a subvariety of more than one language. Example: {"Southern Aymara","Central Aymara"} for Aymara. Note that the returned value can have nested tables in it, when a subvariety goes by more than one name. Example: {"North Azerbaijani","South Azerbaijani",{"Afshar","Afshari","Afshar Azerbaijani","Afchar"},{"Qashqa'i","Qashqai","Kashkay"},"Sonqor"} for Azerbaijani. Here, for example, Afshar, Afshari, Afshar Azerbaijani and Afchar all refer to the same subvariety, whose preferred name is Afshar (the one listed first). To avoid a return value with nested tables in it, specify a non-nil value for the flatten parameter; in that case, the return value would be {"North Azerbaijani","South Azerbaijani","Afshar","Afshari","Afshar Azerbaijani","Afchar","Qashqa'i","Qashqai","Kashkay","Sonqor"}.

Language:getOtherNames

functionLanguage:getOtherNames()

Returns a table of the "other names" that the language is known by, which are listed in the otherNames field. It should be noted that the otherNames field itself is deprecated, and entries listed there should eventually be moved to either aliases or varieties.

Language:getAllNames

functionLanguage:getAllNames()

Return a combined table of the canonical name, aliases, varieties and other names of a given language.

Language:getTypes

functionLanguage:getTypes()

Returns a table of types as a lookup table (with the types as keys).

The possible types are

  • language: This is a language, either full or etymology-only.
  • full: This is a "full" (not etymology-only) language, i.e. the union of regular, reconstructed and appendix-constructed. Note that the types full and etymology-only also exist for families, so if you want to check specifically for a full language and you have an object that might be a family, you should use hasType("language","full") and not simply hasType("full").
  • etymology-only: This is an etymology-only (not full) language, whose parent is another etymology-only language or a full language. Note that the types full and etymology-only also exist for families, so if you want to check specifically for an etymology-only language and you have an object that might be a family, you should use hasType("language","etymology-only") and not simply hasType("etymology-only").
  • regular: This indicates a full language that is attested according to WT:CFI and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in the Reconstruction namespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages.
  • reconstructed: This language is not attested according to WT:CFI, and therefore is allowed only in the Reconstruction namespace. All terms in this language are reconstructed, and must be prefixed with *. Languages such as Proto-Indo-European and Proto-Germanic are in this category.
  • appendix-constructed: This language is attested but does not meet the additional requirements set out for constructed languages (WT:CFI#Constructed languages). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links.

Language:hasType

functionLanguage:hasType(...)

Given a list of types as strings, returns true if the language has all of them.

Language:getWikimediaLanguages

functionLanguage:getWikimediaLanguages()

Returns a table containing WikimediaLanguage objects (see Module:wikimedia languages), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code sh (Serbo-Croatian) maps to four Wikimedia codes: sh (Serbo-Croatian), bs (Bosnian), hr (Croatian) and sr (Serbian). The code for the Wikimedia language is retrieved from the wikimedia_codes property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.

Language:getWikimediaLanguageCodes

functionLanguage:getWikimediaLanguageCodes()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getWikipediaArticle

functionLanguage:getWikipediaArticle(noCategoryFallback,project)

Returns the name of the Wikipedia article for the language. project specifies the language and project to retrieve the article from, defaulting to "enwiki" for the English Wikipedia. Normally if specified it should be the project code for a specific-language Wikipedia e.g. "zhwiki" for the Chinese Wikipedia, but it can be any project, including non-Wikipedia ones. If the project is the English Wikipedia and the property wikipedia_article is present in the data module it will be used first. In all other cases, a sitelink will be generated from :getWikidataItem (if set). The resulting value (or lack of value) is cached so that subsequent calls are fast. If no value could be determined, and noCategoryFallback is false, :getCategoryName is used as fallback; otherwise, nil is returned. Note that if noCategoryFallback is nil or omitted, it defaults to false if the project is the English Wikipedia, otherwise to true. In other words, under normal circumstances, if the English Wikipedia article couldn't be retrieved, the return value will fall back to a link to the language's category, but this won't normally happen for any other project.

Language:makeWikipediaLink

functionLanguage:makeWikipediaLink()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getCommonsCategory

functionLanguage:getCommonsCategory()

Returns the name of the Wikimedia Commons category page for the language.

Language:getWikidataItem

functionLanguage:getWikidataItem()

Returns the Wikidata item id for the language or nil. This corresponds to the the second field in the data modules.

Language:getScripts

functionLanguage:getScripts()

Returns a table of Script objects for all scripts that the language is written in. See Module:scripts.

Language:getScriptCodes

functionLanguage:getScriptCodes()

Returns the table of script codes in the language's data file.

Language:findBestScript

functionLanguage:findBestScript(text,forceDetect)

Given some text, this function iterates through the scripts of a given language and tries to find the script that best matches the text. It returns a Script object representing the script. If no match is found at all, it returns the None script object.

Language:getFamily

functionLanguage:getFamily()

Returns a Family object for the language family that the language belongs to. See Module:families.

Language:getFamilyCode

functionLanguage:getFamilyCode()

Returns the family code in the language's data file.

Language:getFamilyName

functionLanguage:getFamilyName()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:inFamily

functionLanguage:inFamily(...)

Check whether the language belongs to family (which can be a family code or object). A list of objects can be given in place of family; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.

Language:getParent

functionLanguage:getParent()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getParentCode

functionLanguage:getParentCode()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getParentName

functionLanguage:getParentName()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getParentChain

functionLanguage:getParentChain()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:hasParent

functionLanguage:hasParent(...)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getFull

functionLanguage:getFull()

If the language is etymology-only, this iterates through parents until a full language or family is found, and the corresponding object is returned. If the language is a full language, then it simply returns itself.

Language:getFullCode

functionLanguage:getFullCode()

If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding code is returned. If the language is a full language, then it simply returns the language code.

Language:getFullName

functionLanguage:getFullName()

If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding canonical name is returned. If the language is a full language, then it simply returns the canonical name of the language.

Language:getAncestors

functionLanguage:getAncestors()

Returns a table of Language objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.

Language:getAncestorCodes

functionLanguage:getAncestorCodes()

Returns a table of Language codes for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.

Language:hasAncestor

functionLanguage:hasAncestor(...)

Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).

Language:getAncestorChain

functionLanguage:getAncestorChain()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getAncestorChainOld

functionLanguage:getAncestorChainOld()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getDescendants

functionLanguage:getDescendants()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getDescendantCodes

functionLanguage:getDescendantCodes()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getDescendantNames

functionLanguage:getDescendantNames()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:hasDescendant

functionLanguage:hasDescendant(...)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getChildren

functionLanguage:getChildren()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getChildrenCodes

functionLanguage:getChildrenCodes()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getChildrenNames

functionLanguage:getChildrenNames()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:hasChild

functionLanguage:hasChild(...)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getCategoryName

functionLanguage:getCategoryName(nocap)

Returns the name of the main category of that language. Example: "French language" for French, whose category is at Category:French language. Unless optional argument nocap is given, the language name at the beginning of the returned value will be capitalized. This capitalization is correct for category names, but not if the language name is lowercase and the returned value of this function is used in the middle of a sentence.

Language:makeCategoryLink

functionLanguage:makeCategoryLink()

Creates a link to the category; the link text is the canonical name.

Language:getStandardCharacters

functionLanguage:getStandardCharacters(sc)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:stripDiacritics

functionLanguage:stripDiacritics(text,sc)

Strip diacritics from display text text (in a language-specific fashion), which is in the script sc. If sc is omitted or nil, the script is autodetected. This also strips certain punctuation characters from the end and (in the case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the end of the text or between the text and final stripped punctuation characters; and applies some language-specific Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text.

Language:logicalToPhysical

functionLanguage:logicalToPhysical(pagename,is_reconstructed_or_appendix)

Convert a logical pagename (the pagename as it appears to the user, after diacritics and punctuation have been stripped) to a physical pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference between the two are (a) unsupported titles such as [ ] (with square brackets in them), # (pound/hash sign) and ¯\_(ツ)_/¯ (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into parts (e.g. a, which is split into physical pagenames a/languages A to L and a/languages M to Z). For almost all purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical pagenames, such as checking the existence of a page or retrieving a page's contents.

pagename is the logical pagename to be converted. is_reconstructed_or_appendix indicates whether the page is in the Reconstruction or Appendix namespaces. If it is omitted or has the value nil, the pagename is checked for an initial asterisk, and if found, the page is assumed to be a Reconstruction page. Setting a value of false or true to is_reconstructed_or_appendix disables this check and allows for mainspace pagenames that begin with an asterisk.

Language:makeEntryName

functionLanguage:makeEntryName(text,sc,is_reconstructed_or_appendix)

Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename. This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated and will be going away. It is a simple composition of self:stripDiacritics and self:logicalToPhysical; most callers only want the former, and if you need both, call them both yourself.

text and sc are as in self:stripDiacritics, and is_reconstructed_or_appendix is as in self:logicalToPhysical.

Language:generateForms

functionLanguage:generateForms(text,sc)

Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.

Language:makeSortKey

functionLanguage:makeSortKey(text,sc)

Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything in parentheses is removed as well. The sort_key setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.

Language:makeDisplayText

functionLanguage:makeDisplayText(text,sc,keepPrefixes)

Make the display text (i.e. what is displayed on the page).

Language:transliterate

functionLanguage:transliterate(text,sc,module_override)

Transliterates the text from the given script into the Latin script (see Wiktionary:Transliteration and romanization). The language must have the translit property for this to work; if it is not present, nil is returned.

The sc parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate nil as the script, others require it to be one of the possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason, the sc parameter should always be provided when writing non-language-specific code.

The module_override parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by Wiktionary:Tracking/languages/module_override. Known bugs:

  • This function assumes tr(s1)..tr(s2)==tr(s1..s2). When this assertion fails, wikitext markups like ''' can cause wrong transliterations.
  • HTML entities like &apos;, often used to escape wikitext markups, do not work.

Language:overrideManualTranslit

functionLanguage:overrideManualTranslit(sc)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:link_tr

functionLanguage:link_tr(sc)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:hasTranslit

functionLanguage:hasTranslit()

Returns true if the language has a transliteration module, or false if it doesn't.

Language:hasDottedDotlessI

functionLanguage:hasDottedDotlessI()

Returns true if the language uses the letters I/ı and İ/i, or false if it doesn't.

Language:toJSON

functionLanguage:toJSON(opts)

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getData

functionLanguage:getData(extra,raw)

This function is not for use in entries or other content pages. Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If extra is set, any extra data in the relevant /extra module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If raw is set, then the returned data will not contain any data inherited from parent objects. -- Do NOT use these methods! -- All uses should be pre-approved on the talk page!

Language:loadInExtraData

functionLanguage:loadInExtraData()

This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.

Language:getDataModuleName

functionLanguage:getDataModuleName()

Returns the name of the module containing the language's data. Currently, this is always Module:scripts/data.

Language:getExtraDataModuleName

functionLanguage:getExtraDataModuleName()

Returns the name of the module containing the language's data. Currently, this is always Module:scripts/data.

Error function

See Module:languages/error.

Subpages

See also


--[==[ intro:
This module implements fetching of language-specific information and processing text in a given language.
===Types of languages===
There are two types of languages: full languages and etymology-only languages. The essential difference is that only
full languages appear in L2 headings in vocabulary entries, and hence categories like [[:Category:French nouns]] exist
only for full languages. Etymology-only languages have either a full language or another etymology-only language as
their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only
language as their parent, a full language can always be derived by following the parent links upwards. For example,
"Canadian French", code `fr-CA`, is an etymology-only language whose parent is the full language "French", code `fr`.
An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code
`ang-nor`, which has "Anglian Old English", code `ang-ang` as its parent; this is an etymology-only language whose
parent is "Old English", code `ang`, which is a full language. (This is because Northumbrian Old English is considered
a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code `und`; this is the case,
for example, for "substrate" languages such as "Pre-Greek", code `qsb-grc`, and "the BMAC substrate", code `qsb-bma`.
It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one
of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant
relationship is one of descent in time. For example, "Classical Latin", code `la-cla`, and "Late Latin", code `la-lat`,
are both etymology-only languages with "Latin", code `la`, as their parents, because both of the former are varieties
of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of
Classical Latin; rather, it is a descendant. There is in fact a separate `ancestors` field that is used to express the
ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note
that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens,
for example, with "Old Italian" (code `roa-oit`), which is an etymology-only variant of full language "Italian" (code
`it`), and with "Old Latin" (code `itc-ola`), which is an etymology-only variant of Latin. In both cases, the full
language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin
using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance
in time, rather than in the parent-child sense); likewise for Italian and Old Italian.
Full languages come in three subtypes:
* {regular}: This indicates a full language that is attested according to [[WT:CFI]] and therefore permitted in the
			main namespace. There may also be reconstructed terms for the language, which are placed in the
			{Reconstruction} namespace and must be prefixed with * to indicate a reconstruction. Most full languages
			are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük,
			among others) are also allowed in the mainspace and considered regular languages.
* {reconstructed}: This language is not attested according to [[WT:CFI]], and therefore is allowed only in the
				{Reconstruction} namespace. All terms in this language are reconstructed, and must be prefixed with
				*. Languages such as Proto-Indo-European and Proto-Germanic are in this category.
* {appendix-constructed}: This language is attested but does not meet the additional requirements set out for
						constructed languages ([[WT:CFI#Constructed languages]]). Its entries must therefore be in
						the Appendix namespace, but they are not reconstructed and therefore should not have *
						prefixed in links. Most constructed languages are of this subtype.
Both full languages and etymology-only languages have a {Language} object associated with them, which is fetched using
the {getByCode} function in [[Module:languages]] to convert a language code to a {Language} object. Depending on the
options supplied to this function, etymology-only languages may or may not be accepted, and family codes may be
accepted (returning a {Family} object as described in [[Module:families]]). There are also separate {getByCanonicalName}
functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a
{Language} object (depending on whether the canonical name refers to a full or etymology-only language).
===Textual representations===
Textual strings belonging to a given language come in several different ''text variants'':
# The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}},
 {{tl|t}}, {{tl|lang}} and the like.
# The ''corrected input text'' is the input text with some corrections and/or normalizations applied, such as
 bad-character replacements for certain languages, like replacing `l` or `1` to [[palochka]] in some languages written
 in Cyrillic. (FIXME: This currently goes under the name ''display text'' but that will be repurposed below. Also,
 [[User:Surjection]] suggests renaming this to ''normalized input text'', but "normalized" is used in a different sense
 in [[Module:usex]].)
# The ''display text'' is the text in the form as it will be displayed to the user. This is what appears in headwords,
 in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped
 display text (see below), as well as embedded bracketed links that are variously processed further. The display text
 is generated from the corrected input text by applying language-specific transformations; for most languages, there
 will be no such transformations. The general reason for having a difference between input and display text is to allow
 for extra information in the input text that is not displayed to the user but is sent to the transliteration module.
 Note that having different display and input text is only supported currently through special-casing but will be
 generalized. Examples of transformations are: (1) Removing the {{cd|^}} that is used in certain East Asian (and
 possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently
 special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for
 Arabic, removing a ''sukūn'' diacritic placed over a ''tāʔ marbūṭa'' (like this: ةْ) to indicate that the
 ''tāʔ marbūṭa'' is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for
 Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as
 `[กรีน/กฺรีน]`, which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific
 templates like {{tl|th-usex}}].
## The ''right-resolved display text'' is the result of removing brackets around one-part embedded links and resolving
 two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form).
 The process of right-resolution is what happens when you call {{cd|remove_links()}} in [[Module:links]] on some text.
 When applied to the display text, it produces exactly what the user sees, without any link markup.
# The ''stripped display text'' is the result of applying diacritic-stripping to the display text.
## The ''left-resolved stripped display text'' [NEED BETTER NAME] is the result of applying left-resolution to the
 stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand
 components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying
 diacritic stripping and left-resolution produces the ''logical pagename''.
# The ''physical pagename text'' is the result of converting the stripped display text into physical page links. If the
 stripped display text contains embedded links, the left side of those links is converted into physical page links;
 otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three
 things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g.
 {{cd|Unsupported titles/`gt`}} in place of the logical name {{cd|>}}; (2) handles certain special-cased
 unsupported-title logical pagenames, such as {{cd|Unsupported titles/Space}} in place of {{cd|[space]}} and
 {{cd|Unsupported titles/Ancient Greek dish}} in place of a very long Greek name for a gourmet dish as found in
 Aristophanes; (3) converts "mammoth" pagenames such as [[a]] into their appropriate split component, e.g.
 [[a/languages A to L]].
# The ''source translit text'' is the text as supplied to the language-specific {{cd|transliterate()}} method. The form
 of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text,
 whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed
 links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when
 this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the
 text through the transliterate mechanism, and for others (those listed with "cont" in {{cd|substitution}} in
 [[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is
 still unclear to me.)
# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike
 for all the other text variants except the transcribed text, it is always in the Latin script.
# The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where
 "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,
 Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.
 Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.
 Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
 {{cd|transcribe()}} method on language objects.
# The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The
 sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific
 transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it
 needs to be converted to display text, have embedded links removed through right-resolution and have
 diacritic-stripping applied.
# There are other text variants that occur in usexes (specifically, there are normalized variants of several of the
 above text variants), but we can skip them for now.
The following methods exist on {Language} objects to convert between different text variants:
# {correctInputText} (currently called {makeDisplayText}): This converts input text to corrected input text.
# {stripDiacritics}: This converts to stripped display text. [FIXME: This needs some rethinking. In particular,
 {stripDiacritics} is sometimes called on input text, corrected input text or display text (in various paths inside of
 [[Module:links]], and, in the case of input text, usually from other modules). We need to make sure we don't try to
 convert input text to display text twice, but at the same time we need to support calling it directly on input text
 since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input,
 corrected input, or display text; if the former two, we call {correctInputText} ourselves.]
# {logicalToPhysical}: This converts logical pagenames to physical pagenames.
# {transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.
 [FIXME: This needs some rethinking. In particular, it calls {processDisplayText} on its input, which won't work
 for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the
 language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;
 a lot of callers remove the links themselves before calling {transliterate()}, which I assume is wrong.]
# {makeSortKey}: This converts display text (?) to a sort key. [FIXME: Clarify this.]
]==]
localexport={}
localdebug_track_module="Module:debug/track"
localetymology_languages_data_module="Module:etymology languages/data"
localfamilies_module="Module:families"
localheadword_page_module="Module:headword/page"
localjson_module="Module:JSON"
locallanguage_like_module="Module:language-like"
locallanguages_data_module="Module:languages/data"
locallanguages_data_patterns_module="Module:languages/data/patterns"
locallinks_data_module="Module:links/data"
localload_module="Module:load"
localscripts_module="Module:scripts"
localscripts_data_module="Module:scripts/data"
localstring_encode_entities_module="Module:string/encode entities"
localstring_pattern_escape_module="Module:string/patternEscape"
localstring_replacement_escape_module="Module:string/replacementEscape"
localstring_utilities_module="Module:string utilities"
localtable_module="Module:table"
localutilities_module="Module:utilities"
localwikimedia_languages_module="Module:wikimedia languages"
localmw=mw
localstring=string
localtable=table
localchar=string.char
localconcat=table.concat
localfind=string.find
localfloor=math.floor
localget_by_code-- Defined below.
localget_data_module_name-- Defined below.
localget_extra_data_module_name-- Defined below.
localgetmetatable=getmetatable
localgmatch=string.gmatch
localgsub=string.gsub
localinsert=table.insert
localipairs=ipairs
localis_known_language_tag=mw.language.isKnownLanguageTag
localmake_object-- Defined below.
localmatch=string.match
localnext=next
localpairs=pairs
localremove=table.remove
localrequire=require
localselect=select
localsetmetatable=setmetatable
localsub=string.sub
localtype=type
localunstrip=mw.text.unstrip
-- Loaded as needed by findBestScript.
localHans_chars
localHant_chars
localfunctioncheck_object(...)
check_object=require(utilities_module).check_object
returncheck_object(...)
end
localfunctiondebug_track(...)
debug_track=require(debug_track_module)
returndebug_track(...)
end
localfunctiondecode_entities(...)
decode_entities=require(string_utilities_module).decode_entities
returndecode_entities(...)
end
localfunctiondecode_uri(...)
decode_uri=require(string_utilities_module).decode_uri
returndecode_uri(...)
end
localfunctiondeep_copy(...)
deep_copy=require(table_module).deepCopy
returndeep_copy(...)
end
localfunctionencode_entities(...)
encode_entities=require(string_encode_entities_module)
returnencode_entities(...)
end
localfunctionget_L2_sort_key(...)
get_L2_sort_key=require(headword_page_module).get_L2_sort_key
returnget_L2_sort_key(...)
end
localfunctionget_script(...)
get_script=require(scripts_module).getByCode
returnget_script(...)
end
localfunctionfind_best_script_without_lang(...)
find_best_script_without_lang=require(scripts_module).findBestScriptWithoutLang
returnfind_best_script_without_lang(...)
end
localfunctionget_family(...)
get_family=require(families_module).getByCode
returnget_family(...)
end
localfunctionget_plaintext(...)
get_plaintext=require(utilities_module).get_plaintext
returnget_plaintext(...)
end
localfunctionget_wikimedia_lang(...)
get_wikimedia_lang=require(wikimedia_languages_module).getByCode
returnget_wikimedia_lang(...)
end
localfunctionkeys_to_list(...)
keys_to_list=require(table_module).keysToList
returnkeys_to_list(...)
end
localfunctionlist_to_set(...)
list_to_set=require(table_module).listToSet
returnlist_to_set(...)
end
localfunctionload_data(...)
load_data=require(load_module).load_data
returnload_data(...)
end
localfunctionmake_family_object(...)
make_family_object=require(families_module).makeObject
returnmake_family_object(...)
end
localfunctionpattern_escape(...)
pattern_escape=require(string_pattern_escape_module)
returnpattern_escape(...)
end
localfunctionreplacement_escape(...)
replacement_escape=require(string_replacement_escape_module)
returnreplacement_escape(...)
end
localfunctionsafe_require(...)
safe_require=require(load_module).safe_require
returnsafe_require(...)
end
localfunctionshallow_copy(...)
shallow_copy=require(table_module).shallowCopy
returnshallow_copy(...)
end
localfunctionsplit(...)
split=require(string_utilities_module).split
returnsplit(...)
end
localfunctionto_json(...)
to_json=require(json_module).toJSON
returnto_json(...)
end
localfunctionu(...)
u=require(string_utilities_module).char
returnu(...)
end
localfunctionugsub(...)
ugsub=require(string_utilities_module).gsub
returnugsub(...)
end
localfunctionulen(...)
ulen=require(string_utilities_module).len
returnulen(...)
end
localfunctionulower(...)
ulower=require(string_utilities_module).lower
returnulower(...)
end
localfunctionumatch(...)
umatch=require(string_utilities_module).match
returnumatch(...)
end
localfunctionuupper(...)
uupper=require(string_utilities_module).upper
returnuupper(...)
end
localfunctiontrack(page)
debug_track("languages/"..page)
returntrue
end
localfunctionnormalize_code(code)
returnload_data(languages_data_module).aliases[code]orcode
end
localfunctioncheck_inputs(self,check,default,...)
localn=select("#",...)
ifn==0then
returnfalse
end
localret=check(self,(...))
ifret~=nilthen
returnret
elseifn>1then
localinputs={...}
fori=2,ndo
ret=check(self,inputs[i])
ifret~=nilthen
returnret
end
end
end
returndefault
end
localfunctionmake_link(self,target,display)
localprefix,main
ifself:getFamilyCode()=="qfa-sub"then
prefix,main=display:match("^(the )(.*)")
ifnotprefixthen
prefix,main=display:match("^(a )(.*)")
end
end
return(prefixor"").."[["..target.."|"..(mainordisplay).."]]"
end
-- Convert risky characters to HTML entities, which minimizes interference once returned (e.g. for "sms:a", "<!-- -->" etc.).
localfunctionescape_risky_characters(text)
-- Spacing characters in isolation generally need to be escaped in order to be properly processed by the MediaWiki software.
ifumatch(text,"^%s*$")then
returnencode_entities(text,text)
end
returnencode_entities(text,"!#%&*+/:;<=>?@[\\]_{|}")
end
-- Temporarily convert various formatting characters to PUA to prevent them from being disrupted by the substitution process.
localfunctiondoTempSubstitutions(text,subbedChars,keepCarets,noTrim)
-- Clone so that we don't insert any extra patterns into the table in package.loaded. For some reason, using require seems to keep memory use down; probably because the table is always cloned.
localpatterns=shallow_copy(require(languages_data_patterns_module))
ifkeepCaretsthen
insert(patterns,"((\\+)%^)")
insert(patterns,"((%^))")
end
-- Ensure any whitespace at the beginning and end is temp substituted, to prevent it from being accidentally trimmed. We only want to trim any final spaces added during the substitution process (e.g. by a module), which means we only do this during the first round of temp substitutions.
ifnotnoTrimthen
insert(patterns,"^([\128-\191\244]*(%s+))")
insert(patterns,"((%s+)[\128-\191\244]*)$")
end
-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate.
text=gsub(text,"%f[%[]%[%[","\1"):gsub("%f[%]]%]%]","\2")
locali=#subbedChars
for_,patterninipairs(patterns)do
-- Patterns ending in \0 stand are for things like "[[" or "]]"), so the inserted PUA are treated as breaks between terms by modules that scrape info from pages.
localterm_divider
pattern=gsub(pattern,"%z$",function(divider)
term_divider=divider=="\0"
return""
end)
text=gsub(text,pattern,function(...)
localm={...}
localm1New=m[1]
fork=2,#mdo
localn=i+k-1
subbedChars[n]=m[k]
localbyte2=floor(n/4096)%64+(term_dividerand128or136)
localbyte3=floor(n/64)%64+128
localbyte4=n%64+128
m1New=gsub(m1New,pattern_escape(m[k]),"\244"..char(byte2)..char(byte3)..char(byte4),1)
end
i=i+#m-1
returnm1New
end)
end
text=gsub(text,"\1","%[%["):gsub("\2","%]%]")
returntext,subbedChars
end
-- Reinsert any formatting that was temporarily substituted.
localfunctionundoTempSubstitutions(text,subbedChars)
fori=1,#subbedCharsdo
localbyte2=floor(i/4096)%64+128
localbyte3=floor(i/64)%64+128
localbyte4=i%64+128
text=gsub(text,"\244["..char(byte2)..char(byte2+8).."]"..char(byte3)..char(byte4),
replacement_escape(subbedChars[i]))
end
text=gsub(text,"\1","%[%["):gsub("\2","%]%]")
returntext
end
-- Check if the raw text is an unsupported title, and if so return that. Otherwise, remove HTML entities. We do the pre-conversion to avoid loading the unsupported title list unnecessarily.
localfunctioncheckNoEntities(self,text)
localtextNoEnc=decode_entities(text)
iftextNoEnc~=textandload_data(links_data_module).unsupported_titles[text]then
returntext
else
returntextNoEnc
end
end
-- If no script object is provided (or if it's invalid or None), get one.
localfunctioncheckScript(text,self,sc)
ifnotcheck_object("script",true,sc)orsc:getCode()=="None"then
returnself:findBestScript(text)
end
returnsc
end
localfunctionnormalize(text,sc)
text=sc:fixDiscouragedSequences(text)
returnsc:toFixedNFD(text)
end
-- Subfunction of iterateSectionSubstitutions(). Process an individual chunk of text according to the specifications in
-- `substitution_data`. The input parameters are all as in the documentation of iterateSectionSubstitutions() except for
-- `recursed`, which is set to true if we called ourselves recursively to process a script-specific setting or
-- script-wide fallback. Returns two values: the processed text and the actual substitution data used to do the
-- substitutions (same as the `actual_substitution_data` return value to iterateSectionSubstitutions()).
localfunctiondoSubstitutions(self,text,sc,substitution_data,data_field,function_name,recursed)
-- BE CAREFUL in this function because the value at any level can be `false`, which causes no processing to be done
-- and blocks any further fallback processing.
localactual_substitution_data=substitution_data
-- If there are language-specific substitutes given in the data module, use those.
iftype(substitution_data)=="table"then
-- If a script is specified, run this function with the script-specific data before continuing.
localsc_code=sc:getCode()
localhas_substitution_data=false
ifsubstitution_data[sc_code]~=nilthen
has_substitution_data=true
ifsubstitution_data[sc_code]then
text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data[sc_code],data_field,
function_name,true)
end
-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one
-- separately.
elseifsc_code:match("^Han")andsubstitution_data.Hani~=nilthen
has_substitution_data=true
ifsubstitution_data.Hanithen
text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data.Hani,data_field,
function_name,true)
end
-- Substitution data with key 1 in the outer table may be given as a fallback.
elseifsubstitution_data[1]~=nilthen
has_substitution_data=true
ifsubstitution_data[1]then
text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data[1],data_field,
function_name,true)
end
end
-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with
-- the NFD decomposed forms, as this simplifies many substitutions.
ifsubstitution_data.fromthen
has_substitution_data=true
fori,frominipairs(substitution_data.from)do
-- Normalize each loop, to ensure multi-stage substitutions work correctly.
text=sc:toFixedNFD(text)
text=ugsub(text,sc:toFixedNFD(from),substitution_data.to[i]or"")
end
end
ifsubstitution_data.remove_diacriticsthen
has_substitution_data=true
text=sc:toFixedNFD(text)
-- Convert exceptions to PUA.
localremove_exceptions,substitutes=substitution_data.remove_exceptions
ifremove_exceptionsthen
substitutes={}
locali=0
for_,exceptioninipairs(remove_exceptions)do
exception=sc:toFixedNFD(exception)
text=ugsub(text,exception,function(m)
i=i+1
localsubst=u(0x80000+i)
substitutes[subst]=m
returnsubst
end)
end
end
-- Strip diacritics.
text=ugsub(text,"["..substitution_data.remove_diacritics.."]","")
-- Convert exceptions back.
ifremove_exceptionsthen
text=text:gsub("\242[\128-\191]*",substitutes)
end
end
ifnothas_substitution_dataandsc._data[data_field]then
-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
text,actual_substitution_data=doSubstitutions(self,text,sc,sc._data[data_field],data_field,
function_name,true)
end
elseiftype(substitution_data)=="string"then
-- If there is a dedicated function module, use that.
localmodule=safe_require("Module:"..substitution_data)
ifmodulethen
-- TODO: translit functions should take objects, not codes.
-- TODO: translit functions should be called with form NFD.
iffunction_name=="tr"then
ifnotmodule[function_name]then
error(("Internal error: Module [[%s]] has no function named 'tr'"):format(substitution_data))
end
text=module[function_name](text,self._code,sc:getCode())
elseiffunction_name=="stripDiacritics"then
-- FIXME, get rid of this arm after renaming makeEntryName -> stripDiacritics.
ifmodule[function_name]then
text=module[function_name](sc:toFixedNFD(text),self,sc)
elseifmodule.makeEntryNamethen
text=module.makeEntryName(sc:toFixedNFD(text),self,sc)
else
error(("Internal error: Module [[%s]] has no function named 'stripDiacritics' or 'makeEntryName'"
):format(substitution_data))
end
else
ifnotmodule[function_name]then
error(("Internal error: Module [[%s]] has no function named '%s'"):format(
substitution_data,function_name))
end
text=module[function_name](sc:toFixedNFD(text),self,sc)
end
else
error("Substitution data '"..substitution_data.."' does not match an existing module.")
end
elseifsubstitution_data==nilandsc._data[data_field]then
-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
text,actual_substitution_data=doSubstitutions(self,text,sc,sc._data[data_field],data_field,
function_name,true)
end
-- Don't normalize to NFC if this is the inner loop or if a module returned nil.
ifrecursedornottextthen
returntext,actual_substitution_data
end
-- Fix any discouraged sequences created during the substitution process, and normalize into the final form.
returnsc:toFixedNFC(sc:fixDiscouragedSequences(text)),actual_substitution_data
end
-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate
-- over each section to apply substitutions (e.g. transliteration or diacritic stripping). This avoids putting PUA
-- characters through language-specific modules, which may be unequipped for them. This function is passed the following
-- values:
-- * `self` (the Language object);
-- * `text` (the text to process);
-- * `sc` (the script of the text, which must be specified; callers should call checkScript() as needed to autodetect the
-- script of the text if not given explicitly by the user);
-- * `subbedChars` (an array of the same length as the text, indicating which characters have been substituted and by
-- what, or {nil} if no substitutions are to happen);
-- * `keepCarets` (DOCUMENT ME);
-- * `substitution_data` (the data indicating which substitutions to apply, taken directly from `data_field` in the
-- language's data structure in a submodule of [[Module:languages/data]]);
-- * `data_field` (the data field from which `substitution_data` was fetched, such as "sort_key" or "strip_diacritics");
-- * `function_name` (the name of the function to call to do the substitution, in case `substitution_data` specifies a
-- module to do the substitution);
-- * `notrim` (don't trim whitespace at the edges of `text`; set when computing the sort key, because whitespace at the
-- beginning of a sort key is significant and causes the resulting page to be sorted at the beginning of the category
-- it's in).
-- Returns three values:
-- (1) the processed text;
-- (2) the value of `subbedChars` that was passed in, possibly modified with additional character substitutions; will be
-- {nil} if {nil} was passed in;
-- (3) the actual substitution data that was used to apply substitutions to `text`; this may be different from the value
-- of `substitution_data` passed in if that value recursively specified script-specific substitutions or if no
-- substitution data could be found in the language-specific data (e.g. {nil} was passed in or a structure was passed
-- in that had no setting for the script given in `sc`), but a script-wide fallback value was set; currently it is
-- only used by makeSortKey().
localfunctioniterateSectionSubstitutions(self,text,sc,subbedChars,keepCarets,substitution_data,data_field,
function_name,notrim)
localsections
-- See [[Module:languages/data]].
ifnotfind(text,"\244")orload_data(languages_data_module).substitution[self._code]=="cont"then
sections={text}
else
sections=split(text,"\244[\128-\143][\128-\191]*",true)
end
localactual_substitution_data
for_,sectioninipairs(sections)do
-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated
-- modules).
ifgsub(section,"%s+","")~=""then
localsub,this_actual_substitution_data=doSubstitutions(self,section,sc,substitution_data,data_field,
function_name)
actual_substitution_data=this_actual_substitution_data
-- Second round of temporary substitutions, in case any formatting was added by the main substitution
-- process. However, don't do this if the section contains formatting already (as it would have had to have
-- been escaped to reach this stage, and therefore should be given as raw text).
ifsubandsubbedCharsthen
localnoSub
for_,patterninipairs(require(languages_data_patterns_module))do
ifmatch(section,pattern.."%z?")then
noSub=true
end
end
ifnotnoSubthen
sub,subbedChars=doTempSubstitutions(sub,subbedChars,keepCarets,true)
end
end
ifnotsubthen
text=sub
break
end
text=subandgsub(text,pattern_escape(section),replacement_escape(sub),1)ortext
end
end
ifnotnotrimthen
-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
-- Do not trim sort keys because spaces at the beginning are significant.
text=textandtext:gsub("^([\128-\191\244]*)%s+(%S)","%1%2"):gsub("(%S)%s+([\128-\191\244]*)$","%1%2")or
nil
end
returntext,subbedChars,actual_substitution_data
end
-- Process carets (and any escapes). Default to simple removal, if no pattern/replacement is given.
localfunctionprocessCarets(text,pattern,repl)
localrep
repeat
text,rep=gsub(text,"\\\\(\\*^)","\3%1")
untilrep==0
return(text:gsub("\\^","\4")
:gsub(patternor"%^",replor"")
:gsub("\3","\\")
:gsub("\4","^"))
end
-- Remove carets if they are used to capitalize parts of transliterations (unless they have been escaped).
localfunctionremoveCarets(text,sc)
ifnotsc:hasCapitalization()andsc:isTransliterated()andtext:find("^",1,true)then
returnprocessCarets(text)
else
returntext
end
end
localLanguage={}
--[==[Returns the language code of the language. Example: {{code|lua|"fr"}} for French.]==]
functionLanguage:getCode()
returnself._code
end
--[==[Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example: {{code|lua|"French"}} for French.]==]
functionLanguage:getCanonicalName()
localname=self._name
ifname==nilthen
name=self._data[1]
self._name=name
end
returnname
end
--[==[
Return the display form of the language. The display form of a language, family or script is the form it takes when
appearing as the <code><var>source</var></code> in categories such as <code>English terms derived from
<var>source</var></code> or <code>English given names from <var>source</var></code>, and is also the displayed text
in {makeCategoryLink()} links. For full and etymology-only languages, this is the same as the canonical name, but
for families, it reads <code>"<var>name</var> languages"</code> (e.g. {"Indo-Iranian languages"}), and for scripts,
it reads <code>"<var>name</var> script"</code> (e.g. {"Arabic script"}).
]==]
functionLanguage:getDisplayForm()
localform=self._displayForm
ifform==nilthen
form=self:getCanonicalName()
-- Add article and " substrate" to substrates that lack them.
ifself:getFamilyCode()=="qfa-sub"then
ifnot(sub(form,1,4)=="the "orsub(form,1,2)=="a ")then
form="a "..form
end
ifnotmatch(form," [Ss]ubstrate")then
form=form.." substrate"
end
end
self._displayForm=form
end
returnform
end
--[==[Returns the value which should be used in the HTML lang= attribute for tagged text in the language.]==]
functionLanguage:getHTMLAttribute(sc,region)
localcode=self._code
ifnotfind(code,"-",1,true)then
returncode.."-"..sc:getCode()..(regionand"-"..regionor"")
end
localparent=self:getParent()
region=regionormatch(code,"%f[%u][%u-]+%f[%U]")
ifparentthen
returnparent:getHTMLAttribute(sc,region)
end
-- TODO: ISO family codes can also be used.
return"mis-"..sc:getCode()..(regionand"-"..regionor"")
end
--[==[Returns a table of the aliases that the language is known by, excluding the canonical name. Aliases are synonyms for the language in question. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {{code|lua|{"High German", "New High German", "Deutsch"} }} for [[:Category:German language|German]].]==]
functionLanguage:getAliases()
self:loadInExtraData()
returnrequire(language_like_module).getAliases(self)
end
--[==[
Return a table of the known subvarieties of a given language, excluding subvarieties that have been given
explicit etymology-only language codes. The names are not guaranteed to be unique, in that sometimes a given name
refers to a subvariety of more than one language. Example: {{code|lua|{"Southern Aymara", "Central Aymara"} }} for
[[:Category:Aymara language|Aymara]]. Note that the returned value can have nested tables in it, when a subvariety
goes by more than one name. Example: {{code|lua|{"North Azerbaijani", "South Azerbaijani", {"Afshar", "Afshari",
"Afshar Azerbaijani", "Afchar"}, {"Qashqa'i", "Qashqai", "Kashkay"}, "Sonqor"} }} for
[[:Category:Azerbaijani language|Azerbaijani]]. Here, for example, Afshar, Afshari, Afshar Azerbaijani and Afchar
all refer to the same subvariety, whose preferred name is Afshar (the one listed first). To avoid a return value
with nested tables in it, specify a non-{{code|lua|nil}} value for the <code>flatten</code> parameter; in that case,
the return value would be {{code|lua|{"North Azerbaijani", "South Azerbaijani", "Afshar", "Afshari",
"Afshar Azerbaijani", "Afchar", "Qashqa'i", "Qashqai", "Kashkay", "Sonqor"} }}.
]==]
functionLanguage:getVarieties(flatten)
self:loadInExtraData()
returnrequire(language_like_module).getVarieties(self,flatten)
end
--[==[Returns a table of the "other names" that the language is known by, which are listed in the <code>otherNames</code> field. It should be noted that the <code>otherNames</code> field itself is deprecated, and entries listed there should eventually be moved to either <code>aliases</code> or <code>varieties</code>.]==]
functionLanguage:getOtherNames()-- To be eventually removed, once there are no more uses of the `otherNames` field.
self:loadInExtraData()
returnrequire(language_like_module).getOtherNames(self)
end
--[==[
Return a combined table of the canonical name, aliases, varieties and other names of a given language.]==]
functionLanguage:getAllNames()
self:loadInExtraData()
returnrequire(language_like_module).getAllNames(self)
end
--[==[Returns a table of types as a lookup table (with the types as keys).
The possible types are
* {language}: This is a language, either full or etymology-only.
* {full}: This is a "full" (not etymology-only) language, i.e. the union of {regular}, {reconstructed} and
		{appendix-constructed}. Note that the types {full} and {etymology-only} also exist for families, so if you
		want to check specifically for a full language and you have an object that might be a family, you should
		use {{lua|hasType("language", "full")}} and not simply {{lua|hasType("full")}}.
* {etymology-only}: This is an etymology-only (not full) language, whose parent is another etymology-only
					language or a full language. Note that the types {full} and {etymology-only} also exist for
					families, so if you want to check specifically for an etymology-only language and you have an
					object that might be a family, you should use {{lua|hasType("language", "etymology-only")}}
					and not simply {{lua|hasType("etymology-only")}}.
* {regular}: This indicates a full language that is attested according to [[WT:CFI]] and therefore permitted
			in the main namespace. There may also be reconstructed terms for the language, which are placed in
			the {Reconstruction} namespace and must be prefixed with * to indicate a reconstruction. Most full
			languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto
			and Volapük, among others) are also allowed in the mainspace and considered regular languages.
* {reconstructed}: This language is not attested according to [[WT:CFI]], and therefore is allowed only in the
				{Reconstruction} namespace. All terms in this language are reconstructed, and must be prefixed
				with *. Languages such as Proto-Indo-European and Proto-Germanic are in this category.
* {appendix-constructed}: This language is attested but does not meet the additional requirements set out for
						constructed languages ([[WT:CFI#Constructed languages]]). Its entries must therefore
						be in the Appendix namespace, but they are not reconstructed and therefore should
						not have * prefixed in links.
]==]
functionLanguage:getTypes()
localtypes=self._types
iftypes==nilthen
types={language=true}
ifself:getFullCode()==self._codethen
types.full=true
else
types["etymology-only"]=true
end
fortingmatch(self._data.type,"[^,]+")do
types[t]=true
end
self._types=types
end
returntypes
end
--[==[Given a list of types as strings, returns true if the language has all of them.]==]
functionLanguage:hasType(...)
Language.hasType=require(language_like_module).hasType
returnself:hasType(...)
end
--[==[Returns a table containing <code>WikimediaLanguage</code> objects (see [[Module:wikimedia languages]]), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code <code>sh</code> (Serbo-Croatian) maps to four Wikimedia codes: <code>sh</code> (Serbo-Croatian), <code>bs</code> (Bosnian), <code>hr</code> (Croatian) and <code>sr</code> (Serbian).
The code for the Wikimedia language is retrieved from the <code>wikimedia_codes</code> property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.]==]
functionLanguage:getWikimediaLanguages()
localwm_langs=self._wikimediaLanguageObjects
ifwm_langs==nilthen
localcodes=self:getWikimediaLanguageCodes()
wm_langs={}
fori=1,#codesdo
wm_langs[i]=get_wikimedia_lang(codes[i])
end
self._wikimediaLanguageObjects=wm_langs
end
returnwm_langs
end
functionLanguage:getWikimediaLanguageCodes()
localwm_langs=self._wikimediaLanguageCodes
ifwm_langs==nilthen
wm_langs=self._data.wikimedia_codes
ifwm_langsthen
wm_langs=split(wm_langs,",",true,true)
else
localcode=self._code
ifis_known_language_tag(code)then
wm_langs={code}
else
-- Inherit, but only if no codes are specified in the data *and*
-- the language code isn't a valid Wikimedia language code.
localparent=self:getParent()
wm_langs=parentandparent:getWikimediaLanguageCodes()or{}
end
end
self._wikimediaLanguageCodes=wm_langs
end
returnwm_langs
end
--[==[
Returns the name of the Wikipedia article for the language. `project` specifies the language and project to retrieve
the article from, defaulting to {"enwiki"} for the English Wikipedia. Normally if specified it should be the project
code for a specific-language Wikipedia e.g. "zhwiki" for the Chinese Wikipedia, but it can be any project, including
non-Wikipedia ones. If the project is the English Wikipedia and the property {wikipedia_article} is present in the data
module it will be used first. In all other cases, a sitelink will be generated from {:getWikidataItem} (if set). The
resulting value (or lack of value) is cached so that subsequent calls are fast. If no value could be determined, and
`noCategoryFallback` is {false}, {:getCategoryName} is used as fallback; otherwise, {nil} is returned. Note that if
`noCategoryFallback` is {nil} or omitted, it defaults to {false} if the project is the English Wikipedia, otherwise
to {true}. In other words, under normal circumstances, if the English Wikipedia article couldn't be retrieved, the
return value will fall back to a link to the language's category, but this won't normally happen for any other project.
]==]
functionLanguage:getWikipediaArticle(noCategoryFallback,project)
Language.getWikipediaArticle=require(language_like_module).getWikipediaArticle
returnself:getWikipediaArticle(noCategoryFallback,project)
end
functionLanguage:makeWikipediaLink()
returnmake_link(self,"w:"..self:getWikipediaArticle(),self:getCanonicalName())
end
--[==[Returns the name of the Wikimedia Commons category page for the language.]==]
functionLanguage:getCommonsCategory()
Language.getCommonsCategory=require(language_like_module).getCommonsCategory
returnself:getCommonsCategory()
end
--[==[Returns the Wikidata item id for the language or <code>nil</code>. This corresponds to the the second field in the data modules.]==]
functionLanguage:getWikidataItem()
Language.getWikidataItem=require(language_like_module).getWikidataItem
returnself:getWikidataItem()
end
--[==[Returns a table of <code>Script</code> objects for all scripts that the language is written in. See [[Module:scripts]].]==]
functionLanguage:getScripts()
localscripts=self._scriptObjects
ifscripts==nilthen
localcodes=self:getScriptCodes()
ifcodes[1]=="All"then
scripts=load_data(scripts_data_module)
else
scripts={}
fori=1,#codesdo
scripts[i]=get_script(codes[i])
end
end
self._scriptObjects=scripts
end
returnscripts
end
--[==[Returns the table of script codes in the language's data file.]==]
functionLanguage:getScriptCodes()
localscripts=self._scriptCodes
ifscripts==nilthen
scripts=self._data[4]
ifscriptsthen
localcodes,n={},0
forcodeingmatch(scripts,"[^,]+")do
n=n+1
-- Special handling of "Hants", which represents "Hani", "Hant" and "Hans" collectively.
ifcode=="Hants"then
codes[n]="Hani"
codes[n+1]="Hant"
codes[n+2]="Hans"
n=n+2
else
codes[n]=code
end
end
scripts=codes
else
scripts={"None"}
end
self._scriptCodes=scripts
end
returnscripts
end
--[==[Given some text, this function iterates through the scripts of a given language and tries to find the script that best matches the text. It returns a {{code|lua|Script}} object representing the script. If no match is found at all, it returns the {{code|lua|None}} script object.]==]
functionLanguage:findBestScript(text,forceDetect)
ifnottextortext==""ortext=="-"then
returnget_script("None")
end
-- Differs from table returned by getScriptCodes, as Hants is not normalized into its constituents.
localcodes=self._bestScriptCodes
ifcodes==nilthen
codes=self._data[4]
codes=codesandsplit(codes,",",true,true)or{"None"}
self._bestScriptCodes=codes
end
localfirst_sc=codes[1]
iffirst_sc=="All"then
returnfind_best_script_without_lang(text)
end
localcodes_len=#codes
ifnot(forceDetectorfirst_sc=="Hants"orcodes_len>1)then
first_sc=get_script(first_sc)
localcharset=first_sc.characters
returncharsetandumatch(text,"["..charset.."]")andfirst_scorget_script("None")
end
-- Remove all formatting characters.
text=get_plaintext(text)
-- Remove all spaces and any ASCII punctuation. Some non-ASCII punctuation is script-specific, so can't be removed.
text=ugsub(text,"[%s!\"#%%&'()*,%-./:;?@[\\%]_{}]+","")
if#text==0then
returnget_script("None")
end
-- Try to match every script against the text,
-- and return the one with the most matching characters.
localbestcount,bestscript,length=0
fori=1,codes_lendo
localsc=codes[i]
-- Special case for "Hants", which is a special code that represents whichever of "Hant" or "Hans" best matches, or "Hani" if they match equally. This avoids having to list all three. In addition, "Hants" will be treated as the best match if there is at least one matching character, under the assumption that a Han script is desirable in terms that contain a mix of Han and other scripts (not counting those which use Jpan or Kore).
ifsc=="Hants"then
localHani=get_script("Hani")
ifnotHant_charsthen
Hant_chars=load_data("Module:zh/data/ts")
Hans_chars=load_data("Module:zh/data/st")
end
localt,s,found=0,0
-- This is faster than using mw.ustring.gmatch directly.
forchingmatch((ugsub(text,"["..Hani.characters.."]","\255%0")),"\255(.[\128-\191]*)")do
found=true
ifHant_chars[ch]then
t=t+1
ifHans_chars[ch]then
s=s+1
end
elseifHans_chars[ch]then
s=s+1
else
t,s=t+1,s+1
end
end
iffoundthen
ift==sthen
returnHani
end
returnget_script(t>sand"Hant"or"Hans")
end
else
sc=get_script(sc)
ifnotlengththen
length=ulen(text)
end
-- Count characters by removing everything in the script's charset and comparing to the original length.
localcharset=sc.characters
localcount=charsetandlength-ulen((ugsub(text,"["..charset.."]+","")))or0
ifcount>=lengththen
returnsc
elseifcount>bestcountthen
bestcount=count
bestscript=sc
end
end
end
-- Return best matching script, or otherwise None.
returnbestscriptorget_script("None")
end
--[==[Returns a <code>Family</code> object for the language family that the language belongs to. See [[Module:families]].]==]
functionLanguage:getFamily()
localfamily=self._familyObject
iffamily==nilthen
family=self:getFamilyCode()
-- If the value is nil, it's cached as false.
family=familyandget_family(family)orfalse
self._familyObject=family
end
returnfamilyornil
end
--[==[Returns the family code in the language's data file.]==]
functionLanguage:getFamilyCode()
localfamily=self._familyCode
iffamily==nilthen
-- If the value is nil, it's cached as false.
family=self._data[3]orfalse
self._familyCode=family
end
returnfamilyornil
end
functionLanguage:getFamilyName()
localfamily=self._familyName
iffamily==nilthen
family=self:getFamily()
-- If the value is nil, it's cached as false.
family=familyandfamily:getCanonicalName()orfalse
self._familyName=family
end
returnfamilyornil
end
do
localfunctioncheck_family(self,family)
iftype(family)=="table"then
family=family:getCode()
end
ifself:getFamilyCode()==familythen
returntrue
end
localself_family=self:getFamily()
ifself_family:inFamily(family)then
returntrue
-- If the family isn't a real family (e.g. creoles) check any ancestors.
elseifself_family:inFamily("qfa-not")then
localancestors=self:getAncestors()
for_,ancestorinipairs(ancestors)do
ifancestor:inFamily(family)then
returntrue
end
end
end
end
--[==[Check whether the language belongs to `family` (which can be a family code or object). A list of objects can be given in place of `family`; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.]==]
functionLanguage:inFamily(...)
ifself:getFamilyCode()==nilthen
returnfalse
end
returncheck_inputs(self,check_family,false,...)
end
end
functionLanguage:getParent()
localparent=self._parentObject
ifparent==nilthen
parent=self:getParentCode()
-- If the value is nil, it's cached as false.
parent=parentandget_by_code(parent,nil,true,true)orfalse
self._parentObject=parent
end
returnparentornil
end
functionLanguage:getParentCode()
localparent=self._parentCode
ifparent==nilthen
-- If the value is nil, it's cached as false.
parent=self._data.parentorfalse
self._parentCode=parent
end
returnparentornil
end
functionLanguage:getParentName()
localparent=self._parentName
ifparent==nilthen
parent=self:getParent()
-- If the value is nil, it's cached as false.
parent=parentandparent:getCanonicalName()orfalse
self._parentName=parent
end
returnparentornil
end
functionLanguage:getParentChain()
localchain=self._parentChain
ifchain==nilthen
chain={}
localparent,n=self:getParent(),0
whileparentdo
n=n+1
chain[n]=parent
parent=parent:getParent()
end
self._parentChain=chain
end
returnchain
end
do
localfunctioncheck_lang(self,lang)
for_,parentinipairs(self:getParentChain())do
if(type(lang)=="string"andlangorlang:getCode())==parent:getCode()then
returntrue
end
end
end
functionLanguage:hasParent(...)
returncheck_inputs(self,check_lang,false,...)
end
end
--[==[
If the language is etymology-only, this iterates through parents until a full language or family is found, and the
corresponding object is returned. If the language is a full language, then it simply returns itself.
]==]
functionLanguage:getFull()
localfull=self._fullObject
iffull==nilthen
full=self:getFullCode()
full=full==self._codeandselforget_by_code(full)
self._fullObject=full
end
returnfull
end
--[==[
If the language is an etymology-only language, this iterates through parents until a full language or family is
found, and the corresponding code is returned. If the language is a full language, then it simply returns the
language code.
]==]
functionLanguage:getFullCode()
returnself._fullCodeorself._code
end
--[==[
If the language is an etymology-only language, this iterates through parents until a full language or family is
found, and the corresponding canonical name is returned. If the language is a full language, then it simply returns
the canonical name of the language.
]==]
functionLanguage:getFullName()
localfull=self._fullName
iffull==nilthen
full=self:getFull():getCanonicalName()
self._fullName=full
end
returnfull
end
--[==[Returns a table of <code class="nf">Language</code> objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.]==]
functionLanguage:getAncestors()
localancestors=self._ancestorObjects
ifancestors==nilthen
ancestors={}
localancestor_codes=self:getAncestorCodes()
if#ancestor_codes>0then
for_,ancestorinipairs(ancestor_codes)do
insert(ancestors,get_by_code(ancestor,nil,true))
end
else
localfam=self:getFamily()
localprotoLang=famandfam:getProtoLanguage()ornil
-- For the cases where the current language is the proto-language
-- of its family, or an etymology-only language that is ancestral to that
-- proto-language, we need to step up a level higher right from the
-- start.
ifprotoLangand(
protoLang:getCode()==self._codeor
(self:hasType("etymology-only")andprotoLang:hasAncestor(self))
)then
fam=fam:getFamily()
protoLang=famandfam:getProtoLanguage()ornil
end
whilenotprotoLangandnot(notfamorfam:getCode()=="qfa-not")do
fam=fam:getFamily()
protoLang=famandfam:getProtoLanguage()ornil
end
insert(ancestors,protoLang)
end
self._ancestorObjects=ancestors
end
returnancestors
end
do
-- Avoid a language being its own ancestor via class inheritance. We only need to check for this if the language has inherited an ancestor table from its parent, because we never want to drop ancestors that have been explicitly set in the data.
-- Recursively iterate over ancestors until we either find self or run out. If self is found, return true.
localfunctioncheck_ancestor(self,lang)
localcodes=lang:getAncestorCodes()
ifnotcodesthen
returnnil
end
fori=1,#codesdo
localcode=codes[i]
ifcode==self._codethen
returntrue
end
localanc=get_by_code(code,nil,true)
ifcheck_ancestor(self,anc)then
returntrue
end
end
end
--[==[Returns a table of <code class="nf">Language</code> codes for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.]==]
functionLanguage:getAncestorCodes()
ifself._ancestorCodesthen
returnself._ancestorCodes
end
localdata=self._data
localcodes=data.ancestors
ifcodes==nilthen
codes={}
self._ancestorCodes=codes
returncodes
end
codes=split(codes,",",true,true)
self._ancestorCodes=codes
-- If there are no codes or the ancestors weren't inherited data, there's nothing left to check.
if#codes==0orself:getData(false,"raw").ancestors~=nilthen
returncodes
end
locali,code=1
whilei<=#codesdo
code=codes[i]
ifcheck_ancestor(self,self)then
remove(codes,i)
else
i=i+1
end
end
returncodes
end
end
--[==[Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).]==]
functionLanguage:hasAncestor(...)
localfunctioniterateOverAncestorTree(node,func,parent_check)
localancestors=node:getAncestors()
localancestorsParents={}
for_,ancestorinipairs(ancestors)do
-- When checking the parents of the other language, and the ancestor is also a parent, skip to the next ancestor, so that we exclude any etymology-only children of that parent that are not directly related (see below).
localret=(parent_checkornotnode:hasParent(ancestor))and
func(ancestor)oriterateOverAncestorTree(ancestor,func,parent_check)
ifretthen
returnret
end
end
-- Check the parents of any ancestors. We don't do this if checking the parents of the other language, so that we exclude any etymology-only children of those parents that are not directly related (e.g. if the ancestor is Vulgar Latin and we are checking New Latin, we want it to return false because they are on different ancestral branches. As such, if we're already checking the parent of New Latin (Latin) we don't want to compare it to the parent of the ancestor (Latin), as this would be a false positive; it should be one or the other).
ifnotparent_checkthen
returnnil
end
for_,ancestorinipairs(ancestors)do
localancestorParents=ancestor:getParentChain()
for_,ancestorParentinipairs(ancestorParents)do
ifancestorParent:getCode()==self._codeorancestorParent:hasAncestor(ancestor)then
break
else
insert(ancestorsParents,ancestorParent)
end
end
end
for_,ancestorParentinipairs(ancestorsParents)do
localret=func(ancestorParent)
ifretthen
returnret
end
end
end
localfunctiondo_iteration(otherlang,parent_check)
-- otherlang can't be self
if(type(otherlang)=="string"andotherlangorotherlang:getCode())==self._codethen
returnfalse
end
repeat
ifiterateOverAncestorTree(
self,
function(ancestor)
returnancestor:getCode()==(type(otherlang)=="string"andotherlangorotherlang:getCode())
end,
parent_check
)then
returntrue
elseiftype(otherlang)=="string"then
otherlang=get_by_code(otherlang,nil,true)
end
otherlang=otherlang:getParent()
parent_check=false
untilnototherlang
end
localparent_check=true
for_,otherlanginipairs{...}do
localret=do_iteration(otherlang,parent_check)
ifretthen
returntrue
end
end
returnfalse
end
do
localfunctionconstruct_node(lang,memo)
localbranch,ancestors={lang=lang:getCode()}
memo[lang:getCode()]=branch
for_,ancestorinipairs(lang:getAncestors())do
ifancestors==nilthen
ancestors={}
end
insert(ancestors,memo[ancestor:getCode()]orconstruct_node(ancestor,memo))
end
branch.ancestors=ancestors
returnbranch
end
functionLanguage:getAncestorChain()
localchain=self._ancestorChain
ifchain==nilthen
chain=construct_node(self,{})
self._ancestorChain=chain
end
returnchain
end
end
functionLanguage:getAncestorChainOld()
localchain=self._ancestorChain
ifchain==nilthen
chain={}
localstep=self
whiletruedo
localancestors=step:getAncestors()
step=#ancestors==1andancestors[1]ornil
ifnotstepthen
break
end
insert(chain,step)
end
self._ancestorChain=chain
end
returnchain
end
localfunctionfetch_descendants(self,fmt)
localdescendants,family={},self:getFamily()
-- Iterate over all three datasets.
for_,datainipairs{
require("Module:languages/code to canonical name"),
require("Module:etymology languages/code to canonical name"),
require("Module:families/code to canonical name"),
}do
forcodeinpairs(data)do
locallang=get_by_code(code,nil,true,true)
-- Test for a descendant. Earlier tests weed out most candidates, while the more intensive tests are only used sparingly.
if(
code~=self._codeand-- Not self.
lang:inFamily(family)and-- In the same family.
(
family:getProtoLanguageCode()==self._codeor-- Self is the protolanguage.
self:hasDescendant(lang)or-- Full hasDescendant check.
(lang:getFullCode()==self._codeandnotself:hasAncestor(lang))-- Etymology-only child which isn't an ancestor.
)
)then
iffmt=="object"then
insert(descendants,lang)
elseiffmt=="code"then
insert(descendants,code)
elseiffmt=="name"then
insert(descendants,lang:getCanonicalName())
end
end
end
end
returndescendants
end
functionLanguage:getDescendants()
localdescendants=self._descendantObjects
ifdescendants==nilthen
descendants=fetch_descendants(self,"object")
self._descendantObjects=descendants
end
returndescendants
end
functionLanguage:getDescendantCodes()
localdescendants=self._descendantCodes
ifdescendants==nilthen
descendants=fetch_descendants(self,"code")
self._descendantCodes=descendants
end
returndescendants
end
functionLanguage:getDescendantNames()
localdescendants=self._descendantNames
ifdescendants==nilthen
descendants=fetch_descendants(self,"name")
self._descendantNames=descendants
end
returndescendants
end
do
localfunctioncheck_lang(self,lang)
iftype(lang)=="string"then
lang=get_by_code(lang,nil,true)
end
iflang:hasAncestor(self)then
returntrue
end
end
functionLanguage:hasDescendant(...)
returncheck_inputs(self,check_lang,false,...)
end
end
localfunctionfetch_children(self,fmt)
localm_etym_data=require(etymology_languages_data_module)
localself_code,children=self._code,{}
forcode,langinpairs(m_etym_data)do
local_lang=lang
repeat
localparent=_lang.parent
ifparent==self_codethen
iffmt=="object"then
insert(children,get_by_code(code,nil,true))
elseiffmt=="code"then
insert(children,code)
elseiffmt=="name"then
insert(children,lang[1])
end
break
end
_lang=m_etym_data[parent]
untilnot_lang
end
returnchildren
end
functionLanguage:getChildren()
localchildren=self._childObjects
ifchildren==nilthen
children=fetch_children(self,"object")
self._childObjects=children
end
returnchildren
end
functionLanguage:getChildrenCodes()
localchildren=self._childCodes
ifchildren==nilthen
children=fetch_children(self,"code")
self._childCodes=children
end
returnchildren
end
functionLanguage:getChildrenNames()
localchildren=self._childNames
ifchildren==nilthen
children=fetch_children(self,"name")
self._childNames=children
end
returnchildren
end
functionLanguage:hasChild(...)
locallang=...
ifnotlangthen
returnfalse
elseiftype(lang)=="string"then
lang=get_by_code(lang,nil,true)
end
iflang:hasParent(self)then
returntrue
end
returnself:hasChild(select(2,...))
end
--[==[Returns the name of the main category of that language. Example: {{code|lua|"French language"}} for French, whose category is at [[:Category:French language]]. Unless optional argument <code>nocap</code> is given, the language name at the beginning of the returned value will be capitalized. This capitalization is correct for category names, but not if the language name is lowercase and the returned value of this function is used in the middle of a sentence.]==]
functionLanguage:getCategoryName(nocap)
localname=self._categoryName
ifname==nilthen
name=self:getCanonicalName()
-- If a substrate, omit any leading article.
ifself:getFamilyCode()=="qfa-sub"then
name=name:gsub("^the ",""):gsub("^a ","")
end
-- Only add " language" if a full language.
ifself:hasType("full")then
-- Unless the canonical name already ends with "language", "lect" or their derivatives, add " language".
ifnot(match(name,"[Ll]anguage$")ormatch(name,"[Ll]ect$"))then
name=name.." language"
end
end
self._categoryName=name
end
ifnocapthen
returnname
end
returnmw.getContentLanguage():ucfirst(name)
end
--[==[Creates a link to the category; the link text is the canonical name.]==]
functionLanguage:makeCategoryLink()
returnmake_link(self,":Category:"..self:getCategoryName(),self:getDisplayForm())
end
functionLanguage:getStandardCharacters(sc)
localstandard_chars=self._data.standard_chars
iftype(standard_chars)~="table"then
returnstandard_chars
elseifscandtype(sc)~="string"then
check_object("script",nil,sc)
sc=sc:getCode()
end
if(notsc)orsc=="None"then
localscripts={}
for_,scriptinpairs(standard_chars)do
insert(scripts,script)
end
returnconcat(scripts)
end
ifstandard_chars[sc]then
returnstandard_chars[sc]..(standard_chars[1]or"")
end
end
--[==[
Strip diacritics from display text `text` (in a language-specific fashion), which is in the script `sc`. If `sc` is
omitted or {nil}, the script is autodetected. This also strips certain punctuation characters from the end and (in the
case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the
end of the text or between the text and final stripped punctuation characters; and applies some language-specific
Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text.
]==]
functionLanguage:stripDiacritics(text,sc)
if(nottext)ortext==""then
returntext
end
sc=checkScript(text,self,sc)
text=normalize(text,sc)
-- FIXME, rename makeEntryName to stripDiacritics and get rid of second and third return values
-- everywhere
text,_,_=iterateSectionSubstitutions(self,text,sc,nil,nil,
self._data.strip_diacriticsorself._data.entry_name,"strip_diacritics","stripDiacritics")
text=umatch(text,"^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$")ortext
returntext
end
--[==[
Convert a ''logical'' pagename (the pagename as it appears to the user, after diacritics and punctuation have been
stripped) to a ''physical'' pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference
between the two are (a) unsupported titles such as `[ ]` (with square brackets in them), `#` (pound/hash sign) and
`¯\_(ツ)_/¯` (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into
parts (e.g. `a`, which is split into physical pagenames `a/languages A to L` and `a/languages M to Z`). For almost all
purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical
pagenames, such as checking the existence of a page or retrieving a page's contents.
`pagename` is the logical pagename to be converted. `is_reconstructed_or_appendix` indicates whether the page is in the
`Reconstruction` or `Appendix` namespaces. If it is omitted or has the value {nil}, the pagename is checked for an
initial asterisk, and if found, the page is assumed to be a `Reconstruction` page. Setting a value of `false` or `true`
to `is_reconstructed_or_appendix` disables this check and allows for mainspace pagenames that begin with an asterisk.
]==]
functionLanguage:logicalToPhysical(pagename,is_reconstructed_or_appendix)
-- FIXME: This probably shouldn't happen but it happens when makeEntryName() receives nil.
ifpagename==nilthen
track("nil-passed-to-logicalToPhysical")
returnnil
end
localinitial_asterisk
ifis_reconstructed_or_appendix==nilthen
localpagename_minus_initial_asterisk
initial_asterisk,pagename_minus_initial_asterisk=pagename:match("^(%*)(.*)$")
ifpagename_minus_initial_asteriskthen
is_reconstructed_or_appendix=true
pagename=pagename_minus_initial_asterisk
elseifself:hasType("appendix-constructed")then
is_reconstructed_or_appendix=true
end
end
ifnotis_reconstructed_or_appendixthen
-- Check if the pagename is a listed unsupported title.
localunsupportedTitles=load_data(links_data_module).unsupported_titles
ifunsupportedTitles[pagename]then
return"Unsupported titles/"..unsupportedTitles[pagename]
end
end
-- Set `unsupported` as true if certain conditions are met.
localunsupported
-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed
-- directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for
-- relative paths in links, as are 3 or more consecutive tildes. Note: match is faster with magic
-- characters/charsets; find is faster with plaintext.
if(
match(pagename,"[#<>%[%]_{|}]")or
find(pagename,"\239\191\189")or
match(pagename,"%f[^%z/]%.%.?%f[%z/]")or
find(pagename,"~~~")
)then
unsupported=true
-- If it looks like an interwiki link.
elseiffind(pagename,":")then
localprefix=gsub(pagename,"^:*(.-):.*",ulower)
if(
load_data("Module:data/namespaces")[prefix]or
load_data("Module:data/interwikis")[prefix]
)then
unsupported=true
end
end
-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of
-- it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported
-- titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded
-- from the earlier test).
ifunsupportedthen
-- FIXME: This conversion needs to be different for reconstructed pages with unsupported characters. There
-- aren't any currently, but if there ever are, we need to fix this e.g. to put them in something like
-- Reconstruction:Proto-Indo-European/Unsupported titles/`lowbar``num`.
localunsupported_characters=load_data(links_data_module).unsupported_characters
pagename=pagename:gsub("[#<>%[%]_`{|}\239]\191?\189?",unsupported_characters)
:gsub("%f[^%z/]%.%.?%f[%z/]",function(m)
return(gsub(m,"%.","`period`"))
end)
:gsub("~~~+",function(m)
return(gsub(m,"~","`tilde`"))
end)
pagename="Unsupported titles/"..pagename
elseifnotis_reconstructed_or_appendixthen
-- Check if this is a mammoth page. If so, which subpage should we link to?
localm_links_data=load_data(links_data_module)
localmammoth_page_type=m_links_data.mammoth_pages[pagename]
ifmammoth_page_typethen
localcanonical_name=self:getFullName()
ifcanonical_name~="Translingual"andcanonical_name~="English"then
localthis_subpage
localL2_sort_key=get_L2_sort_key(canonical_name)
for_,subpage_specinipairs(m_links_data.mammoth_page_subpage_types[mammoth_page_type])do
-- unpack() fails utterly on data loaded using mw.loadData() even if offsets are given
localsubpage,pattern=subpage_spec[1],subpage_spec[2]
ifpattern==trueorL2_sort_key:match(pattern)then
this_subpage=subpage
break
end
end
ifnotthis_subpagethen
error(("Internal error: Bad data in mammoth_page_subpage_pages in [[Module:links/data]] for mammoth page %s, type %s; last entry didn't have 'true' in it"):format(
pagename,mammoth_page_type))
end
pagename=pagename.."/"..this_subpage
end
end
end
return(initial_asteriskor"")..pagename
end
--[==[
Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename.
This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated
and will be going away. It is a simple composition of `self:stripDiacritics` and `self:logicalToPhysical`; most callers
only want the former, and if you need both, call them both yourself.
`text` and `sc` are as in `self:stripDiacritics`, and `is_reconstructed_or_appendix` is as in `self:logicalToPhysical`.
]==]
functionLanguage:makeEntryName(text,sc,is_reconstructed_or_appendix)
returnself:logicalToPhysical(self:stripDiacritics(text,sc),is_reconstructed_or_appendix)
end
--[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==]
functionLanguage:generateForms(text,sc)
localgenerate_forms=self._data.generate_forms
ifgenerate_forms==nilthen
return{text}
end
sc=checkScript(text,self,sc)
returnrequire("Module:"..self._data.generate_forms).generateForms(text,self,sc)
end
--[==[Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes
diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other
changes. Any initial hyphen is also removed, and anything in parentheses is removed as well.
The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.]==]
functionLanguage:makeSortKey(text,sc)
if(nottext)ortext==""then
returntext
end
ifmatch(text,"<[^<>]+>")then
track("track HTML tag")
end
-- Remove directional characters, bold, italics, soft hyphens, strip markers and HTML tags.
-- FIXME: Partly duplicated with remove_formatting() in [[Module:links]].
text=ugsub(text,"[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]","")
text=text:gsub("('*)'''(.-'*)'''","%1%2"):gsub("('*)''(.-'*)''","%1%2")
text=gsub(unstrip(text),"<[^<>]+>","")
text=decode_uri(text,"PATH")
text=checkNoEntities(self,text)
-- Remove initial hyphens and * unless the term only consists of spacing + punctuation characters.
text=ugsub(text,"^([􀀀-􏿽]*)[-־ـ᠊*]+([􀀀-􏿽]*)(.*[^%s%p].*)","%1%2%3")
sc=checkScript(text,self,sc)
text=normalize(text,sc)
text=removeCarets(text,sc)
-- For languages with dotted dotless i, ensure that "İ" is sorted as "i", and "I" is sorted as "ı".
ifself:hasDottedDotlessI()then
text=gsub(text,"I\204\135","i")-- decomposed "İ"
:gsub("I","ı")
text=sc:toFixedNFD(text)
end
-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is
-- usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as
-- conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the
-- sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is
-- necessary so as to prevent "i" and "ı" both being sorted as "I".
--
-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive
-- to changes in capitalization (as it changes the target page).
ifnotsc:sortByScraping()then
text=ulower(text)
end
localactual_substitution_data
-- Don't trim whitespace here because it's significant at the beginning of a sort key or sort base.
text,_,actual_substitution_data=iterateSectionSubstitutions(self,text,sc,nil,nil,self._data.sort_key,
"sort_key","makeSortKey","notrim")
ifnotsc:sortByScraping()then
ifself:hasDottedDotlessI()andnotactual_substitution_datathen
text=text:gsub("ı","I"):gsub("i","İ")
text=sc:toFixedNFC(text)
end
text=uupper(text)
end
-- Remove parentheses, as long as they are either preceded or followed by something.
text=gsub(text,"(.)[()]+","%1"):gsub("[()]+(.)","%1")
text=escape_risky_characters(text)
returntext
end
--[==[Create the form used as as a basis for display text and transliteration. FIXME: Rename to correctInputText().]==]
localfunctionprocessDisplayText(text,self,sc,keepCarets,keepPrefixes)
localsubbedChars={}
text,subbedChars=doTempSubstitutions(text,subbedChars,keepCarets)
text=decode_uri(text,"PATH")
text=checkNoEntities(self,text)
sc=checkScript(text,self,sc)
text=normalize(text,sc)
text,subbedChars=iterateSectionSubstitutions(self,text,sc,subbedChars,keepCarets,self._data.display_text,
"display_text","makeDisplayText")
text=removeCarets(text,sc)
-- Remove any interwiki link prefixes (unless they have been escaped or this has been disabled).
iffind(text,":")andnotkeepPrefixesthen
localrep
repeat
text,rep=gsub(text,"\\\\(\\*:)","\3%1")
untilrep==0
text=gsub(text,"\\:","\4")
whiletruedo
localprefix=gsub(text,"^(.-):.+",function(m1)
return(gsub(m1,"\244[\128-\191]*",""))
end)
-- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace.
ifnotprefixorprefix==textorprefix=="Wiktionary"
ornot(load_data("Module:data/interwikis")[ulower(prefix)]orprefix=="")then
break
end
text=gsub(text,"^(.-):(.*)",function(m1,m2)
localret={}
forsubbedCharingmatch(m1,"\244[\128-\191]*")do
insert(ret,subbedChar)
end
returnconcat(ret)..m2
end)
end
text=gsub(text,"\3","\\"):gsub("\4",":")
end
returntext,subbedChars
end
--[==[Make the display text (i.e. what is displayed on the page).]==]
functionLanguage:makeDisplayText(text,sc,keepPrefixes)
ifnottextortext==""then
returntext
end
localsubbedChars
text,subbedChars=processDisplayText(text,self,sc,nil,keepPrefixes)
text=escape_risky_characters(text)
returnundoTempSubstitutions(text,subbedChars)
end
--[==[Transliterates the text from the given script into the Latin script (see
[[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to
work; if it is not present, {{code|lua|nil}} is returned.
The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that
module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the
possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason,
the <code>sc</code> parameter should always be provided when writing non-language-specific code.
The <code>module_override</code> parameter is used to override the default module that is used to provide the
transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no
default module yet, or you want to demonstrate an alternative version of a transliteration module before making it
official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked
by [[Wiktionary:Tracking/languages/module_override]].
'''Known bugs''':
* This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations.
* HTML entities like <code>&amp;apos;</code>, often used to escape wikitext markups, do not work.
]==]
functionLanguage:transliterate(text,sc,module_override)
-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.
ifnottextortext==""ortext=="-"then
returntext
end
-- If the script is not transliteratable (and no override is given), return nil.
sc=checkScript(text,self,sc)
ifnot(sc:isTransliterated()ormodule_override)then
-- temporary tracking to see if/when this gets triggered
track("non-transliterable")
track("non-transliterable/"..self._code)
track("non-transliterable/"..sc:getCode())
track("non-transliterable/"..sc:getCode().."/"..self._code)
returnnil
end
-- Remove any strip markers.
text=unstrip(text)
-- Do not process the formatting into PUA characters for certain languages.
localprocessed=load_data(languages_data_module).substitution[self._code]~="none"
-- Get the display text with the keepCarets flag set.
localsubbedChars
ifprocessedthen
text,subbedChars=processDisplayText(text,self,sc,true)
end
-- Transliterate (using the module override if applicable).
text,subbedChars=iterateSectionSubstitutions(self,text,sc,subbedChars,true,module_overrideor
self._data.translit,"translit","tr")
ifnottextthen
returnnil
end
-- Incomplete transliterations return nil.
localcharset=sc.characters
ifcharsetandumatch(text,"["..charset.."]")then
-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are
-- false positives), as well as any PUA substitutions. Anything remaining should only be script code "None"
-- (e.g. numerals).
localcheck_text=ugsub(text,"["..get_script("Latn").characters.."􀀀-􏿽]+","")
-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be
-- returned.
iffind_best_script_without_lang(check_text,true):getCode()~="None"then
returnnil
end
end
ifprocessedthen
text=escape_risky_characters(text)
text=undoTempSubstitutions(text,subbedChars)
end
-- If the script does not use capitalization, then capitalize any letters of the transliteration which are
-- immediately preceded by a caret (and remove the caret).
iftextandnotsc:hasCapitalization()andtext:find("^",1,true)then
text=processCarets(text,"%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)",function(m1,m2)
returnm1..uupper(m2)
end)
end
-- Track module overrides.
ifmodule_override~=nilthen
track("module_override")
end
returntext
end
do
localfunctionhandle_language_spec(self,spec,sc)
localret=self["_"..spec]
ifret==nilthen
ret=self._data[spec]
iftype(ret)=="string"then
ret=list_to_set(split(ret,",",true,true))
end
self["_"..spec]=ret
end
iftype(ret)=="table"then
ret=ret[sc:getCode()]
end
returnnotnotret
end
functionLanguage:overrideManualTranslit(sc)
returnhandle_language_spec(self,"override_translit",sc)
end
functionLanguage:link_tr(sc)
returnhandle_language_spec(self,"link_tr",sc)
end
end
--[==[Returns {{code|lua|true}} if the language has a transliteration module, or {{code|lua|false}} if it doesn't.]==]
functionLanguage:hasTranslit()
returnnotnotself._data.translit
end
--[==[Returns {{code|lua|true}} if the language uses the letters I/ı and İ/i, or {{code|lua|false}} if it doesn't.]==]
functionLanguage:hasDottedDotlessI()
returnnotnotself._data.dotted_dotless_i
end
functionLanguage:toJSON(opts)
localstrip_diacritics,strip_diacritics_patterns,strip_diacritics_remove_diacritics=self._data.strip_diacritics
ifstrip_diacriticsthen
ifstrip_diacritics.fromthen
strip_diacritics_patterns={}
fori,frominipairs(strip_diacritics.from)do
insert(strip_diacritics_patterns,{from=from,to=strip_diacritics.to[i]or""})
end
end
strip_diacritics_remove_diacritics=strip_diacritics.remove_diacritics
end
-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object().
-- props should either contain zero-argument functions to compute the value, or the value itself.
localprops={
ancestors=function()returnself:getAncestorCodes()end,
canonicalName=function()returnself:getCanonicalName()end,
categoryName=function()returnself:getCategoryName("nocap")end,
code=self._code,
mainCode=self._mainCode,
parent=function()returnself:getParentCode()end,
full=function()returnself:getFullCode()end,
stripDiacriticsPatterns=strip_diacritics_patterns,
stripDiacriticsRemoveDiacritics=strip_diacritics_remove_diacritics,
family=function()returnself:getFamilyCode()end,
aliases=function()returnself:getAliases()end,
varieties=function()returnself:getVarieties()end,
otherNames=function()returnself:getOtherNames()end,
scripts=function()returnself:getScriptCodes()end,
type=function()returnkeys_to_list(self:getTypes())end,
wikimediaLanguages=function()returnself:getWikimediaLanguageCodes()end,
wikidataItem=function()returnself:getWikidataItem()end,
wikipediaArticle=function()returnself:getWikipediaArticle(true)end,
}
localret={}
forprop,valinpairs(props)do
ifnotopts.skip_fieldsornotopts.skip_fields[prop]then
iftype(val)=="function"then
ret[prop]=val()
else
ret[prop]=val
end
end
end
-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`.
returnoptsandopts.lua_tableanddeep_copy(ret)orto_json(ret,opts)
end
functionexport.getDataModuleName(code)
localletter=match(code,"^(%l)%l%l?$")
return"Module:"..(
letter==niland"languages/data/exceptional"or
#code==2and"languages/data/2"or
"languages/data/3/"..letter
)
end
get_data_module_name=export.getDataModuleName
functionexport.getExtraDataModuleName(code)
returnget_data_module_name(code).."/extra"
end
get_extra_data_module_name=export.getExtraDataModuleName
do
localfunctionmake_stack(data)
localkey_types={
[2]="unique",
aliases="unique",
otherNames="unique",
type="append",
varieties="unique",
wikipedia_article="unique",
wikimedia_codes="unique"
}
localfunction__index(self,k)
localstack,key_type=getmetatable(self),key_types[k]
-- Data that isn't inherited from the parent.
ifkey_type=="unique"then
localv=stack[stack[make_stack]][k]
ifv==nilthen
locallayer=stack[0]
iflayerthen-- Could be false if there's no extra data.
v=layer[k]
end
end
returnv
-- Data that is appended by each generation.
elseifkey_type=="append"then
localparts,offset,n={},0,stack[make_stack]
fori=1,ndo
localpart=stack[i][k]
ifpart==nilthen
offset=offset+1
else
parts[i-offset]=part
end
end
returnoffset~=nandconcat(parts,",")ornil
end
localn=stack[make_stack]
whiletruedo
locallayer=stack[n]
ifnotlayerthen-- Could be false if there's no extra data.
returnnil
end
localv=layer[k]
ifv~=nilthen
returnv
end
n=n-1
end
end
localfunction__newindex()
error("table is read-only")
end
localfunction__pairs(self)
-- Iterate down the stack, caching keys to avoid duplicate returns.
localstack,seen=getmetatable(self),{}
localn=stack[make_stack]
localiter,state,k,v=pairs(stack[n])
returnfunction()
repeat
repeat
k=iter(state,k)
ifk==nilthen
n=n-1
locallayer=stack[n]
ifnotlayerthen-- Could be false if there's no extra data.
returnnil
end
iter,state,k=pairs(layer)
end
untilnot(k==nilorseen[k])
-- Get the value via a lookup, as the one returned by the
-- iterator will be the raw value from the current layer,
-- which may not be the one __index will return for that
-- key. Also memoize the key in `seen` (even if the lookup
-- returns nil) so that it doesn't get looked up again.
-- TODO: store values in `self`, avoiding the need to create
-- the `seen` table. The iterator will need to iterate over
-- `self` with `next` first to find these on future loops.
v,seen[k]=self[k],true
untilv~=nil
returnk,v
end
end
local__ipairs=require(table_module).indexIpairs
functionmake_stack(data)
localstack={
data,
[make_stack]=1,-- stores the length and acts as a sentinel to confirm a given metatable is a stack.
__index=__index,
__newindex=__newindex,
__pairs=__pairs,
__ipairs=__ipairs,
}
stack.__metatable=stack
returnsetmetatable({},stack),stack
end
returnmake_stack(data)
end
localfunctionget_stack(data)
localstack=getmetatable(data)
returnstackandtype(stack)=="table"andstack[make_stack]andstackornil
end
--[==[
	<span style="color: var(--wikt-palette-red,#BA0000)">This function is not for use in entries or other content pages.</span>
	Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects.
	-- Do NOT use these methods!
	-- All uses should be pre-approved on the talk page!
	]==]
functionLanguage:getData(extra,raw)
ifextrathen
self:loadInExtraData()
end
localdata=self._data
-- If raw is not set, just return the data.
ifnotrawthen
returndata
end
localstack=get_stack(data)
-- If there isn't a stack or its length is 1, return the data. Extra data (if any) will be included, as it's stored at key 0 and doesn't affect the reported length.
ifstack==nilthen
returndata
end
localn=stack[make_stack]
ifn==1then
returndata
end
localextra=stack[0]
-- If there isn't any extra data, return the top layer of the stack.
ifextra==nilthen
returnstack[n]
end
-- If there is, return a new stack which has the top layer at key 1 and the extra data at key 0.
data,stack=make_stack(stack[n])
stack[0]=extra
returndata
end
functionLanguage:loadInExtraData()
-- Only full languages have extra data.
ifnotself:hasType("language","full")then
return
end
localdata=self._data
-- If there's no stack, create one.
localstack=get_stack(self._data)
ifstack==nilthen
data,stack=make_stack(data)
-- If already loaded, return.
elseifstack[0]~=nilthen
return
end
self._data=data
-- Load extra data from the relevant module and add it to the stack at key 0, so that the __index and __pairs metamethods will pick it up, since they iterate down the stack until they run out of layers.
localcode=self._code
localmodulename=get_extra_data_module_name(code)
-- No data cached as false.
stack[0]=modulenameandload_data(modulename)[code]orfalse
end
--[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==]
functionLanguage:getDataModuleName()
localname=self._dataModuleName
ifname==nilthen
name=self:hasType("etymology-only")andetymology_languages_data_moduleor
get_data_module_name(self._mainCodeorself._code)
self._dataModuleName=name
end
returnname
end
--[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==]
functionLanguage:getExtraDataModuleName()
localname=self._extraDataModuleName
ifname==nilthen
name=notself:hasType("etymology-only")andget_extra_data_module_name(self._mainCodeorself._code)orfalse
self._extraDataModuleName=name
end
returnnameornil
end
functionexport.makeObject(code,data,dontCanonicalizeAliases)
localdata_type=type(data)
ifdata_type~="table"then
error(("bad argument #2 to 'makeObject' (table expected, got %s)"):format(data_type))
end
-- Convert any aliases.
localinput_code=code
code=normalize_code(code)
input_code=dontCanonicalizeAliasesandinput_codeorcode
localparent
ifdata.parentthen
parent=get_by_code(data.parent,nil,true,true)
else
parent=Language
end
parent.__index=parent
locallang={_code=input_code}
-- This can only happen if dontCanonicalizeAliases is passed to make_object().
ifcode~=input_codethen
lang._mainCode=code
end
localparent_data=parent._data
ifparent_data==nilthen
-- Full code is the same as the code.
lang._fullCode=parent._codeorcode
else
-- Copy full code.
lang._fullCode=parent._fullCode
localstack=get_stack(parent_data)
ifstack==nilthen
parent_data,stack=make_stack(parent_data)
end
-- Insert the input data as the new top layer of the stack.
localn=stack[make_stack]+1
data,stack[n],stack[make_stack]=parent_data,data,n
end
lang._data=data
returnsetmetatable(lang,parent)
end
make_object=export.makeObject
end
--[==[Finds the language whose code matches the one provided. If it exists, it returns a <code class="nf">Language</code> object representing the language. Otherwise, it returns {{code|lua|nil}}, unless <code class="n">paramForError</code> is given, in which case an error is generated. If <code class="n">paramForError</code> is {{code|lua|true}}, a generic error message mentioning the bad code is generated; otherwise <code class="n">paramForError</code> should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If <code class="n">allowEtymLang</code> is specified, etymology-only language codes are allowed and looked up along with normal language codes. If <code class="n">allowFamily</code> is specified, language family codes are allowed and looked up along with normal language codes.]==]
functionexport.getByCode(code,paramForError,allowEtymLang,allowFamily)
-- Track uses of paramForError, ultimately so it can be removed, as error-handling should be done by [[Module:parameters]], not here.
ifparamForError~=nilthen
track("paramForError")
end
iftype(code)~="string"then
localtyp
ifnotcodethen
typ="nil"
elseifcheck_object("language",true,code)then
typ="a language object"
elseifcheck_object("family",true,code)then
typ="a family object"
else
typ="a "..type(code)
end
error("The function getByCode expects a string as its first argument, but received "..typ..".")
end
localm_data=load_data(languages_data_module)
ifm_data.aliases[code]orm_data.track[code]then
track(code)
end
localnorm_code=normalize_code(code)
-- Get the data, checking for etymology-only languages if allowEtymLang is set.
localdata=load_data(get_data_module_name(norm_code))[norm_code]or
allowEtymLangandload_data(etymology_languages_data_module)[norm_code]
-- If no data was found and allowFamily is set, check the family data. If the main family data was found, make the object with [[Module:families]] instead, as family objects have different methods. However, if it's an etymology-only family, use make_object in this module (which handles object inheritance), and the family-specific methods will be inherited from the parent object.
ifdata==nilandallowFamilythen
data=load_data("Module:families/data")[norm_code]
ifdata~=nilthen
ifdata.parent==nilthen
returnmake_family_object(norm_code,data)
elseifnotallowEtymLangthen
data=nil
end
end
end
localretval=codeanddataandmake_object(code,data)
ifnotretvalandparamForErrorthen
require("Module:languages/errorGetBy").code(code,paramForError,allowEtymLang,allowFamily)
end
returnretval
end
get_by_code=export.getByCode
--[==[Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a <code class="nf">Language</code> object representing the language. Otherwise, it returns {{code|lua|nil}}, unless <code class="n">paramForError</code> is given, in which case an error is generated. If <code class="n">allowEtymLang</code> is specified, etymology-only language codes are allowed and looked up along with normal language codes. If <code class="n">allowFamily</code> is specified, language family codes are allowed and looked up along with normal language codes.
The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result.
This function is powered by [[Module:languages/canonical names]], which contains a pre-generated mapping of full-language canonical names to codes. It is generated by going through the [[:Category:Language data modules]] for full languages. When <code class="n">allowEtymLang</code> is specified for the above function, [[Module:etymology languages/canonical names]] may also be used, and when <code class="n">allowFamily</code> is specified for the above function, [[Module:families/canonical names]] may also be used.]==]
functionexport.getByCanonicalName(name,errorIfInvalid,allowEtymLang,allowFamily)
localbyName=load_data("Module:languages/canonical names")
localcode=byNameandbyName[name]
ifnotcodeandallowEtymLangthen
byName=load_data("Module:etymology languages/canonical names")
code=byNameandbyName[name]or
byName[gsub(name," [Ss]ubstrate$","")]or
byName[gsub(name,"^a ","")]or
byName[gsub(name,"^a ",""):gsub(" [Ss]ubstrate$","")]or
-- For etymology families like "ira-pro".
-- FIXME: This is not ideal, as it allows " languages" to be appended to any etymology-only language, too.
byName[match(name,"^(.*) languages$")]
end
ifnotcodeandallowFamilythen
byName=load_data("Module:families/canonical names")
code=byName[name]orbyName[match(name,"^(.*) languages$")]
end
localretval=codeandget_by_code(code,errorIfInvalid,allowEtymLang,allowFamily)
ifnotretvalanderrorIfInvalidthen
require("Module:languages/errorGetBy").canonicalName(name,allowEtymLang,allowFamily)
end
returnretval
end
--[==[Used by [[Module:languages/data/2]] (et al.) and [[Module:etymology languages/data]], [[Module:families/data]], [[Module:scripts/data]] and [[Module:writing systems/data]] to finalize the data into the format that is actually returned.]==]
functionexport.finalizeData(data,main_type,variety)
localfields={"type"}
ifmain_type=="language"then
insert(fields,4)-- script codes
insert(fields,"ancestors")
insert(fields,"link_tr")
insert(fields,"override_translit")
insert(fields,"wikimedia_codes")
elseifmain_type=="script"then
insert(fields,3)-- writing system codes
end-- Families and writing systems have no extra fields to process.
localfields_len=#fields
for_,entityinnext,datado
ifvarietythen
-- Move parent from 3 to "parent" and family from "family" to 3. These are different for the sake of convenience, since very few varieties have the family specified, whereas all of them have a parent.
entity.parent,entity[3],entity.family=entity[3],entity.family
-- Give the type "regular" iff not a variety and no other types are assigned.
elseifnot(entity.typeorentity.parent)then
entity.type="regular"
end
fori=1,fields_lendo
localkey=fields[i]
localfield=entity[key]
iffieldandtype(field)=="string"then
entity[key]=gsub(field,"%s*,%s*",",")
end
end
end
returndata
end
--[==[For backwards compatibility only; modules should require the error themselves.]==]
functionexport.err(lang_code,param,code_desc,template_tag,not_real_lang)
returnrequire("Module:languages/error")(lang_code,param,code_desc,template_tag,not_real_lang)
end
returnexport