- The following documentation is located at Module:languages/documentation. [edit]
- Useful links: subpage list • links • transclusions • testcases • sandbox
This module is used to retrieve and manage the languages that can have Wiktionary entries, and the information associated with them. See Wiktionary:Languages for more information.
For the languages and language varieties that may be used in etymologies, see Module:etymology languages. For language families, which sometimes also appear in etymologies, see Module:families.
This module provides access to other modules. To access the information from within a template, see Module:languages/templates.
The information itself is stored in the various data modules that are subpages of this module. These modules should not be used directly by any other module, the data should only be accessed through the functions provided by this module.
Data submodules:
- Two-letter codes
- Three-letter codes by their first letter: a b c d e f g h i j k l m n o p q r s t u v w x y z
- Longer codes containing hyphens (
-)
Extra data submodules (for less frequently used data):
- Two-letter codes
- Three-letter codes by their first letter: a b c d e f g h i j k l m n o p q r s t u v w x y z
- Longer codes containing hyphens (
-)
Finding and retrieving languages
The module exports a number of functions that are used to find languages.
This module implements fetching of language-specific information and processing text in a given language.
Types of languages
There are two types of languages: full languages and etymology-only languages. The essential difference is that only full languages appear in L2 headings in vocabulary entries, and hence categories like Category:French nouns exist only for full languages. Etymology-only languages have either a full language or another etymology-only language as their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only language as their parent, a full language can always be derived by following the parent links upwards. For example, "Canadian French", code fr-CA, is an etymology-only language whose parent is the full language "French", code fr. An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code ang-nor, which has "Anglian Old English", code ang-ang as its parent; this is an etymology-only language whose parent is "Old English", code ang, which is a full language. (This is because Northumbrian Old English is considered a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code und; this is the case, for example, for "substrate" languages such as "Pre-Greek", code qsb-grc, and "the BMAC substrate", code qsb-bma.
It is important to distinguish language parents from language ancestors. The parent-child relationship is one of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant relationship is one of descent in time. For example, "Classical Latin", code la-cla, and "Late Latin", code la-lat, are both etymology-only languages with "Latin", code la, as their parents, because both of the former are varieties of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of Classical Latin; rather, it is a descendant. There is in fact a separate ancestors field that is used to express the ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens, for example, with "Old Italian" (code roa-oit), which is an etymology-only variant of full language "Italian" (code it), and with "Old Latin" (code itc-ola), which is an etymology-only variant of Latin. In both cases, the full language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin using the {{inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance in time, rather than in the parent-child sense); likewise for Italian and Old Italian.
Full languages come in three subtypes:
regular: This indicates a full language that is attested according to WT:CFI and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in theReconstructionnamespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages.reconstructed: This language is not attested according to WT:CFI, and therefore is allowed only in theReconstructionnamespace. All terms in this language are reconstructed, and must be prefixed with- . Languages such as Proto-Indo-European and Proto-Germanic are in this category.
appendix-constructed: This language is attested but does not meet the additional requirements set out for constructed languages (WT:CFI#Constructed languages). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links. Most constructed languages are of this subtype.
Both full languages and etymology-only languages have a Language object associated with them, which is fetched using the getByCode function in to convert a language code to a Language object. Depending on the options supplied to this function, etymology-only languages may or may not be accepted, and family codes may be accepted (returning a Family object as described in Module:families). There are also separate getByCanonicalName functions in and Module:etymology languages to convert a language's canonical name to a Language object (depending on whether the canonical name refers to a full or etymology-only language).
Textual representations
Textual strings belonging to a given language come in several different text variants:
- The input text is what the user supplies in wikitext, in the parameters to
{{m}},{{l}},{{ux}},{{t}},{{lang}}and the like. - The corrected input text is the input text with some corrections and/or normalizations applied, such as bad-character replacements for certain languages, like replacing
lor1to palochka in some languages written in Cyrillic. (FIXME: This currently goes under the name display text but that will be repurposed below. Also, User:Surjection suggests renaming this to normalized input text, but "normalized" is used in a different sense in Module:usex.) - The display text is the text in the form as it will be displayed to the user. This is what appears in headwords, in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped display text (see below), as well as embedded bracketed links that are variously processed further. The display text is generated from the corrected input text by applying language-specific transformations; for most languages, there will be no such transformations. The general reason for having a difference between input and display text is to allow for extra information in the input text that is not displayed to the user but is sent to the transliteration module. Note that having different display and input text is only supported currently through special-casing but will be generalized. Examples of transformations are: (1) Removing the
^that is used in certain East Asian (and possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for Arabic, removing a sukūn diacritic placed over a tāʔ marbūṭa (like this: ةْ) to indicate that the tāʔ marbūṭa is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as[กรีน/กฺรีน], which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific templates like{{th-usex}}].- The right-resolved display text is the result of removing brackets around one-part embedded links and resolving two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form). The process of right-resolution is what happens when you call
remove_links()in Module:links on some text. When applied to the display text, it produces exactly what the user sees, without any link markup.
- The right-resolved display text is the result of removing brackets around one-part embedded links and resolving two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form). The process of right-resolution is what happens when you call
- The stripped display text is the result of applying diacritic-stripping to the display text.
- The left-resolved stripped display text [NEED BETTER NAME] is the result of applying left-resolution to the stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying diacritic stripping and left-resolution produces the logical pagename.
- The physical pagename text is the result of converting the stripped display text into physical page links. If the stripped display text contains embedded links, the left side of those links is converted into physical page links; otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g.
Unsupported titles/<code>gt</code>in place of the logical name>; (2) handles certain special-cased unsupported-title logical pagenames, such asUnsupported titles/Spacein place of[space]andUnsupported titles/Ancient Greek dishin place of a very long Greek name for a gourmet dish as found in Aristophanes; (3) converts "mammoth" pagenames such as a into their appropriate split component, e.g. a/languages A to L. - The source translit text is the text as supplied to the language-specific
transliterate()method. The form of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the text through the transliterate mechanism, and for others (those listed with "cont" insubstitutionin Module:languages/data) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is still unclear to me.) - The transliterated text (or transliteration) is the result of transliterating the source translit text. Unlike for all the other text variants except the transcribed text, it is always in the Latin script.
- The transcribed text (or transcription) is the result of transcribing the source translit text, where "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian, Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form. Unlike for all the other text variants other than the transliterated text, it is always in the Latin script. Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
transcribe()method on language objects. - The sort key is the text used in sort keys for determining the placing of pages in categories they belong to. The sort key is generated from the pagename or a specified sort base by lowercasing, doing language-specific transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it needs to be converted to display text, have embedded links removed through right-resolution and have diacritic-stripping applied.
- There are other text variants that occur in usexes (specifically, there are normalized variants of several of the above text variants), but we can skip them for now.
The following methods exist on Language objects to convert between different text variants:
correctInputText(currently calledmakeDisplayText): This converts input text to corrected input text.stripDiacritics: This converts to stripped display text. [FIXME: This needs some rethinking. In particular,stripDiacriticsis sometimes called on input text, corrected input text or display text (in various paths inside of Module:links, and, in the case of input text, usually from other modules). We need to make sure we don't try to convert input text to display text twice, but at the same time we need to support calling it directly on input text since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input, corrected input, or display text; if the former two, we callcorrectInputTextourselves.]logicalToPhysical: This converts logical pagenames to physical pagenames.transliterate: This appears to convert input text with embedded brackets removed into a transliteration. [FIXME: This needs some rethinking. In particular, it callsprocessDisplayTexton its input, which won't work for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code; a lot of callers remove the links themselves before callingtransliterate(), which I assume is wrong.]makeSortKey: This converts display text (?) to a sort key. [FIXME: Clarify this.]
export.getDataModuleName
functionexport.getDataModuleName(code)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.getExtraDataModuleName
functionexport.getExtraDataModuleName(code)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.makeObject
functionexport.makeObject(code,data,dontCanonicalizeAliases)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
export.getByCode
functionexport.getByCode(code,paramForError,allowEtymLang,allowFamily)
Finds the language whose code matches the one provided. If it exists, it returns a Language object representing the language. Otherwise, it returns nil, unless paramForError is given, in which case an error is generated. If paramForError is true, a generic error message mentioning the bad code is generated; otherwise paramForError should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If allowEtymLang is specified, etymology-only language codes are allowed and looked up along with normal language codes. If allowFamily is specified, language family codes are allowed and looked up along with normal language codes.
export.getByCanonicalName
functionexport.getByCanonicalName(name,errorIfInvalid,allowEtymLang,allowFamily)
Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a Language object representing the language. Otherwise, it returns nil, unless paramForError is given, in which case an error is generated. If allowEtymLang is specified, etymology-only language codes are allowed and looked up along with normal language codes. If allowFamily is specified, language family codes are allowed and looked up along with normal language codes. The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result. This function is powered by Module:languages/canonical names, which contains a pre-generated mapping of full-language canonical names to codes. It is generated by going through the Category:Language data modules for full languages. When allowEtymLang is specified for the above function, Module:etymology languages/canonical names may also be used, and when allowFamily is specified for the above function, Module:families/canonical names may also be used.
export.finalizeData
functionexport.finalizeData(data,main_type,variety)
Used by Module:languages/data/2 (et al.) and Module:etymology languages/data, Module:families/data, Module:scripts/data and Module:writing systems/data to finalize the data into the format that is actually returned.
export.err
functionexport.err(lang_code,param,code_desc,template_tag,not_real_lang)
For backwards compatibility only; modules should require the error themselves.
Language objects
A Language object is returned from one of the functions above. It is a Lua representation of a language and the data associated with it. It has a number of methods that can be called on it, using the : syntax. For example:
localm_languages=require("Module:languages") locallang=m_languages.getByCode("fr") localname=lang:getCanonicalName() -- "name" will now be "French"
Language:getCode
functionLanguage:getCode()
Returns the language code of the language. Example: "fr" for French.
Language:getCanonicalName
functionLanguage:getCanonicalName()
Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example: "French" for French.
Language:getDisplayForm
functionLanguage:getDisplayForm()
Return the display form of the language. The display form of a language, family or script is the form it takes when appearing as the source in categories such as English terms derived from source or English given names from source, and is also the displayed text in makeCategoryLink() links. For full and etymology-only languages, this is the same as the canonical name, but for families, it reads "name languages" (e.g. "Indo-Iranian languages"), and for scripts, it reads "name script" (e.g. "Arabic script").
Language:getHTMLAttribute
functionLanguage:getHTMLAttribute(sc,region)
Returns the value which should be used in the HTML lang= attribute for tagged text in the language.
Language:getAliases
functionLanguage:getAliases()
Returns a table of the aliases that the language is known by, excluding the canonical name. Aliases are synonyms for the language in question. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {"High German","New High German","Deutsch"} for German.
Language:getVarieties
functionLanguage:getVarieties(flatten)
Return a table of the known subvarieties of a given language, excluding subvarieties that have been given explicit etymology-only language codes. The names are not guaranteed to be unique, in that sometimes a given name refers to a subvariety of more than one language. Example: {"Southern Aymara","Central Aymara"} for Aymara. Note that the returned value can have nested tables in it, when a subvariety goes by more than one name. Example: {"North Azerbaijani","South Azerbaijani",{"Afshar","Afshari","Afshar Azerbaijani","Afchar"},{"Qashqa'i","Qashqai","Kashkay"},"Sonqor"} for Azerbaijani. Here, for example, Afshar, Afshari, Afshar Azerbaijani and Afchar all refer to the same subvariety, whose preferred name is Afshar (the one listed first). To avoid a return value with nested tables in it, specify a non-nil value for the flatten parameter; in that case, the return value would be {"North Azerbaijani","South Azerbaijani","Afshar","Afshari","Afshar Azerbaijani","Afchar","Qashqa'i","Qashqai","Kashkay","Sonqor"}.
Language:getOtherNames
functionLanguage:getOtherNames()
Returns a table of the "other names" that the language is known by, which are listed in the otherNames field. It should be noted that the otherNames field itself is deprecated, and entries listed there should eventually be moved to either aliases or varieties.
Language:getAllNames
functionLanguage:getAllNames()
Return a combined table of the canonical name, aliases, varieties and other names of a given language.
Language:getTypes
functionLanguage:getTypes()
Returns a table of types as a lookup table (with the types as keys).
The possible types are
language: This is a language, either full or etymology-only.full: This is a "full" (not etymology-only) language, i.e. the union ofregular,reconstructedandappendix-constructed. Note that the typesfullandetymology-onlyalso exist for families, so if you want to check specifically for a full language and you have an object that might be a family, you should usehasType("language","full")and not simplyhasType("full").etymology-only: This is an etymology-only (not full) language, whose parent is another etymology-only language or a full language. Note that the typesfullandetymology-onlyalso exist for families, so if you want to check specifically for an etymology-only language and you have an object that might be a family, you should usehasType("language","etymology-only")and not simplyhasType("etymology-only").regular: This indicates a full language that is attested according to WT:CFI and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in theReconstructionnamespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages.reconstructed: This language is not attested according to WT:CFI, and therefore is allowed only in theReconstructionnamespace. All terms in this language are reconstructed, and must be prefixed with *. Languages such as Proto-Indo-European and Proto-Germanic are in this category.appendix-constructed: This language is attested but does not meet the additional requirements set out for constructed languages (WT:CFI#Constructed languages). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links.
Language:hasType
functionLanguage:hasType(...)
Given a list of types as strings, returns true if the language has all of them.
Language:getWikimediaLanguages
functionLanguage:getWikimediaLanguages()
Returns a table containing WikimediaLanguage objects (see Module:wikimedia languages), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code sh (Serbo-Croatian) maps to four Wikimedia codes: sh (Serbo-Croatian), bs (Bosnian), hr (Croatian) and sr (Serbian). The code for the Wikimedia language is retrieved from the wikimedia_codes property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.
Language:getWikimediaLanguageCodes
functionLanguage:getWikimediaLanguageCodes()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getWikipediaArticle
functionLanguage:getWikipediaArticle(noCategoryFallback,project)
Returns the name of the Wikipedia article for the language. project specifies the language and project to retrieve the article from, defaulting to "enwiki" for the English Wikipedia. Normally if specified it should be the project code for a specific-language Wikipedia e.g. "zhwiki" for the Chinese Wikipedia, but it can be any project, including non-Wikipedia ones. If the project is the English Wikipedia and the property wikipedia_article is present in the data module it will be used first. In all other cases, a sitelink will be generated from :getWikidataItem (if set). The resulting value (or lack of value) is cached so that subsequent calls are fast. If no value could be determined, and noCategoryFallback is false, :getCategoryName is used as fallback; otherwise, nil is returned. Note that if noCategoryFallback is nil or omitted, it defaults to false if the project is the English Wikipedia, otherwise to true. In other words, under normal circumstances, if the English Wikipedia article couldn't be retrieved, the return value will fall back to a link to the language's category, but this won't normally happen for any other project.
Language:makeWikipediaLink
functionLanguage:makeWikipediaLink()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getCommonsCategory
functionLanguage:getCommonsCategory()
Returns the name of the Wikimedia Commons category page for the language.
Language:getWikidataItem
functionLanguage:getWikidataItem()
Returns the Wikidata item id for the language or nil. This corresponds to the the second field in the data modules.
Language:getScripts
functionLanguage:getScripts()
Returns a table of Script objects for all scripts that the language is written in. See Module:scripts.
Language:getScriptCodes
functionLanguage:getScriptCodes()
Returns the table of script codes in the language's data file.
Language:findBestScript
functionLanguage:findBestScript(text,forceDetect)
Given some text, this function iterates through the scripts of a given language and tries to find the script that best matches the text. It returns a Script object representing the script. If no match is found at all, it returns the None script object.
Language:getFamily
functionLanguage:getFamily()
Returns a Family object for the language family that the language belongs to. See Module:families.
Language:getFamilyCode
functionLanguage:getFamilyCode()
Returns the family code in the language's data file.
Language:getFamilyName
functionLanguage:getFamilyName()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:inFamily
functionLanguage:inFamily(...)
Check whether the language belongs to family (which can be a family code or object). A list of objects can be given in place of family; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.
Language:getParent
functionLanguage:getParent()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getParentCode
functionLanguage:getParentCode()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getParentName
functionLanguage:getParentName()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getParentChain
functionLanguage:getParentChain()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:hasParent
functionLanguage:hasParent(...)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getFull
functionLanguage:getFull()
If the language is etymology-only, this iterates through parents until a full language or family is found, and the corresponding object is returned. If the language is a full language, then it simply returns itself.
Language:getFullCode
functionLanguage:getFullCode()
If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding code is returned. If the language is a full language, then it simply returns the language code.
Language:getFullName
functionLanguage:getFullName()
If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding canonical name is returned. If the language is a full language, then it simply returns the canonical name of the language.
Language:getAncestors
functionLanguage:getAncestors()
Returns a table of Language objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.
Language:getAncestorCodes
functionLanguage:getAncestorCodes()
Returns a table of Language codes for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.
Language:hasAncestor
functionLanguage:hasAncestor(...)
Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).
Language:getAncestorChain
functionLanguage:getAncestorChain()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getAncestorChainOld
functionLanguage:getAncestorChainOld()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getDescendants
functionLanguage:getDescendants()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getDescendantCodes
functionLanguage:getDescendantCodes()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getDescendantNames
functionLanguage:getDescendantNames()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:hasDescendant
functionLanguage:hasDescendant(...)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getChildren
functionLanguage:getChildren()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getChildrenCodes
functionLanguage:getChildrenCodes()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getChildrenNames
functionLanguage:getChildrenNames()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:hasChild
functionLanguage:hasChild(...)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getCategoryName
functionLanguage:getCategoryName(nocap)
Returns the name of the main category of that language. Example: "French language" for French, whose category is at Category:French language. Unless optional argument nocap is given, the language name at the beginning of the returned value will be capitalized. This capitalization is correct for category names, but not if the language name is lowercase and the returned value of this function is used in the middle of a sentence.
Language:makeCategoryLink
functionLanguage:makeCategoryLink()
Creates a link to the category; the link text is the canonical name.
Language:getStandardCharacters
functionLanguage:getStandardCharacters(sc)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:stripDiacritics
functionLanguage:stripDiacritics(text,sc)
Strip diacritics from display text text (in a language-specific fashion), which is in the script sc. If sc is omitted or nil, the script is autodetected. This also strips certain punctuation characters from the end and (in the case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the end of the text or between the text and final stripped punctuation characters; and applies some language-specific Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text.
Language:logicalToPhysical
functionLanguage:logicalToPhysical(pagename,is_reconstructed_or_appendix)
Convert a logical pagename (the pagename as it appears to the user, after diacritics and punctuation have been stripped) to a physical pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference between the two are (a) unsupported titles such as [ ] (with square brackets in them), # (pound/hash sign) and ¯\_(ツ)_/¯ (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into parts (e.g. a, which is split into physical pagenames a/languages A to L and a/languages M to Z). For almost all purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical pagenames, such as checking the existence of a page or retrieving a page's contents.
pagename is the logical pagename to be converted. is_reconstructed_or_appendix indicates whether the page is in the Reconstruction or Appendix namespaces. If it is omitted or has the value nil, the pagename is checked for an initial asterisk, and if found, the page is assumed to be a Reconstruction page. Setting a value of false or true to is_reconstructed_or_appendix disables this check and allows for mainspace pagenames that begin with an asterisk.
Language:makeEntryName
functionLanguage:makeEntryName(text,sc,is_reconstructed_or_appendix)
Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename. This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated and will be going away. It is a simple composition of self:stripDiacritics and self:logicalToPhysical; most callers only want the former, and if you need both, call them both yourself.
text and sc are as in self:stripDiacritics, and is_reconstructed_or_appendix is as in self:logicalToPhysical.
Language:generateForms
functionLanguage:generateForms(text,sc)
Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.
Language:makeSortKey
functionLanguage:makeSortKey(text,sc)
Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything in parentheses is removed as well. The sort_key setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.
Language:makeDisplayText
functionLanguage:makeDisplayText(text,sc,keepPrefixes)
Make the display text (i.e. what is displayed on the page).
Language:transliterate
functionLanguage:transliterate(text,sc,module_override)
Transliterates the text from the given script into the Latin script (see Wiktionary:Transliteration and romanization). The language must have the translit property for this to work; if it is not present, nil is returned.
The sc parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate nil as the script, others require it to be one of the possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason, the sc parameter should always be provided when writing non-language-specific code.
The module_override parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by Wiktionary:Tracking/languages/module_override. Known bugs:
- This function assumes
tr(s1)..tr(s2)==tr(s1..s2). When this assertion fails, wikitext markups like ''' can cause wrong transliterations. - HTML entities like
', often used to escape wikitext markups, do not work.
Language:overrideManualTranslit
functionLanguage:overrideManualTranslit(sc)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:link_tr
functionLanguage:link_tr(sc)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:hasTranslit
functionLanguage:hasTranslit()
Returns true if the language has a transliteration module, or false if it doesn't.
Language:hasDottedDotlessI
functionLanguage:hasDottedDotlessI()
Returns true if the language uses the letters I/ı and İ/i, or false if it doesn't.
Language:toJSON
functionLanguage:toJSON(opts)
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getData
functionLanguage:getData(extra,raw)
This function is not for use in entries or other content pages. Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If extra is set, any extra data in the relevant /extra module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If raw is set, then the returned data will not contain any data inherited from parent objects. -- Do NOT use these methods! -- All uses should be pre-approved on the talk page!
Language:loadInExtraData
functionLanguage:loadInExtraData()
This function lacks documentation. Please add a description of its usages, inputs and outputs, or its difference from similar functions, or make it local to remove it from the function list.
Language:getDataModuleName
functionLanguage:getDataModuleName()
Returns the name of the module containing the language's data. Currently, this is always Module:scripts/data.
Language:getExtraDataModuleName
functionLanguage:getExtraDataModuleName()
Returns the name of the module containing the language's data. Currently, this is always Module:scripts/data.
Error function
Subpages
See also
--[==[ intro: This module implements fetching of language-specific information and processing text in a given language. ===Types of languages=== There are two types of languages: full languages and etymology-only languages. The essential difference is that only full languages appear in L2 headings in vocabulary entries, and hence categories like [[:Category:French nouns]] exist only for full languages. Etymology-only languages have either a full language or another etymology-only language as their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only language as their parent, a full language can always be derived by following the parent links upwards. For example, "Canadian French", code `fr-CA`, is an etymology-only language whose parent is the full language "French", code `fr`. An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code `ang-nor`, which has "Anglian Old English", code `ang-ang` as its parent; this is an etymology-only language whose parent is "Old English", code `ang`, which is a full language. (This is because Northumbrian Old English is considered a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code `und`; this is the case, for example, for "substrate" languages such as "Pre-Greek", code `qsb-grc`, and "the BMAC substrate", code `qsb-bma`. It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant relationship is one of descent in time. For example, "Classical Latin", code `la-cla`, and "Late Latin", code `la-lat`, are both etymology-only languages with "Latin", code `la`, as their parents, because both of the former are varieties of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of Classical Latin; rather, it is a descendant. There is in fact a separate `ancestors` field that is used to express the ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens, for example, with "Old Italian" (code `roa-oit`), which is an etymology-only variant of full language "Italian" (code `it`), and with "Old Latin" (code `itc-ola`), which is an etymology-only variant of Latin. In both cases, the full language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance in time, rather than in the parent-child sense); likewise for Italian and Old Italian. Full languages come in three subtypes: * {regular}: This indicates a full language that is attested according to [[WT:CFI]] and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in the {Reconstruction} namespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages. * {reconstructed}: This language is not attested according to [[WT:CFI]], and therefore is allowed only in the {Reconstruction} namespace. All terms in this language are reconstructed, and must be prefixed with *. Languages such as Proto-Indo-European and Proto-Germanic are in this category. * {appendix-constructed}: This language is attested but does not meet the additional requirements set out for constructed languages ([[WT:CFI#Constructed languages]]). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links. Most constructed languages are of this subtype. Both full languages and etymology-only languages have a {Language} object associated with them, which is fetched using the {getByCode} function in [[Module:languages]] to convert a language code to a {Language} object. Depending on the options supplied to this function, etymology-only languages may or may not be accepted, and family codes may be accepted (returning a {Family} object as described in [[Module:families]]). There are also separate {getByCanonicalName} functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a {Language} object (depending on whether the canonical name refers to a full or etymology-only language). ===Textual representations=== Textual strings belonging to a given language come in several different ''text variants'': # The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}}, {{tl|t}}, {{tl|lang}} and the like. # The ''corrected input text'' is the input text with some corrections and/or normalizations applied, such as bad-character replacements for certain languages, like replacing `l` or `1` to [[palochka]] in some languages written in Cyrillic. (FIXME: This currently goes under the name ''display text'' but that will be repurposed below. Also, [[User:Surjection]] suggests renaming this to ''normalized input text'', but "normalized" is used in a different sense in [[Module:usex]].) # The ''display text'' is the text in the form as it will be displayed to the user. This is what appears in headwords, in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped display text (see below), as well as embedded bracketed links that are variously processed further. The display text is generated from the corrected input text by applying language-specific transformations; for most languages, there will be no such transformations. The general reason for having a difference between input and display text is to allow for extra information in the input text that is not displayed to the user but is sent to the transliteration module. Note that having different display and input text is only supported currently through special-casing but will be generalized. Examples of transformations are: (1) Removing the {{cd|^}} that is used in certain East Asian (and possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for Arabic, removing a ''sukūn'' diacritic placed over a ''tāʔ marbūṭa'' (like this: ةْ) to indicate that the ''tāʔ marbūṭa'' is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as `[กรีน/กฺรีน]`, which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific templates like {{tl|th-usex}}]. ## The ''right-resolved display text'' is the result of removing brackets around one-part embedded links and resolving two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form). The process of right-resolution is what happens when you call {{cd|remove_links()}} in [[Module:links]] on some text. When applied to the display text, it produces exactly what the user sees, without any link markup. # The ''stripped display text'' is the result of applying diacritic-stripping to the display text. ## The ''left-resolved stripped display text'' [NEED BETTER NAME] is the result of applying left-resolution to the stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying diacritic stripping and left-resolution produces the ''logical pagename''. # The ''physical pagename text'' is the result of converting the stripped display text into physical page links. If the stripped display text contains embedded links, the left side of those links is converted into physical page links; otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g. {{cd|Unsupported titles/`gt`}} in place of the logical name {{cd|>}}; (2) handles certain special-cased unsupported-title logical pagenames, such as {{cd|Unsupported titles/Space}} in place of {{cd|[space]}} and {{cd|Unsupported titles/Ancient Greek dish}} in place of a very long Greek name for a gourmet dish as found in Aristophanes; (3) converts "mammoth" pagenames such as [[a]] into their appropriate split component, e.g. [[a/languages A to L]]. # The ''source translit text'' is the text as supplied to the language-specific {{cd|transliterate()}} method. The form of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the text through the transliterate mechanism, and for others (those listed with "cont" in {{cd|substitution}} in [[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is still unclear to me.) # The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike for all the other text variants except the transcribed text, it is always in the Latin script. # The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian, Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form. Unlike for all the other text variants other than the transliterated text, it is always in the Latin script. Currently, the transcribed text is always supplied manually be the user; there is no such thing as a {{cd|transcribe()}} method on language objects. # The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it needs to be converted to display text, have embedded links removed through right-resolution and have diacritic-stripping applied. # There are other text variants that occur in usexes (specifically, there are normalized variants of several of the above text variants), but we can skip them for now. The following methods exist on {Language} objects to convert between different text variants: # {correctInputText} (currently called {makeDisplayText}): This converts input text to corrected input text. # {stripDiacritics}: This converts to stripped display text. [FIXME: This needs some rethinking. In particular, {stripDiacritics} is sometimes called on input text, corrected input text or display text (in various paths inside of [[Module:links]], and, in the case of input text, usually from other modules). We need to make sure we don't try to convert input text to display text twice, but at the same time we need to support calling it directly on input text since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input, corrected input, or display text; if the former two, we call {correctInputText} ourselves.] # {logicalToPhysical}: This converts logical pagenames to physical pagenames. # {transliterate}: This appears to convert input text with embedded brackets removed into a transliteration. [FIXME: This needs some rethinking. In particular, it calls {processDisplayText} on its input, which won't work for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code; a lot of callers remove the links themselves before calling {transliterate()}, which I assume is wrong.] # {makeSortKey}: This converts display text (?) to a sort key. [FIXME: Clarify this.] ]==] localexport={} localdebug_track_module="Module:debug/track" localetymology_languages_data_module="Module:etymology languages/data" localfamilies_module="Module:families" localheadword_page_module="Module:headword/page" localjson_module="Module:JSON" locallanguage_like_module="Module:language-like" locallanguages_data_module="Module:languages/data" locallanguages_data_patterns_module="Module:languages/data/patterns" locallinks_data_module="Module:links/data" localload_module="Module:load" localscripts_module="Module:scripts" localscripts_data_module="Module:scripts/data" localstring_encode_entities_module="Module:string/encode entities" localstring_pattern_escape_module="Module:string/patternEscape" localstring_replacement_escape_module="Module:string/replacementEscape" localstring_utilities_module="Module:string utilities" localtable_module="Module:table" localutilities_module="Module:utilities" localwikimedia_languages_module="Module:wikimedia languages" localmw=mw localstring=string localtable=table localchar=string.char localconcat=table.concat localfind=string.find localfloor=math.floor localget_by_code-- Defined below. localget_data_module_name-- Defined below. localget_extra_data_module_name-- Defined below. localgetmetatable=getmetatable localgmatch=string.gmatch localgsub=string.gsub localinsert=table.insert localipairs=ipairs localis_known_language_tag=mw.language.isKnownLanguageTag localmake_object-- Defined below. localmatch=string.match localnext=next localpairs=pairs localremove=table.remove localrequire=require localselect=select localsetmetatable=setmetatable localsub=string.sub localtype=type localunstrip=mw.text.unstrip -- Loaded as needed by findBestScript. localHans_chars localHant_chars localfunctioncheck_object(...) check_object=require(utilities_module).check_object returncheck_object(...) end localfunctiondebug_track(...) debug_track=require(debug_track_module) returndebug_track(...) end localfunctiondecode_entities(...) decode_entities=require(string_utilities_module).decode_entities returndecode_entities(...) end localfunctiondecode_uri(...) decode_uri=require(string_utilities_module).decode_uri returndecode_uri(...) end localfunctiondeep_copy(...) deep_copy=require(table_module).deepCopy returndeep_copy(...) end localfunctionencode_entities(...) encode_entities=require(string_encode_entities_module) returnencode_entities(...) end localfunctionget_L2_sort_key(...) get_L2_sort_key=require(headword_page_module).get_L2_sort_key returnget_L2_sort_key(...) end localfunctionget_script(...) get_script=require(scripts_module).getByCode returnget_script(...) end localfunctionfind_best_script_without_lang(...) find_best_script_without_lang=require(scripts_module).findBestScriptWithoutLang returnfind_best_script_without_lang(...) end localfunctionget_family(...) get_family=require(families_module).getByCode returnget_family(...) end localfunctionget_plaintext(...) get_plaintext=require(utilities_module).get_plaintext returnget_plaintext(...) end localfunctionget_wikimedia_lang(...) get_wikimedia_lang=require(wikimedia_languages_module).getByCode returnget_wikimedia_lang(...) end localfunctionkeys_to_list(...) keys_to_list=require(table_module).keysToList returnkeys_to_list(...) end localfunctionlist_to_set(...) list_to_set=require(table_module).listToSet returnlist_to_set(...) end localfunctionload_data(...) load_data=require(load_module).load_data returnload_data(...) end localfunctionmake_family_object(...) make_family_object=require(families_module).makeObject returnmake_family_object(...) end localfunctionpattern_escape(...) pattern_escape=require(string_pattern_escape_module) returnpattern_escape(...) end localfunctionreplacement_escape(...) replacement_escape=require(string_replacement_escape_module) returnreplacement_escape(...) end localfunctionsafe_require(...) safe_require=require(load_module).safe_require returnsafe_require(...) end localfunctionshallow_copy(...) shallow_copy=require(table_module).shallowCopy returnshallow_copy(...) end localfunctionsplit(...) split=require(string_utilities_module).split returnsplit(...) end localfunctionto_json(...) to_json=require(json_module).toJSON returnto_json(...) end localfunctionu(...) u=require(string_utilities_module).char returnu(...) end localfunctionugsub(...) ugsub=require(string_utilities_module).gsub returnugsub(...) end localfunctionulen(...) ulen=require(string_utilities_module).len returnulen(...) end localfunctionulower(...) ulower=require(string_utilities_module).lower returnulower(...) end localfunctionumatch(...) umatch=require(string_utilities_module).match returnumatch(...) end localfunctionuupper(...) uupper=require(string_utilities_module).upper returnuupper(...) end localfunctiontrack(page) debug_track("languages/"..page) returntrue end localfunctionnormalize_code(code) returnload_data(languages_data_module).aliases[code]orcode end localfunctioncheck_inputs(self,check,default,...) localn=select("#",...) ifn==0then returnfalse end localret=check(self,(...)) ifret~=nilthen returnret elseifn>1then localinputs={...} fori=2,ndo ret=check(self,inputs[i]) ifret~=nilthen returnret end end end returndefault end localfunctionmake_link(self,target,display) localprefix,main ifself:getFamilyCode()=="qfa-sub"then prefix,main=display:match("^(the )(.*)") ifnotprefixthen prefix,main=display:match("^(a )(.*)") end end return(prefixor"").."[["..target.."|"..(mainordisplay).."]]" end -- Convert risky characters to HTML entities, which minimizes interference once returned (e.g. for "sms:a", "<!-- -->" etc.). localfunctionescape_risky_characters(text) -- Spacing characters in isolation generally need to be escaped in order to be properly processed by the MediaWiki software. ifumatch(text,"^%s*$")then returnencode_entities(text,text) end returnencode_entities(text,"!#%&*+/:;<=>?@[\\]_{|}") end -- Temporarily convert various formatting characters to PUA to prevent them from being disrupted by the substitution process. localfunctiondoTempSubstitutions(text,subbedChars,keepCarets,noTrim) -- Clone so that we don't insert any extra patterns into the table in package.loaded. For some reason, using require seems to keep memory use down; probably because the table is always cloned. localpatterns=shallow_copy(require(languages_data_patterns_module)) ifkeepCaretsthen insert(patterns,"((\\+)%^)") insert(patterns,"((%^))") end -- Ensure any whitespace at the beginning and end is temp substituted, to prevent it from being accidentally trimmed. We only want to trim any final spaces added during the substitution process (e.g. by a module), which means we only do this during the first round of temp substitutions. ifnotnoTrimthen insert(patterns,"^([\128-\191\244]*(%s+))") insert(patterns,"((%s+)[\128-\191\244]*)$") end -- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate. text=gsub(text,"%f[%[]%[%[","\1"):gsub("%f[%]]%]%]","\2") locali=#subbedChars for_,patterninipairs(patterns)do -- Patterns ending in \0 stand are for things like "[[" or "]]"), so the inserted PUA are treated as breaks between terms by modules that scrape info from pages. localterm_divider pattern=gsub(pattern,"%z$",function(divider) term_divider=divider=="\0" return"" end) text=gsub(text,pattern,function(...) localm={...} localm1New=m[1] fork=2,#mdo localn=i+k-1 subbedChars[n]=m[k] localbyte2=floor(n/4096)%64+(term_dividerand128or136) localbyte3=floor(n/64)%64+128 localbyte4=n%64+128 m1New=gsub(m1New,pattern_escape(m[k]),"\244"..char(byte2)..char(byte3)..char(byte4),1) end i=i+#m-1 returnm1New end) end text=gsub(text,"\1","%[%["):gsub("\2","%]%]") returntext,subbedChars end -- Reinsert any formatting that was temporarily substituted. localfunctionundoTempSubstitutions(text,subbedChars) fori=1,#subbedCharsdo localbyte2=floor(i/4096)%64+128 localbyte3=floor(i/64)%64+128 localbyte4=i%64+128 text=gsub(text,"\244["..char(byte2)..char(byte2+8).."]"..char(byte3)..char(byte4), replacement_escape(subbedChars[i])) end text=gsub(text,"\1","%[%["):gsub("\2","%]%]") returntext end -- Check if the raw text is an unsupported title, and if so return that. Otherwise, remove HTML entities. We do the pre-conversion to avoid loading the unsupported title list unnecessarily. localfunctioncheckNoEntities(self,text) localtextNoEnc=decode_entities(text) iftextNoEnc~=textandload_data(links_data_module).unsupported_titles[text]then returntext else returntextNoEnc end end -- If no script object is provided (or if it's invalid or None), get one. localfunctioncheckScript(text,self,sc) ifnotcheck_object("script",true,sc)orsc:getCode()=="None"then returnself:findBestScript(text) end returnsc end localfunctionnormalize(text,sc) text=sc:fixDiscouragedSequences(text) returnsc:toFixedNFD(text) end -- Subfunction of iterateSectionSubstitutions(). Process an individual chunk of text according to the specifications in -- `substitution_data`. The input parameters are all as in the documentation of iterateSectionSubstitutions() except for -- `recursed`, which is set to true if we called ourselves recursively to process a script-specific setting or -- script-wide fallback. Returns two values: the processed text and the actual substitution data used to do the -- substitutions (same as the `actual_substitution_data` return value to iterateSectionSubstitutions()). localfunctiondoSubstitutions(self,text,sc,substitution_data,data_field,function_name,recursed) -- BE CAREFUL in this function because the value at any level can be `false`, which causes no processing to be done -- and blocks any further fallback processing. localactual_substitution_data=substitution_data -- If there are language-specific substitutes given in the data module, use those. iftype(substitution_data)=="table"then -- If a script is specified, run this function with the script-specific data before continuing. localsc_code=sc:getCode() localhas_substitution_data=false ifsubstitution_data[sc_code]~=nilthen has_substitution_data=true ifsubstitution_data[sc_code]then text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data[sc_code],data_field, function_name,true) end -- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one -- separately. elseifsc_code:match("^Han")andsubstitution_data.Hani~=nilthen has_substitution_data=true ifsubstitution_data.Hanithen text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data.Hani,data_field, function_name,true) end -- Substitution data with key 1 in the outer table may be given as a fallback. elseifsubstitution_data[1]~=nilthen has_substitution_data=true ifsubstitution_data[1]then text,actual_substitution_data=doSubstitutions(self,text,sc,substitution_data[1],data_field, function_name,true) end end -- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with -- the NFD decomposed forms, as this simplifies many substitutions. ifsubstitution_data.fromthen has_substitution_data=true fori,frominipairs(substitution_data.from)do -- Normalize each loop, to ensure multi-stage substitutions work correctly. text=sc:toFixedNFD(text) text=ugsub(text,sc:toFixedNFD(from),substitution_data.to[i]or"") end end ifsubstitution_data.remove_diacriticsthen has_substitution_data=true text=sc:toFixedNFD(text) -- Convert exceptions to PUA. localremove_exceptions,substitutes=substitution_data.remove_exceptions ifremove_exceptionsthen substitutes={} locali=0 for_,exceptioninipairs(remove_exceptions)do exception=sc:toFixedNFD(exception) text=ugsub(text,exception,function(m) i=i+1 localsubst=u(0x80000+i) substitutes[subst]=m returnsubst end) end end -- Strip diacritics. text=ugsub(text,"["..substitution_data.remove_diacritics.."]","") -- Convert exceptions back. ifremove_exceptionsthen text=text:gsub("\242[\128-\191]*",substitutes) end end ifnothas_substitution_dataandsc._data[data_field]then -- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.). text,actual_substitution_data=doSubstitutions(self,text,sc,sc._data[data_field],data_field, function_name,true) end elseiftype(substitution_data)=="string"then -- If there is a dedicated function module, use that. localmodule=safe_require("Module:"..substitution_data) ifmodulethen -- TODO: translit functions should take objects, not codes. -- TODO: translit functions should be called with form NFD. iffunction_name=="tr"then ifnotmodule[function_name]then error(("Internal error: Module [[%s]] has no function named 'tr'"):format(substitution_data)) end text=module[function_name](text,self._code,sc:getCode()) elseiffunction_name=="stripDiacritics"then -- FIXME, get rid of this arm after renaming makeEntryName -> stripDiacritics. ifmodule[function_name]then text=module[function_name](sc:toFixedNFD(text),self,sc) elseifmodule.makeEntryNamethen text=module.makeEntryName(sc:toFixedNFD(text),self,sc) else error(("Internal error: Module [[%s]] has no function named 'stripDiacritics' or 'makeEntryName'" ):format(substitution_data)) end else ifnotmodule[function_name]then error(("Internal error: Module [[%s]] has no function named '%s'"):format( substitution_data,function_name)) end text=module[function_name](sc:toFixedNFD(text),self,sc) end else error("Substitution data '"..substitution_data.."' does not match an existing module.") end elseifsubstitution_data==nilandsc._data[data_field]then -- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.). text,actual_substitution_data=doSubstitutions(self,text,sc,sc._data[data_field],data_field, function_name,true) end -- Don't normalize to NFC if this is the inner loop or if a module returned nil. ifrecursedornottextthen returntext,actual_substitution_data end -- Fix any discouraged sequences created during the substitution process, and normalize into the final form. returnsc:toFixedNFC(sc:fixDiscouragedSequences(text)),actual_substitution_data end -- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate -- over each section to apply substitutions (e.g. transliteration or diacritic stripping). This avoids putting PUA -- characters through language-specific modules, which may be unequipped for them. This function is passed the following -- values: -- * `self` (the Language object); -- * `text` (the text to process); -- * `sc` (the script of the text, which must be specified; callers should call checkScript() as needed to autodetect the -- script of the text if not given explicitly by the user); -- * `subbedChars` (an array of the same length as the text, indicating which characters have been substituted and by -- what, or {nil} if no substitutions are to happen); -- * `keepCarets` (DOCUMENT ME); -- * `substitution_data` (the data indicating which substitutions to apply, taken directly from `data_field` in the -- language's data structure in a submodule of [[Module:languages/data]]); -- * `data_field` (the data field from which `substitution_data` was fetched, such as "sort_key" or "strip_diacritics"); -- * `function_name` (the name of the function to call to do the substitution, in case `substitution_data` specifies a -- module to do the substitution); -- * `notrim` (don't trim whitespace at the edges of `text`; set when computing the sort key, because whitespace at the -- beginning of a sort key is significant and causes the resulting page to be sorted at the beginning of the category -- it's in). -- Returns three values: -- (1) the processed text; -- (2) the value of `subbedChars` that was passed in, possibly modified with additional character substitutions; will be -- {nil} if {nil} was passed in; -- (3) the actual substitution data that was used to apply substitutions to `text`; this may be different from the value -- of `substitution_data` passed in if that value recursively specified script-specific substitutions or if no -- substitution data could be found in the language-specific data (e.g. {nil} was passed in or a structure was passed -- in that had no setting for the script given in `sc`), but a script-wide fallback value was set; currently it is -- only used by makeSortKey(). localfunctioniterateSectionSubstitutions(self,text,sc,subbedChars,keepCarets,substitution_data,data_field, function_name,notrim) localsections -- See [[Module:languages/data]]. ifnotfind(text,"\244")orload_data(languages_data_module).substitution[self._code]=="cont"then sections={text} else sections=split(text,"\244[\128-\143][\128-\191]*",true) end localactual_substitution_data for_,sectioninipairs(sections)do -- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated -- modules). ifgsub(section,"%s+","")~=""then localsub,this_actual_substitution_data=doSubstitutions(self,section,sc,substitution_data,data_field, function_name) actual_substitution_data=this_actual_substitution_data -- Second round of temporary substitutions, in case any formatting was added by the main substitution -- process. However, don't do this if the section contains formatting already (as it would have had to have -- been escaped to reach this stage, and therefore should be given as raw text). ifsubandsubbedCharsthen localnoSub for_,patterninipairs(require(languages_data_patterns_module))do ifmatch(section,pattern.."%z?")then noSub=true end end ifnotnoSubthen sub,subbedChars=doTempSubstitutions(sub,subbedChars,keepCarets,true) end end ifnotsubthen text=sub break end text=subandgsub(text,pattern_escape(section),replacement_escape(sub),1)ortext end end ifnotnotrimthen -- Trim, unless there are only spacing characters, while ignoring any final formatting characters. -- Do not trim sort keys because spaces at the beginning are significant. text=textandtext:gsub("^([\128-\191\244]*)%s+(%S)","%1%2"):gsub("(%S)%s+([\128-\191\244]*)$","%1%2")or nil end returntext,subbedChars,actual_substitution_data end -- Process carets (and any escapes). Default to simple removal, if no pattern/replacement is given. localfunctionprocessCarets(text,pattern,repl) localrep repeat text,rep=gsub(text,"\\\\(\\*^)","\3%1") untilrep==0 return(text:gsub("\\^","\4") :gsub(patternor"%^",replor"") :gsub("\3","\\") :gsub("\4","^")) end -- Remove carets if they are used to capitalize parts of transliterations (unless they have been escaped). localfunctionremoveCarets(text,sc) ifnotsc:hasCapitalization()andsc:isTransliterated()andtext:find("^",1,true)then returnprocessCarets(text) else returntext end end localLanguage={} --[==[Returns the language code of the language. Example: {{code|lua|"fr"}} for French.]==] functionLanguage:getCode() returnself._code end --[==[Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example: {{code|lua|"French"}} for French.]==] functionLanguage:getCanonicalName() localname=self._name ifname==nilthen name=self._data[1] self._name=name end returnname end --[==[ Return the display form of the language. The display form of a language, family or script is the form it takes when appearing as the <code><var>source</var></code> in categories such as <code>English terms derived from <var>source</var></code> or <code>English given names from <var>source</var></code>, and is also the displayed text in {makeCategoryLink()} links. For full and etymology-only languages, this is the same as the canonical name, but for families, it reads <code>"<var>name</var> languages"</code> (e.g. {"Indo-Iranian languages"}), and for scripts, it reads <code>"<var>name</var> script"</code> (e.g. {"Arabic script"}). ]==] functionLanguage:getDisplayForm() localform=self._displayForm ifform==nilthen form=self:getCanonicalName() -- Add article and " substrate" to substrates that lack them. ifself:getFamilyCode()=="qfa-sub"then ifnot(sub(form,1,4)=="the "orsub(form,1,2)=="a ")then form="a "..form end ifnotmatch(form," [Ss]ubstrate")then form=form.." substrate" end end self._displayForm=form end returnform end --[==[Returns the value which should be used in the HTML lang= attribute for tagged text in the language.]==] functionLanguage:getHTMLAttribute(sc,region) localcode=self._code ifnotfind(code,"-",1,true)then returncode.."-"..sc:getCode()..(regionand"-"..regionor"") end localparent=self:getParent() region=regionormatch(code,"%f[%u][%u-]+%f[%U]") ifparentthen returnparent:getHTMLAttribute(sc,region) end -- TODO: ISO family codes can also be used. return"mis-"..sc:getCode()..(regionand"-"..regionor"") end --[==[Returns a table of the aliases that the language is known by, excluding the canonical name. Aliases are synonyms for the language in question. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {{code|lua|{"High German", "New High German", "Deutsch"} }} for [[:Category:German language|German]].]==] functionLanguage:getAliases() self:loadInExtraData() returnrequire(language_like_module).getAliases(self) end --[==[ Return a table of the known subvarieties of a given language, excluding subvarieties that have been given explicit etymology-only language codes. The names are not guaranteed to be unique, in that sometimes a given name refers to a subvariety of more than one language. Example: {{code|lua|{"Southern Aymara", "Central Aymara"} }} for [[:Category:Aymara language|Aymara]]. Note that the returned value can have nested tables in it, when a subvariety goes by more than one name. Example: {{code|lua|{"North Azerbaijani", "South Azerbaijani", {"Afshar", "Afshari", "Afshar Azerbaijani", "Afchar"}, {"Qashqa'i", "Qashqai", "Kashkay"}, "Sonqor"} }} for [[:Category:Azerbaijani language|Azerbaijani]]. Here, for example, Afshar, Afshari, Afshar Azerbaijani and Afchar all refer to the same subvariety, whose preferred name is Afshar (the one listed first). To avoid a return value with nested tables in it, specify a non-{{code|lua|nil}} value for the <code>flatten</code> parameter; in that case, the return value would be {{code|lua|{"North Azerbaijani", "South Azerbaijani", "Afshar", "Afshari", "Afshar Azerbaijani", "Afchar", "Qashqa'i", "Qashqai", "Kashkay", "Sonqor"} }}. ]==] functionLanguage:getVarieties(flatten) self:loadInExtraData() returnrequire(language_like_module).getVarieties(self,flatten) end --[==[Returns a table of the "other names" that the language is known by, which are listed in the <code>otherNames</code> field. It should be noted that the <code>otherNames</code> field itself is deprecated, and entries listed there should eventually be moved to either <code>aliases</code> or <code>varieties</code>.]==] functionLanguage:getOtherNames()-- To be eventually removed, once there are no more uses of the `otherNames` field. self:loadInExtraData() returnrequire(language_like_module).getOtherNames(self) end --[==[ Return a combined table of the canonical name, aliases, varieties and other names of a given language.]==] functionLanguage:getAllNames() self:loadInExtraData() returnrequire(language_like_module).getAllNames(self) end --[==[Returns a table of types as a lookup table (with the types as keys). The possible types are * {language}: This is a language, either full or etymology-only. * {full}: This is a "full" (not etymology-only) language, i.e. the union of {regular}, {reconstructed} and {appendix-constructed}. Note that the types {full} and {etymology-only} also exist for families, so if you want to check specifically for a full language and you have an object that might be a family, you should use {{lua|hasType("language", "full")}} and not simply {{lua|hasType("full")}}. * {etymology-only}: This is an etymology-only (not full) language, whose parent is another etymology-only language or a full language. Note that the types {full} and {etymology-only} also exist for families, so if you want to check specifically for an etymology-only language and you have an object that might be a family, you should use {{lua|hasType("language", "etymology-only")}} and not simply {{lua|hasType("etymology-only")}}. * {regular}: This indicates a full language that is attested according to [[WT:CFI]] and therefore permitted in the main namespace. There may also be reconstructed terms for the language, which are placed in the {Reconstruction} namespace and must be prefixed with * to indicate a reconstruction. Most full languages are natural (not constructed) languages, but a few constructed languages (e.g. Esperanto and Volapük, among others) are also allowed in the mainspace and considered regular languages. * {reconstructed}: This language is not attested according to [[WT:CFI]], and therefore is allowed only in the {Reconstruction} namespace. All terms in this language are reconstructed, and must be prefixed with *. Languages such as Proto-Indo-European and Proto-Germanic are in this category. * {appendix-constructed}: This language is attested but does not meet the additional requirements set out for constructed languages ([[WT:CFI#Constructed languages]]). Its entries must therefore be in the Appendix namespace, but they are not reconstructed and therefore should not have * prefixed in links. ]==] functionLanguage:getTypes() localtypes=self._types iftypes==nilthen types={language=true} ifself:getFullCode()==self._codethen types.full=true else types["etymology-only"]=true end fortingmatch(self._data.type,"[^,]+")do types[t]=true end self._types=types end returntypes end --[==[Given a list of types as strings, returns true if the language has all of them.]==] functionLanguage:hasType(...) Language.hasType=require(language_like_module).hasType returnself:hasType(...) end --[==[Returns a table containing <code>WikimediaLanguage</code> objects (see [[Module:wikimedia languages]]), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code <code>sh</code> (Serbo-Croatian) maps to four Wikimedia codes: <code>sh</code> (Serbo-Croatian), <code>bs</code> (Bosnian), <code>hr</code> (Croatian) and <code>sr</code> (Serbian). The code for the Wikimedia language is retrieved from the <code>wikimedia_codes</code> property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.]==] functionLanguage:getWikimediaLanguages() localwm_langs=self._wikimediaLanguageObjects ifwm_langs==nilthen localcodes=self:getWikimediaLanguageCodes() wm_langs={} fori=1,#codesdo wm_langs[i]=get_wikimedia_lang(codes[i]) end self._wikimediaLanguageObjects=wm_langs end returnwm_langs end functionLanguage:getWikimediaLanguageCodes() localwm_langs=self._wikimediaLanguageCodes ifwm_langs==nilthen wm_langs=self._data.wikimedia_codes ifwm_langsthen wm_langs=split(wm_langs,",",true,true) else localcode=self._code ifis_known_language_tag(code)then wm_langs={code} else -- Inherit, but only if no codes are specified in the data *and* -- the language code isn't a valid Wikimedia language code. localparent=self:getParent() wm_langs=parentandparent:getWikimediaLanguageCodes()or{} end end self._wikimediaLanguageCodes=wm_langs end returnwm_langs end --[==[ Returns the name of the Wikipedia article for the language. `project` specifies the language and project to retrieve the article from, defaulting to {"enwiki"} for the English Wikipedia. Normally if specified it should be the project code for a specific-language Wikipedia e.g. "zhwiki" for the Chinese Wikipedia, but it can be any project, including non-Wikipedia ones. If the project is the English Wikipedia and the property {wikipedia_article} is present in the data module it will be used first. In all other cases, a sitelink will be generated from {:getWikidataItem} (if set). The resulting value (or lack of value) is cached so that subsequent calls are fast. If no value could be determined, and `noCategoryFallback` is {false}, {:getCategoryName} is used as fallback; otherwise, {nil} is returned. Note that if `noCategoryFallback` is {nil} or omitted, it defaults to {false} if the project is the English Wikipedia, otherwise to {true}. In other words, under normal circumstances, if the English Wikipedia article couldn't be retrieved, the return value will fall back to a link to the language's category, but this won't normally happen for any other project. ]==] functionLanguage:getWikipediaArticle(noCategoryFallback,project) Language.getWikipediaArticle=require(language_like_module).getWikipediaArticle returnself:getWikipediaArticle(noCategoryFallback,project) end functionLanguage:makeWikipediaLink() returnmake_link(self,"w:"..self:getWikipediaArticle(),self:getCanonicalName()) end --[==[Returns the name of the Wikimedia Commons category page for the language.]==] functionLanguage:getCommonsCategory() Language.getCommonsCategory=require(language_like_module).getCommonsCategory returnself:getCommonsCategory() end --[==[Returns the Wikidata item id for the language or <code>nil</code>. This corresponds to the the second field in the data modules.]==] functionLanguage:getWikidataItem() Language.getWikidataItem=require(language_like_module).getWikidataItem returnself:getWikidataItem() end --[==[Returns a table of <code>Script</code> objects for all scripts that the language is written in. See [[Module:scripts]].]==] functionLanguage:getScripts() localscripts=self._scriptObjects ifscripts==nilthen localcodes=self:getScriptCodes() ifcodes[1]=="All"then scripts=load_data(scripts_data_module) else scripts={} fori=1,#codesdo scripts[i]=get_script(codes[i]) end end self._scriptObjects=scripts end returnscripts end --[==[Returns the table of script codes in the language's data file.]==] functionLanguage:getScriptCodes() localscripts=self._scriptCodes ifscripts==nilthen scripts=self._data[4] ifscriptsthen localcodes,n={},0 forcodeingmatch(scripts,"[^,]+")do n=n+1 -- Special handling of "Hants", which represents "Hani", "Hant" and "Hans" collectively. ifcode=="Hants"then codes[n]="Hani" codes[n+1]="Hant" codes[n+2]="Hans" n=n+2 else codes[n]=code end end scripts=codes else scripts={"None"} end self._scriptCodes=scripts end returnscripts end --[==[Given some text, this function iterates through the scripts of a given language and tries to find the script that best matches the text. It returns a {{code|lua|Script}} object representing the script. If no match is found at all, it returns the {{code|lua|None}} script object.]==] functionLanguage:findBestScript(text,forceDetect) ifnottextortext==""ortext=="-"then returnget_script("None") end -- Differs from table returned by getScriptCodes, as Hants is not normalized into its constituents. localcodes=self._bestScriptCodes ifcodes==nilthen codes=self._data[4] codes=codesandsplit(codes,",",true,true)or{"None"} self._bestScriptCodes=codes end localfirst_sc=codes[1] iffirst_sc=="All"then returnfind_best_script_without_lang(text) end localcodes_len=#codes ifnot(forceDetectorfirst_sc=="Hants"orcodes_len>1)then first_sc=get_script(first_sc) localcharset=first_sc.characters returncharsetandumatch(text,"["..charset.."]")andfirst_scorget_script("None") end -- Remove all formatting characters. text=get_plaintext(text) -- Remove all spaces and any ASCII punctuation. Some non-ASCII punctuation is script-specific, so can't be removed. text=ugsub(text,"[%s!\"#%%&'()*,%-./:;?@[\\%]_{}]+","") if#text==0then returnget_script("None") end -- Try to match every script against the text, -- and return the one with the most matching characters. localbestcount,bestscript,length=0 fori=1,codes_lendo localsc=codes[i] -- Special case for "Hants", which is a special code that represents whichever of "Hant" or "Hans" best matches, or "Hani" if they match equally. This avoids having to list all three. In addition, "Hants" will be treated as the best match if there is at least one matching character, under the assumption that a Han script is desirable in terms that contain a mix of Han and other scripts (not counting those which use Jpan or Kore). ifsc=="Hants"then localHani=get_script("Hani") ifnotHant_charsthen Hant_chars=load_data("Module:zh/data/ts") Hans_chars=load_data("Module:zh/data/st") end localt,s,found=0,0 -- This is faster than using mw.ustring.gmatch directly. forchingmatch((ugsub(text,"["..Hani.characters.."]","\255%0")),"\255(.[\128-\191]*)")do found=true ifHant_chars[ch]then t=t+1 ifHans_chars[ch]then s=s+1 end elseifHans_chars[ch]then s=s+1 else t,s=t+1,s+1 end end iffoundthen ift==sthen returnHani end returnget_script(t>sand"Hant"or"Hans") end else sc=get_script(sc) ifnotlengththen length=ulen(text) end -- Count characters by removing everything in the script's charset and comparing to the original length. localcharset=sc.characters localcount=charsetandlength-ulen((ugsub(text,"["..charset.."]+","")))or0 ifcount>=lengththen returnsc elseifcount>bestcountthen bestcount=count bestscript=sc end end end -- Return best matching script, or otherwise None. returnbestscriptorget_script("None") end --[==[Returns a <code>Family</code> object for the language family that the language belongs to. See [[Module:families]].]==] functionLanguage:getFamily() localfamily=self._familyObject iffamily==nilthen family=self:getFamilyCode() -- If the value is nil, it's cached as false. family=familyandget_family(family)orfalse self._familyObject=family end returnfamilyornil end --[==[Returns the family code in the language's data file.]==] functionLanguage:getFamilyCode() localfamily=self._familyCode iffamily==nilthen -- If the value is nil, it's cached as false. family=self._data[3]orfalse self._familyCode=family end returnfamilyornil end functionLanguage:getFamilyName() localfamily=self._familyName iffamily==nilthen family=self:getFamily() -- If the value is nil, it's cached as false. family=familyandfamily:getCanonicalName()orfalse self._familyName=family end returnfamilyornil end do localfunctioncheck_family(self,family) iftype(family)=="table"then family=family:getCode() end ifself:getFamilyCode()==familythen returntrue end localself_family=self:getFamily() ifself_family:inFamily(family)then returntrue -- If the family isn't a real family (e.g. creoles) check any ancestors. elseifself_family:inFamily("qfa-not")then localancestors=self:getAncestors() for_,ancestorinipairs(ancestors)do ifancestor:inFamily(family)then returntrue end end end end --[==[Check whether the language belongs to `family` (which can be a family code or object). A list of objects can be given in place of `family`; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.]==] functionLanguage:inFamily(...) ifself:getFamilyCode()==nilthen returnfalse end returncheck_inputs(self,check_family,false,...) end end functionLanguage:getParent() localparent=self._parentObject ifparent==nilthen parent=self:getParentCode() -- If the value is nil, it's cached as false. parent=parentandget_by_code(parent,nil,true,true)orfalse self._parentObject=parent end returnparentornil end functionLanguage:getParentCode() localparent=self._parentCode ifparent==nilthen -- If the value is nil, it's cached as false. parent=self._data.parentorfalse self._parentCode=parent end returnparentornil end functionLanguage:getParentName() localparent=self._parentName ifparent==nilthen parent=self:getParent() -- If the value is nil, it's cached as false. parent=parentandparent:getCanonicalName()orfalse self._parentName=parent end returnparentornil end functionLanguage:getParentChain() localchain=self._parentChain ifchain==nilthen chain={} localparent,n=self:getParent(),0 whileparentdo n=n+1 chain[n]=parent parent=parent:getParent() end self._parentChain=chain end returnchain end do localfunctioncheck_lang(self,lang) for_,parentinipairs(self:getParentChain())do if(type(lang)=="string"andlangorlang:getCode())==parent:getCode()then returntrue end end end functionLanguage:hasParent(...) returncheck_inputs(self,check_lang,false,...) end end --[==[ If the language is etymology-only, this iterates through parents until a full language or family is found, and the corresponding object is returned. If the language is a full language, then it simply returns itself. ]==] functionLanguage:getFull() localfull=self._fullObject iffull==nilthen full=self:getFullCode() full=full==self._codeandselforget_by_code(full) self._fullObject=full end returnfull end --[==[ If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding code is returned. If the language is a full language, then it simply returns the language code. ]==] functionLanguage:getFullCode() returnself._fullCodeorself._code end --[==[ If the language is an etymology-only language, this iterates through parents until a full language or family is found, and the corresponding canonical name is returned. If the language is a full language, then it simply returns the canonical name of the language. ]==] functionLanguage:getFullName() localfull=self._fullName iffull==nilthen full=self:getFull():getCanonicalName() self._fullName=full end returnfull end --[==[Returns a table of <code class="nf">Language</code> objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.]==] functionLanguage:getAncestors() localancestors=self._ancestorObjects ifancestors==nilthen ancestors={} localancestor_codes=self:getAncestorCodes() if#ancestor_codes>0then for_,ancestorinipairs(ancestor_codes)do insert(ancestors,get_by_code(ancestor,nil,true)) end else localfam=self:getFamily() localprotoLang=famandfam:getProtoLanguage()ornil -- For the cases where the current language is the proto-language -- of its family, or an etymology-only language that is ancestral to that -- proto-language, we need to step up a level higher right from the -- start. ifprotoLangand( protoLang:getCode()==self._codeor (self:hasType("etymology-only")andprotoLang:hasAncestor(self)) )then fam=fam:getFamily() protoLang=famandfam:getProtoLanguage()ornil end whilenotprotoLangandnot(notfamorfam:getCode()=="qfa-not")do fam=fam:getFamily() protoLang=famandfam:getProtoLanguage()ornil end insert(ancestors,protoLang) end self._ancestorObjects=ancestors end returnancestors end do -- Avoid a language being its own ancestor via class inheritance. We only need to check for this if the language has inherited an ancestor table from its parent, because we never want to drop ancestors that have been explicitly set in the data. -- Recursively iterate over ancestors until we either find self or run out. If self is found, return true. localfunctioncheck_ancestor(self,lang) localcodes=lang:getAncestorCodes() ifnotcodesthen returnnil end fori=1,#codesdo localcode=codes[i] ifcode==self._codethen returntrue end localanc=get_by_code(code,nil,true) ifcheck_ancestor(self,anc)then returntrue end end end --[==[Returns a table of <code class="nf">Language</code> codes for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.]==] functionLanguage:getAncestorCodes() ifself._ancestorCodesthen returnself._ancestorCodes end localdata=self._data localcodes=data.ancestors ifcodes==nilthen codes={} self._ancestorCodes=codes returncodes end codes=split(codes,",",true,true) self._ancestorCodes=codes -- If there are no codes or the ancestors weren't inherited data, there's nothing left to check. if#codes==0orself:getData(false,"raw").ancestors~=nilthen returncodes end locali,code=1 whilei<=#codesdo code=codes[i] ifcheck_ancestor(self,self)then remove(codes,i) else i=i+1 end end returncodes end end --[==[Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).]==] functionLanguage:hasAncestor(...) localfunctioniterateOverAncestorTree(node,func,parent_check) localancestors=node:getAncestors() localancestorsParents={} for_,ancestorinipairs(ancestors)do -- When checking the parents of the other language, and the ancestor is also a parent, skip to the next ancestor, so that we exclude any etymology-only children of that parent that are not directly related (see below). localret=(parent_checkornotnode:hasParent(ancestor))and func(ancestor)oriterateOverAncestorTree(ancestor,func,parent_check) ifretthen returnret end end -- Check the parents of any ancestors. We don't do this if checking the parents of the other language, so that we exclude any etymology-only children of those parents that are not directly related (e.g. if the ancestor is Vulgar Latin and we are checking New Latin, we want it to return false because they are on different ancestral branches. As such, if we're already checking the parent of New Latin (Latin) we don't want to compare it to the parent of the ancestor (Latin), as this would be a false positive; it should be one or the other). ifnotparent_checkthen returnnil end for_,ancestorinipairs(ancestors)do localancestorParents=ancestor:getParentChain() for_,ancestorParentinipairs(ancestorParents)do ifancestorParent:getCode()==self._codeorancestorParent:hasAncestor(ancestor)then break else insert(ancestorsParents,ancestorParent) end end end for_,ancestorParentinipairs(ancestorsParents)do localret=func(ancestorParent) ifretthen returnret end end end localfunctiondo_iteration(otherlang,parent_check) -- otherlang can't be self if(type(otherlang)=="string"andotherlangorotherlang:getCode())==self._codethen returnfalse end repeat ifiterateOverAncestorTree( self, function(ancestor) returnancestor:getCode()==(type(otherlang)=="string"andotherlangorotherlang:getCode()) end, parent_check )then returntrue elseiftype(otherlang)=="string"then otherlang=get_by_code(otherlang,nil,true) end otherlang=otherlang:getParent() parent_check=false untilnototherlang end localparent_check=true for_,otherlanginipairs{...}do localret=do_iteration(otherlang,parent_check) ifretthen returntrue end end returnfalse end do localfunctionconstruct_node(lang,memo) localbranch,ancestors={lang=lang:getCode()} memo[lang:getCode()]=branch for_,ancestorinipairs(lang:getAncestors())do ifancestors==nilthen ancestors={} end insert(ancestors,memo[ancestor:getCode()]orconstruct_node(ancestor,memo)) end branch.ancestors=ancestors returnbranch end functionLanguage:getAncestorChain() localchain=self._ancestorChain ifchain==nilthen chain=construct_node(self,{}) self._ancestorChain=chain end returnchain end end functionLanguage:getAncestorChainOld() localchain=self._ancestorChain ifchain==nilthen chain={} localstep=self whiletruedo localancestors=step:getAncestors() step=#ancestors==1andancestors[1]ornil ifnotstepthen break end insert(chain,step) end self._ancestorChain=chain end returnchain end localfunctionfetch_descendants(self,fmt) localdescendants,family={},self:getFamily() -- Iterate over all three datasets. for_,datainipairs{ require("Module:languages/code to canonical name"), require("Module:etymology languages/code to canonical name"), require("Module:families/code to canonical name"), }do forcodeinpairs(data)do locallang=get_by_code(code,nil,true,true) -- Test for a descendant. Earlier tests weed out most candidates, while the more intensive tests are only used sparingly. if( code~=self._codeand-- Not self. lang:inFamily(family)and-- In the same family. ( family:getProtoLanguageCode()==self._codeor-- Self is the protolanguage. self:hasDescendant(lang)or-- Full hasDescendant check. (lang:getFullCode()==self._codeandnotself:hasAncestor(lang))-- Etymology-only child which isn't an ancestor. ) )then iffmt=="object"then insert(descendants,lang) elseiffmt=="code"then insert(descendants,code) elseiffmt=="name"then insert(descendants,lang:getCanonicalName()) end end end end returndescendants end functionLanguage:getDescendants() localdescendants=self._descendantObjects ifdescendants==nilthen descendants=fetch_descendants(self,"object") self._descendantObjects=descendants end returndescendants end functionLanguage:getDescendantCodes() localdescendants=self._descendantCodes ifdescendants==nilthen descendants=fetch_descendants(self,"code") self._descendantCodes=descendants end returndescendants end functionLanguage:getDescendantNames() localdescendants=self._descendantNames ifdescendants==nilthen descendants=fetch_descendants(self,"name") self._descendantNames=descendants end returndescendants end do localfunctioncheck_lang(self,lang) iftype(lang)=="string"then lang=get_by_code(lang,nil,true) end iflang:hasAncestor(self)then returntrue end end functionLanguage:hasDescendant(...) returncheck_inputs(self,check_lang,false,...) end end localfunctionfetch_children(self,fmt) localm_etym_data=require(etymology_languages_data_module) localself_code,children=self._code,{} forcode,langinpairs(m_etym_data)do local_lang=lang repeat localparent=_lang.parent ifparent==self_codethen iffmt=="object"then insert(children,get_by_code(code,nil,true)) elseiffmt=="code"then insert(children,code) elseiffmt=="name"then insert(children,lang[1]) end break end _lang=m_etym_data[parent] untilnot_lang end returnchildren end functionLanguage:getChildren() localchildren=self._childObjects ifchildren==nilthen children=fetch_children(self,"object") self._childObjects=children end returnchildren end functionLanguage:getChildrenCodes() localchildren=self._childCodes ifchildren==nilthen children=fetch_children(self,"code") self._childCodes=children end returnchildren end functionLanguage:getChildrenNames() localchildren=self._childNames ifchildren==nilthen children=fetch_children(self,"name") self._childNames=children end returnchildren end functionLanguage:hasChild(...) locallang=... ifnotlangthen returnfalse elseiftype(lang)=="string"then lang=get_by_code(lang,nil,true) end iflang:hasParent(self)then returntrue end returnself:hasChild(select(2,...)) end --[==[Returns the name of the main category of that language. Example: {{code|lua|"French language"}} for French, whose category is at [[:Category:French language]]. Unless optional argument <code>nocap</code> is given, the language name at the beginning of the returned value will be capitalized. This capitalization is correct for category names, but not if the language name is lowercase and the returned value of this function is used in the middle of a sentence.]==] functionLanguage:getCategoryName(nocap) localname=self._categoryName ifname==nilthen name=self:getCanonicalName() -- If a substrate, omit any leading article. ifself:getFamilyCode()=="qfa-sub"then name=name:gsub("^the ",""):gsub("^a ","") end -- Only add " language" if a full language. ifself:hasType("full")then -- Unless the canonical name already ends with "language", "lect" or their derivatives, add " language". ifnot(match(name,"[Ll]anguage$")ormatch(name,"[Ll]ect$"))then name=name.." language" end end self._categoryName=name end ifnocapthen returnname end returnmw.getContentLanguage():ucfirst(name) end --[==[Creates a link to the category; the link text is the canonical name.]==] functionLanguage:makeCategoryLink() returnmake_link(self,":Category:"..self:getCategoryName(),self:getDisplayForm()) end functionLanguage:getStandardCharacters(sc) localstandard_chars=self._data.standard_chars iftype(standard_chars)~="table"then returnstandard_chars elseifscandtype(sc)~="string"then check_object("script",nil,sc) sc=sc:getCode() end if(notsc)orsc=="None"then localscripts={} for_,scriptinpairs(standard_chars)do insert(scripts,script) end returnconcat(scripts) end ifstandard_chars[sc]then returnstandard_chars[sc]..(standard_chars[1]or"") end end --[==[ Strip diacritics from display text `text` (in a language-specific fashion), which is in the script `sc`. If `sc` is omitted or {nil}, the script is autodetected. This also strips certain punctuation characters from the end and (in the case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the end of the text or between the text and final stripped punctuation characters; and applies some language-specific Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text. ]==] functionLanguage:stripDiacritics(text,sc) if(nottext)ortext==""then returntext end sc=checkScript(text,self,sc) text=normalize(text,sc) -- FIXME, rename makeEntryName to stripDiacritics and get rid of second and third return values -- everywhere text,_,_=iterateSectionSubstitutions(self,text,sc,nil,nil, self._data.strip_diacriticsorself._data.entry_name,"strip_diacritics","stripDiacritics") text=umatch(text,"^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$")ortext returntext end --[==[ Convert a ''logical'' pagename (the pagename as it appears to the user, after diacritics and punctuation have been stripped) to a ''physical'' pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference between the two are (a) unsupported titles such as `[ ]` (with square brackets in them), `#` (pound/hash sign) and `¯\_(ツ)_/¯` (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into parts (e.g. `a`, which is split into physical pagenames `a/languages A to L` and `a/languages M to Z`). For almost all purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical pagenames, such as checking the existence of a page or retrieving a page's contents. `pagename` is the logical pagename to be converted. `is_reconstructed_or_appendix` indicates whether the page is in the `Reconstruction` or `Appendix` namespaces. If it is omitted or has the value {nil}, the pagename is checked for an initial asterisk, and if found, the page is assumed to be a `Reconstruction` page. Setting a value of `false` or `true` to `is_reconstructed_or_appendix` disables this check and allows for mainspace pagenames that begin with an asterisk. ]==] functionLanguage:logicalToPhysical(pagename,is_reconstructed_or_appendix) -- FIXME: This probably shouldn't happen but it happens when makeEntryName() receives nil. ifpagename==nilthen track("nil-passed-to-logicalToPhysical") returnnil end localinitial_asterisk ifis_reconstructed_or_appendix==nilthen localpagename_minus_initial_asterisk initial_asterisk,pagename_minus_initial_asterisk=pagename:match("^(%*)(.*)$") ifpagename_minus_initial_asteriskthen is_reconstructed_or_appendix=true pagename=pagename_minus_initial_asterisk elseifself:hasType("appendix-constructed")then is_reconstructed_or_appendix=true end end ifnotis_reconstructed_or_appendixthen -- Check if the pagename is a listed unsupported title. localunsupportedTitles=load_data(links_data_module).unsupported_titles ifunsupportedTitles[pagename]then return"Unsupported titles/"..unsupportedTitles[pagename] end end -- Set `unsupported` as true if certain conditions are met. localunsupported -- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed -- directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for -- relative paths in links, as are 3 or more consecutive tildes. Note: match is faster with magic -- characters/charsets; find is faster with plaintext. if( match(pagename,"[#<>%[%]_{|}]")or find(pagename,"\239\191\189")or match(pagename,"%f[^%z/]%.%.?%f[%z/]")or find(pagename,"~~~") )then unsupported=true -- If it looks like an interwiki link. elseiffind(pagename,":")then localprefix=gsub(pagename,"^:*(.-):.*",ulower) if( load_data("Module:data/namespaces")[prefix]or load_data("Module:data/interwikis")[prefix] )then unsupported=true end end -- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of -- it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported -- titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded -- from the earlier test). ifunsupportedthen -- FIXME: This conversion needs to be different for reconstructed pages with unsupported characters. There -- aren't any currently, but if there ever are, we need to fix this e.g. to put them in something like -- Reconstruction:Proto-Indo-European/Unsupported titles/`lowbar``num`. localunsupported_characters=load_data(links_data_module).unsupported_characters pagename=pagename:gsub("[#<>%[%]_`{|}\239]\191?\189?",unsupported_characters) :gsub("%f[^%z/]%.%.?%f[%z/]",function(m) return(gsub(m,"%.","`period`")) end) :gsub("~~~+",function(m) return(gsub(m,"~","`tilde`")) end) pagename="Unsupported titles/"..pagename elseifnotis_reconstructed_or_appendixthen -- Check if this is a mammoth page. If so, which subpage should we link to? localm_links_data=load_data(links_data_module) localmammoth_page_type=m_links_data.mammoth_pages[pagename] ifmammoth_page_typethen localcanonical_name=self:getFullName() ifcanonical_name~="Translingual"andcanonical_name~="English"then localthis_subpage localL2_sort_key=get_L2_sort_key(canonical_name) for_,subpage_specinipairs(m_links_data.mammoth_page_subpage_types[mammoth_page_type])do -- unpack() fails utterly on data loaded using mw.loadData() even if offsets are given localsubpage,pattern=subpage_spec[1],subpage_spec[2] ifpattern==trueorL2_sort_key:match(pattern)then this_subpage=subpage break end end ifnotthis_subpagethen error(("Internal error: Bad data in mammoth_page_subpage_pages in [[Module:links/data]] for mammoth page %s, type %s; last entry didn't have 'true' in it"):format( pagename,mammoth_page_type)) end pagename=pagename.."/"..this_subpage end end end return(initial_asteriskor"")..pagename end --[==[ Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename. This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated and will be going away. It is a simple composition of `self:stripDiacritics` and `self:logicalToPhysical`; most callers only want the former, and if you need both, call them both yourself. `text` and `sc` are as in `self:stripDiacritics`, and `is_reconstructed_or_appendix` is as in `self:logicalToPhysical`. ]==] functionLanguage:makeEntryName(text,sc,is_reconstructed_or_appendix) returnself:logicalToPhysical(self:stripDiacritics(text,sc),is_reconstructed_or_appendix) end --[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==] functionLanguage:generateForms(text,sc) localgenerate_forms=self._data.generate_forms ifgenerate_forms==nilthen return{text} end sc=checkScript(text,self,sc) returnrequire("Module:"..self._data.generate_forms).generateForms(text,self,sc) end --[==[Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything in parentheses is removed as well. The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.]==] functionLanguage:makeSortKey(text,sc) if(nottext)ortext==""then returntext end ifmatch(text,"<[^<>]+>")then track("track HTML tag") end -- Remove directional characters, bold, italics, soft hyphens, strip markers and HTML tags. -- FIXME: Partly duplicated with remove_formatting() in [[Module:links]]. text=ugsub(text,"[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]","") text=text:gsub("('*)'''(.-'*)'''","%1%2"):gsub("('*)''(.-'*)''","%1%2") text=gsub(unstrip(text),"<[^<>]+>","") text=decode_uri(text,"PATH") text=checkNoEntities(self,text) -- Remove initial hyphens and * unless the term only consists of spacing + punctuation characters. text=ugsub(text,"^([-]*)[-־ـ᠊*]+([-]*)(.*[^%s%p].*)","%1%2%3") sc=checkScript(text,self,sc) text=normalize(text,sc) text=removeCarets(text,sc) -- For languages with dotted dotless i, ensure that "İ" is sorted as "i", and "I" is sorted as "ı". ifself:hasDottedDotlessI()then text=gsub(text,"I\204\135","i")-- decomposed "İ" :gsub("I","ı") text=sc:toFixedNFD(text) end -- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is -- usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as -- conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the -- sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is -- necessary so as to prevent "i" and "ı" both being sorted as "I". -- -- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive -- to changes in capitalization (as it changes the target page). ifnotsc:sortByScraping()then text=ulower(text) end localactual_substitution_data -- Don't trim whitespace here because it's significant at the beginning of a sort key or sort base. text,_,actual_substitution_data=iterateSectionSubstitutions(self,text,sc,nil,nil,self._data.sort_key, "sort_key","makeSortKey","notrim") ifnotsc:sortByScraping()then ifself:hasDottedDotlessI()andnotactual_substitution_datathen text=text:gsub("ı","I"):gsub("i","İ") text=sc:toFixedNFC(text) end text=uupper(text) end -- Remove parentheses, as long as they are either preceded or followed by something. text=gsub(text,"(.)[()]+","%1"):gsub("[()]+(.)","%1") text=escape_risky_characters(text) returntext end --[==[Create the form used as as a basis for display text and transliteration. FIXME: Rename to correctInputText().]==] localfunctionprocessDisplayText(text,self,sc,keepCarets,keepPrefixes) localsubbedChars={} text,subbedChars=doTempSubstitutions(text,subbedChars,keepCarets) text=decode_uri(text,"PATH") text=checkNoEntities(self,text) sc=checkScript(text,self,sc) text=normalize(text,sc) text,subbedChars=iterateSectionSubstitutions(self,text,sc,subbedChars,keepCarets,self._data.display_text, "display_text","makeDisplayText") text=removeCarets(text,sc) -- Remove any interwiki link prefixes (unless they have been escaped or this has been disabled). iffind(text,":")andnotkeepPrefixesthen localrep repeat text,rep=gsub(text,"\\\\(\\*:)","\3%1") untilrep==0 text=gsub(text,"\\:","\4") whiletruedo localprefix=gsub(text,"^(.-):.+",function(m1) return(gsub(m1,"\244[\128-\191]*","")) end) -- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace. ifnotprefixorprefix==textorprefix=="Wiktionary" ornot(load_data("Module:data/interwikis")[ulower(prefix)]orprefix=="")then break end text=gsub(text,"^(.-):(.*)",function(m1,m2) localret={} forsubbedCharingmatch(m1,"\244[\128-\191]*")do insert(ret,subbedChar) end returnconcat(ret)..m2 end) end text=gsub(text,"\3","\\"):gsub("\4",":") end returntext,subbedChars end --[==[Make the display text (i.e. what is displayed on the page).]==] functionLanguage:makeDisplayText(text,sc,keepPrefixes) ifnottextortext==""then returntext end localsubbedChars text,subbedChars=processDisplayText(text,self,sc,nil,keepPrefixes) text=escape_risky_characters(text) returnundoTempSubstitutions(text,subbedChars) end --[==[Transliterates the text from the given script into the Latin script (see [[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to work; if it is not present, {{code|lua|nil}} is returned. The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason, the <code>sc</code> parameter should always be provided when writing non-language-specific code. The <code>module_override</code> parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by [[Wiktionary:Tracking/languages/module_override]]. '''Known bugs''': * This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations. * HTML entities like <code>&apos;</code>, often used to escape wikitext markups, do not work. ]==] functionLanguage:transliterate(text,sc,module_override) -- If there is no text, or the language doesn't have transliteration data and there's no override, return nil. ifnottextortext==""ortext=="-"then returntext end -- If the script is not transliteratable (and no override is given), return nil. sc=checkScript(text,self,sc) ifnot(sc:isTransliterated()ormodule_override)then -- temporary tracking to see if/when this gets triggered track("non-transliterable") track("non-transliterable/"..self._code) track("non-transliterable/"..sc:getCode()) track("non-transliterable/"..sc:getCode().."/"..self._code) returnnil end -- Remove any strip markers. text=unstrip(text) -- Do not process the formatting into PUA characters for certain languages. localprocessed=load_data(languages_data_module).substitution[self._code]~="none" -- Get the display text with the keepCarets flag set. localsubbedChars ifprocessedthen text,subbedChars=processDisplayText(text,self,sc,true) end -- Transliterate (using the module override if applicable). text,subbedChars=iterateSectionSubstitutions(self,text,sc,subbedChars,true,module_overrideor self._data.translit,"translit","tr") ifnottextthen returnnil end -- Incomplete transliterations return nil. localcharset=sc.characters ifcharsetandumatch(text,"["..charset.."]")then -- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are -- false positives), as well as any PUA substitutions. Anything remaining should only be script code "None" -- (e.g. numerals). localcheck_text=ugsub(text,"["..get_script("Latn").characters.."-]+","") -- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be -- returned. iffind_best_script_without_lang(check_text,true):getCode()~="None"then returnnil end end ifprocessedthen text=escape_risky_characters(text) text=undoTempSubstitutions(text,subbedChars) end -- If the script does not use capitalization, then capitalize any letters of the transliteration which are -- immediately preceded by a caret (and remove the caret). iftextandnotsc:hasCapitalization()andtext:find("^",1,true)then text=processCarets(text,"%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)",function(m1,m2) returnm1..uupper(m2) end) end -- Track module overrides. ifmodule_override~=nilthen track("module_override") end returntext end do localfunctionhandle_language_spec(self,spec,sc) localret=self["_"..spec] ifret==nilthen ret=self._data[spec] iftype(ret)=="string"then ret=list_to_set(split(ret,",",true,true)) end self["_"..spec]=ret end iftype(ret)=="table"then ret=ret[sc:getCode()] end returnnotnotret end functionLanguage:overrideManualTranslit(sc) returnhandle_language_spec(self,"override_translit",sc) end functionLanguage:link_tr(sc) returnhandle_language_spec(self,"link_tr",sc) end end --[==[Returns {{code|lua|true}} if the language has a transliteration module, or {{code|lua|false}} if it doesn't.]==] functionLanguage:hasTranslit() returnnotnotself._data.translit end --[==[Returns {{code|lua|true}} if the language uses the letters I/ı and İ/i, or {{code|lua|false}} if it doesn't.]==] functionLanguage:hasDottedDotlessI() returnnotnotself._data.dotted_dotless_i end functionLanguage:toJSON(opts) localstrip_diacritics,strip_diacritics_patterns,strip_diacritics_remove_diacritics=self._data.strip_diacritics ifstrip_diacriticsthen ifstrip_diacritics.fromthen strip_diacritics_patterns={} fori,frominipairs(strip_diacritics.from)do insert(strip_diacritics_patterns,{from=from,to=strip_diacritics.to[i]or""}) end end strip_diacritics_remove_diacritics=strip_diacritics.remove_diacritics end -- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object(). -- props should either contain zero-argument functions to compute the value, or the value itself. localprops={ ancestors=function()returnself:getAncestorCodes()end, canonicalName=function()returnself:getCanonicalName()end, categoryName=function()returnself:getCategoryName("nocap")end, code=self._code, mainCode=self._mainCode, parent=function()returnself:getParentCode()end, full=function()returnself:getFullCode()end, stripDiacriticsPatterns=strip_diacritics_patterns, stripDiacriticsRemoveDiacritics=strip_diacritics_remove_diacritics, family=function()returnself:getFamilyCode()end, aliases=function()returnself:getAliases()end, varieties=function()returnself:getVarieties()end, otherNames=function()returnself:getOtherNames()end, scripts=function()returnself:getScriptCodes()end, type=function()returnkeys_to_list(self:getTypes())end, wikimediaLanguages=function()returnself:getWikimediaLanguageCodes()end, wikidataItem=function()returnself:getWikidataItem()end, wikipediaArticle=function()returnself:getWikipediaArticle(true)end, } localret={} forprop,valinpairs(props)do ifnotopts.skip_fieldsornotopts.skip_fields[prop]then iftype(val)=="function"then ret[prop]=val() else ret[prop]=val end end end -- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`. returnoptsandopts.lua_tableanddeep_copy(ret)orto_json(ret,opts) end functionexport.getDataModuleName(code) localletter=match(code,"^(%l)%l%l?$") return"Module:"..( letter==niland"languages/data/exceptional"or #code==2and"languages/data/2"or "languages/data/3/"..letter ) end get_data_module_name=export.getDataModuleName functionexport.getExtraDataModuleName(code) returnget_data_module_name(code).."/extra" end get_extra_data_module_name=export.getExtraDataModuleName do localfunctionmake_stack(data) localkey_types={ [2]="unique", aliases="unique", otherNames="unique", type="append", varieties="unique", wikipedia_article="unique", wikimedia_codes="unique" } localfunction__index(self,k) localstack,key_type=getmetatable(self),key_types[k] -- Data that isn't inherited from the parent. ifkey_type=="unique"then localv=stack[stack[make_stack]][k] ifv==nilthen locallayer=stack[0] iflayerthen-- Could be false if there's no extra data. v=layer[k] end end returnv -- Data that is appended by each generation. elseifkey_type=="append"then localparts,offset,n={},0,stack[make_stack] fori=1,ndo localpart=stack[i][k] ifpart==nilthen offset=offset+1 else parts[i-offset]=part end end returnoffset~=nandconcat(parts,",")ornil end localn=stack[make_stack] whiletruedo locallayer=stack[n] ifnotlayerthen-- Could be false if there's no extra data. returnnil end localv=layer[k] ifv~=nilthen returnv end n=n-1 end end localfunction__newindex() error("table is read-only") end localfunction__pairs(self) -- Iterate down the stack, caching keys to avoid duplicate returns. localstack,seen=getmetatable(self),{} localn=stack[make_stack] localiter,state,k,v=pairs(stack[n]) returnfunction() repeat repeat k=iter(state,k) ifk==nilthen n=n-1 locallayer=stack[n] ifnotlayerthen-- Could be false if there's no extra data. returnnil end iter,state,k=pairs(layer) end untilnot(k==nilorseen[k]) -- Get the value via a lookup, as the one returned by the -- iterator will be the raw value from the current layer, -- which may not be the one __index will return for that -- key. Also memoize the key in `seen` (even if the lookup -- returns nil) so that it doesn't get looked up again. -- TODO: store values in `self`, avoiding the need to create -- the `seen` table. The iterator will need to iterate over -- `self` with `next` first to find these on future loops. v,seen[k]=self[k],true untilv~=nil returnk,v end end local__ipairs=require(table_module).indexIpairs functionmake_stack(data) localstack={ data, [make_stack]=1,-- stores the length and acts as a sentinel to confirm a given metatable is a stack. __index=__index, __newindex=__newindex, __pairs=__pairs, __ipairs=__ipairs, } stack.__metatable=stack returnsetmetatable({},stack),stack end returnmake_stack(data) end localfunctionget_stack(data) localstack=getmetatable(data) returnstackandtype(stack)=="table"andstack[make_stack]andstackornil end --[==[ <span style="color: var(--wikt-palette-red,#BA0000)">This function is not for use in entries or other content pages.</span> Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects. -- Do NOT use these methods! -- All uses should be pre-approved on the talk page! ]==] functionLanguage:getData(extra,raw) ifextrathen self:loadInExtraData() end localdata=self._data -- If raw is not set, just return the data. ifnotrawthen returndata end localstack=get_stack(data) -- If there isn't a stack or its length is 1, return the data. Extra data (if any) will be included, as it's stored at key 0 and doesn't affect the reported length. ifstack==nilthen returndata end localn=stack[make_stack] ifn==1then returndata end localextra=stack[0] -- If there isn't any extra data, return the top layer of the stack. ifextra==nilthen returnstack[n] end -- If there is, return a new stack which has the top layer at key 1 and the extra data at key 0. data,stack=make_stack(stack[n]) stack[0]=extra returndata end functionLanguage:loadInExtraData() -- Only full languages have extra data. ifnotself:hasType("language","full")then return end localdata=self._data -- If there's no stack, create one. localstack=get_stack(self._data) ifstack==nilthen data,stack=make_stack(data) -- If already loaded, return. elseifstack[0]~=nilthen return end self._data=data -- Load extra data from the relevant module and add it to the stack at key 0, so that the __index and __pairs metamethods will pick it up, since they iterate down the stack until they run out of layers. localcode=self._code localmodulename=get_extra_data_module_name(code) -- No data cached as false. stack[0]=modulenameandload_data(modulename)[code]orfalse end --[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==] functionLanguage:getDataModuleName() localname=self._dataModuleName ifname==nilthen name=self:hasType("etymology-only")andetymology_languages_data_moduleor get_data_module_name(self._mainCodeorself._code) self._dataModuleName=name end returnname end --[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==] functionLanguage:getExtraDataModuleName() localname=self._extraDataModuleName ifname==nilthen name=notself:hasType("etymology-only")andget_extra_data_module_name(self._mainCodeorself._code)orfalse self._extraDataModuleName=name end returnnameornil end functionexport.makeObject(code,data,dontCanonicalizeAliases) localdata_type=type(data) ifdata_type~="table"then error(("bad argument #2 to 'makeObject' (table expected, got %s)"):format(data_type)) end -- Convert any aliases. localinput_code=code code=normalize_code(code) input_code=dontCanonicalizeAliasesandinput_codeorcode localparent ifdata.parentthen parent=get_by_code(data.parent,nil,true,true) else parent=Language end parent.__index=parent locallang={_code=input_code} -- This can only happen if dontCanonicalizeAliases is passed to make_object(). ifcode~=input_codethen lang._mainCode=code end localparent_data=parent._data ifparent_data==nilthen -- Full code is the same as the code. lang._fullCode=parent._codeorcode else -- Copy full code. lang._fullCode=parent._fullCode localstack=get_stack(parent_data) ifstack==nilthen parent_data,stack=make_stack(parent_data) end -- Insert the input data as the new top layer of the stack. localn=stack[make_stack]+1 data,stack[n],stack[make_stack]=parent_data,data,n end lang._data=data returnsetmetatable(lang,parent) end make_object=export.makeObject end --[==[Finds the language whose code matches the one provided. If it exists, it returns a <code class="nf">Language</code> object representing the language. Otherwise, it returns {{code|lua|nil}}, unless <code class="n">paramForError</code> is given, in which case an error is generated. If <code class="n">paramForError</code> is {{code|lua|true}}, a generic error message mentioning the bad code is generated; otherwise <code class="n">paramForError</code> should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If <code class="n">allowEtymLang</code> is specified, etymology-only language codes are allowed and looked up along with normal language codes. If <code class="n">allowFamily</code> is specified, language family codes are allowed and looked up along with normal language codes.]==] functionexport.getByCode(code,paramForError,allowEtymLang,allowFamily) -- Track uses of paramForError, ultimately so it can be removed, as error-handling should be done by [[Module:parameters]], not here. ifparamForError~=nilthen track("paramForError") end iftype(code)~="string"then localtyp ifnotcodethen typ="nil" elseifcheck_object("language",true,code)then typ="a language object" elseifcheck_object("family",true,code)then typ="a family object" else typ="a "..type(code) end error("The function getByCode expects a string as its first argument, but received "..typ..".") end localm_data=load_data(languages_data_module) ifm_data.aliases[code]orm_data.track[code]then track(code) end localnorm_code=normalize_code(code) -- Get the data, checking for etymology-only languages if allowEtymLang is set. localdata=load_data(get_data_module_name(norm_code))[norm_code]or allowEtymLangandload_data(etymology_languages_data_module)[norm_code] -- If no data was found and allowFamily is set, check the family data. If the main family data was found, make the object with [[Module:families]] instead, as family objects have different methods. However, if it's an etymology-only family, use make_object in this module (which handles object inheritance), and the family-specific methods will be inherited from the parent object. ifdata==nilandallowFamilythen data=load_data("Module:families/data")[norm_code] ifdata~=nilthen ifdata.parent==nilthen returnmake_family_object(norm_code,data) elseifnotallowEtymLangthen data=nil end end end localretval=codeanddataandmake_object(code,data) ifnotretvalandparamForErrorthen require("Module:languages/errorGetBy").code(code,paramForError,allowEtymLang,allowFamily) end returnretval end get_by_code=export.getByCode --[==[Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a <code class="nf">Language</code> object representing the language. Otherwise, it returns {{code|lua|nil}}, unless <code class="n">paramForError</code> is given, in which case an error is generated. If <code class="n">allowEtymLang</code> is specified, etymology-only language codes are allowed and looked up along with normal language codes. If <code class="n">allowFamily</code> is specified, language family codes are allowed and looked up along with normal language codes. The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result. This function is powered by [[Module:languages/canonical names]], which contains a pre-generated mapping of full-language canonical names to codes. It is generated by going through the [[:Category:Language data modules]] for full languages. When <code class="n">allowEtymLang</code> is specified for the above function, [[Module:etymology languages/canonical names]] may also be used, and when <code class="n">allowFamily</code> is specified for the above function, [[Module:families/canonical names]] may also be used.]==] functionexport.getByCanonicalName(name,errorIfInvalid,allowEtymLang,allowFamily) localbyName=load_data("Module:languages/canonical names") localcode=byNameandbyName[name] ifnotcodeandallowEtymLangthen byName=load_data("Module:etymology languages/canonical names") code=byNameandbyName[name]or byName[gsub(name," [Ss]ubstrate$","")]or byName[gsub(name,"^a ","")]or byName[gsub(name,"^a ",""):gsub(" [Ss]ubstrate$","")]or -- For etymology families like "ira-pro". -- FIXME: This is not ideal, as it allows " languages" to be appended to any etymology-only language, too. byName[match(name,"^(.*) languages$")] end ifnotcodeandallowFamilythen byName=load_data("Module:families/canonical names") code=byName[name]orbyName[match(name,"^(.*) languages$")] end localretval=codeandget_by_code(code,errorIfInvalid,allowEtymLang,allowFamily) ifnotretvalanderrorIfInvalidthen require("Module:languages/errorGetBy").canonicalName(name,allowEtymLang,allowFamily) end returnretval end --[==[Used by [[Module:languages/data/2]] (et al.) and [[Module:etymology languages/data]], [[Module:families/data]], [[Module:scripts/data]] and [[Module:writing systems/data]] to finalize the data into the format that is actually returned.]==] functionexport.finalizeData(data,main_type,variety) localfields={"type"} ifmain_type=="language"then insert(fields,4)-- script codes insert(fields,"ancestors") insert(fields,"link_tr") insert(fields,"override_translit") insert(fields,"wikimedia_codes") elseifmain_type=="script"then insert(fields,3)-- writing system codes end-- Families and writing systems have no extra fields to process. localfields_len=#fields for_,entityinnext,datado ifvarietythen -- Move parent from 3 to "parent" and family from "family" to 3. These are different for the sake of convenience, since very few varieties have the family specified, whereas all of them have a parent. entity.parent,entity[3],entity.family=entity[3],entity.family -- Give the type "regular" iff not a variety and no other types are assigned. elseifnot(entity.typeorentity.parent)then entity.type="regular" end fori=1,fields_lendo localkey=fields[i] localfield=entity[key] iffieldandtype(field)=="string"then entity[key]=gsub(field,"%s*,%s*",",") end end end returndata end --[==[For backwards compatibility only; modules should require the error themselves.]==] functionexport.err(lang_code,param,code_desc,template_tag,not_real_lang) returnrequire("Module:languages/error")(lang_code,param,code_desc,template_tag,not_real_lang) end returnexport
