VOOZH about

URL: https://www.geeksforgeeks.org/python/detect-an-unknown-language-using-python/

⇱ Detect an Unknown Language using Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Detect an Unknown Language using Python

Last Updated : 22 Oct, 2025

Language detection is an essential task in Natural Language Processing (NLP). It involves identifying the language of a given text by analyzing its characters, words, and structure. Python provides several libraries to make this process simple and accurate.

In this article, we’ll explore three popular libraries for language detection:

  • langdetect
  • textblob
  • langid

Using langdetect Library

The langdetect module is a port of Google’s language-detection library and supports 55+ languages. It’s not included in Python’s standard library, so you need to install it first.

Install the library using:

pip install langdetect

Output

en
ru
es
zh-cn
hi
ja

Explanation: detect(): automatically identifies the most probable language for the given text using a pre-trained statistical model.

Using textblob Library

TextBlob is a powerful library for various NLP tasks such as sentiment analysis, translation, and language detection.

Install the library using:

pip install textblob

Example:

Output

en
ru
es
zh-CN
hi
ja

Explanation:

  • TextBlob(): Creates a text processing object for each sentence.
  • .detect_language(): Automatically detects the language of the text using an internal API.
  • The loop prints the detected language code for each sentence, for example: en (English), ru (Russian), es (Spanish), zh-CN (Chinese), hi (Hindi), ja (Japanese).

Using langid Library

langid is a standalone language identification tool pre-trained on 97 languages. It’s lightweight and doesn’t require an internet connection.

Install it using:

pip install langid

Example:

Output

('en', -119.93) ('ru', -641.34) ('es', -191.01) ('zh', -199.18) ('hi', -286.99) ('ja', -875.66)

Explanation:

  • langid.classify(text) returns a tuple containing two values - the detected language code and a confidence score.
  • The first element (e.g., 'en') indicates the language code.
  • The second element is a log-probability score - a lower (more negative) value still represents a valid prediction, not necessarily lower confidence.
  • For example, ('en', -119.93) means the model detected English with a log-probability score of -119.93.

Related Articles:

Comment
Article Tags: