![]() |
VOOZH | about |
The purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit, to create a simple text categorization pipeline.
Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to text documents. This process enables the automated sorting and organization of textual data, facilitating the extraction of valuable information and insights from large volumes of text. Text classification is widely used in various applications, including sentiment analysis, spam detection, topic labelling, and document categorization.
We'll categorize text using a straightforward example. Now let's look at a dataset of favorable and bad movie reviews.
For this example, we'll use the 'sklearn.datasets.fetch_20newsgroups' dataset, which is a collection of newsgroup documents.
text label
0 From: mss@netcom.com (Mark Singer)\nSubject: R... 0
1 From: cuz@chaos.cs.brandeis.edu (Cousin It)\nS... 0
2 From: J019800@LMSC5.IS.LMSC.LOCKHEED.COM\nSubj... 0
3 From: tedward@cs.cornell.edu (Edward [Ted] Fis... 0
4 From: snichols@adobe.com (Sherri Nichols)\nSub... 0
Term frequency-inverse document frequency, or TF-IDF, will be used to translate text into numerical vectors.
We'll use aSupport Vector Machine (SVM) for classification.
Output:
SVC
SVC(kernel='linear')
Evaluate the model using accuracy score and classification report.
Accuracy: 0.9966
Classification Report:
precision recall f1-score support
rec.sport.baseball 0.99 1.00 1.00 286
sci.space 1.00 0.99 1.00 309
accuracy 1.00 595
macro avg 1.00 1.00 1.00 595
weighted avg 1.00 1.00 1.00 595
This code defines a function predict_category that takes a text input, vectorizes it using a pre-trained vectorizer, and predicts its category using a pre-trained classifier. The function then maps the predicted label to its corresponding category name from the newsgroups dataset. Finally, an example usage of the function is provided, demonstrating the prediction of a sample text about exoplanets.
The predicted category is: sci.spaceIn this article, we showed you how to use scikit-learn to create a simple text categorization pipeline. The first steps involved importing and preparing the dataset, using TF-IDF to convert text data into numerical representations, and then training an SVM classifier. Lastly, we assessed the model's effectiveness and offered a feature for categorising fresh textual input. Depending on the dataset and the requirements, this method can be modified to perform a variety of text classification tasks, including subject categorization, sentiment analysis, and spam detection.