![]() |
VOOZH | about |
The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.
A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.
In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.
This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Shaumik is an optimist, but one who carries an umbrella. An undergrad at IITR, he loves writing, when he's not busy keeping the blue flag flying high.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Hi Shaumik,
Thank you very much for this brilliant tutorial. Iβm in the process of developing a few custom tools for Alteryx and this tutorial was absolutely legendary!!
Really interesting read, I wonder about the speed though. Would having this hosted as a service as an API endpoint on lambda or cloud functions make the speed of feedback somewhat usable in real-world scenarios or you have any other tips on the matter?
So how can we alter the logic, so you would only need to do all then training part only once - as it takes a lot of time and resources. And in real life scenarios most of the time only the custom sentence will be changing.
I think thereβs a slice too much in this example:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print(tweet_tokens[0])
Seems to me you wanted to show a single example tweet, so makes sense to keep the [0] in your print() function, but remove it from the line above. Otherwise tweet_tokens becomes less useful.
Hi, Shaumik:
In the final code, what is
text = twitter_samples.strings('tweets.20150430-223406.json')
β¦for? It looks like βtextβ is never referenced or used after that.
Thank you, ~Todd
I tried the sentiment analysis with the positive and negative tweets but I want to add more sentiments to it like sarcasm or neutral. I tried to add 5000 neutral tweets and followed the same procedure like positive and negative. If I do so can I get the ratio of all the three sentiments when I use the βclassifier.show_most_informative_features(10)β command . Currently I am getting ratios of neutral with either only positive or negative
following is the output:
Most Informative Features :( = True Negati : Neutra = 1864.7 : 1.0 :) = True Positi : Negati = 847.0 : 1.0 rt = True Neutra : Negati = 807.8 : 1.0 :d = True Positi : Neutra = 672.7 : 1.0 :-) = True Positi : Neutra = 215.0 : 1.0 β¦ = True Neutra : Negati = 198.0 : 1.0 tory = True Neutra : Positi = 108.7 : 1.0 morning = True Positi : Neutra = 104.0 : 1.0 rather = True Neutra : Negati = 99.9 : 1.0 deal = True Neutra : Positi = 84.4 : 1.0
How do I compare all three together or If I add more sentiments how do I compare their ratios to each other
The obtained accuracy is very high so I was wondering what made the model that accurate when it does not even handle double negation sentences. Does it consist of any outliers? Or Is there something else?
What else classifierβs in nltk can we use here in place of Naive Bayes?
Great tutorial, this is very much appreciated!
One of, if not THE cleanest, well-thought-out tutorials I have seen! Thanks for taking the time and going to the trouble to get it right. Very helpful!..
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.