Python 中 Twitter 的情感分析

我正在寻找一个 语篇情感分析(http://en.wikipedia.org/wiki/Sentiment_analysis)的开源实现,最好是 python。有谁熟悉我可以使用的这种开源实现吗?

我正在编写一个应用程序,它可以在 twitter 上搜索一些搜索关键词,比如“ youtube”,并计算“快乐”和“悲伤”的 tweet。 我使用的是 Google 的应用程序引擎,所以它是在 python 中。我希望能够分类从 twitter 返回的搜索结果,我想这样做的 Python。 到目前为止,我还没有找到这样的情绪分析器,尤其是在 python 中。 您是否熟悉我可以使用的这种开源实现?最好这已经在 python 中,但是如果没有,希望我可以把它翻译成 python。

注意,我正在分析的文本非常短,它们是 tweet。所以理想情况下,这个分类器是针对这样的短文本进行优化的。

顺便说一句,twitter 确实支持“ :)”和“ : (”搜索中的操作符,它们的目的就是这样做,但不幸的是,它们提供的分类不是很好,所以我想我可以自己尝试一下。

谢谢!

顺便说一下,一个早期的演示是 给你和代码,我已经到目前为止是 给你,我想开源它与任何感兴趣的开发人员。

51660 次浏览

I came across Natural Language Toolkit a while ago. You could probably use it as a starting point. It also has a lot of modules and addons, so maybe they already have something similar.

I think you may find it difficult to find what you're after. The closest thing that I know of is LingPipe, which has some sentiment analysis functionality and is available under a limited kind of open-source licence, but is written in Java.

Also, sentiment analysis systems are usually developed by training a system on product/movie review data which is significantly different from the average tweet. They are going to be optimised for text with several sentences, all about the same topic. I suspect you would do better coming up with a rule-based system yourself, perhaps based on a lexicon of sentiment terms like the one the University of Pittsburgh provide.

Check out We Feel Fine for an implementation of similar idea with a really beautiful interface (and twitrratr).

A lot of research papers indicate that a good starting point for sentiment analysis is looking at adjectives, e.g., are they positive adjectives or negative adjectives. For a short block of text this is pretty much your only option... There are papers that look at entire documents, or sentence level analysis, but as you say tweets are quite short... There is no real magic approach to understanding the sentiment of a sentence, so I think your best bet would be hunting down one of these research papers and trying to get their data-set of positively/negatively oriented adjectives.

Now, this having been said, sentiment is domain specific, and you might find it difficult to get a high-level of accuracy with a general purpose data-set.

Good luck.

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.

You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.

The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.

Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.

Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

Good luck with that.

Sentiment is enormously contextual, and tweeting culture makes the problem worse because you aren't given the context for most tweets. The whole point of twitter is that you can leverage the huge amount of shared "real world" context to pack meaningful communication in a very short message.

If they say the video is bad, does that mean bad, or bad?

A linguistics professor was lecturing to her class one day. "In English," she said, "A double negative forms a positive. In some languages, though, such as Russian, a double negative is still a negative. However, there is no language wherein a double positive can form a negative."

A voice from the back of the room piped up, "Yeah . . .right."

Somewhat wacky thought: you could try using the Twitter API to download a large set of tweets, and then classifying a subset of that set using emoticons: one positive group for ":)", ":]", ":D", etc, and another negative group with ":(", etc.

Once you have that crude classification, you could search for more clues with frequency or ngram analysis or something along those lines.

It may seem silly, but serious research has been done on this (search for "sentiment analysis" and emoticon). Worth a look.

Thanks everyone for your suggestions, they were indeed very useful! I ended up using a Naive Bayesian classifier, which I borrowed from here. I started by feeding it with a list of good/bad keywords and then added a "learn" feature by employing user feedback. It turned out to work pretty nice.

The full details of my work as in a blog post.

Again, your help was very useful, so thank you!

There's a Twitter Sentiment API by TweetFeel that does advanced linguistic analysis of tweets, and can retrieve positive/negative tweets. See http://www.webservius.com/corp/docs/tweetfeel_sentiment.htm

Take a look at Twitter sentiment analysis tool. It's written in python, and it uses Naive Bayes classifier with semi-supervised machine learning. The source can be found here.

I have constructed a word list labeled with sentiment. You can access it from here:

http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip

You will find a short Python program on my blog:

http://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/

This post displays how to use the word list with single sentences as well as with Twitter.

Word lists approaches have their limitations. You will find a investigation of the limitations of my word list in the article "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs". That article is available from my homepage.

Please note a unicode(s, 'utf-8') is missing from the code (for paedagogic reasons).

For those interested in coding Twitter Sentiment Analyis from scratch, there is a Coursera course "Data Science" with python code on GitHub (as part of assignment 1 - link). The sentiments are part of the AFINN-111.

You can find working solutions, for example here. In addition to the AFINN-111 sentiment list, there is a simple implementation of builing a dynamic term list based on frequency of terms in tweets that have a pos/neg score (see here).

Maybe TextBlob (based on NLTK and pattern) is the right sentiment analysis tool for you.