2014年11月10日星期一

Sentiment Analyzing on Weibo data

Sentiment analyzing is analyzing a sentence to find out that the sentiment of it is positive or negative. Part of the project task of our team is analyzing the sentiment of the comments of 4 domestic mobile phones: Xiaomi M4, Smartisan T1, Huawei Honor6, Meizu MX4. 

Here are the key steps for algorithm
1.    Read the text and tokenize it.
2.    In every sentence, find the sentiment word, and record its feature (positive or negative) according to the sentiment dictionary and position.
3.    Find the adverb of degree before the sentiment word. When we find one then stop searching. And we will set weights for adverbs of different degrees. And the weights will multiply the sentiment value (assume the primary sentiment value of every sentiment word is 1)
4.    Find all the negation word before the sentiment word. If the number of negation word is odd, then the sentiment value will multiply -1. If the number is even, multiply 1.
5.    If there is ‘!’ in the sentence, every ‘!’ will add 2 sentiment value to the corresponding feature.
6.    Print out positive and negative value and the corresponding percentage of every sentence.
7.    Add all the sentiment value up and print out the positive and negative value and the corresponding percentage of the whole text.
8.    Calculate the average and variance of the positive and negative sentiment for the text.
 
And during the programming, I came up with some problems:
1.       Python is a little troublesome for processing Chinese characters. The encode information should be presented in Unicode.
2.       Python sometimes can’t input the data in txt file completely.
3.       As the Internet words are much different from the standard sentiment dictionary. We should add and edit some words in the sentiment dictionary after we study the linguist habits of the netizen. That will promote the accuracy of the analyzing outcome.
4.       There is some difference of sentiment value between analyzing the whole text and analyzing every sentence and add them together. I think there might be some unnecessary values at the boundary of two sentences. For example, the sentiment word in the beginning of a sentence will look for adverb of degree in the end of the previous sentence.
 
We are  still working on optimizing the outcome. Hope we can achieve our goals.  

14 条评论:

  1. Hello Jiang Yue! You shared the sentiment analyzing parts of your group project. I find out that your group has very structured and well organized algorithm. I wish you can achieve your goals!

    回复删除
  2. I am interested in the forth step of your algorithm. It has such as large impact that it can determine whether a post is positive or not.
    Is there any case that there are even number of negation words while the sentiment value is not reversed?
    For example: 我真的不不不不高興!

    回复删除
    回复
    1. Thanks for your comment! The example is a problem indeed. Different people may have different habits to express their feelings. And we don't have accurate data to train classifiers, so I choose to use sentiment dictionary to extract data for a general trend. If you have a better method, you can discuss with me.

      删除
    2. Hi Yue.
      Provide that this rule has such a great impact on the result, I think it is worthwhile to examine some of the samples which this rule applied, manually.
      It may not be a problem if the problem rarely exists.

      删除
  3. After trying to analysis Chinese Weibo, I find you have to be familiar with the language or you can not get accurate answer. Dictionary is a key issue. But we still have to build by hand. There is no end in NLP. This project is a great challenge. Thank you for your sharing.

    回复删除
    回复
    1. Yes, you are right. Maybe machining learning will lead to more accurate answer. But there will be more steps for data pre-processing. It might be harder than building rules in our situation.

      删除
  4. Interesting topic of project. But how to deal with the Chinese words may become your big problem. There is no naturally split between the Chinese words. Besides, Chinese corpus is also very rare. Hope you can solve those problems and get a good point in your project.

    回复删除
    回复
    1. I use 'jieba' —— python's own library to tokenize Chinese word. I tried some sentences and it showed good performance. And we also concluded the regulations of 'jieba' and made our dictionary more suitable for it.

      删除
  5. Hi Yue,
    The steps of sentiment analysis are very detailed in this blog with a specific application on mobile phone comments analysis. So it helped me to realize some intractable problems that may occur when conducting the sentiment analysis. I hope you can solve these problems and get the desirable results.

    回复删除
  6. Hi, I also try to analyze the data based on weibo. But I want to know how to get the access token and how to get the authority to get access to weibo API, could you share more detail? Thank you very much.

    回复删除
  7. hi chenxi, um, I am really appreciating that you list your opinions and thoughts in 1,2,3 ... which helps us to make up our mind. sentiment analysis is important, we can see every group of project uses it to organize our data and get our conclusions.

    回复删除
    回复
    1. sorry.. jiangyue...not chenxi..just edit too quickly.....

      删除
    2. Haha! orz Just can't help myself stop leaving a line.... Good job!

      删除
  8. You wrote our project as the last blog, so clever honey~ ^-^ Today's presentation is perfect! I cannot imagine how can I do the project without you guys. You are all so brilliant~!

    回复删除