情感分析学习笔记1.1 NLTK Naive-Bayes

NLTK Naive Bayes Classification

Sentiment analysis: a movie reviews corpus with reviews categorized into posand neg categories, and a number of trainable classifiers.

Bag of Words Feature Extraction

For text, we’ll use a simplified bag of words model where every word is feature name with a value of True.

Training and Testing the Naive Bayes Classifier

The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
    return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:int(negcutoff)] + posfeats[:int(poscutoff)]
testfeats = negfeats[int(negcutoff):] + posfeats[int(poscutoff):]
print ('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))
classifier = NaiveBayesClassifier.train(trainfeats)
print ('accuracy:', nltk.classify.util.accuracy(classifier, testfeats))
classifier.show_most_informative_features()

Screen Shot 2017-11-14 at 11.05.48 AM

sentiwordnet

# scores
# senti_synset
from nltk.corpus import sentiwordnet as swn
breakdown = swn.senti_synset('breakdown.n.03')
print(breakdown)
breakdown.pos_score()
breakdown.neg_score()
breakdown.obj_score()

Screen Shot 2017-11-14 at 11.10.51 AM