Reference of Sentiment Analysis

领域

来源:微博(twitter)淘宝,youtube 短评 格式音频视频音乐图片

用途

货物评价 事件评价 预测(股票,广告prefer products; reviews hotel airports) 推荐系统

挑战

语言翻译 错误拼写 Sarcasm
偏旁

Steps

  1. Web crawling:
  2. Text cleaning:
  3. Topic extraction:
  4. Semantic analysis: Semantic analysis is the most important step to obtain investor sentiment data.

方法

  • 词典

    NTU 知网DataTang SnowNLP

  • 语料

    电影Amazon微博 酒店
    NTLK语料库 (特征提取-模型训练)词汇量——DL深度学习(lstm Long Short Term Memory长短时记忆)

  • word2vec(bow+skip-gram)

    语境x词序 CBOW的目标是根据上下文来预测当前词语的概率。Skip-gram刚好相反:根据当前词语来预测上下文的概率

  • doc2vec (dm+dbow)

    Distributed Memory(DM) 和 Distributed Bag of Words(DBOW)。DM 试图在给定上下文和段落向量的情况下预测单词的概率
    scikit-learn “t-sne”降维

背景

文本预处理技术包括分词、词性标注、句法分析等
如中国科学院计算技术研究所研制的基于多层隐马模型的汉语词法分析系统 ICTCLAS( Institute of Computing Technology,ChineseLexical Analysis System) ,系统的功 能有: 中文分词、词性标注、命名实体识别和未登录词识别,分词 正确率高达 97. 58% ; 哈尔滨工业大学社会计算与信息检索研 究中心研制的 LTP( Language Technology Platform) 开源语言技术平台具有分词、词性标注、命名实体识别、依存句法分析在内 的一整套基于 XML 的中文语言处理模块

基于 PMI_IR 和 HowNet 词汇提出了 Ontology 模型,将极性词图像化为“地球仪”,越近两极倾向性越明显,通过计算词汇相似度和相似度最 大词汇在 Ontology 模型中的映射确定情感词的极性。

主题抽取

为了实现自动构建层次化结构,可以利用领域特征,从而免去人工构建本体库的困难。文献提出了利用句法结构相似度以及启发式的方法来查找可能的评价对象的方法。文献通过结合《知网》本体库,提出基于语义的微博短信息主题分类方法。

关系抽取

常用的方法有根据词性建立语言模版。如文献先寻找语句中的目标词,通过对文本句法进行分析,得出与目标词有修饰关系的词,根据条规则模版对关系进行分类。

互联网中的博客、论坛、讨论组出现了大量的由用户发布的主观性文本。这些主观性文本可以是用户对某个产 品或服务的评论,或者是公众对某个新闻事件或国家政策的 观点等。潜在的消费者在购买某个产品或服务时获取相关的 评论可以提供决策参考,政府部门也可以浏览公众对新闻事 件或国家政策的看法来了解舆情。这些主观性文本每天以指 数级的速度增长,仅靠人工进行分析需要消耗大量的人力和 时间。因此采用计算机来自动地分析这些主观性文本表达的 情感,成为目前学术界研究的一个热点,这个热点的研究方向 就是文本情感分析或称为意见挖掘。

通过对两种方法的最优特征组合进行对比发现,情感词和否定词特征对两种 模型均有用; 词性特征对SVM模型有用,对 CRF 模型有干扰作用,程度副词和特殊符号特征对CRF 模型有用,对 SVM模型有干扰作用

弹幕(词云可视化)
网络评论情感分析的目的在于按照评论文本所表达的情感倾向对评论进行分类。目前针对网络评论的 情感分析应用研究多集中在微博舆论和商品评论两个领域: 利用情感分析技术对微博舆论中的热点话题、 公众观点等进行提取, 或对面向商品的用户在线评论
根据情感分类的方法, 情感分析在微博、网络商品评论领域的应用研究可分为两大类。机器学习方法主要是应用机器学习模型, 对训练集的情感特征进行学习, 估计系统输入输出之间依赖关系, 从而应用于对测试集的分类判断。Pang 等利用支持向量机、朴素贝叶斯、最大熵三种方法对电影评论进行分类, 发现支持向量机的分类效果最好, 而最大熵和朴素贝叶 斯的分类效果相当。刘志明等研究发现, 采用支持向 量机的机器学习算法、信息增益的特征选取算法和 TF-IDF 的特征项权重计算方法, 三者的结合对微博短 文本的情感分类效果最好。

应用

  • w2v

    twitter分析 逻辑分类 ROC曲线 ANN人工神经

  • D2v

    电影评论 sgdclassifer(训练)
    股市分析 线段-移动平均值

机器学习

tensorflow(卷积)

data science in python

抓取数据 分析 kaggle 可视化 大数据
requests静态 pandas预处理(大选,共享单车) matplotlib-seaborn dl(sklearn,spark处理 selenium动态

词云

Map-reduce(hadoop)

其他资源

o 斯坦福公开课

• WIKI Sentiment analysis (https://en.wikipedia.org/wiki/Sentiment_analysis)
• INFO Awesome Sentiment Analysis (https://github.com/xiamx/awesome-sentiment-analysis)

Methods + Fields

  • Absa

    (Multilingual) Aspect Based Sentiment Analysis (ABSA) refers to the systems that determine the opinions or sentiments expressed on different features or aspects of the products and services under evaluation (e.g., battery or performance for a laptop).

    An ABSA system should be capable of classifying each opinion according to the aspect categories relevant for each domain in addition to classifying its sentiment polarity

    多语言,通过方面的特征(电池和性能)决定意见和情感

  • Anew

    Affective Norms for English Words

    The list is further propagated using WordNet ( Miller, 1995 ) by identifying their synonyms and antonyms.

    由Wordnet转化辨认同义词和反义词

  • Ann
    Artificial Neural Network

  • Anp

    SVMs were trained using a taxonomy of a semantic construct called adjective–noun pairs

    The proposed ANPs combined a “noun” for visual detectability and an “adjective” for sentiment modulation of the object described by noun semantics, resulting in pairs like cute dog, beautiful sunset, disgusting food and terrible accident.

    训练SVM以形容词+名词,如可爱的狗

  • antusd

    Chinese WordNet and e-HowNet do provide more ontology semantics, and the augmented version of the NTUSD called ANTUSD was integrated into E-HowNet

    NTUSD升级版与E-HowNet结合

  • apt(model for business study) 金融

    this study extends the arbitrage pricing theory (APT) risk versus behavior factor debate in asset pricing by testing APT using a multivariate framework.

  • asean (intraday)

    find factors to explain market returns in ASEAN have tested: Nasdaq; Dow Jones; S&P 500; Nikkei; Hang Seng; Straits Times; Industrial Index; gold prices; oil

    交易市场

  • asum(短评)

    aspect and sentiment unification model

    The experimental results indicate that the proposed technique performs well on short texts where ASUM performs poorly on short texts.

    ASUM短评中效果不好


  • beautifulsoup
    crawl data in python

  • bns

    Bi-Normal Separation (BNS). They showed that a robust feature selection allows lifting classification accuracies significantly when combined with complex feature types.

    结合复杂的特征种类,提高分类准确度

  • bow (Bag Of Words

    While CBOW aims to predict a word given its context, Skip-gram predicts the context given a word.

    Word2vec computes continuous vector representations of words form very large datasets.

    Continuous Bag-of-Words algorithm (CBOW) –whose goal is to predict a word when the surrounding words are given, and (ii) the Continuous Skip-Gram algorithm (Skip-gram) – which predicts a set of
    words when a single word is known.

    CBOW: 给语境猜词
    Skip-gram:给词猜语境

    a document is regarded as a bag of words (BOW), mapped into a feature vector, and then classified by machine learning techniques such as naive Bayes (NB), maximum entropy (ME), or support vector machines (SVM)

  • btm(短评

    The sentiment classifier is built using a three-level classification approach, while the aspect extractor is built using extended biterm topic model (eBTM), an extension of LDA topic model for short texts.

    三层分类


  • chatter-grabber

    We used ChatterGrabber, a web-scraping tool that randomly samples public tweets of Twitter users in the United States.

    Parameter include tweet data/ user data/ media data

    随机抓取twitter公共样本

  • chi(CHI statistics

    feature selection algorithms
    特征选取

    GR >IG >CHI >DF,

  • clm-z

    based method was used for feature extraction from visual modality;

    适用于图像处理

  • cloud

    hybrid cloud intelligence infrastructure is used to conduct large-scale experiments to analyze user sentiments and associated emotions, using data from a million Facebook users.

    混合云处理大量实验来分析用户情感

  • cnn

    a deep convolutional neural network (CNN) based architecture has been proposed for aspect term extraction task.

    The trained CNN features were then fed into a SVM for classification. So, in particular we used CNN as trainable feature extractor and SVM as a classifier.

    The deep CNN-SVM -based textual sentiment analysis component is found to be the key element for out-performing the state-of-the-art model’s accuracy.

    CNN用于训练特征提取-SVM分类

  • code-switching (multi-lingual)

    There are even more difficult and unexplored multilingual variants, such as code-switching texts (i.e. texts that contain terms in two or more different languages).翻译

  • crf

    hierarchical sequence learning algorithm similar to conditional random fields (CRF) to learn sentiment

    学习情感

  • customer-preference

    Mapping from customer preference to product features
    Results showed that tourism product reviews available on web sites contain valuable information about customer preferences that can be extracted using an aspect-based opinion mining approach.

    旅游产品网站提取重要信息推测游客喜好


  • deep-learning

The main idea of deep learning techniques is to learn complex features extracted from data with minimum external contribution using deep neural networks
deep learning architectures such as convolutional neural networks on code-switching texts,

神经网络学习复杂特征通过最小的附加数据
o df (document frequency

DF denotes the frequency that each term occurs in all texts (document) If the frequency of a term is too low, then it is regarded that the term is not typical enough to represent the texts;

一个词在文本中的频率太低,这个词非典型性

  • dsl

Domain-Specific Languages

a small, usually declarative, language expressive over the distinguishing characteristics of a set of programs in a particular problem
代表性分辨特别问题

  • dt

graphical representation of training samples, where leaves represent class labels of training samples and branches represent con-junctions of features that lead to the class labels.

DT is an algorithm commonly used in data mining The goal of DT is to construct a tree structure that predicts the values of
test samples based on the training samples.

训练样本,叶子代表类标签。DT目标是树结构预测根据训练样本得到的测试样本


  • eeg

Electroencephalogram (EEG) is considered as a promising tool for measuring cognitive workload at lower cost with easy handling, wireless connectivity and lower maintenance cost

EEG信号用来测试可认知workload在最小成本

  • em

exploited a semi-supervised learning algorithm which is based on Expectation Maximization (EM) using naive Bayes as a classifier to learn from a small set of labeled sentences and a large
set of unlabeled sentences

运用Bayes分类器学习小部分已标记的句子,和大部分未标记的句子

  • emotions (cloud based, embedding, cnn

There are a number of theories on emotion taxonomy which spans from Ekman’s emotion categorization model to the Hourglass of Emotion.

So far, approaches to text-based emotion and sentiment detection rely mainly on rule-based techniques, bag of words modeling using a large sentiment or emotion lexicon , or statistical approaches that assume the availability of a large dataset annotated with polarity or emotion labels

Emotion analysis , or affective analysis can be considered as a refined version of sentiment analysis, since it aims at a more detailed categorization of documents based on the emotions they express

  • environment (great reef)

  • event( company event, hurricane)

  • f-measure

The proposed method uses the created lexicon features alongside another lexicon and n-gram features to improve accuracy and F-measure.

In ALGA, it is shown that even using a small dataset (STS-Test) is enough for achieving high accuracy and F-measure values and the number of tweets for training phase does not need to be high.

在ALGA中,分词用另一种分词和n-gram提高准确度和f-measure

  • fcm (fuzzy)

The fuzzy C-means (FCM) clustering algorithm was used in order to define five original fuzzy sets. These fuzzy sets have triangular membership functions.

Every membership function has a degree of membership equal to 1 at the center previously calculated by the FCM, and a support that is defined as the space between the projections of the previous center and the next center on the horizontal axis.

定义5个模糊集合,

  • fr-rank

rank existing nodes according to frequency of occurrence and richness (higher co-occurring PMI value), called FR-Rank, FR-Rank approach to extract frequently used nodes as keywords,
提取常用词作为关键词

  • fuzzy

The fuzzy computational mechanism, along with the AV-AT model of emotion form the core components of the proposed emotion modeling methodology.

模糊计算,结合AV-AT表情模型->表情建模方法


  • gavam

GAVAM was also used to extract facial expression features from the face. shows the extracted features from facial images. In our experiment we used the features extracted by CLM-Z along with the features extracted using GAVAM.

结合CLM-Z和GAVAM提取图片

• gc(Gauss based cuckoo search algorithm

  • generic

Those features can be either surface features (which stands for S ), generic automatic word vectors
( G ), or affect word vectors specifically trained for the sentiment analysis task ( A ).
S:表面特征, G自动词向量 A:SA

Generic word vectors, also denoted as pre-trained word vectors, can be captured by word embeddings techniques such as word2vec ( Mikolov, Chen et al., 2013 ) and GloVe ( Pennington, Socher, & Man-ning, 2014 ). Generic vectors are extracted in an unsupervised manner i.e., they are not trained for a specific task. These word vectors contain semantic and syntactic information, but do not enclose
any specific sentiment information.

Generic framework for stock
捕捉 词嵌套word2vec(包含语义,不包含任何特殊情感信息)

  • genetic-pro

we find that the sentiment indicator based genetic programming optimization approach yields a superior trading performance.

Algorithm :

  1. Randomly create an initial population of individuals from the available function and terminal set;
    repeat
  2. Execute each individual and compute its fitness;
  3. Select one or two individual(s) from the population with a fitness-based probability to participate in genetic operations (i.e. crossover and mutation);
  4. Create new individual(s) by applying genetic operations with specified probabilities of crossover or
    mutation; until Stopping condition is met;

从可用功能和终端集合中随机建立初始样本
执行单个个体,重复
从样本中选择1-2个个体
从样本中选择1-2个fitness-based个体
建立个体适用于genetic操作

  • ghlz (trade stock

Extending GHLZ model by exploring the relationship between intraday stock market returns and intraday sentiment, Sun et al. (2016) (SNS hereafter) find that the change in investor sentiment has predictive value for the intraday market returns.

GHLZ模型 通过股市和情感的关系,找到投资者情感变化

  • gmm
    probabilistic neural network (PNN), Gaussian mixture model (GMM)

The use of Gaussian mixture models as a classification tool is motivated by the interpretation that the Gaussian components represent some general output dependent features and the capability of Gaussian mixtures to model arbitrary densities
分类工具,取决于一般输出和随机密度

  • gr

Gain ratio (GR)
GR is introduced in decision tree (C4.5) algorithms and is a type of feature selection algorithm based on the principle of IG ( Quinlan, 1993; Sharma & Dey, 2012b ).

The GR value of a text feature is calculated by normalizing the IG value of the text feature. The high GR value indicates that the text feature will be useful for classification.
GR值 规范化文本特征IG值,GR值越高->文本特征分类越有用


  • hadoop

Data are stored in a NoSQL MongoDB database, which is located on a cluster computer with a Hadoop architecture.

  • heat-map

Heat map of geo-referenced tweets showing where tweets were posted from between March and October 2016.

  • histogram

shows the histogram of accuracies for our configuration-space in both training and test partitions.

  • hownet

crawled online reviews and the HowNet sentiment dictionary.

HowNet) in simplified Chinese are not suitable for this study because there are different cultural problems and different radicals for the same character. 11,000 positive/negative word senses, respectively. Although it is not necessary to use all Chinese words’ sentiment meanings to conduct an analysis

Hownet 应用于简体中文x

  • human-behavior

focused on building multimodal human behavior analysis tools to extract sentiment in response to videos such as product advertisement.

多模型人类行为分析 对视频反应从而决定产品广告

  • ifwa

intuitionistic fuzzy weighted averaging

the IFWA operator is suitable to be used in the problems that the weights are assigned to the product features, whereas the IFOWA( intuitionistic fuzzy ordered weighted averaging) operator is suitable to be used in the problems that the weights are assigned to the ranking positions of feature values

IFWA用于问题(权重被分配到产品特征
IFOWA 权重被分配到排序位置特征值

  • ig
    information gain

IG is also one of the most used feature selection algorithms in sentiment classification ( Tan & Zhang, 2008; Yang & Ped-ersen, 1997 ). In IG algorithm, it is regarded that the larger uncertainty of a text is, the greater uncertainty of the sentiment class of the text will be.

文本越不确定,情感分类越不确定

  • igraph

NetworkX and iGraph were used in network construction and analysis; visualizations were created in Gephi

  • intraday

(stock return, news-driven, economic freedom, trade)

  • investor

(stock price, rank news, Asian market

  • knn

K-nearest neighbor (KNN).

KNN is a popular instance-based learning algorithm ( Yang & Liu, 1999 ), which has been shown as one of the most effective alg-rithms in sentiment classification ( Lam & Han, 2003 ). In KNN, to
determine the sentiment class of a test text d l , K nearest neighbors are first selected from the training texts,

k最近邻首选为训练文本


  • language

(Spanish tweet, Chinese stock market, microblog, multilingual, arabic literature, Chinese words)

  • lda

Topic modeling techniques make use of latent Dirichlet allocation (LDA) or its variants to automatically extract aspects from text.

Key assumption of LDA model ( Fig. 1 (a)) is that a document is a compound distribution generated by different member probability distributions over words, and each member probability distribution corresponds to a topic.

LDA is a generative model introduced by Blei, Ng, and Jordan (2003) that quickly gained popularity because it is unsupervised, flexible and extensible. LDA models documents as multinomial distributions of so-called topics. Topics are multinomial distributions of words over a fixed vocabulary.

自动提取层次

  • learning-to-rank (stock)

First, learning-to-rank algorithms produce accurate predictions of expected return rankings, and the proposed stock selection approach works robustly under different financial market conditions. Secondly, the sentiment indicators [31,38] support ranking predictions consistently in reflecting individual stock’s future performances under different market conditions.
预测未来股市

  • lexicon-based

Lexicon-based methods make use of known positive and negative terms to classify feature sentiments and thus are domain independent, whereas machine learning methods excel in optimizing system parameters with a large training data set.

Two sub-classifications can be found here: Dictionary-based and Corpus-based approaches.

Typically, a sentiment lexicon consists of a set of terms in a specific language, carrying some kind of emotion weight, annotated along a number of dimensions. The number of dimensions (emotions) is lexicon-dependent while, for each dimension, a given term can be scored either in a binary manner (e.g. the term is characterized by the anger emotion or not), or by using a specific rating scale. (Ref to sentiment analysis leveraging)
情感分词由指定语言的词构成,情绪权重。

  • liblinear

We use an L2-loss L2-regularized support vector regression provided by LIBLINEAR, a library for large linear classification specifically good for document classification. Each support vector machine classifier receives a sparse matrix as an input and the sentiment score as the output and produces a real number between 0 and 1, inclusive.

线性分类助于文档分类,SVM收稀疏矩阵作为输入和情感分数作为输出

  • loocv

Results are obtained by user-independent training with Leave-One-Out Cross-Validation
(LOOCV) scheme.

According to this scheme, the number of folds is equal to the number of instances in the dataset.
重叠=数据集中事件

  • lrr

Latent Rating Regression (LRR) which is a kind of Latent Dirichlet Allocation to analyze both aspect ratings and aspect weights,
分析层面评分和权重

  • lsa

Latent Semantic Analysis (LSA)

  • machine-learning

Machine learning based methods. They are divided also in groups: supervised and unsupervised techniques. In addition, some authors mention a hybrid between these both: semi-supervised learning

监督、非监督、半监督
(DT, NB, SVM, RBFNN and KNN) refer to multi-class sentiment

DT (Decision tree)

graphical representation of training samples, where leaves represent class labels of training samples and branches represent con-junctions of features that lead to the class labels. DT is an algorithm commonly used in data mining. The goal of DT is to construct a tree structure that predicts the values of test samples based on the training samples.

NB

in binary sentiment classification The NB algorithm is based on the assumption that the probability of each term in a text is independence in the term’s context and position in the text.
二进制根据每个词在语境中的独立性

SVM

Based on the structural risk minimization principle, the SVM seeks a hyperplane that separates the feature vectors of training texts into different sentiment classes with the maximum margin.

Radial basis function neural network (RBFNN)

RBFNN is an artificial neural network that uses radial basis functions as activation functions. RBFNN has strong nonlinear fitting ability and high reliability, and it performs well in many domains RBFNN is a three-layer (input layer, hidden layer and output layer) feedforward network,

ANN:非线性匹配三层(输入,隐藏和输出)

Radial basis function neural network (RBFNN)

a popular instance-based learning algorithm ( Yang & Liu,1999 ), which has been shown as one of the most effective algorithms in sentiment classification ( Lam & Han, 2003 ). In KNN, to determine the sentiment class of a test text, K nearest neighbors are first selected from the training texts
从训练文本中最近k邻居是首选


  • map(多层分析

Maximum A Posterior (MAP) technique to tackle the aspect sparsity problem.
estimate the rating on each aspect for each review.
处理层次散乱问题,推测不同评论的不同层次

  • me

Maximum Entropy
supervised techniques, support vector machines (SVM), Naive Bayes, Maximum Entropy

  • micro-blog(twitter, hurricane, reef

  • mjst (microblog)

multimodal joint sentiment topic model
(MJST) for weakly supervised sentiment analysis in microblogging, which applies latent Dirichlet allocation (LDA) to simultaneously analyze sentiment and topic hidden in messages based the introduction of emoticons and microbloggers personality.
弱监督,微博分析hidden topic

  • mkl (多核)

multiple kernel learning (MKL) using support vector machine (SVM) as a classifier with different types of kernel.

MKL was used simultaneously for optimizing different modalities in Alzheimer’s disease. However, in order to deal with co-morbidity with other diseases, they used the hinge loss function to penalize misclassified samples that did not scale well with the number of kernels.

支持SVM多核,loss乘法错误分类的样本 alzhemier

  • mlp

MLP network is trained using a back-propagation algorithm which uses Empirical Risk Minimization. It tries to minimize the errors in training data. Once it finds the hyperplane, regardless of global or local optimum, the training process is stopped.

训练back-propagation 减少错误

  • mongodb

Data are stored in a NoSQL MongoDB database, which is located on a cluster computer with a
Hadoop architecture.

  • multimodal (multimodal emotions)

multimodal sentiment analysis in different domains, including spoken reviews, images, video blogs, human–machine and human–human interactions.

multimodal framework combining physiological analysis of the user and global sentiment-rating available on the internet. We have fused Electroencephalogram (EEG) waves of user and corresponding global textual comments of the video to understand the user’s preference more precisely.
语言图片视频

  • naive-bayes

Naive Bayes classifier is the most common probabilistic classifier and refers to a family of simple classifiers based on applying Bayes theorem with strong independence assumptions among the different variables or features.

This method incorporates the Naive Bayes (NB) algorithm to compute the sentiment index of online reviews and then employs the sentiment index to extend the imitation coefficient in the Bass/Norton
model.

The maximum accuracy was achieved by Yu, Hong and Vasileios Hatzivassiloglou10 by using Naive Bayes, a commonly used supervised machine-learning algorithm.

This approach presupposes the availability of at least a collection of articles with pre-assigned opinion and fact labels at the document level. They used single words, without stemming or stopword removal as features. Naive Bayes assigns a document d to the class c, that maximizes P(c|d) by applying Bayes’ rule, P(c / d)= P(c)P(d / c)/ P(d)
在线评论情感

  • net-optimism

Net-Optimism provides not only robust results [32],but its simplicity makes later comparisons straightforward.

  • networkx

NetworkX [35] and iGraph [36] were used in network construction and analysis; visualizations were created in Gephi

  • neuroimaging
    (Eeg response)

Various neuroimaging techniques have already been applied to study the cortical and subcortical
portions of the brain that get activated while watching pleasant or unpleasant video contents
神经网络

  • nlp

(great reef, stanford nlp, web services (tools included)
Natural Language Processing

DSocial NLP + DSocial NLP sentiment(in python)

Emoticons are taken into account, since there are very common in social networks [60]. Thus, :-) or :-(express something positive or negative respectively

常见问题

• All capital letters are replaced by lower case letters.

• The presence of special elements and symbols (e.g., URLs, usernames, commas) are substituted.

• Additional white spaces are removed because they do not provide semantic information.

• Hashtags symbols (#) are removed from the words they precede in order to be part of the text.

• Informal intensifiers and character repetitions are also identified.

• A list of stopwords is also used to remove very common words that can reduce the performance of the classifier.
大写,表情符号,url,空格,#

SharpNLP POS tagger for dutch

  • norton-model(sales products forecast)

combines the Bass/Norton model and sentiment analysis while using historical sales data and online review data is developed for product sales forecasting.
结合bass/Norton和情感分析预测销售

  • nosql

future real-time assessments will require stable systems (including back up for power outage) to ensure no loss of data. Data are stored in a NoSQL MongoDB database, which is located on a cluster computer with a Hadoop architecture.
无损数据库

  • ntlk(hotel rating)
    platforms in python
  • ntusd (Chinese use)
    (National Taiwan University Sentiment Dictionary)

a predefined wordlist with positive/negative that has been most commonly applied for text mining in traditional Chinese environment, Chinese. It includes 1122 positive words and 4525 negative words after it was subjected to CKIP part-of-speech processing.

If such a list existed, we could use it as a keyword list and apply the same analysis approach as the NTUSD method by searching for keywords in the extracted list and collocations as machine learning features.

词典pos和neg/pos繁体中文,用作关键词表

  • opensmile

the openSMILE toolkit was used to extract various features from audio;
PLP –The Perceptual Linear Predictive Coefficients of the audio segment were calculated using the openSMILE toolkit.
用于提取音频特征

  • opinion

Tweets , short reviews, websites, airports, products, web services, arabic literature

  • ordinal

Ordinal-based integration
In set theory, an ordinal number is the order type of a well-ordered set. Like other kinds of numbers, ordinals can be added and multiplied.

In OIFV method, whole features are ranked and then sorted in descending order by feature selection methods in each feature vector respectively. After feature ranking by five feature selection methods, we obtain the ordinal-based features vector (OFV) using the OIFV method.
特征选择,倒序排列特征

  • pmi
    Point-wise Mutual Information (PMI)

model the mutual information between the features and the classes. This measure was derived from the information theory. The point-wise mutual information (PMI)Mi(w) between the word w and the class i is defined on the basis of the level of co-occurrence between the class i and word w. The expected co-occurrence of class i and word w, on the basis of mutual independence, is given by Pi x F(w), and the true co-occurrence is given by F(w) x pi(w).

模拟特征和类别中的相互信息

  • polarity
    Products review rating neg/pos

In the test phase, for each record T i , ALGA ( D m , T i , L k best) is computed in which k best is the best chromosome in the final iteration in terms of fitness. If the score is greater than zero, the test record is assigned to the positive class; otherwise, it is assigned to the negative class.
分数大于0,即为积极
Use in twitter

SentimentAnalyzer is a simple web service which computes sentiment for English, German or French texts. It is able to classify the polarity (positive, negative or neutral) of a whole text and score it in the range
3类适用语言

SentiRate

Polarity is grouped into 11 categories: very_positive, quite_positive, positive, fairly_positive, a_little_positive, neutral, a_little_negative, fairly_negative, negative, quite_negative,
very_negative. It also scores each sentence or the whole text by number with two decimals.
11类形容词

  • pos
    Parts of speech (POS): finding adjectives, as they are important indicators of opinions.
    查找形容词

  • predict
    预测销售成绩,广告喜好
    预测产品服务rating

  • product
    Suv on amazon 通过online reviews

crawl-prepocess (分词 和pos tag,去除stopwords)-正负词词典

关系:销售和评价
电影评价

  • pso (particle swarm optimization
    effective selection of optimal parameter values for SVM.

  • python
    Trading

Using the Python library BeautifulSoup , we extract all messages published on StockTwits between January 1, 2012, and December 31, 2016, and we store them in a MongoDB NoSQL database.

Machine learning package scikit-learn

  • radical (偏旁seed
    PMI-radical, PMI-word
    compare with NTUSD
    Radical-based approach 尤其适用于鸟凤,繁体字

FR-Rank approach to extract frequently used nodes as keywords,

dataset:trip advisor restaurant

  • real-time(trading, online-social media)

a real-time news analytics framework

developed a real-time quantitative trading system based on six technical indicators and it generates positive returns with statistical significance.

  • recommender (产品比较web service)
    IMDb list rating

  • regression (investor)

examine the heterogeneous effect of investor psychology on conditional
institutional trading behavior by applying a quantile regression.

We perform OLS regressions and QR estimations for the
conditional ETF return distribution quantiles.

  • reinforcement-learning
    to predict negation scopes.

learn a so-called agent from the outcome of its actions on the basis of past experience. This method tries to replicate human-like learning and thus appears well suited for natural language processing.

introduces a so-called state-action function Q(si, ai)that defines the expected value of each possible action ai in each state si. If Q(si, ai) is known, then the optimal policy p∗(si, ai) is given by the
action ai that maximizes Q(si, ai) given the state si. Consequently, the
learning problem of the agent is to maximize the expected reward by

学习以往经验,预测

  • rnn

recurrent neural network (RNN) has been discussed for the aspect term extraction task.

  • semeval( multilingual dataset)

topic modelling based approaches on the SemEval2016 task 5 dataset.

consists of restaurant reviews in several languages. Reviews are split by sentence and labelled with explicit aspect-term mentions, the coarse-grained domain aspect or category
无语言限制,情感分析大会,酒店评价

  • sentislangnet

a novel unsupervised method based on linguistic sentiment propagation model to predict the sentiments in informal texts.
非监督 语言预测

  • sentistrength

This tool is a lexicon-based sentiment evaluator that is specially focused on short social web texts written in English. The classification will be in five different classes: positive, negative, neutral, extremely negative and extremely positive.

returns a positive score, from 1 (not positive) to 5 (extremely positive), a negative score from -1 (not negative) to -5 (extremely negative), and a neutral label taking the values: -1 (negative), 0 (neutral), and
1 (positive).

英语:将社交网络短文本分类为五类

  • snowball

(twitter Spanish sentiment
the Snowball Stemmer for the Spanish language implemented in NLTK package
无语言限制

  • srs (Simple Random Sampling

sample size should be less than 10% of the population size.
样本大小应小于总体10%

  • stanford-corenlp

tokenize, lemmatize, and POS tag the English documents and the Chinese documents.

Stanford CoreNLP created a sentiment classifier based on deep learning technique called recursive neural network that builds on top of grammatical structures.
无语言限制

  • state-wide

    Map印度货币面额流通

  • stock

stock return volatility. (W/ heat map
股票波动

  • stopwords (pre processing)
    eg. A the they I
  • svm

(support vector machines (binary class)

To be trained with larger dataset
vector representing the hyperplane that separates the feature vectors of training texts belonging the two sentiment classes C i and C i

分开训练文本两个情感类的特征向量
(There are 5 machine learning)

  • textblob (in python)

The score polarity is a float in the range [-1.0,1.0] and subjectivity varies within the range [0.0, 1.0], where 0.0 is very objective and 1.0 is very subjective.

极性区间-1至1,主观打分0-1(0为客观)
eg. Mysentimentapi(java) affin(ANEW w/ python) 比较手机(company event

  • tfidf (weighting schema
    比较tf和tfdif在movie data和product data的performance
    TermFrequency Inverse Document Frequency (TF-IDF).

如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TFIDF实际上是:TF * IDF,TF词频(Term Frequency),IDF反文档频率(Inverse Document Frequency)

  • tokenizers
    word n -grams
    N-words are word sequences. To compute the n-words, the text is tokenized and n-word are calculated from tokens. NLTK Tokenizer is used to identified word tokens.

For example, let T = ‘‘the lights and shadows of your future’’ , its 1-words (unigrams) are each word alone, and its 2-words (bigrams) set are the sequences of two words, the set ( W T2 ), and so on. For example, let W T 2 ={ the lights, lights and, and shadows, shadows of, of your, your future } , then, given a text of m words, we obtain a set { the lights, lights and, and shadows, shadows of, of your, your future } , then, given a text of m words, we obtain a set with at most exists 127 possible combinations of tokenizers, that is, the powerset of { 2 -words , 1 -words , 3 -grams , 4 -grams , 5 -grams , 6 -grams , 7 -grams } ,

标记生成器,n个词标记。比如bigrams即一句话中抽取相连两个词的集合

  • vader

(Classify tweets positive/negatvie)
Valence Aware Dictionary for Sentiment Reasoning (VADER) is a rule-based model
that combines a general lexicon and a series of intensifiers, punctuation transformation, emoticons, and many other heuristics to compute sentiment polarity of a review or text.

分词,结合一般词库和增强标点转换,表情符号以计算文本的极性

  • video
    EEG signal and comments (textblob predict ratings)
  1. To increase the number of data samples, the acquired EEG signals corresponding to each video were divided into 8 equal parts. Thus we achieved a total dataset of 3000 signals (i.e. 15×8×25) for 15
    videos.(regression analysis)
  2. Crawl comments from YoutubeAPI(Naive-Bayes Analyzer and a polarity score between −1.0 and 1.0 was assigned to each sentence.

EEG信号预测评分

  1. 增加数据样本,每个视频对应的EEG信号分为8等份,因此15个视频对应3000个信号数据集
  2. 通过YoutubeAPI抓取评论(NB分类和极性打分)
  • visual-sentiment
    analyzing images and their associated tags posted on social media.
    分析图片和社交网络上附带标签
    using CNN, used an AlexNet model pre-trained on ImageNet, simply as a feature extractor [123] with SVMs and logistic regression classifiers,
    特征提取器:CNN和ImageNet上的AlexNet模型,作为带有SVM和逻辑回归分类功能

  • VSM

Vector Space Model (VSM) is exploited to represent documents for sentiment analysis task. The weight of each term in a document’s vector is the key component of the VSM of document representation that measures the importance of the term in a document.

In the indexing process, two features are of main concern: statistical term weighting where term weighting is based on discriminative supremacy of a term that appears in a document or a group of documents and semantic term weighting where term weighting is based on a term’s meaning

文档层次上,测试一个词在文档中的重要性。统计词的权重(展示词在文档中的主导地位)和语义的词权重

  • W2LVDA

an almost unsupervised system based on topic modelling that, combined with some other unsupervised methods and a minimal configuration step, performs aspect category classification, aspect-term and opinion-word separation and sentiment polarity classification for any given domain and language.

无语言限制,几乎非监督,结合其他非监督方法,表现层次处理分类(层面词,意见词,极性词)
Input: 用户评价、评价具体方面(食物服务)、正负词(excellent
Output:

  1. 评价方面aspect term,具体方面形容词(yummy、tasteless)
  2. 整个domain的权重和极性
  • webservice
    比较评分15个具体web services 如alchemy api

  • wom
    Word of mouth 通过网上评价或微博影响消费决定

  • word2vec

Word Embeddings – We employ the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality 300 trained using the continuous bag-of-words architecture . Words not present in the set of pre-trained words are initialized randomly.

词嵌套,word2vec向量通过训练Google新闻上的100十亿词获得300个训练后CBOW架构
非监督Unsupervised 通过发现cooccurrence of words语境中含义和关系(机器学习中Recurrent or Deep Neural Networks.)

包括两个算法
(适用多频)1。Continuous Bag-of-Words algorithm (CBOW) (通过已知周围词语预测一个词语
(适用少词语)2.Continuous Skip-Gram algorithm (Skip-gram) 通过已知单个词语预测一组词语

  • wordcloud

词云:提取词频-酒店电影评价

  • wordnet(常用dict)

词典ontology 偏旁 synonyms and antonyms.
which words are adjectives like ‘great’, ‘amazing’, ‘wonderful’,

  • youtube

We use positive and negative as sentiment classes in the classification problem. In the annotations provided with the YouTube dataset, each video was segmented into utterances and each of the utterances has the length of a few seconds. Every utterance was annotated as either 1, 0 and −1, denoting positive, neutral and negative sentiment.

Using a matlab code, we converted all videos innotated as either 1, 0 and −1, denoting positive, neutral and negative sentiment. Using a matlab code, we converted all videos in the dataset to image frames, after which we extracted facial features from each image frame. To extract facial characteristic points(FCPs) from the images, we used the facial recognition library CLM-Z

使用积极消极词来分类:在YouTube数据集中,我们将视频分成以数秒为间隔的小片段,分别以1,0,-1来表达积极、中性与消极。通过matlab代码,我们将所有视频转换成图片帧,随后通过使用面部识别库CLM-Z 提取图片中的面部特征facial characteristic points(FCPs)