tchi áo 1
tchi áo 1
1 2 https://ofai.github.io/million-post-
https://derstandard corpus
.at
1602
In particular, we have identified three relevant goals, and results from academic research might not carry
perspectives to our project, all of which need to be over directly.
considered simultaneously: The scientific/technical
perspective focuses on questions such as which methods 2.3. “Messy” Data
to apply for particular subproblems, how to best make use Data annotation by humans is time-consuming and thus
of the available data, which evaluation metrics to apply, costly, especially when specific domain expertise is
etc. necessary, as is the case with the categories the
The industrial perspective focuses on the operational moderators have defined with their particular moderation
realization and deals with topics like integration and goals at DER STANDARD in mind. We need to ensure efficient
interfaces, software quality, performance and scalability, use of moderator time in data annotation for model
privacy, security, backups, etc. training and evaluation. Furthermore, most categories can
Finally, the user perspective is concerned with the benefits be considered rare anomalies, resulting in strongly
the system is able to deliver to the end-users, in our case unbalanced data (e.g., the binary category
the moderators working for the newspaper. For example, “discriminating” has a prevalence of about 8% in our data
using evaluation metrics like precision, recall and F1-score set). These two factors explain the somewhat complicated,
for a classification problem is wellestablished in machine exploratory annotation procedure described in our data
learning research (scientific perspective), but measuring set paper (Schabus et al., 2017): In the first attempt, where
the time savings for a well-defined moderation task tells us 1,000 user comments were selected randomly for
more adequately how well we are addressing the needs of annotation, some categories were very weakly
the moderators (user perspective). It has shown to be represented. Subsequently making use of the moderators’
beneficial to frequently switch between these experience in selecting suitable topics (e.g., articles about
perspectives during the project timespan or to consider the refugee crisis or gender mainstreaming for finding
and address them simultaneously. discriminating posts) turned out to be helpful for acquiring
additional positive instances. 3 However, this also has
2.2. Specificity unwanted side-effects; First, it means that many posts are
Even though we have identified the requirement for annotated only according to one particular category, i.e.,
holistic thinking as one challenge for our endeavor in the the data sets for the categories are mostly disjoint and
previous subsection, we can at the same time also name consequently separate classification models must be
challenges that come from the highly specific practical trained, rather than a single multi-label model. Second, the
needs in the given real-world setting. For example, the class distributions in the labeled data are no longer
applied classification scheme (i.e., the annotated necessarily indicative of the real class distributions in
categories) could be criticized in a purely academic setting practice. And even if we accept these disadvantages, it still
as being specifically tailored to the needs of one particular does not mean that we have vast amounts of data: In our
newspaper. Indeed, it is difficult to find related work that data collection, even the better represented categories
deals with text classification according to categories like have less than 2,000 positive examples.
“arguments used”, “personal stories” or “feedback” While in academic research settings methods are typically
originating from the concrete moderation needs at DER evaluated on carefully compiled benchmark data sets,
STANDARD. And even for our category negative/positive which have reasonable balance and size, in practical
sentiment, the related research literature in sentiment industry applied research settings these might not always
analysis often relates to online reviews for different things be available in a similar quality and quantity. Our situation
like movies (Pang and Lee, 2005; Socher et al., 2013; Maas is also different from what one might associate with an
et al., 2011; Le and Mikolov, 2014), restaurants (Snyder industrial setting typical for large companies which have
and Barzilay, 2007) or books (Sakunkoo and Sakunkoo, less limitations in terms of available data or capacities to
2009), where a clear indication of sentiment can be conduct large scale annotations. In the presented
expected in every document, and the domain restriction approach we thus focused on the efficient usage of the
can be expected to facilitate sentiment classification. In available data acknowledging its characteristics which are
our data, sentiment has yet another specific meaning neither typical for academia nor for large enterprises.
stemming from the application; in particular, the
moderators are interested in locating negative sentiment 2.4. Integration
in order to prevent escalation in user discussions. In The goal of the project is to deliver a prototype system
general, in a concrete industrial setting it is likely that one applicable in practice, i.e., supporting the moderation of
has to deal with very specific phenomena, use cases and current online discussions. Therefore, a connection to the
production forum system is required, such that the
prototype works with the live stream of new postings in
4. Conclusions
In this “industry track” paper, we have shared our
experiences from a collaborative applied research project
involving a small research institution and a medium size
commercial enterprise. The goal of the project is to
develop and deploy a prototype system that supports the
moderation of user discussions on a large newspaper
website. A key building block of this system is a text
classification module predicting eight moderator-defined
category labels. We have described a number of
challenges faced in this context and
1605
Method
Concat 7 7 3 3 7 7 3 3
Topics 7 7 7 7 3 3 3 3
Kernel Linear RBF Linear RBF Linear RBF Linear RBF
Category Measure 1 (BL) 2 3 4 5 6 7 8 9
Precision 1 3 2 6 4 5 4 6
> BL Recall 5 2 5 2 5 3 5 3
F1 2 2 4 3 4 4 5 5
Table 1: Classification results: precision, recall and F1-scores per method and category. BL indicates the baseline from Schabus
et al. (2017). Values outperforming the baseline are underlined, the best value per row is in bold. The last three rows indicate
the number of times the baseline was outperformed per method and measure.
grouped them under the terms holism, specificity, “messy” 5. Acknowledgments
data and integration, highlighting identified differences This research was partially funded by the Google Digital News
between academic and industrial perspectives. Finally, we Initiative.4 We thank DER STANDARD and their moderators for the
reported new results on our data set to illustrate some of interesting collaboration.
these challenges and proposed solutions more concretely.
https://www.digitalnewsinitiative.co
m
1606
6. Bibliographical References
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In Proc. ICML, pages 1188–
1196, Bejing, China.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and
Potts, C. (2011). Learning word vectors for sentiment
analysis. In Proc. ACL, pages 142–150, Portland, OR,
USA.
Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class
relationships for sentiment categorization with respect
to rating scales. In Proc. ACL, pages 115–124, Ann Arbor,
MI, USA.
Sakunkoo, P. and Sakunkoo, N. (2009). Analysis of social
influence in online book reviews. In Proc. AAAI, pages
308–310, San Jose, CA, USA.
Schabus, D., Skowron, M., and Trapp, M. (2017). One
million posts: A data set of German online discussions.
In Proc. SIGIR, pages 1241–1244, Tokyo, Japan.
Snyder, B. and Barzilay, R. (2007). Multiple aspect ranking
using the good grief algorithm. In Proc. ACL, pages 300–
307, Rochester, NY, USA.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D.,
Ng, A., and Potts, C. (2013). Recursive deep models for
semantic compositionality over a sentiment treebank. In
Proc. EMNLP, pages 1631–1642, Seattle, WA, USA.
1607