0% found this document useful (0 votes)
2 views6 pages

tchi áo 1

This paper discusses the development and deployment of a Natural Language Processing system for moderating user comments on a newspaper website, highlighting the differences between academic and industrial research settings. It details the challenges faced, including data annotation, integration into existing IT infrastructure, and the need for a holistic approach to meet specific operational requirements. The authors present experimental results demonstrating the effectiveness of their methods in improving comment moderation while addressing the unique constraints of their industrial context.

Uploaded by

tien46697
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

tchi áo 1

This paper discusses the development and deployment of a Natural Language Processing system for moderating user comments on a newspaper website, highlighting the differences between academic and industrial research settings. It details the challenges faced, including data annotation, integration into existing IT infrastructure, and the need for a holistic approach to meet specific operational requirements. The authors present experimental results demonstrating the effectiveness of their methods in improving comment moderation while addressing the unique constraints of their industrial context.

Uploaded by

tien46697
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Academic-Industrial Perspective on the Development and Deployment of a

Moderation System for a Newspaper Website


Dietmar Schabus, Marcin Skowron
Austrian Research Institute for Artificial Intelligence (OFAI)
Freyung 6/6, 1010 Vienna, Austria
{dietmar.schabus, marcin.skowron}@ofai.at
Abstract
This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language
Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We
highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made
in the data collection and annotation processes, selection of document representation and machine learning methods. We report on
classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical
for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for
addressing them can provide insights to others working in a similar setting. Data and experiment code related to this paper are available
for download at https://ofai.github.io/million-post-corpus. tasks that often interconnect predominately research- and
industry-oriented aspects:
• Automatic labeling of new user comments according
1. Background to the defined categories – a text classification
For about two years, we have been working on an applied problem,
research project in a collaboration between a research • Using these predictions and other (meta-)data for
institute and a large Austrian broadsheet newspaper (DER finding posts and/or entire discussions that require
STANDARD), which supports the moderation of the moderator attention – an information retrieval
comments posted to the newspaper’s website 1 by its problem,
readers. Like many newspaper websites, DER STANDARD’s
website features a comment section below each • Providing a user interface to the moderators so they
newspaper article, where users engage in discussion. In can use the results of the above in their workflow –
the year 2017, more than 9.5 million comments were user interface design,
posted by more than 55,000 distinct users. To ensure high
• Integrating the resulting system into the existing IT
quality in the discourse, the newspaper’s community
infrastructure – system integration.
management department invests considerable effort in
the moderation of the discussion fora, using both Since we want to deliver a (prototype) system that is
machine-learning-based tools and a team of professional usable in practice, our setting differs considerably from
human forum moderators. With the project goal to academic research in several aspects. For example, both
improve the moderation, the moderators have defined user interface design and system integration are not
eight relevant categories of posts, and have annotated a typically relevant in NLP research.
collection of posts with respect to these categories. The
annotated categories are “negative sentiment”, “positive 2. Challenges
sentiment”, “off-topic”, “inappropriate”, “discriminating”, In this section, we describe a few challenges we have faced
“feedback”, “personal stories” and “arguments used”. The in detail and how the issues were addressed. We have
detailed description of the categories and the annotation grouped them under the four terms holism, specificity,
process, as well as the resulting data set and the baseline “messy” data and integration, highlighting differences
classification results are provided in Schabus et al. (2017). between academic and industrial settings.
Both the data set and the classification experiment code
are available online for research purposes. 2 We have 2.1. Holism
designed, developed and deployed a moderator In academic research we are often focused on a highly
dashboard that provides various ways of searching, specific problem, and we can make extensive assumptions
filtering, sorting and aggregating according to the about aspects that are not in the immediate center of
introduced set of categories to help the moderators find attention. In contrast, industrial endeavors require a more
locations in the discussions where moderation actions are holistic view; they need to work with given practical
required. In this process, we have addressed the following settings and address specific requirements of live systems.

1 2 https://ofai.github.io/million-post-
https://derstandard corpus
.at

1602
In particular, we have identified three relevant goals, and results from academic research might not carry
perspectives to our project, all of which need to be over directly.
considered simultaneously: The scientific/technical
perspective focuses on questions such as which methods 2.3. “Messy” Data
to apply for particular subproblems, how to best make use Data annotation by humans is time-consuming and thus
of the available data, which evaluation metrics to apply, costly, especially when specific domain expertise is
etc. necessary, as is the case with the categories the
The industrial perspective focuses on the operational moderators have defined with their particular moderation
realization and deals with topics like integration and goals at DER STANDARD in mind. We need to ensure efficient
interfaces, software quality, performance and scalability, use of moderator time in data annotation for model
privacy, security, backups, etc. training and evaluation. Furthermore, most categories can
Finally, the user perspective is concerned with the benefits be considered rare anomalies, resulting in strongly
the system is able to deliver to the end-users, in our case unbalanced data (e.g., the binary category
the moderators working for the newspaper. For example, “discriminating” has a prevalence of about 8% in our data
using evaluation metrics like precision, recall and F1-score set). These two factors explain the somewhat complicated,
for a classification problem is wellestablished in machine exploratory annotation procedure described in our data
learning research (scientific perspective), but measuring set paper (Schabus et al., 2017): In the first attempt, where
the time savings for a well-defined moderation task tells us 1,000 user comments were selected randomly for
more adequately how well we are addressing the needs of annotation, some categories were very weakly
the moderators (user perspective). It has shown to be represented. Subsequently making use of the moderators’
beneficial to frequently switch between these experience in selecting suitable topics (e.g., articles about
perspectives during the project timespan or to consider the refugee crisis or gender mainstreaming for finding
and address them simultaneously. discriminating posts) turned out to be helpful for acquiring
additional positive instances. 3 However, this also has
2.2. Specificity unwanted side-effects; First, it means that many posts are
Even though we have identified the requirement for annotated only according to one particular category, i.e.,
holistic thinking as one challenge for our endeavor in the the data sets for the categories are mostly disjoint and
previous subsection, we can at the same time also name consequently separate classification models must be
challenges that come from the highly specific practical trained, rather than a single multi-label model. Second, the
needs in the given real-world setting. For example, the class distributions in the labeled data are no longer
applied classification scheme (i.e., the annotated necessarily indicative of the real class distributions in
categories) could be criticized in a purely academic setting practice. And even if we accept these disadvantages, it still
as being specifically tailored to the needs of one particular does not mean that we have vast amounts of data: In our
newspaper. Indeed, it is difficult to find related work that data collection, even the better represented categories
deals with text classification according to categories like have less than 2,000 positive examples.
“arguments used”, “personal stories” or “feedback” While in academic research settings methods are typically
originating from the concrete moderation needs at DER evaluated on carefully compiled benchmark data sets,
STANDARD. And even for our category negative/positive which have reasonable balance and size, in practical
sentiment, the related research literature in sentiment industry applied research settings these might not always
analysis often relates to online reviews for different things be available in a similar quality and quantity. Our situation
like movies (Pang and Lee, 2005; Socher et al., 2013; Maas is also different from what one might associate with an
et al., 2011; Le and Mikolov, 2014), restaurants (Snyder industrial setting typical for large companies which have
and Barzilay, 2007) or books (Sakunkoo and Sakunkoo, less limitations in terms of available data or capacities to
2009), where a clear indication of sentiment can be conduct large scale annotations. In the presented
expected in every document, and the domain restriction approach we thus focused on the efficient usage of the
can be expected to facilitate sentiment classification. In available data acknowledging its characteristics which are
our data, sentiment has yet another specific meaning neither typical for academia nor for large enterprises.
stemming from the application; in particular, the
moderators are interested in locating negative sentiment 2.4. Integration
in order to prevent escalation in user discussions. In The goal of the project is to deliver a prototype system
general, in a concrete industrial setting it is likely that one applicable in practice, i.e., supporting the moderation of
has to deal with very specific phenomena, use cases and current online discussions. Therefore, a connection to the
production forum system is required, such that the
prototype works with the live stream of new postings in

3Positive here means that the property in question is


present, e.g., that a given posting does exhibit the
characteristics of the “discriminating” category.
1603
near realtime. To achieve this, the prototype needed to be investigate in our new experiments, using the old setup as
integrated into the existing IT infrastructure at the a baseline.
newspaper. The IT environments typically used in research The first extension we consider is to train two paragraph
institutions (operating systems, programming languages, vector models (one using the distributed memory method
software libraries, database systems, etc.) differ and one using the distributed bag-of-words method) and
significantly from those used in commercial enterprises. In to then represent each document by the concatenation of
the former case, open-source libraries are often used, the two vectors, as proposed by Le and Mikolov (2014). We
where new methodological advances become available used a vector size of 100 dimensions for each of the two
quickly. In the latter case, systems from large commercial models instead of 300 as we did in the baseline setup as
vendors are often preferred, with certifications, support this turned out to be superior in preliminary experiments,
services, etc. Letting researchers use the tools they are and it also keeps the dimensionality from becoming too
accustomed to is beneficial for flexibility and agility in excessive when two vectors are concatenated.
experimental prototype development; on the other hand, Secondly, we add topic features to the representation:
a prototype using the same technologies as the existing Each of our user comments belongs to a news article, and
environment facilitates integration. Our approach to this for each news article we have meta-data including a topic
situation was to compromise: give the researchers path such as sports / motorsports / formula 1. We have
flexibility in the core area of experimentation (e.g., selected 17 top level topics (e.g., sports, economy, science,
machine learning frameworks), but adapt to the enterprise etc.) and added the resulting 17 binary dimensions
systems in other areas (e.g., database system). Another indicating topic membership to the representation for
important aspect of adding an experimental prototype each comment. Finally, we compare support vector
from a collaborative research project to a production machines with linear kernels against Radial Basis Function
environment is (data) security and privacy. No enterprise (RBF) kernels, motivated by the hypothesis that the new
would tolerate the risks involved with a prototype system composite feature space requires more complex decision
directly manipulating its production databases for a boundaries for accurate classification.
service used by thousands of users every day. Therefore, The evaluation results of a 10-fold stratified
the data was mirrored to a dedicated database server for crossvalidation on the data set from (Schabus et al., 2017)
the prototype system, such that the production system is are given in Table 1, where Method 1 represents the
shielded from potential bugs. By restricting this mirroring baseline results from our prior study. Methods 2–9
to the data that is actually required, most privacy risks can represent all combinations of the three configuration
be avoided (e.g., no personal data of users are mirrored). options described above. Note that Method 2 is identical
In a practical setting, scalability and performance under to the baseline except with regard to the number of
peak loads become key factors. In our setting with up to dimensions (100 vs. 300).
200 new comments per minute and eight different labels In terms of F1-score, Method 9 (concatenation, topics and
to predict, the run-time performance requirements for RBF kernel) outperforms the baseline on five of eight
prediction (time and memory) influenced the choice of categories, and Method 8 (concatenation, topics and
methods. By keeping the models for comment linear kernel) outperforms the baseline on an additional
representation and classification small enough to all fit category (“negative sentiment”). For the two remaining
into main memory simultaneously, and by processing new categories “inappropriate” and “personal stories”, the
comments in batches, we achieve a performance of almost results of Method 9 are less than 0.01 below the baseline
20,000 classified comments per minute on a machine with results. Using the concatenated representation generally
16 cores. Finally, we need to keep in mind that the system helps to improve the prediction results (Methods 4 and 5
needs to be completed on time before the end of the vs. Methods 2 and 3), most noticeably for the categories
project and operated and maintained by the industry “feedback” and “personal stories”. Adding topic
partner after that. Therefore, clean code, suitable error information also generally improves the prediction results
handling and documentation are essential; these aspects (Methods 6 and 7 vs. Methods 2 and 3), this time most
typically can and are neglected in purely academic noticeably for categories “negative sentiment”, “off-
settings. topic”, “discriminating”, “feedback” and “personal
stories”. Combining both representation extensions
3. Experimental Results results in further improvement, especially for the
To better illustrate some of the challenges we face in our “negative sentiment”, “off-topic” and “feedback”
concrete industrial setting, we report the results of new categories. Even when Methods 8 and 9 are not the best
experiments using our data set, which extend the performing, the differences are insignificant for practical
experiments from our previous work (Schabus et al., settings and therefore we choose one of these two for
2017). There, the most promising method was a (linear) deployment, favoring a more uniform overall setup. With
support vector machine on a paragraph vector respect to the challenges discussed in Section 2., the
representation (Le and Mikolov, 2014), which we further specificity of the data we work with becomes apparent in
the context of the experiments. For example, there are no
directly applicable baseline results to compare against for
1604
most of our categories, with the exception of the two
“sentiment” categories where extensive prior work exists.
Here however the differences are in the definition and
scope of the labels, hindering direct comparison of
classification results. For example, Le and Mikolov (2014)
report sentiment classification accuracies above 90% on a
balanced data set of movie reviews, while our best result
for negative sentiment in terms of minority class F1-score
(0.6063) corresponds to only 63% accuracy on our set of
user comments, which are highly diverse in terms of topic,
style, length, author intention, etc.
Finally, the integration aspect also plays a role in selecting
the classification method to use in practice. For example,
with deep LSTM models, which achieved competitive
results in our previous work, we need to sequentially load
separate large models onto a GPU for efficient
classification, increasing the required efforts in operation
and maintenance of the system after the “hand-over” to
the industry partner. On the other hand, paragraph
vectors are an efficient representation in our scenario,
because they are computed only once for all eight
categories, and then fed into separate SVM models. The
resulting system is relatively light-weight and easier to
deploy and maintain in the long term.

4. Conclusions
In this “industry track” paper, we have shared our
experiences from a collaborative applied research project
involving a small research institution and a medium size
commercial enterprise. The goal of the project is to
develop and deploy a prototype system that supports the
moderation of user discussions on a large newspaper
website. A key building block of this system is a text
classification module predicting eight moderator-defined
category labels. We have described a number of
challenges faced in this context and

1605
Method

Concat 7 7 3 3 7 7 3 3
Topics 7 7 7 7 3 3 3 3
Kernel Linear RBF Linear RBF Linear RBF Linear RBF
Category Measure 1 (BL) 2 3 4 5 6 7 8 9

Precision 0.56530.57550.5832 0.5893 0.59750.61060.61120.6216


0.5842 0.56590.52280.5908 0.5192 0.53100.46840.60140.4837
Negative Recall 0.5624
F1 0.57310.56560.54790.5870 0.5520 0.56230.53010.60630.5441
Precision 0.06440.07070.0845 0.0618 0.10200.08510.08040.0977
0.0397 0.34880.30230.2791 0.2558 0.34880.27910.20930.3023
Positive Recall 0.4651
F1 0.07310.10870.11450.1297 0.0995 0.15790.13040.11610.1477
Precision 0.19300.20390.2010 0.2090 0.22840.25790.24720.2524
0.2065 0.58970.45520.5707 0.4724 0.57410.47590.60860.4534
OffTopic Recall 0.6241
F1 0.31030.29080.28160.2973 0.2898 0.32680.33450.35160.3243

Precision 0.10740.13820.1218 0.1475 0.12030.13400.11790.1433


0.1340 0.53470.40590.5116 0.4059 0.59740.41580.52480.4125
Inappr Recall 0.5776
F1 0.21750.17890.20620.1967 0.2164 0.20020.20270.19250.2128

Precision 0.10380.12060.1115 0.1402 0.12070.13430.12230.1547


0.1111 0.45740.20570.4610 0.1844 0.59220.34400.50710.2837
Discrim Recall 0.3936
F1 0.17330.16920.15200.1796 0.1593 0.20050.19320.19710.2003

Precision0.52400.46040.50390.4865 0.5393 0.45200.47430.48390.5311


Feedb Recall 0.70560.72330.64720.7317 0.7018 0.76790.72940.76330.7356
F1 0.60140.56260.56660.5844 0.6099 0.56910.57480.59230.6168

Precision0.62470.55250.54620.5995 0.5835 0.55630.57710.59520.5898


PersonalRecall 0.81230.81600.82520.8271 0.8388 0.83940.84980.83320.8505
F1 0.70630.65890.65740.6951 0.6882 0.66910.68740.69440.6966

Precision0.56570.56360.53980.5594 0.5457 0.56310.54340.55810.5458


Argum

Precision 1 3 2 6 4 5 4 6
> BL Recall 5 2 5 2 5 3 5 3
F1 2 2 4 3 4 4 5 5

Table 1: Classification results: precision, recall and F1-scores per method and category. BL indicates the baseline from Schabus
et al. (2017). Values outperforming the baseline are underlined, the best value per row is in bold. The last three rows indicate
the number of times the baseline was outperformed per method and measure.
grouped them under the terms holism, specificity, “messy” 5. Acknowledgments
data and integration, highlighting identified differences This research was partially funded by the Google Digital News
between academic and industrial perspectives. Finally, we Initiative.4 We thank DER STANDARD and their moderators for the
reported new results on our data set to illustrate some of interesting collaboration.
these challenges and proposed solutions more concretely.

https://www.digitalnewsinitiative.co
m
1606
6. Bibliographical References
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In Proc. ICML, pages 1188–
1196, Bejing, China.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and
Potts, C. (2011). Learning word vectors for sentiment
analysis. In Proc. ACL, pages 142–150, Portland, OR,
USA.
Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class
relationships for sentiment categorization with respect
to rating scales. In Proc. ACL, pages 115–124, Ann Arbor,
MI, USA.
Sakunkoo, P. and Sakunkoo, N. (2009). Analysis of social
influence in online book reviews. In Proc. AAAI, pages
308–310, San Jose, CA, USA.
Schabus, D., Skowron, M., and Trapp, M. (2017). One
million posts: A data set of German online discussions.
In Proc. SIGIR, pages 1241–1244, Tokyo, Japan.
Snyder, B. and Barzilay, R. (2007). Multiple aspect ranking
using the good grief algorithm. In Proc. ACL, pages 300–
307, Rochester, NY, USA.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D.,
Ng, A., and Potts, C. (2013). Recursive deep models for
semantic compositionality over a sentiment treebank. In
Proc. EMNLP, pages 1631–1642, Seattle, WA, USA.

1607

You might also like