Informing Policy With Text Mining PDF
Informing Policy With Text Mining PDF
Abstract
The fast development and adoption of ICT technologies and digital services
has major social implications. Policy-makers often struggle to design appro-
priate regulations, defend the rights of citizens and ensure competition. The
aim of this work is to present various methodologies that support stakehold-
ers with identification of emerging technologies and related social challenges
based on the text mining of news articles. The analysis demonstrates that
while each text mining algorithm provides insightful results, their combin-
ation yields more detailed and robust overview of media discussions. The
results present early signals and trends, the relationships between techno-
logies and social challenges, and changing attitudes towards selected tech
issues.
Keywords: online news, web scraping, text mining, sentiment analysis,
topic-modelling
Funding
Acknowledgement: This research is part of a project that has received
funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 780643 and Grant Agreement no
825652.
Preprint submitted to Technological Forecasting and Social Change 29th January 2020
Disclaimer: The information and views set out in this report are those
of the author(s) and do not necessarily reflect the official opinion of the
European Union. Neither the European Union institutions and bodies nor
any person acting on their behalf may be held responsible for the use which
may be made of the information contained herein.
1 1. Introduction
2 The fast development and adoption of ICT technologies and digital ser-
3 vices has major social implications. Policy-makers often struggle to design
4 appropriate regulations, defend the rights of citizens and ensure competition.
5 The importance of text mining for informing policy has been recognised in
6 the recent years by different political institutions. A notable example is
7 the European Commission which has established the Competence Centre on
8 Text Mining and Analysis. As the founders argue, text mining and analysis
9 tools are necessary to address not only the problem of volume, but also of
10 timeliness in order to provide the right information in the proper format for
11 the decision making process, in a variety of contexts (European Commission,
12 2016).
13 The aim of our work is to present a methodology that supports policy-
14 makers with the identification of emerging technologies and related social
15 challenges. Using text mining techniques, we demonstrate how early signals
16 and trends can be explored in news articles. Moreover, the methodology
17 enables policy stakeholders to analyse the relationships between technologies
18 and social challenges, and gain deeper insights on the most relevant topics.
19 The presented approach contributes to the agenda-setting for policy, sup-
20 porting problem recognition, definition and selection. Using the proposed
21 solution, policy makers are able to answer the following questions:
2
30 Our methodology implements various NLP techniques in a sequential
31 order. The analysis demonstrates that while each text mining algorithm
32 provides insightful results, their combination yields more detailed and robust
33 overview of media discussions.
34 First, we identify emerging topics based on the analysis of changing term
35 frequencies over time. The terms with the greatest increase of frequency
36 per article can be filtered using regression analysis. The identified terms
37 serve as input for further analysis. The connections between emerging social
38 issues (such as fake news or privacy) and technologies are explored using
39 co-occurrence and sentiment analysis techniques. The co-occurrence values
40 are calculated between selected emerging trending topics, pinpointing pairs
41 of terms that are most frequently mentioned together (such as hate speech
42 - machine learning). In order to track the public perception of issues and
43 identify the positive and negative news stories related to a selected topic,
44 sentiment analysis is performed.
45 To additionally verify our results, topic modelling is prepared using Lat-
46 ent Dirichlet Allocation (Blei et al., 2003), a complementary approach to
47 our topic identification strategy. LDA shows which terms define key topics
48 across the text corpus, providing additional information on wide topics and
49 co-occurrences. Finally, various robustness checks are carried out that are
50 discussed in the Appendix.
51 Therefore, the combination of these four techniques provides a funnel. In
52 the first stage, policy-makers gain a high-level overview of recent trends. In
53 the second part of the process, various deep dives can be prepared, analys-
54 ing the relationships between trending issues, and the often changing public
55 attitudes around them.
56 The results confirm that the methodology has huge potential in sup-
57 porting informed policy-making, and decrease the lag between technological
58 changes and regulatory responses. Such methodology has not been available
59 for policy-makers before. The literature discussing the implementation of
60 text-mining for policy is very limited, with no studies presenting working
61 solutions. Moreover, the methodology has significant advantages over prior
62 efforts in the area of trend identification.
63 First, the method enables the automatic detection of trending technolo-
64 gies and issues. Unlike existing studies, the process does not require the prior
65 selection of topics or keywords (e.g. Kim and Ju (2018)). In the absence of
66 a complicated filtering process (e.g. expert groups), the methodology can be
67 easily implemented, and facilitates the exploration of new and unexpected
3
68 areas by policy-makers. Second, by combining various NLP methods, the
69 presented solution is not only more robust, but also provides more insights
70 beyond highlighting trends.
71 The presented results are available in the form of interactive visualisations
72 at ngi.delabapps.eu. The raw results are stored in the Zenodo repository
73 (https://zenodo.org/communities/engineroom/ ) and codes to replicate them
74 are available at Gitlab (https://gitlab.com/enginehouse/engineroom).
75 2. Literature Review
76 The identification of emerging technologies with text mining techniques
77 has an established literature. Rotolo et al. (2015) provide a comprehensive
78 literature review of methods applied for the analysis of trends in science and
79 technology. Text mining has also a growing literature in forecasting methods,
80 summarised by Kayser and Blind (2017). The implemented methods include
81 the analysis of document-frequencies, term-frequencies, co-occurrence ana-
82 lysis, topic modelling and other approaches from the field of scientometrics
83 (Morris et al., 2003; Bettencourt et al., 2008). Finally, a systematic literat-
84 ure review of text mining applications in policy-making was provided by Ngai
85 and Lee (2016). The authors showed how text mining analyses can support
86 different stages of the policy-making cycle, including (i) agenda setting, (ii)
87 policy formulation, (iii) implementation and (iv) evaluation.
88 The identification of trends, especially technologies and research areas,
89 is addressed in the following studies. Kim and Ju (2018) examined the tra-
90 jectory of selected technologies in online news and blog posts based on daily
91 frequencies of documents. Besides measuring the growth of frequencies, the
92 authors calculated the skewness of frequency distributions to examine the sta-
93 bility of trends. Yoon (2012) also focused on a set of pre-selected keywords in
94 a defined field (solar energy). The described methodology is based on both
95 term and document frequencies, also including a time dimension to differen-
96 tiate between strong signals (high frequency and growing) and weak signals
97 (low frequency but growing). Albert et al. (2015) analysed blog posts to
98 identify whether a technology is basic (mature) or pacing (emerging). The
99 authors created a list of terms associated with the two groups, calculated the
100 term frequencies in various time periods, and implemented a fuzzy process
101 to categorise selected technologies.
102 Several studies provide additional insight by analysing not only individual
103 terms, but rather groups of keywords, e.g. by incorporating co-occurrences,
4
104 topic modelling and various other clustering techniques. Lee and Park (2018)
105 based their study on the work of Yoon (2012), concluding that a single
106 keyword is not sufficient to identify a topic. In order to establish the meaning
107 of word groups, the authors improved the methodology by incorporating co-
108 occurrences. Bildosola et al. (2017) combined text mining methods (keyword
109 co-occurrences, keyword counts) with various forecasting techniques to ex-
110 plore and map emerging technologies in the area of cloud computing. Lee and
111 Jeong (2008) examined emerging areas in information security publications
112 by hierarchical clustering of co-occurring keywords and network analysis.
113 The authors prepared a technology roadmap for robot technology using co-
114 occurrence analysis. Kajikawa et al. (2008) tracked emerging energy research
115 fields by clustering publications and analysing the growth in the number of
116 publications by cluster. Roche et al. (2010) examined emerging fields in Bio-
117 logy, combining term frequencies with inverse document frequencies (TF-IDF
118 approach) in scientific literature. Li et al. (2019) analysed trends of topics
119 generated with Lingo algorithm in scientific documents and patents in a se-
120 lected area. Xie et al. (2018) used the HDP topic modelling technique in
121 various time periods to identify promising technologies related to solar cells.
122 Based on the changes in topic-keyword probability distributions, emerging
123 technologies were identified. Niemann et al. (2017) examined patent lanes in
124 selected areas, analysing semantic similarities in patents and applying topic
125 modelling. Li et al. (2015) proposed the usage of words networks to analyse
126 two selected news events in Chinese online media.
127 Following the analysis of trends and exploring their topological features,
128 further dimensions can be explored, such as the related public debate. Sen-
129 timent analysis is widely used in analysing public perception of technologies
130 based on news articles or social media. Choi et al. (2010) identified con-
131 troversial issues and related sub-topics using query generating methods and
132 sentiment models. Ku et al. (2006) presented algorithms for opinion ex-
133 traction in news articles based on concept keywords and sentiment words.
134 Kim et al. (2006) demonstrated an approach with semantic role labelling to
135 extract opinions in news media.
136 Most relevant to our study are the works that explore multiple aspects of
137 emerging technologies, combining various methods. The literature becomes
138 very limited for such works: existing works mostly present a hybrid approach
139 of topic modelling with sentiment analysis. Such an example is the work of
140 Xie et al. (2018), who besides using topic modelling to identify emerging
141 trends, also used sentiment analysis to calculate the share of news with pos-
5
142 itive or negative sentiment. Bian et al. (2016) analysed public perception
143 of topics related to IoT in Twitter. Similarly, Mejia and Kajikawa (2017)
144 identified topics related to robots in newspaper articles and scientific papers,
145 and examined sentiments.
146 The analyses of trends with text mining techniques are incorporated into
147 a rather limited number of policy-oriented or economic studies. Baker et al.
148 (2016) measured the uncertainty of economic policy based on the frequency of
149 newspaper articles containing a set of keywords. Tobback et al. (2018) refined
150 the methodology using modality annotation and support vector machines
151 classification model. Suh (2018) presented a framework for supporting the
152 agenda setting in multidisciplinary group discussions on energy policies using
153 text mining techniques. Kralj Novak et al. (2014) demonstrated a method
154 based on co-occurrence networks to extract relevant entities and time-varying
155 relationships in business news. The robustness of the methodology is verified
156 in a case study with financial data. Kim et al. (2017) predicted bitcoin price
157 with the implementation of topic modelling on bitcoin forum posts. Hisano
158 et al. (2013) estimated the impact of business news on the stock market.
159 The presented studies demonstrate that various NLP methods can be
160 used to analyse trends, identify the topologies of emerging technologies, and
161 also extract opinions and sentiments of public debate based on news articles.
162 However, various limitations can be identified. We have identified three re-
163 search gaps that are addressed by our study.
164 First, analyses focusing on trends and emerging technologies examined
165 pre-defined, narrow areas or technologies. While some efforts identified areas
166 gaining traction using e.g. topic modelling (Xie et al., 2018) or network
167 analysis (Lee and Jeong, 2008), a significant share of literature is restricted to
168 the analysis of selected keywords (Kim and Ju, 2018; Yoon, 2012). Moreover,
169 several works did not explore the full potential of text data. Such example
170 is the analysis of document counts, instead of term frequencies (Kajikawa
171 et al., 2008). In contrast to the existing literature, our approach does not
172 require prior assumptions on trending technologies. Instead of reducing our
173 sample or selecting keywords, the presented approach enables the automatic
174 identification of trending terms. Such methodology creates opportunity to
175 explore unexpected but highly relevant trending issues.
176 Second, the analysed studies mostly focus on the presentation of a se-
177 lected text mining method. However, the combination of methods provide
178 more insight, such as the application of sentiment analysis and topic model-
179 ling (Xie et al., 2018) or term frequencies and co-occurrences (Lee and Park,
6
180 2018). We implement all of the four methods, in order to provide a com-
181 prehensive overview of trending technology news. Therefore, not only the
182 trending technologies and social issues are identified (term frequencies), but
183 also the wider topics are established (co-occurrences and topic modelling),
184 along with the public perception and different views (sentiment analysis).
185 Finally, there is a lack of studies using trend analysis for policy agenda
186 setting. Although previous efforts show the potential of text mining in in-
187 forming policy (Sun et al., 2016) or explaining economic processes (Kralj
188 Novak et al., 2014), our study provides new tools to support policy-making
189 and decrease the lag between technological change and regulations.
190 3. Dataset
191 3.1. Selection of sources
192 We have included 13 popular English language online tech press sources
193 in the analysis. The final sources have been selected in a four stage process.
194 First, online sources were identified that covered and reported on early
195 signals of technological change in the past. The feature of Google search
196 to filter news from specific time periods was used to search for articles that
7
197 covered early on promising technologies and business models of today. Ad-
198 ditionally, the Google Trends tool was used to identify the particular time
199 periods for early signs of a technology in online news. These periods are called
200 innovation trigger stage in the hype cycle literature, i.e. a time when aware-
201 ness about the technology starts to spread and attracts first media coverage
202 (Dedehayir and Steinert, 2016). A popular hype cycle model introduced in
203 the 1990s by Gartner corporation explains the evolution of a technology in
204 terms of expectation or public visibility (y-axis) in relation to time (x-axis).
205 The bell shaped curve of hype around the blockchain technology, proxied
206 by its online search popularity, is presented in Figure 1. The figure shows
207 that the first mentions are likely to originate from the period 2010-2012.
208 The chosen technologies were under intensive development (e.g. autonomous
209 vehicles) or in the pursuit of practical application (e.g. blockchain) during
210 2018. The keywords to identify relevant sources included “IoT”, “virtual
211 reality”, “blockchain”, “bitcoin”, “smartwatch”, “sharing economy” (articles
212 from period Jan 2010-Jan 2012), “autonomous cars”, “big data” (news from
213 Jan 2007-Jan 2008). This process helped us to identify around 25 sources
214 that published articles at the beginning of the current decade on a set of
215 promising technologies.
216 Second, the initial list of sources was supplemented with a set of relevant
217 newer sources (e.g. Politico Europe, established in 2015), as well as high qual-
218 ity non-US sources (e.g. Euractiv, The Guardian) in order to counterbalance
219 the dominant American tech perspective in the analysis.
220 Third, those media outlets were prioritized that covered tech news also
221 from a regulatory and social aspect. On the other hand, sources with a
222 greater focus on consumer electronics or enterprise IT were not preferred.
223 Finally, some selected sources were excluded from the study due to tech-
224 nical reasons, such as a paid subscription business model. The final list of
225 analysed sources, with the number of articles and the location of headquar-
226 ters is presented in Figure 2.
227 The articles and the accompanying metadata have been collected for a
228 period of 38 months between 01.2016 and 03.2019. The collection process
229 was conducted with the use of web-scraping tools. Web-scraping scripts
230 are designed to recognize different types of content, and extract and store
231 only the ones specified by the user (Ignatow and Mihalcea, 2018). In this
232 study, for each individual source, a separate script has been written in Py-
233 thon programming language, using the web automation framework Selen-
234 ium WebDriver. The scripts are available at the project repository: ht-
8
235 tps://gitlab.com/enginehouse/engineroom/tree/master/scrapers. The collec-
236 tion of data had the following main elements:
237 1. The web crawler opened the news website and selected the category or
238 column of news.
239 2. The metadata (title, author, short abstract, publishing date) of articles
240 published in the period 01.01.2016 and 01.03.2019, along with hyper-
241 links directing to full texts were collected and saved.
242 3. The full text of articles were collected and saved.
243 The categories of scraped news sections, and the corresponding links are
244 included in the Appendix (Tables A.11 and A.12).
9
Figure 2: Number of articles per source
10
270 The results of applied readability tests show that the analysed sources are
271 rather homogeneous in terms of reading difficulty. Not surprisingly, among
272 the top 3 most difficult sources two focus on the political aspects of techno-
273 logies (Euractiv and Politico Europe).
274 4. Methodology
275 4.1. Term-frequencies and regression analysis
276 For each source/month pair, the average number of term occurrences per
277 article has been counted. The words have been transformed to their stemmed
278 (base) form using SnowballStemmer. For the readability purposes all tables
279 present identified terms in the human-readable form (e.g. ’election’ instead of
280 stemmed ’elect’). Afterwards, the weighted average of frequencies by source
281 has been calculated. Weights have been assigned to ensure that no source
282 has excessive influence on final results due to the number of articles and to
283 maintain relative balance between American and other sources. Robustness
284 checks and expert analysis confirmed that weighted sources perform slightly
285 better than unweighted (see Table B.15 in the Appendix).
11
286 For all terms which occurred at least once in the last two months of the
287 analysis, an ordinary least squares regression has been performed for the en-
288 tire time period, and also for the last 3, 6 and 12 months. The dependent
289 variable of the estimation is the weighted frequency, while the number of
290 months since the beginning of the analysed period is the independent vari-
291 able. The result is a single coefficient (referred to as coef ) β. Terms with the
292 highest β coefficients have grown the most. However, the top growing words
293 are always stopwords due to their sheer number of occurrences. Most lists
294 of stopwords are not domain-specific: NLTK’s list does not include words
295 such as ”internet”, which should be regarded as a stopword in modern tech-
296 nological media. Instead of creating a domain-specific stopwords list, we
297 divided the coefficient by the mean weighted frequency in all months of the
298 regression. The resulting normalised coefficient (coef norm) delivers a num-
299 ber which can be used to window out irrelevant terms by setting a threshold
300 a term needs to achieve to be included in further analysis. The threshold
301 has been set to 0.025, a value high enough to remove stopwords (including
302 domain-specific ones), but low enough to allow the capture of early signals
303 of new technologies and quickly growing established topics.
304 The 1000 most significantly growing terms (with the largest coefficient
305 which are above the threshold for normalised coefficient), both words and
306 bigrams, have been reviewed, and the relevant terms for further analysis
307 have been selected.
12
323 but co-occur consistently with the analysed word. The sum of boolean values
324 has been divided by number of articles and – just like in the basic method –
325 weighted by source. The 100% value for a term co-occurring with itself has
326 been maintained.
327 Both for the basic and boolean methods, the results have been normalised
328 by the (i) mean frequency and (ii) square root of mean frequency of the co-
329 occurring term. The normalised values are useful to find the terms which
330 often co-occur with the analysed term, but are rarely present in other articles.
331 The terms with the highest normalised values are strictly connected to the
332 analysed topic.
13
357 4.4. Topic modelling
358 Topic modelling refers to a combination of “algorithms for discovering
359 the main themes that pervade a large and otherwise unstructured collection
360 of documents (...) topic models can organize the collection according to
361 the discovered themes” (Blei, 2012a). The most popular method of topic
362 assignment is the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003),
363 which is a probabilistic topic model using Bayesian formulation to reveal
364 hidden (latent) topics in the given text corpus. Documents in the corpus
365 are treated as bag-of-words, i.e. the word ordering is not taken into account.
366 Being an unsupervised machine learning algorithm, LDA does not require a
367 training dataset nor the manual coding of the document. The topics obtained
368 via LDA analysis are probability distributions over terms. Each topic consists
369 of a different set of terms characterised by a certain probability of appearance
370 in the given subset of texts (Blei, 2012b). Instead of beginning the analysis
371 with a predefined set of terms and codes derived from the domain expertise,
372 the researcher specifies the number of topics the algorithm is supposed to
373 find. The additional key parameters of the algorithm include α, the prior
374 topic-document distribution and β, the prior word-topic distribution.
375 Designating a specific number of topics is considered to be a challenging
376 process in text mining analyses (Ignatow and Mihalcea, 2018). The risk is
377 that “choosing too few topics will produce results that are overly broad, while
378 choosing too many will result in the over-clustering of a corpus into many
379 small, highly-similar topics” (Greene et al., 2014). Similarly, the concentra-
380 tion parameters for the Dirichlet distribution (α and β) highly influence the
381 topics’ quality (Wallach et al., 2009).
382 Certain solutions are commonly deployed in order to obtain interpretable
383 and coherent topics. Human generated topic rankings are considered to be
384 the gold standard for coherence evaluation, but their implementation is costly
385 (Röder et al., 2015). LDA models are often evaluated based on the log-
386 likelihood value for a held-out test set. However, Chang et al. (2009) showed
387 that predictive likelihood (and hence perplexity) can differ significantly from
388 the human assessment of the quality of topics. Numerous studies concen-
389 trated on the measurement of coherence for generated topics and assessed
390 its correlation with respect to human topic ranking data (Röder et al., 2015;
391 Aletras and Stevenson, 2013; Newman et al., 2010).
392 In this analysis, we utilize a state-of-the-art coherence framework pro-
393 posed by (Röder et al., 2015). The method assesses the coherence of topics
394 by analysing the pairs of N most frequent topic-words. The pipeline consists
14
395 of four major steps: i) segmentation, ii) probability estimation, iii) confirma-
396 tion measure and iv) aggregation. Segmentation creates the set of two words
397 combinations of N top topic terms. Probability estimation provides the prob-
398 ability of the word pair: the number of documents in which given pair occurs
399 is divided by the total number of documents. The confirmation measure
400 shows the level of agreement of a given word pair with the use of Point-
401 wise Mutual Information (PMI). PMI quantifies the discrepancy between
402 the probability of the two words coincidence given their joint distribution
403 and their individual distributions. Equation (2) normalizes values of PMI
404 between -1 and +1 (NPMI).
P (wi , wj ) +
P M I(wi , wj ) = (1)
P (wi ) ∗ P (wj )
405
P M I(wi , wj )
N P M I(wi , wj ) = (2)
−log(P (wi , wj ) + )
406 Finally, all confirmation measures are aggregated to a single score using arith-
407 metic mean.
408 We optimize the coherence value in respect to three hyper-parameters in
409 the LDA model - number of topics, α and β (see: 5.4). Standard text prepro-
410 cessing consisting of four steps - removing stopwords, conversion to lowercase,
411 tokenization, and vectorization - as well as measuring coherence of topics was
412 conducted using the Gensim package (Řehůřek and Sojka, 2010). Interact-
413 ive visualizations of the obtained topics were prepared with the pyLDAvis
414 package (Sievert and Shirley, 2014)1 .
415 5. Results
416 The described methods are implemented in a sequential order to identify
417 and analyse trending news stories on technologies and related social chal-
418 lenges. First, the most trending terms are filtered out based on term-frequency
419 analysis. Second, the wider topics are mapped and the relationship between
420 trending terms is established with co-occurrence analysis. Third, the sur-
421 rounding emotions and public perception of topics is examined, along with
422 the most positive and negative associated terms. Finally, topic modelling
423 is used to additionally verify our identified trending areas. Such approach
1
Both packages are open source Python tools.
15
424 enables to highlight the most important technological and social issues (Sec-
425 tion 5.1), and explore them in greater details (Sections 5.2 and 5.3). In order
426 to present the value of the methods in informing policy, 3 case studies are
427 analysed: privacy, information in social media, and the technology sector in
428 China. These trending areas have been chosen due to their high regulatory
429 and social relevance.
431 We begin the analysis with the identification of emerging topics in the
432 examined sources. The methodology is based on the regression analysis: the
433 regression coefficient (coef ) reveals the trend of growing terms. As the value
434 of the coefficient is heavily dependent of the average frequency of the term,
435 it highlights relatively frequent and trending words. However, the aim of the
16
Table 3: Bigram Coefficients
436 exercise is to capture early signals of technologies and social issues that may
437 still have low frequencies. Therefore, a normalised coefficient (coefnorm) is
438 calculated that is the coefficient divided by the average frequency of the term.
439 This normalised coefficient is used to exclude terms that have a growing, but
440 already large frequencies, such as stop words.
441 Table 2 presents the results for unigrams and Table 3 for bigrams, sorting
442 the terms by coef. The results show that various technologies (related to
443 AI/ML, 5G), regulatory issues (tech giants, Cambridge Analytica) and con-
444 sumer products (iPhone) gained traction in the online tech press. The top
445 20 words are all closely related to tech topics, with the exception to world
446 series. This supports that adequate sources were selected and that the topic
447 identification methodology is performing well in finding hot topics.
448 We have reviewed the top 1000 trending uni- and bigrams (based on coef )
449 and summarised the most relevant terms in Table 4. The results provide
17
450 an overview for the most important topics online tech press covered in the
451 recent years. The first half of the table includes topics related to computer
452 science, consumer products and future technologies, while the lower section
453 summarises various social and regulatory challenges.
454 The results demonstrate the importance of such technologies as AI/ML
455 algorithms, robotics, autonomous vehicles, quantum computing, IoT or de-
456 centralised computing. Moreover, the identified terms reveal various lesser-
457 known, domain specific terms, such as massive MIMO from 5g technology,
458 quantum key distribution - QKD from quantum computing or cloud native
459 related to cloud computing.
460 The identified social issues present recent regulatory challenges: e.g. on-
461 line privacy, fake news, cybersecurity, net neutrality, election meddling or the
462 growing influence of China. Similarly to technologies, the analysis presents
463 numerous terms outside of the mainstream: e.g. section 230 related to on-
464 line content moderation, or the platform gab in relation to hate speech or
465 political extremism.
466 Another desired attribute of the results is that they lack buzzwords from
467 the past. As an example, big data was a hot topic in the past, however,
468 this bigram has been not identified in the analysis. On the other hand,
469 discussions moved to technologies that exploit big data, such as machine
470 learning algorithms.
471 In the case of policy-making, it is especially crucial to be informed about
472 early signals of social issues. For an even easier filtering of relevant topics,
473 the coef is calculated for the period of last 3, 6 and 12 months. The table is
474 included in the Appendix (Table B.13).
Table 4: Trending topics based on top 1000 terms
Terms Topic
AI ’artificial intelligence’, ’machine learning’, ’facial recognition’, ’ai
system’, ’conscious computing’, ’general intelligence’, ’autonomous
weapon’, ’ai research’, ’reinforcement learning’, ’project maven’,
’neural network’, ’training data’, ’black box’, ’false positive’, ’lethal
autonomous’, ’convolutional neural’, ’ai strategy’, ’openai’, ’tensor-
flow’, ’duplex’, ’alphazero’, ’pytorch’, ’ai-driven’
robot ’ubtech’, ’sphero’, ’killer robot’, ’smart robot’, ’misty robot’, ’sex
robot’, ’aibo’
autonomous ’nuro’, ’waymo’
vehicle
18
5g ’5g network’, ’5g technology’, ’5g smartphone’, ’massive mimo’, ’5g
equipment’, ’5g spectrum’, ’mmwave’
quantum com- ’quantum computing’, ’quantum technology’, ’qubit’, ’d-wave’,
puting ’qkd’
IoT ’iot tech’, ’smart speaker’, ’voice assistant’, ’wireless charger’, ’fold-
able phone’
decentralised ’edge computing’, ’cloud native’, ’serverless computing’, ’edge
computing device’, ’iot edge’, ’kubernet’, ’multi-cloud’, ’snowflake’, ’datab-
rick’, ’rubrik’, ’drivescale’, ’gcp’, ’cncf’
crypto ’cryptocurrency exchange’, ’cryptocurrency miner’, ’crypto-
currency’, ’blockchain’, ’ico’, ’monero’, ’blockchain-based’,
’cryptoasset’, ’coinhive’
competition ’tech giant’, ’digital tax’, ’digital market’, ’network giant’, ’tax
break’, ’surveillance capitalism’, ’second headquarter’, ’neutrality
law’, ’neutrality repeal’, ’fcc commission’, ’paid priority’, ’gafa’
climate ’climate change’, ’global warming’, ’circular economy’, ’paris agree-
ment’, ’greenhouse gas’, ’ipcc’
cybersecurity ’cyber security’, ’data management’, ’data breach’, ’fake account’,
’password manager’, ’vpn service’, ’wannacry ransomware’, ’phish-
ing email’, ’cybersecurity act’, ’wire fraud’, ’multi-factor authentic-
ation’, ’cybersecurity standard’, ’equifax’, ’meltdown’, ’notpetya’,
’spyware’, ’magecart’, ’pwned’, ’robocall’
content crisis ’fake news’, ’hate speech’, ’conspiracy theory’, ’media platform’,
’political ads’, ’copyright directive’, ’article 13’, ’terrorist content’,
’false information’, ’russian troll’, ’disinformation campaign’, ’il-
legal content’, ’content moderation’, ’deep fake’, ’white suprem-
acist’, ’troll farm’, ’section 230’, ’media literacy’, ’disinformation’,
’anti-vaccine’, ’infowars’, ’fact-checking’, ’anti-semitism’, ’gab’
privacy ’facebook user’, ’cambridge analytica’, ’user data’, ’location data’,
’data broker’, ’privacy setting’, ’privacy scandal’, ’privacy stand-
ard’, ’gdpr compliance’, ’gdpr’, ’aadhaar’, ’duckduckgo’
democracy ’european election’, ’trade war’, ’vote leave’, ’yellow jacket’, ’no-
deal brexit’, ’russian interference’, ’election meddling’, ’influence
campaign’, ’vote machines’, ’netizen’
china ’chinese government’, ’chinese companies’, ’chinese telecom’,
’chinese tech’, ’chinese intelligence’
health ’health record’, ’screen time’, ’genome editing’, ’gene drive’, ’e-
cigarettes’, ’biohacking’
19
sexual harass- ’sexual harassment’, ’# metoo’, ’sexual misconduct’, ’sex traffick-
ment ing’
work ’tech worker’, ’work conditions’, ’home office’, ’wework’
475 Next, the evolution of the frequency of various terms is examined. Figure
476 3 compares the 3-monthly moving average of frequency per articles for 3
477 terms: AI, 5g and GDPR.
20
Figure 4: Term frequencies for cryptocurrency topic
21
478 AI has been the second most increasing unigram in the collected articles,
479 showing a strong increase in the period between 09.2016 and 01.2018. During
480 2018, the term frequency has been rather stagnating. In comparison, the term
481 5g has experienced periods of quick increase and decline. In the beginning
482 of 2019, the average frequency of 5g has even surpassed that of AI. A similar
483 seasonality can be observed in the case of GDPR: the frequency of GDPR
484 has been first gradually increasing, then skyrocketed around May 2018, when
485 it came into force. After a short period of large media interest, the frequency
486 declined, but remained at a relatively high level.
487 Figure 4 presents the term frequencies for three closely related terms:
488 cryptocurrency, ICO (initial coin offering) and digital currency. The figure
489 reveals a massive growth of news stories in the second half of 2017, and a
490 sharp decline during 2018. This has been the period of rising bitcoin prices
491 and massive hype around various startups using ICOs to receive funding.
492 This indicates the necessity for policy to identify regulatory areas in an early
493 phase, and reduce the lag between regulations and actual events.
494 To conclude, the analysis of term frequencies helps to evaluate whether a
495 technology or related social issue is a temporary topic or remain relevant.
22
514 in the context of such words as Chinese government, 5G network, giant Hua-
515 wei, Ren Zhengfei and security threat.
516 These keywords well describe recent news about the US administration
517 recommending to avoid Huawei networking equipment due to strong connec-
518 tions of Huawei to the Chinese government (Reichert, C., 2019).
519 A similar story can be explored in the case of 5G that has been covered
520 together with the terms Huawei, trade wars and security risk. Besides the
521 security aspect, 5G news has been also closely related to new consumer tech,
522 such as foldable phones with 5G support.
523 Similarly to 5G, the analysis shows the areas of implementations of neural
524 networks, including the Google Brain project, image classification, the ana-
525 lysis of Higgs Bosons or in medical diagnoses.
526 Finally, the exercise maps two social/regulatory issues: hate speech and
527 GDPR. The co-occurrences for hate speech show the related technologies
528 (machine learning), actors, (Mark Zuckerberg, Cambridge Analytica, Alex
529 Jones) and wider problems (e.g. terrorist content).
546 Additionally, the size of bubbles also reveal the number of analysed para-
547 graphs. In each graph, the top 3 paragraph counts are shown.
23
Figure 5: Co-occurrences for selected terms
fake news
brand safety
illegal content
surveillance capitalism
machine learning
election meddling
alex jones
hate speech
white supremacist
terrorist content
ceo mark
analytica data
pr firm
ai strategy
chinese government
5g network
artificial intelligence
giant huawei
trade war
chinese tech
google china
ren zhengfei
security threat
chinese telecom
machine learning
convolutional neural
training data
google brain
geoffrey hinton
accurate diagnosis
neural network
higgs boson
topics image classification
reinforcement learning
# metoo black box
voice assistant
data breach
location data cambridge analytica
user data
artificial intelligence
machine learning
gdpr
facebook user
facial recognition
consent agreement
gdpr compliance
electronic health
open ai
5g mobile
5g service
chinese companies
huawei equipment
trade war
5g trump administration
anti-vaccine mate x
blockchain-based galaxy fold
vote leave tech giant
digital tax secuity risk
ajit pai
24
548 Public perception has been rather volatile for the analysed terms. In
549 the case of GDPR, the overall positive sentiment declined around May 2018,
550 possibly as a reaction to the complications faced by users and businesses
551 during the introduction. The public sentiment on Chinese tech also declined,
552 changing from positive to neutral in 2018, most probably due to the issues
553 related to cybersecurity. The increase of negative news stories on Facebook
554 CEO Mark Zuckerberg is especially visible. The first great decline happened
555 at the end of 2016, possibly due to the scandals related to misinformation
556 campaigns and fake news on Facebook (Goulard, H., 2016). The second rapid
557 decline is reported at the beginning of the Cambridge Analytica scandal that
558 began in March 2018.
559 Besides tracking the positive and negative sentiments for selected topics,
560 the different shades of selected topics can be further examined by combining
561 the co-occurrence analysis and the sentiment analysis. Tables 5 - 7 demon-
562 strate that technologies and social challenges are related to numerous news
563 stories that are described differently by the media. The tables summarise the
564 most positive and negative co-occurrences based on articles containing both
565 expressions, with sentiment computed on paragraphs. The words have been
566 selected among the most positive and negative 30 co-occurrences. Besides
567 the calculated sentiments, the tables also show the number of paragraphs.
25
Figure 7: Chinese Tech: Sentiment over time
26
568 Therefore, news stories on GDPR have been most positive in the context
569 of data management and digital market, while negative when covering the
570 British Airways (Gibbs, 2018), and neutral in the case of electronic health
571 records or hate speech.
572 In the case of Chinese tech, media coverage has been positive in the
573 context of technologies, while negative in the context of scandals related to
574 Huawei in the US.
575 Finally, news stories related to Mark Zuckerberg have been more positive
576 in relation to GDPR, while negative when mentioning content moderation,
577 and neutral in the news stories on election interference or the work of the
578 committee led by Damian Collins in the British Parliament (Cadwalladr, C.,
579 2018).
Sentiment Count
data management 0.2556 2594
machine learning 0.2521 4355
digital market 0.2496 991
processing data 0.2052 1289
explicit consent 0.1635 1179
british airways -0.1704 177
cybersecurity serious -0.0516 120
electronic health -0.0246 201
face recognition 0.0129 236
hate speech 0.0488 1004
27
Table 6: Chinese tech: Positive and negative sentiments
Sentiment Count
ai capable 0.276 303
ai research 0.2438 459
machine learning 0.2304 429
artificial intelligence 0.198 1281
5g smartphone 0.1874 276
face extradition -0.2212 114
meng wanzhou -0.1062 522
trade secrets -0.0916 221
ren zhengfei -0.0625 591
against huawei -0.0298 286
Sentiment Count
jeff bezos 0.1967 1921
machine learning 0.1796 2740
artificial intelligence 0.1372 5971
privacy law 0.0944 1738
tech giant 0.0941 4912
content moderation -0.0745 1158
hate speech -0.0424 5182
damian collins -0.0274 2149
election interference 0.0057 1589
fake account 0.007 2332
28
589 topic distribution and β: a Dirichlet prior on the per-topic word distribution.
590 The hyper-parameters chosen based on the coherence score maximisation are
591 presented in the Table 8, for all calculated values see: Table D.20. We have
592 obtained coherence values (cv) ranging from 0.55 to 0.60.
29
Figure 9: LDA: Cambridge analytica topic
30
629 repercussions (’campaign’). Topic 12 encompasses regulatory issues around
630 cryptocurrencies in the EU and US (’US Securities and Exchange Commis-
631 sion’). It tackles the subjects of taxation and novel forms of financing such
632 as ICO.
633 The main technological topics identified are presented in Table 10. Topic
634 1 accounts for the largest share of tokens, relating to Internet of Things and
635 cloud computing services. A decentralised approach to these technologies is
636 indicated by the appearance of the term ’blockchain’. Topic 8 features issues
637 related to smart devices. It is focused on two renowned gadget manufacturers
638 famous from their rivalry - Samsung and Apple. Both hardware issues (dis-
639 plays and battery lifetime) and software issues (mobile operating systems)
640 are encompassed in the topic. Topic 10 discusses broader cybersecurity issues
641 such as cyber frauds (’ransomware’) and attacks exploiting critical vulnerab-
642 ilities in the modern processors (’meltdown’). Topic 13 addresses the 5G race,
643 its main contestants as well as regulatory and hardware challenges. Topic
644 20 touches upon issues related to autonomous vehicles, such as controversies
645 around crashes of self-driving cars and the use of the autopilot feature as well
646 as the Federal Communications Commission spectrum proposals for vehicle
647 related communications.
31
Table 9: LDA: Top-15 keywords in social topics
32
youtube security breach court data
video cyber inform regulation nix
people research app sec profile
state attack encrypt tax zeroday
internet google access bitcoin harvest
law risk blackberry lawsuit scandal
rule intelligence protection copyright scl
house agency email rule user
senate departure advertisement law schroepfer
social healthcare ad proposal campaign
Table 10: LDA: Top-10 keywords in technological topics
33
amazon battery flaw chip autonomous
digital ios secure pai tesla
business android bug qualcomm driver
technology pixel hacker broadband fcc
blockchain display ransomware mobile neutral
storage smartphone malicious lg driverless
brand broadcom infect amd crash
product xs code ericsson road
service device server carrier cruise
industry mac meltdown 4g autopilot
648 5.5. Robustness checks
649 Various robustness checks were carried out to validate the topic identi-
650 fication methodology that are included in Appendix B. The results suggest
651 that the assumptions of the methodology are adequate, yielding consistent
652 results.
653 As it has been revealed by Figures 3 and 4, trends are not always stable
654 over longer time periods. In order to better highlight the most recent trends,
655 the regression analysis is calculated for shorter time periods: the last 3, 6
656 and 12 months (Table B.13). The results suggest that the baseline regression
657 performs well in capturing the most trending terms at the shorter time peri-
658 ods (e.g. such topics as 5G, deepfakes, elections, fake news and consumer
659 electronics).
660 As an additional robustness check, exponential regression was calculated
661 that may better suit the dynamics of term frequencies. However, various
662 disadvantages are related to this form of equation, as discussed in Appendix
663 Appendix B.4.
664 The approach presented in the trend analysis (Section 5.1) is based on the
665 assumption that all words in an article are equally important. However, it
666 can be argued that the most important terms are located in the title and the
667 beginning of the text. Therefore, the term frequencies were recalculated, with
668 more weight assigned to the title (5) and the first paragraph (3). Following
669 the regression analysis, 92 words of the original top 100 most trending were
670 located in the new top 250 (Table B.14).
671 The second robustness check concerned the weights of the sources. In-
672 stead of taking into account the number of articles published by the sources,
673 articles from all sources are assigned equal weights. The results were similar
674 tor our baseline, with 87 words from top 100 appearing in the new top 250
675 (Table B.15).
676 6. Conclusions
677 This study presented a methodology for identifying trending topics in
678 online news media, enabling a deeper exploration of technologies and related
679 social challenges. Although text mining has an established literature for the
680 identification of trending topics, we address numerous research gaps.
681 The previous studies are narrowly specialised in regards of the applied
682 methods and the examined technological areas. However, policy-makers need
683 information on the overall technological landscape at the very beginning of
34
684 a policy cycle. In this paper, we have proposed a sequential text mining
685 framework apt to inform policy-makers about the fast changing tech land-
686 scape. The presented methods give policy-makers tools to quickly process
687 vast amounts of information and discover new knowledge at a low cost. The
688 temporal feature of the analysis enables to select the most relevant issues
689 and dismiss overhyped hot topics, characterised by a sudden increase and
690 immediate drop in public discussion.
691 Our methodology brings together a set of straightforward text mining
692 methodologies that are easy to diagnose, tune, evaluate and interpret. The
693 proposed sequence of methods enables the exploration of news stories by
694 different levels of granularity. The terms frequency analysis provides a bird’s
695 eye view on the emerging technologies and interrelated social issues. The co-
696 occurrence analysis helps building the topologies of the most relevant topics.
697 The changing public perception is tracked by the sentiment analysis. The
698 combination of the co-occurrence and sentiment analysis is used to unravel
699 the positive and negative stories related to a topic. Finally, topic modelling
700 provides a robustness check to identify dominant themes of discussion.
701 The implementation of our methodology is illustrated with the exemplary
702 path a policy maker can take through the results obtained for the period
703 01.2016 and 03.2018 from 13 popular English language online tech press
704 sources. The topic identification exercise revealed that the most trending
705 technologies include AI and ML algorithms, blockchain, robotics, quantum
706 computing, autonomous vehicles and various consumer technologies such as
707 IoT devices. Among the most debated social issues, we identified content
708 crisis and fake news, privacy, election meddling, the rising influence of China,
709 cybersecurity and competition in the digital economy.
710 Following the presentation of the main trending topics, selected case stud-
711 ies are explored in greater details, including privacy and GDPR, the Chinese
712 tech sector and the content crisis in social media.
713 The LDA analysis largely supports the results, highlighting such tech-
714 nologies as autonomous vehicles, and regulatory challenges including online
715 content and the Chinese tech sector.
716 The raw results, documented programming scripts and interactive visu-
717 alizations available in the accompanying paper’s website let users explore
718 the tech landscape from different angles. Basic programming background is
719 sufficient for users to reproduce the results for different set of sources and
720 different time periods.
721 We have demonstrated that simple and explicable text mining techniques
35
722 enable the automatic identification of trending topics based on online news.
723 Moreover, the combination of methods provides more nuanced details on
724 the stories from the tech world. Therefore, the methodology has potential
725 to decrease the policy lag, i.e. the time between the recognition of policy
726 challenge and implementation of the solution.
727 Acknowledgements
728 We would like to express our sincere gratitude to dr hab. Katarzyna
729 Śledziewska and our partners from the Engineroom and NGI Forward pro-
730 jects for the constant encouragement and support during the research process.
36
731 Appendix A. Data set
37
Table A.12: Links to the webscraped sources
Source Link
The Register https://www.theregister.co.uk/
ZDNet https://www.zdnet.com/
Gizmodo https://gizmodo.com/
Reuters https://www.reuters.com/
Arstechnica https://arstechnica.com/
The Guardian https://www.theguardian.com/uk
Fastcompany https://www.fastcompany.com/
Techforge https://www.techforge.pub/
IEEE Spectrum https://spectrum.ieee.org/
Politico Europe https://www.politico.eu/
The Conversation https://theconversation.com/
Euractiv https://www.euractiv.com/
Gigaom https://gigaom.com/
38
732 Appendix B. Robustness checks
733 Appendix B.1. Regression analysis with shorter time period
Table B.13: Most growing terms in last 3, 6 and 12 months among those significantly
growing (coef norm > 0.025) in the whole analysis period
39
734 Appendix B.2. Increased importance of title and first paragraph
735 This research is based on multiple implicit underlying assumptions, for ex-
736 ample that all words in an article are equally important and that the assigned
737 weights correctly represent the importance of a source. We recalculated all
738 results changing one of the assumptions, but the final results do not change
739 significantly regardless of the method. Out of top 100 most growing words in
740 the base methodology, 92 and 87 occur in top 250 most growing words after
741 first and second change respectively.
742 More weight has been given to title and the first paragraph in coefficient
743 calculations (title weight: 5, first paragraph: weight 3). Terms which are
744 more salient and important are assumed to be at the beginning of article.
745 It may be misleading if title and the first paragraph are “clickbait” – there
746 is risk of capturing divisive and eye-catching words, which are not emerging
747 technologies. The changes may be desirable – e.g. 2017 disappears from
748 the growing list, mostly because describing the past is unlikely to be the
749 main theme of an article, however words like “artificial”, “quantum” and
750 “ethics” disappear too, even though they may warrant further investigation
751 – ”artificial” (intelligence) in the case of multiple articles may be a solution to
752 a problem described in title and lead paragraph, so would “ethics” (boards,
753 committees) make an attempt to solve AI issues such as “killer robots”.
Words among top100 trending in main Words among top100 trending in 531*,
method, not top250 trending in 531* not top250 trending in main method
amazon recognition
2017 pdf
artificial cup
quantum
ethical
upcoming
attend
beijing
*531 method: title’s weight 5, first paragraph’s weight 3, remaining para-
graphs’ weight 1
40
754 Appendix B.3. Removed source weights
755 Weights of sources have been removed, so that all articles are equally
756 important. In spite of Fastcompany and ZDNet being similarly popular ac-
757 cording to Alexa.com rank, ZDNet has much more influence on final results,
758 as ZDNet has four times as many articles as Fastcompany. There have been
759 more changes in trending keywords than in the first method, but the addi-
760 tional keywords mostly concern investigation about Russian collusion, some
761 specific cloud and AI technologies, as well as telecommunication networks.
762 As the goal of this research is to discover trends in a wide variety of
763 sources and find emerging technologies, we have chosen the methodology
764 of weighting by source, and not changing terms’ importance based on their
765 position in an article.
41
Table B.15: Weighting methods comparison: main vs equal source weight
Words among top100 trending in main Words among top100 trending in equal
method, not top250 trending in equal source weight, not top250 trending in
source weight main method
election u.s.
artificial russian
fake equipment
climate edge
voter nbn
attend gizmodo
episode chrome
democracy speaker
yeah azure
raccine recognition
reese mine
le telstra
biz parliament
neural
nvidia
4g
ces
nhs
365
actor
rollout
optus
gift
lte
42
774 The β parameter is highest for terms with very high values in the last
775 analysed month, low in the second-to-last month, and zero in all previous
776 months. Consequently, normalization is required – we multiply the β para-
777 meter by mean frequency to the power of a scaling parameter:
778 If the scaling parameter is 1, top results are similar to our results from
779 OLS: facebook, ai, 2018, 2017, and 5g are five most trending words (1st,
780 2nd, 3rd, 8th and 5th highest OLS coefficients respectively). The lower the
781 scaling parameter, the more specific the results get. With the value of 0.8,
782 view20 (a Huawei smartphone released in December 2018) enters the top 10
783 and becomes the most trending term with scaling parameter 32 . With scaling
784 parameter 12 , some other terms (like thunberg) enter the top 10 most trending
785 terms:
1
Table B.16: Highest scaled exponential regression coefficients with scaling parameter 2
786 The trending terms change significantly with the scaling parameter below
787 0.5, the most trending words for scaling parameter 13 are as follows:
43
1
Table B.17: Highest scaled exponential regression coefficients with scaling parameter 3
44
788 Appendix C. Co-occurrence analysis
789 The top 30 co-occurring words are presented for selected terms.
45
Table C.18: Calculated co-occurrences for terms Chinese tech and neural network
46
13 kai-fu lee 9 tech giant 7.21792
14 security concern 8.54665 geoffrey hinton 6.97004
15 communist party 8.07109 sophisticated algorithm 6.08807
16 ren zhengfei 7.85066 certain diseases 6
17 security risk 7.65887 accuracy diagnosis 6
18 security threat 7.64338 higgs boson 5.83333
19 meng wanzhou 7.4464 new image 5.71583
20 chinese telecom 7.39499 image classification 5.29268
21 huawei equipment 7.36787 reinforcement learning 5.23159
22 machine learning 7.25645 recognition system 4.81285
23 other chinese 6.65046 black box 4.40454
24 president trump 6.46544 quantum computing 4.36231
25 chief financial 6.28317 new ai 4.2136
26 financial officer 6.28317 voice assistance 4.11935
27 translation system 6 ai into 4.04834
28 another language 6 machine-learning model 4.04509
29 babel fish 6 expo world 4
Table C.19: Calculated co-occurrences for terms GDPR, hate speech and 5G
47
13 mate x 1.9152 tech giant 4.84816 privacy regulation 4.69564
14 telecom equipment 1.80113 terrorist content 4.66608 facial recognition 4.21035
15 security concern 1.75714 facebook user 4.03051 tech giant 3.85276
16 new 5g 1.67168 ceo mark 3.85351 data broker 3.72572
17 chinese telecom 1.63324 analytica data 3.85322 analytica scandal 3.70646
18 galaxy fold 1.59965 2016 us 3.85134 privacy setting 3.31546
19 ren zhengfei 1.58783 fake account 3.83033 big tech 3.30796
20 trump administration 1.56132 conspiracy theory 3.76995 under gdpr 3.13044
21 chief financial 1.5492 content moderation 3.74147 consent agreement 3.1126
22 financial officer 1.54848 pr firm 3.45908 protection act 2.9683
23 network equipment 1.51906 graphic violence 3.45055 protection rule 2.87675
24 chinese firm 1.46971 media giant 3.43518 gdpr compliance 2.80291
25 cyber security 1.41964 data leak 3.31173 new general 2.71066
26 huawei technology 1.38123 content appearance 3.27126 health record 2.55844
27 tech giant 1.36108 account over 3.19663 electronic health 2.54785
28 security risk 1.2847 negative press 3.0735 ceo mark 2.49394
29 ajit pai 1.27578 communication officer 3.0735 global turnover 2.4243
790 Appendix D. Topic modelling
Hyper-parameter optimization
48
Topics Alpha Beta Coherence
35 0.01 symmetric 0.60
35 0.01 0.01 0.59
35 symmetric symmetric 0.59
35 symmetric 0.01 0.58
40 0.003 0.003 0.58
40 0.003 0.01 0.58
35 0.003 0.01 0.58
35 0.003 symmetric 0.58
35 symmetric 0.003 0.57
40 0.01 0.01 0.57
40 0.01 symmetric 0.57
30 0.01 symmetric 0.57
40 0.003 symmetric 0.57
35 0.003 0.003 0.57
30 symmetric 0.003 0.57
30 0.003 symmetric 0.57
35 0.01 0.003 0.57
30 0.01 0.003 0.56
30 0.003 0.003 0.56
30 symmetric symmetric 0.56
30 symmetric 0.01 0.56
40 0.01 0.003 0.56
45 symmetric symmetric 0.56
45 0.002 symmetric 0.56
45 0.01 symmetric 0.56
40 symmetric 0.01 0.55
45 symmetric 0.002 0.55
45 symmetric 0.01 0.55
30 0.01 0.01 0.55
45 0.002 0.002 0.55
30 0.003 0.01 0.55
45 0.002 0.01 0.55
40 symmetric symmetric 0.55
45 0.01 0.01 0.55
10 0.01 symmetric 0.55
10 0.01 symmetric 0.55
49
791 References
792 European Commission, Competence Centre on Text Mining and
793 Analysis, 2016. Available online at https://ec.europa.eu/jrc/en/
794 text-mining-and-analysis(Accessed 31.03.2019).
804 V. Kayser, K. Blind, Extending the knowledge base of foresight: The con-
805 tribution of text mining, Technological Forecasting and Social Change
806 (2017).
816 J. Yoon, Detecting weak signals for long-term business opportunities using
817 text mining of Web news, Expert Systems with Applications 39 (2012)
818 12543–12550.
50
821 Y. J. Lee, J. Y. Park, Identification of future signal based on the quantitative
822 and qualitative text mining: a case study on ethical issues in artificial
823 intelligence, Quality and Quantity 52 (2018) 653–667.
824 I. Bildosola, R. M. Rı́o-Bélver, G. Garechana, E. Cilleruelo, TeknoRoadmap,
825 an approach for depicting emerging technologies, Technological Forecasting
826 and Social Change 117 (2017) 25–37.
827 B. Lee, Y. I. Jeong, Mapping Korea’s national R&D domain of robot tech-
828 nology by using the co-word analysis, Scientometrics 77 (2008) 3–19.
829 Y. Kajikawa, J. Yoshikawa, Y. Takeda, K. Matsushima, Tracking emerging
830 technologies in energy research: Toward a roadmap for sustainable energy,
831 Technological Forecasting and Social Change 75 (2008) 771–782.
832 I. Roche, D. Besagni, C. François, M. Hörlesberger, E. Schiebel, Identifica-
833 tion and characterisation of technological topics in the field of Molecular
834 Biology, Scientometrics 82 (2010) 663–676.
835 X. Li, Q. Xie, T. Daim, L. Huang, Forecasting technology trends using text
836 mining of the gaps between science and technology: The case of perovskite
837 solar cell technology, Technological Forecasting and Social Change (2019)
838 1–18.
839 Q.-q. Xie, X. Li, L.-c. Huang, Identifying the Development Trends of Emer-
840 ging Technologies : A Social Awareness Analysis Method Using Web News
841 Data Mining, 2018 Portland International Conference on Management of
842 Engineering and Technology (PICMET) (2018) 1–12.
843 H. Niemann, M. G. Moehrle, J. Frischkorn, Use of a new patent text-
844 mining and visualization method for identifying patenting patterns over
845 time: Concept, method and test application, Technological Forecasting
846 and Social Change 115 (2017) 210–220.
847 H. Li, W. Fang, H. An, X. Huang, Words analysis of online Chinese news
848 headlines about trending events: A complex network perspective, PLoS
849 ONE 10 (2015) 1–22.
850 Y. Choi, Y. Jung, S. H. Myaeng, Identifying controversial issues and their
851 sub-topics in news articles, Lecture Notes in Computer Science (includ-
852 ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
853 Bioinformatics) 6122 LNCS (2010) 140–153.
51
854 L.-W. Ku, Y.-T. Liang, H.-H. Chen, Opinion extraction, summarization and
855 tracking in news and blog corpora, In Proceedings of AAAI-2006 Spring
856 Symposium on Computational Approaches to Analyzing Weblogs pages
857 (2006) 100–107.
52
884 A. Sun, M. Lachanski, F. J. Fabozzi, Trade the tweet: Social media text
885 mining and sparse matrix factorization for stock market prediction, Inter-
886 national Review of Financial Analysis 48 (2016) 272–281.
887 O. Dedehayir, M. Steinert, The hype cycle model: A review and future
888 directions, Technological Forecasting and Social Change (2016).
889 G. Ignatow, R. Mihalcea, Text Mining: A Guidebook for the Social Sciences,
890 2018.
894 R. Gunning, The Fog Index After Twenty Years, International Journal of
895 Business Communication (1968).
53
913 J. Chang, J. Boyd-Graber, G. Sean, W. Chong, D. M. Blei, Reading Tea
914 Leaves: How Humans Interpret Topic Models, Advances in neural inform-
915 ation processing systems (2009).
916 N. Aletras, M. Stevenson, Evaluating topic coherence using distributional
917 semantics, in: Proceedings of the 10th International Conference on Com-
918 putational Semantics (IWCS 2013) – Long Papers, Association for Com-
919 putational Linguistics, Potsdam, Germany, 2013, pp. 13–22.
920 D. Newman, J. H. Lau, K. Grieser, T. Baldwin, Automatic evaluation of topic
921 coherence, in: Human Language Technologies: The 2010 Annual Confer-
922 ence of the North American Chapter of the Association for Computational
923 Linguistics, Association for Computational Linguistics, pp. 100–108.
924 R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large
925 Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges
926 for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. http://
927 is.muni.cz/publication/884893/en.
928 C. Sievert, K. Shirley, Ldavis: A method for visualizing and interpreting
929 topics, in: Proceedings of the workshop on interactive language learning,
930 visualization, and interfaces, pp. 63–70.
931 Reichert, C., US tells Germany to ban Huawei on 5G or it will share less
932 intelligence: Report, 2019. Available online at https://zd.net/2u261VC
933 (Accessed 31.03.2019).
934 Goulard, H., Facebook boss Mark Zuckerberg sued over hate speech,
935 2016. Available online at https://www.politico.eu/article/
936 facebook-boss-mark-zuckerberg-sued-over-hate-speech/ (Ac-
937 cessed 31.03.2019).
938 Gibbs, How did hackers manage to lift the details of BA custom-
939 ers?, 2018. Available online at https://www.politico.eu/article/
940 facebook-boss-mark-zuckerberg-sued-over-hate-speech/ (Accessed
941 31.03.2019).
942 Cadwalladr, C., Parliament seizes cache of Facebook internal papers,
943 2018. Available online at https://www.theguardian.com/technology/
944 2018/nov/24/mps-seize-cache-facebook-internal-papers (Accessed
945 31.03.2019).
54
946 J. Chuang, C. D. Manning, J. Heer, Termite : Visualization Techniques
947 for Assessing Textual Topic Models, International Working Conference on
948 Advanced Visual Interfaces (2012) 74.
55