0% found this document useful (0 votes)

98 views

Informing Policy With Text Mining PDF

Uploaded by

Michal Palinski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views

Informing Policy With Text Mining PDF

Uploaded by

Michal Palinski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Informing policy with text mining: technological change

and social challenges

Kristóf Gyódi1,, Lukasz Nawaro1 , Michal Paliński1 , Maciej Wilamowski1
University of Warsaw, Faculty of Economic Sciences; DELab UW: Dobra 56/66, 00-312
Warsaw, Poland

Abstract
The fast development and adoption of ICT technologies and digital services
has major social implications. Policy-makers often struggle to design appro-
priate regulations, defend the rights of citizens and ensure competition. The
aim of this work is to present various methodologies that support stakehold-
ers with identification of emerging technologies and related social challenges
based on the text mining of news articles. The analysis demonstrates that
while each text mining algorithm provides insightful results, their combin-
ation yields more detailed and robust overview of media discussions. The
results present early signals and trends, the relationships between techno-
logies and social challenges, and changing attitudes towards selected tech
issues.
Keywords: online news, web scraping, text mining, sentiment analysis,
topic-modelling

Funding
Acknowledgement: This research is part of a project that has received
funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 780643 and Grant Agreement no
825652.

Email addresses: [email protected] (Kristóf Gyódi),

[email protected] (Lukasz Nawaro), [email protected] (Michal
Paliński), [email protected] (Maciej Wilamowski)

Preprint submitted to Technological Forecasting and Social Change 29th January 2020
Disclaimer: The information and views set out in this report are those
of the author(s) and do not necessarily reflect the official opinion of the
European Union. Neither the European Union institutions and bodies nor
any person acting on their behalf may be held responsible for the use which
may be made of the information contained herein.

1 1. Introduction
2 The fast development and adoption of ICT technologies and digital ser-
3 vices has major social implications. Policy-makers often struggle to design
4 appropriate regulations, defend the rights of citizens and ensure competition.
5 The importance of text mining for informing policy has been recognised in
6 the recent years by different political institutions. A notable example is
7 the European Commission which has established the Competence Centre on
8 Text Mining and Analysis. As the founders argue, text mining and analysis
9 tools are necessary to address not only the problem of volume, but also of
10 timeliness in order to provide the right information in the proper format for
11 the decision making process, in a variety of contexts (European Commission,
12 2016).
13 The aim of our work is to present a methodology that supports policy-
14 makers with the identification of emerging technologies and related social
15 challenges. Using text mining techniques, we demonstrate how early signals
16 and trends can be explored in news articles. Moreover, the methodology
17 enables policy stakeholders to analyse the relationships between technologies
18 and social challenges, and gain deeper insights on the most relevant topics.
19 The presented approach contributes to the agenda-setting for policy, sup-
20 porting problem recognition, definition and selection. Using the proposed
21 solution, policy makers are able to answer the following questions:

22 1. Which are the most trending technologies and regulatory challenges?

23 2. What are the connections between social challenges and technologies?
24 3. What is the public perception of selected regulatory and technological
25 issues?

26 The quantitative text mining analysis is based on a novel dataset of tech-

27 nology news articles. Using web-scraping tools, more than 140k articles pub-
28 lished in the period 01.2016 - 03.2019 were compiled. The sources include 13
29 major English-language technology websites from the US, EU and Australia.

2
30 Our methodology implements various NLP techniques in a sequential
31 order. The analysis demonstrates that while each text mining algorithm
32 provides insightful results, their combination yields more detailed and robust
33 overview of media discussions.
34 First, we identify emerging topics based on the analysis of changing term
35 frequencies over time. The terms with the greatest increase of frequency
36 per article can be filtered using regression analysis. The identified terms
37 serve as input for further analysis. The connections between emerging social
38 issues (such as fake news or privacy) and technologies are explored using
39 co-occurrence and sentiment analysis techniques. The co-occurrence values
40 are calculated between selected emerging trending topics, pinpointing pairs
41 of terms that are most frequently mentioned together (such as hate speech
42 - machine learning). In order to track the public perception of issues and
43 identify the positive and negative news stories related to a selected topic,
44 sentiment analysis is performed.
45 To additionally verify our results, topic modelling is prepared using Lat-
46 ent Dirichlet Allocation (Blei et al., 2003), a complementary approach to
47 our topic identification strategy. LDA shows which terms define key topics
48 across the text corpus, providing additional information on wide topics and
49 co-occurrences. Finally, various robustness checks are carried out that are
50 discussed in the Appendix.
51 Therefore, the combination of these four techniques provides a funnel. In
52 the first stage, policy-makers gain a high-level overview of recent trends. In
53 the second part of the process, various deep dives can be prepared, analys-
54 ing the relationships between trending issues, and the often changing public
55 attitudes around them.
56 The results confirm that the methodology has huge potential in sup-
57 porting informed policy-making, and decrease the lag between technological
58 changes and regulatory responses. Such methodology has not been available
59 for policy-makers before. The literature discussing the implementation of
60 text-mining for policy is very limited, with no studies presenting working
61 solutions. Moreover, the methodology has significant advantages over prior
62 efforts in the area of trend identification.
63 First, the method enables the automatic detection of trending technolo-
64 gies and issues. Unlike existing studies, the process does not require the prior
65 selection of topics or keywords (e.g. Kim and Ju (2018)). In the absence of
66 a complicated filtering process (e.g. expert groups), the methodology can be
67 easily implemented, and facilitates the exploration of new and unexpected

3
68 areas by policy-makers. Second, by combining various NLP methods, the
69 presented solution is not only more robust, but also provides more insights
70 beyond highlighting trends.
71 The presented results are available in the form of interactive visualisations
72 at ngi.delabapps.eu. The raw results are stored in the Zenodo repository
73 (https://zenodo.org/communities/engineroom/ ) and codes to replicate them
74 are available at Gitlab (https://gitlab.com/enginehouse/engineroom).

75 2. Literature Review
76 The identification of emerging technologies with text mining techniques
77 has an established literature. Rotolo et al. (2015) provide a comprehensive
78 literature review of methods applied for the analysis of trends in science and
79 technology. Text mining has also a growing literature in forecasting methods,
80 summarised by Kayser and Blind (2017). The implemented methods include
81 the analysis of document-frequencies, term-frequencies, co-occurrence ana-
82 lysis, topic modelling and other approaches from the field of scientometrics
83 (Morris et al., 2003; Bettencourt et al., 2008). Finally, a systematic literat-
84 ure review of text mining applications in policy-making was provided by Ngai
85 and Lee (2016). The authors showed how text mining analyses can support
86 different stages of the policy-making cycle, including (i) agenda setting, (ii)
87 policy formulation, (iii) implementation and (iv) evaluation.
88 The identification of trends, especially technologies and research areas,
89 is addressed in the following studies. Kim and Ju (2018) examined the tra-
90 jectory of selected technologies in online news and blog posts based on daily
91 frequencies of documents. Besides measuring the growth of frequencies, the
92 authors calculated the skewness of frequency distributions to examine the sta-
93 bility of trends. Yoon (2012) also focused on a set of pre-selected keywords in
94 a defined field (solar energy). The described methodology is based on both
95 term and document frequencies, also including a time dimension to differen-
96 tiate between strong signals (high frequency and growing) and weak signals
97 (low frequency but growing). Albert et al. (2015) analysed blog posts to
98 identify whether a technology is basic (mature) or pacing (emerging). The
99 authors created a list of terms associated with the two groups, calculated the
100 term frequencies in various time periods, and implemented a fuzzy process
101 to categorise selected technologies.
102 Several studies provide additional insight by analysing not only individual
103 terms, but rather groups of keywords, e.g. by incorporating co-occurrences,

4
104 topic modelling and various other clustering techniques. Lee and Park (2018)
105 based their study on the work of Yoon (2012), concluding that a single
106 keyword is not sufficient to identify a topic. In order to establish the meaning
107 of word groups, the authors improved the methodology by incorporating co-
108 occurrences. Bildosola et al. (2017) combined text mining methods (keyword
109 co-occurrences, keyword counts) with various forecasting techniques to ex-
110 plore and map emerging technologies in the area of cloud computing. Lee and
111 Jeong (2008) examined emerging areas in information security publications
112 by hierarchical clustering of co-occurring keywords and network analysis.
113 The authors prepared a technology roadmap for robot technology using co-
114 occurrence analysis. Kajikawa et al. (2008) tracked emerging energy research
115 fields by clustering publications and analysing the growth in the number of
116 publications by cluster. Roche et al. (2010) examined emerging fields in Bio-
117 logy, combining term frequencies with inverse document frequencies (TF-IDF
118 approach) in scientific literature. Li et al. (2019) analysed trends of topics
119 generated with Lingo algorithm in scientific documents and patents in a se-
120 lected area. Xie et al. (2018) used the HDP topic modelling technique in
121 various time periods to identify promising technologies related to solar cells.
122 Based on the changes in topic-keyword probability distributions, emerging
123 technologies were identified. Niemann et al. (2017) examined patent lanes in
124 selected areas, analysing semantic similarities in patents and applying topic
125 modelling. Li et al. (2015) proposed the usage of words networks to analyse
126 two selected news events in Chinese online media.
127 Following the analysis of trends and exploring their topological features,
128 further dimensions can be explored, such as the related public debate. Sen-
129 timent analysis is widely used in analysing public perception of technologies
130 based on news articles or social media. Choi et al. (2010) identified con-
131 troversial issues and related sub-topics using query generating methods and
132 sentiment models. Ku et al. (2006) presented algorithms for opinion ex-
133 traction in news articles based on concept keywords and sentiment words.
134 Kim et al. (2006) demonstrated an approach with semantic role labelling to
135 extract opinions in news media.
136 Most relevant to our study are the works that explore multiple aspects of
137 emerging technologies, combining various methods. The literature becomes
138 very limited for such works: existing works mostly present a hybrid approach
139 of topic modelling with sentiment analysis. Such an example is the work of
140 Xie et al. (2018), who besides using topic modelling to identify emerging
141 trends, also used sentiment analysis to calculate the share of news with pos-

5
142 itive or negative sentiment. Bian et al. (2016) analysed public perception
143 of topics related to IoT in Twitter. Similarly, Mejia and Kajikawa (2017)
144 identified topics related to robots in newspaper articles and scientific papers,
145 and examined sentiments.
146 The analyses of trends with text mining techniques are incorporated into
147 a rather limited number of policy-oriented or economic studies. Baker et al.
148 (2016) measured the uncertainty of economic policy based on the frequency of
149 newspaper articles containing a set of keywords. Tobback et al. (2018) refined
150 the methodology using modality annotation and support vector machines
151 classification model. Suh (2018) presented a framework for supporting the
152 agenda setting in multidisciplinary group discussions on energy policies using
153 text mining techniques. Kralj Novak et al. (2014) demonstrated a method
154 based on co-occurrence networks to extract relevant entities and time-varying
155 relationships in business news. The robustness of the methodology is verified
156 in a case study with financial data. Kim et al. (2017) predicted bitcoin price
157 with the implementation of topic modelling on bitcoin forum posts. Hisano
158 et al. (2013) estimated the impact of business news on the stock market.
159 The presented studies demonstrate that various NLP methods can be
160 used to analyse trends, identify the topologies of emerging technologies, and
161 also extract opinions and sentiments of public debate based on news articles.
162 However, various limitations can be identified. We have identified three re-
163 search gaps that are addressed by our study.
164 First, analyses focusing on trends and emerging technologies examined
165 pre-defined, narrow areas or technologies. While some efforts identified areas
166 gaining traction using e.g. topic modelling (Xie et al., 2018) or network
167 analysis (Lee and Jeong, 2008), a significant share of literature is restricted to
168 the analysis of selected keywords (Kim and Ju, 2018; Yoon, 2012). Moreover,
169 several works did not explore the full potential of text data. Such example
170 is the analysis of document counts, instead of term frequencies (Kajikawa
171 et al., 2008). In contrast to the existing literature, our approach does not
172 require prior assumptions on trending technologies. Instead of reducing our
173 sample or selecting keywords, the presented approach enables the automatic
174 identification of trending terms. Such methodology creates opportunity to
175 explore unexpected but highly relevant trending issues.
176 Second, the analysed studies mostly focus on the presentation of a se-
177 lected text mining method. However, the combination of methods provide
178 more insight, such as the application of sentiment analysis and topic model-
179 ling (Xie et al., 2018) or term frequencies and co-occurrences (Lee and Park,

6
180 2018). We implement all of the four methods, in order to provide a com-
181 prehensive overview of trending technology news. Therefore, not only the
182 trending technologies and social issues are identified (term frequencies), but
183 also the wider topics are established (co-occurrences and topic modelling),
184 along with the public perception and different views (sentiment analysis).
185 Finally, there is a lack of studies using trend analysis for policy agenda
186 setting. Although previous efforts show the potential of text mining in in-
187 forming policy (Sun et al., 2016) or explaining economic processes (Kralj
188 Novak et al., 2014), our study provides new tools to support policy-making
189 and decrease the lag between technological change and regulations.

190 3. Dataset
191 3.1. Selection of sources
192 We have included 13 popular English language online tech press sources
193 in the analysis. The final sources have been selected in a four stage process.

Figure 1: Blockchain query popularity

Source: Own elaboration using Google Trends data.

194 First, online sources were identified that covered and reported on early
195 signals of technological change in the past. The feature of Google search
196 to filter news from specific time periods was used to search for articles that

7
197 covered early on promising technologies and business models of today. Ad-
198 ditionally, the Google Trends tool was used to identify the particular time
199 periods for early signs of a technology in online news. These periods are called
200 innovation trigger stage in the hype cycle literature, i.e. a time when aware-
201 ness about the technology starts to spread and attracts first media coverage
202 (Dedehayir and Steinert, 2016). A popular hype cycle model introduced in
203 the 1990s by Gartner corporation explains the evolution of a technology in
204 terms of expectation or public visibility (y-axis) in relation to time (x-axis).
205 The bell shaped curve of hype around the blockchain technology, proxied
206 by its online search popularity, is presented in Figure 1. The figure shows
207 that the first mentions are likely to originate from the period 2010-2012.
208 The chosen technologies were under intensive development (e.g. autonomous
209 vehicles) or in the pursuit of practical application (e.g. blockchain) during
210 2018. The keywords to identify relevant sources included “IoT”, “virtual
211 reality”, “blockchain”, “bitcoin”, “smartwatch”, “sharing economy” (articles
212 from period Jan 2010-Jan 2012), “autonomous cars”, “big data” (news from
213 Jan 2007-Jan 2008). This process helped us to identify around 25 sources
214 that published articles at the beginning of the current decade on a set of
215 promising technologies.
216 Second, the initial list of sources was supplemented with a set of relevant
217 newer sources (e.g. Politico Europe, established in 2015), as well as high qual-
218 ity non-US sources (e.g. Euractiv, The Guardian) in order to counterbalance
219 the dominant American tech perspective in the analysis.
220 Third, those media outlets were prioritized that covered tech news also
221 from a regulatory and social aspect. On the other hand, sources with a
222 greater focus on consumer electronics or enterprise IT were not preferred.
223 Finally, some selected sources were excluded from the study due to tech-
224 nical reasons, such as a paid subscription business model. The final list of
225 analysed sources, with the number of articles and the location of headquar-
226 ters is presented in Figure 2.
227 The articles and the accompanying metadata have been collected for a
228 period of 38 months between 01.2016 and 03.2019. The collection process
229 was conducted with the use of web-scraping tools. Web-scraping scripts
230 are designed to recognize different types of content, and extract and store
231 only the ones specified by the user (Ignatow and Mihalcea, 2018). In this
232 study, for each individual source, a separate script has been written in Py-
233 thon programming language, using the web automation framework Selen-
234 ium WebDriver. The scripts are available at the project repository: ht-

8
235 tps://gitlab.com/enginehouse/engineroom/tree/master/scrapers. The collec-
236 tion of data had the following main elements:
237 1. The web crawler opened the news website and selected the category or
238 column of news.
239 2. The metadata (title, author, short abstract, publishing date) of articles
240 published in the period 01.01.2016 and 01.03.2019, along with hyper-
241 links directing to full texts were collected and saved.
242 3. The full text of articles were collected and saved.
243 The categories of scraped news sections, and the corresponding links are
244 included in the Appendix (Tables A.11 and A.12).

245 3.2. Descriptive statistics of sources

246 Using the web-scraping method described in the previous section, a unique
247 dataset was created, collecting more than 140 thousand articles. The sources
248 greatly vary in terms of number of articles. The most abundant sources, in-
249 cluding The Register, ZDNet and Gizmodo constitute together around 50%
250 of the dataset. On the other hand, the three smallest sources: Gigaom,
251 Euractiv and The Conversation account only for around 3% of all articles.
252 In terms of location, US sources are the most prevalent in the dataset,
253 comprising 53% of articles, preceding sources from the UK (44% of articles),
254 Belgium (2% of articles) and one source located in Australia (1% of articles)
255 (see: Figure 2).

9
Figure 2: Number of articles per source

256 In order to examine the heterogeneity of sources in terms of readability,

257 we have utilised two traditional metrics: Flesh reading ease (Kincaid et al.,
258 1975) and Gunning FOG formulas (Gunning, 1968). Both tests evaluate the
259 readability of text considering the number of syllables, words, and sentences.
260 In the Flesch reading-ease test, higher scores indicate that a given text is
261 easier to read, contrary to the FOG formula where higher scores mark texts
262 that are more difficult to read.
263 The sources included in our dataset are characterised by mean Flesh
264 readability between 43.5 (Euractiv) and 56.9 (Gizmodo), which situates them
265 between sources fairly difficult to read (60-50) and difficult to read (30-50)
266 (Kincaid et al., 1975). According to the mean FOG index, the most difficult
267 texts are published in Politico Europe (15.1) and easiest in Gigaom (12.5).
268 12 and 15 points in the FOG index require a reading level of a United States
269 high school senior and college junior accordingly.

10
270 The results of applied readability tests show that the analysed sources are
271 rather homogeneous in terms of reading difficulty. Not surprisingly, among
272 the top 3 most difficult sources two focus on the political aspects of techno-
273 logies (Euractiv and Politico Europe).

Table 1: Sources readability indices

Sources Flesh read. std FOG index std Sentences std

counts
Euractiv 43.5 10.3 14.1 2.3 34.0 25.0

ZDNet 44.2 14.7 15.1 3.9 24.0 20.0

Politico Europe 44.3 12.8 15.1 4.0 29.0 25.0
Techforge 44.3 10.6 14.4 2.2 23.0 13.0
Reuters 46.0 10.1 14.5 2.6 15.0 10.0
IEEE 49.5 9.5 13.5 2.1 37.0 23.0
The Register 49.8 11.0 14.3 2.7 20.0 14.0
Arstechnica 51.6 10.3 13.4 2.3 27.0 17.0
The Conversation 53.3 8.1 12.6 1.4 40.0 10.0
Fastcompany 53.9 11.9 13.2 3.3 32.0 33.0
The Guardian 55.0 14.5 12.9 4.5 35.0 28.0
Gigaom 55.3 13.5 12.5 2.8 50.0 103.0
Gizmodo 56.9 11.6 12.6 2.6 24.0 22.0

274 4. Methodology
275 4.1. Term-frequencies and regression analysis
276 For each source/month pair, the average number of term occurrences per
277 article has been counted. The words have been transformed to their stemmed
278 (base) form using SnowballStemmer. For the readability purposes all tables
279 present identified terms in the human-readable form (e.g. ’election’ instead of
280 stemmed ’elect’). Afterwards, the weighted average of frequencies by source
281 has been calculated. Weights have been assigned to ensure that no source
282 has excessive influence on final results due to the number of articles and to
283 maintain relative balance between American and other sources. Robustness
284 checks and expert analysis confirmed that weighted sources perform slightly
285 better than unweighted (see Table B.15 in the Appendix).

11
286 For all terms which occurred at least once in the last two months of the
287 analysis, an ordinary least squares regression has been performed for the en-
288 tire time period, and also for the last 3, 6 and 12 months. The dependent
289 variable of the estimation is the weighted frequency, while the number of
290 months since the beginning of the analysed period is the independent vari-
291 able. The result is a single coefficient (referred to as coef ) β. Terms with the
292 highest β coefficients have grown the most. However, the top growing words
293 are always stopwords due to their sheer number of occurrences. Most lists
294 of stopwords are not domain-specific: NLTK’s list does not include words
295 such as ”internet”, which should be regarded as a stopword in modern tech-
296 nological media. Instead of creating a domain-specific stopwords list, we
297 divided the coefficient by the mean weighted frequency in all months of the
298 regression. The resulting normalised coefficient (coef norm) delivers a num-
299 ber which can be used to window out irrelevant terms by setting a threshold
300 a term needs to achieve to be included in further analysis. The threshold
301 has been set to 0.025, a value high enough to remove stopwords (including
302 domain-specific ones), but low enough to allow the capture of early signals
303 of new technologies and quickly growing established topics.
304 The 1000 most significantly growing terms (with the largest coefficient
305 which are above the threshold for normalised coefficient), both words and
306 bigrams, have been reviewed, and the relevant terms for further analysis
307 have been selected.

308 4.2. Co-occurrence analysis

309 For the terms chosen in the previous part (”analysed terms”), the most
310 common ”co-occurring” terms out of the top 15000 most significantly growing
311 terms have been calculated. The terms co-occur if they are present in the
312 same article, not necessarily in the same sentence or paragraph. In the basic
313 method, the number of occurrences of the co-occurring term in all articles in a
314 given source containing the analysed term has been checked. The number of
315 co-occurrences has been divided by the number of occurrences of the analysed
316 term in all articles in a given source. Sources have been aggregated using
317 weights, just as frequencies were. A term co-occurring with itself which exists
318 in all sources (if a word is significantly trending and sources are homogeneous,
319 it should be the case) will have co-occurrence value 100%.
320 Apart from the main method, boolean (for each article, the term exists
321 or not and is assigned value 1 or 0 respectively) method has been used to
322 increase the importance of terms which may not be the main topics in articles,

12
323 but co-occur consistently with the analysed word. The sum of boolean values
324 has been divided by number of articles and – just like in the basic method –
325 weighted by source. The 100% value for a term co-occurring with itself has
326 been maintained.
327 Both for the basic and boolean methods, the results have been normalised
328 by the (i) mean frequency and (ii) square root of mean frequency of the co-
329 occurring term. The normalised values are useful to find the terms which
330 often co-occur with the analysed term, but are rarely present in other articles.
331 The terms with the highest normalised values are strictly connected to the
332 analysed topic.

333 4.3. Sentiment analysis

334 The same words which have been chosen for the co-occurrence analysis
335 were selected for the sentiment analysis. The sentiment analysis has been
336 prepared using VADER (Hutto and Gilbert, 2014), an open-source lexicon
337 and rule-based sentiment analysis tool. VADER is specifically designed for
338 social media analysis, but can be also applied for other text sources. The
339 sentiment lexicon was compiled using various sources (other sentiment data
340 sets, Twitter etc.) and was validated by human input. The advantage of
341 VADER is that the rule-based engine includes word-order sensitive relations
342 and degree modifiers. As VADER is more robust in the case of shorter social
343 media texts, the analysed articles have been divided into paragraphs.
344 For the main analysis of terms, all paragraphs in articles containing the
345 given term are modified to exclude this term and assigned a score between -1
346 (most extreme negative) and 1 (most extreme positive) by VADER. Removal
347 of terms is meant to exclude sentiment of the term itself, because the term
348 may not be emotionally neutral, e.g. when some technologies or companies
349 attempt to solve a negative issue. In such case, the neighbourhood’s scores
350 would be positive, but the negative term would bring the paragraph’s score
351 down. Two analyses will be presented: the average sentiment for each month,
352 and co-occurrences with the most positive and negative sentiment. In the
353 case of the latter, for each term, the 100 most co-occurring terms have been
354 selected and sentiment computed for each paragraph modified once again by
355 removing both the analysed and co-occurring terms. Terms present in most
356 negative and most positive (on average) paragraphs are then extracted.

13
357 4.4. Topic modelling
358 Topic modelling refers to a combination of “algorithms for discovering
359 the main themes that pervade a large and otherwise unstructured collection
360 of documents (...) topic models can organize the collection according to
361 the discovered themes” (Blei, 2012a). The most popular method of topic
362 assignment is the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003),
363 which is a probabilistic topic model using Bayesian formulation to reveal
364 hidden (latent) topics in the given text corpus. Documents in the corpus
365 are treated as bag-of-words, i.e. the word ordering is not taken into account.
366 Being an unsupervised machine learning algorithm, LDA does not require a
367 training dataset nor the manual coding of the document. The topics obtained
368 via LDA analysis are probability distributions over terms. Each topic consists
369 of a different set of terms characterised by a certain probability of appearance
370 in the given subset of texts (Blei, 2012b). Instead of beginning the analysis
371 with a predefined set of terms and codes derived from the domain expertise,
372 the researcher specifies the number of topics the algorithm is supposed to
373 find. The additional key parameters of the algorithm include α, the prior
374 topic-document distribution and β, the prior word-topic distribution.
375 Designating a specific number of topics is considered to be a challenging
376 process in text mining analyses (Ignatow and Mihalcea, 2018). The risk is
377 that “choosing too few topics will produce results that are overly broad, while
378 choosing too many will result in the over-clustering of a corpus into many
379 small, highly-similar topics” (Greene et al., 2014). Similarly, the concentra-
380 tion parameters for the Dirichlet distribution (α and β) highly influence the
381 topics’ quality (Wallach et al., 2009).
382 Certain solutions are commonly deployed in order to obtain interpretable
383 and coherent topics. Human generated topic rankings are considered to be
384 the gold standard for coherence evaluation, but their implementation is costly
385 (Röder et al., 2015). LDA models are often evaluated based on the log-
386 likelihood value for a held-out test set. However, Chang et al. (2009) showed
387 that predictive likelihood (and hence perplexity) can differ significantly from
388 the human assessment of the quality of topics. Numerous studies concen-
389 trated on the measurement of coherence for generated topics and assessed
390 its correlation with respect to human topic ranking data (Röder et al., 2015;
391 Aletras and Stevenson, 2013; Newman et al., 2010).
392 In this analysis, we utilize a state-of-the-art coherence framework pro-
393 posed by (Röder et al., 2015). The method assesses the coherence of topics
394 by analysing the pairs of N most frequent topic-words. The pipeline consists

14
395 of four major steps: i) segmentation, ii) probability estimation, iii) confirma-
396 tion measure and iv) aggregation. Segmentation creates the set of two words
397 combinations of N top topic terms. Probability estimation provides the prob-
398 ability of the word pair: the number of documents in which given pair occurs
399 is divided by the total number of documents. The confirmation measure
400 shows the level of agreement of a given word pair with the use of Point-
401 wise Mutual Information (PMI). PMI quantifies the discrepancy between
402 the probability of the two words coincidence given their joint distribution
403 and their individual distributions. Equation (2) normalizes values of PMI
404 between -1 and +1 (NPMI).

P (wi , wj ) +
P M I(wi , wj ) = (1)
P (wi ) ∗ P (wj )
405
P M I(wi , wj )
N P M I(wi , wj ) = (2)
−log(P (wi , wj ) + )
406 Finally, all confirmation measures are aggregated to a single score using arith-
407 metic mean.
408 We optimize the coherence value in respect to three hyper-parameters in
409 the LDA model - number of topics, α and β (see: 5.4). Standard text prepro-
410 cessing consisting of four steps - removing stopwords, conversion to lowercase,
411 tokenization, and vectorization - as well as measuring coherence of topics was
412 conducted using the Gensim package (Řehůřek and Sojka, 2010). Interact-
413 ive visualizations of the obtained topics were prepared with the pyLDAvis
414 package (Sievert and Shirley, 2014)1 .

415 5. Results
416 The described methods are implemented in a sequential order to identify
417 and analyse trending news stories on technologies and related social chal-
418 lenges. First, the most trending terms are filtered out based on term-frequency
419 analysis. Second, the wider topics are mapped and the relationship between
420 trending terms is established with co-occurrence analysis. Third, the sur-
421 rounding emotions and public perception of topics is examined, along with
422 the most positive and negative associated terms. Finally, topic modelling
423 is used to additionally verify our identified trending areas. Such approach

1
Both packages are open source Python tools.

15
424 enables to highlight the most important technological and social issues (Sec-
425 tion 5.1), and explore them in greater details (Sections 5.2 and 5.3). In order
426 to present the value of the methods in informing policy, 3 case studies are
427 analysed: privacy, information in social media, and the technology sector in
428 China. These trending areas have been chosen due to their high regulatory
429 and social relevance.

430 5.1. Identification of emerging terms

Table 2: Unigram Coefficients

term coef coef norm

0 facebook 0.01977 0.0283
1 ai 0.01429 0.04281
2 2018 0.01293 0.07243
3 amazon 0.00829 0.0256
4 5g 0.00812 0.05334
5 huawei 0.00705 0.076
6 2019 0.00639 0.08975
7 2017 0.00611 0.02705
8 china 0.00566 0.02746
9 chinese 0.00503 0.03942
10 election 0.00473 0.02704
11 cambridge 0.00398 0.05936
12 analytica 0.00387 0.07599
13 artificial 0.00342 0.02563
14 fake 0.0034 0.03042
15 cryptocurrency 0.00326 0.05817
16 ban 0.00305 0.02699
17 climate 0.00292 0.03118
18 gdpr 0.00286 0.0658
19 zuckerberg 0.00284 0.03469

431 We begin the analysis with the identification of emerging topics in the
432 examined sources. The methodology is based on the regression analysis: the
433 regression coefficient (coef ) reveals the trend of growing terms. As the value
434 of the coefficient is heavily dependent of the average frequency of the term,
435 it highlights relatively frequent and trending words. However, the aim of the

16
Table 3: Bigram Coefficients

term coef coef norm

0 cambridge analytica 0.00374 0.07579
1 artificial intelligence 0.00299 0.02915
2 machine learning 0.00281 0.03184
3 fake news 0.00196 0.03427
4 tech giant 0.00139 0.03696
5 facial recognition 0.00125 0.0521
6 discuss subject 0.00119 0.13307
7 iphone x 0.00119 0.05786
8 5g network 0.00116 0.07099
9 climate change 0.00107 0.02573
10 iphone xs 0.00104 0.11981
11 industrial leader 0.001 0.09431
12 hear industry 0.00097 0.13342
13 leader discussion 0.00096 0.13236
14 cyber security 0.00096 0.0264
15 expo world 0.00094 0.12241
16 world series 0.00092 0.11656
17 mark zuckerberg 0.00092 0.03029
18 data breach 0.0009 0.02956
19 cloud expo 0.00083 0.13218

436 exercise is to capture early signals of technologies and social issues that may
437 still have low frequencies. Therefore, a normalised coefficient (coefnorm) is
438 calculated that is the coefficient divided by the average frequency of the term.
439 This normalised coefficient is used to exclude terms that have a growing, but
440 already large frequencies, such as stop words.
441 Table 2 presents the results for unigrams and Table 3 for bigrams, sorting
442 the terms by coef. The results show that various technologies (related to
443 AI/ML, 5G), regulatory issues (tech giants, Cambridge Analytica) and con-
444 sumer products (iPhone) gained traction in the online tech press. The top
445 20 words are all closely related to tech topics, with the exception to world
446 series. This supports that adequate sources were selected and that the topic
447 identification methodology is performing well in finding hot topics.
448 We have reviewed the top 1000 trending uni- and bigrams (based on coef )
449 and summarised the most relevant terms in Table 4. The results provide

17
450 an overview for the most important topics online tech press covered in the
451 recent years. The first half of the table includes topics related to computer
452 science, consumer products and future technologies, while the lower section
453 summarises various social and regulatory challenges.
454 The results demonstrate the importance of such technologies as AI/ML
455 algorithms, robotics, autonomous vehicles, quantum computing, IoT or de-
456 centralised computing. Moreover, the identified terms reveal various lesser-
457 known, domain specific terms, such as massive MIMO from 5g technology,
458 quantum key distribution - QKD from quantum computing or cloud native
459 related to cloud computing.
460 The identified social issues present recent regulatory challenges: e.g. on-
461 line privacy, fake news, cybersecurity, net neutrality, election meddling or the
462 growing influence of China. Similarly to technologies, the analysis presents
463 numerous terms outside of the mainstream: e.g. section 230 related to on-
464 line content moderation, or the platform gab in relation to hate speech or
465 political extremism.
466 Another desired attribute of the results is that they lack buzzwords from
467 the past. As an example, big data was a hot topic in the past, however,
468 this bigram has been not identified in the analysis. On the other hand,
469 discussions moved to technologies that exploit big data, such as machine
470 learning algorithms.
471 In the case of policy-making, it is especially crucial to be informed about
472 early signals of social issues. For an even easier filtering of relevant topics,
473 the coef is calculated for the period of last 3, 6 and 12 months. The table is
474 included in the Appendix (Table B.13).
Table 4: Trending topics based on top 1000 terms

Terms Topic
AI ’artificial intelligence’, ’machine learning’, ’facial recognition’, ’ai
system’, ’conscious computing’, ’general intelligence’, ’autonomous
weapon’, ’ai research’, ’reinforcement learning’, ’project maven’,
’neural network’, ’training data’, ’black box’, ’false positive’, ’lethal
autonomous’, ’convolutional neural’, ’ai strategy’, ’openai’, ’tensor-
flow’, ’duplex’, ’alphazero’, ’pytorch’, ’ai-driven’
robot ’ubtech’, ’sphero’, ’killer robot’, ’smart robot’, ’misty robot’, ’sex
robot’, ’aibo’
autonomous ’nuro’, ’waymo’
vehicle

18
5g ’5g network’, ’5g technology’, ’5g smartphone’, ’massive mimo’, ’5g
equipment’, ’5g spectrum’, ’mmwave’
quantum com- ’quantum computing’, ’quantum technology’, ’qubit’, ’d-wave’,
puting ’qkd’
IoT ’iot tech’, ’smart speaker’, ’voice assistant’, ’wireless charger’, ’fold-
able phone’
decentralised ’edge computing’, ’cloud native’, ’serverless computing’, ’edge
computing device’, ’iot edge’, ’kubernet’, ’multi-cloud’, ’snowflake’, ’datab-
rick’, ’rubrik’, ’drivescale’, ’gcp’, ’cncf’
crypto ’cryptocurrency exchange’, ’cryptocurrency miner’, ’crypto-
currency’, ’blockchain’, ’ico’, ’monero’, ’blockchain-based’,
’cryptoasset’, ’coinhive’
competition ’tech giant’, ’digital tax’, ’digital market’, ’network giant’, ’tax
break’, ’surveillance capitalism’, ’second headquarter’, ’neutrality
law’, ’neutrality repeal’, ’fcc commission’, ’paid priority’, ’gafa’
climate ’climate change’, ’global warming’, ’circular economy’, ’paris agree-
ment’, ’greenhouse gas’, ’ipcc’
cybersecurity ’cyber security’, ’data management’, ’data breach’, ’fake account’,
’password manager’, ’vpn service’, ’wannacry ransomware’, ’phish-
ing email’, ’cybersecurity act’, ’wire fraud’, ’multi-factor authentic-
ation’, ’cybersecurity standard’, ’equifax’, ’meltdown’, ’notpetya’,
’spyware’, ’magecart’, ’pwned’, ’robocall’
content crisis ’fake news’, ’hate speech’, ’conspiracy theory’, ’media platform’,
’political ads’, ’copyright directive’, ’article 13’, ’terrorist content’,
’false information’, ’russian troll’, ’disinformation campaign’, ’il-
legal content’, ’content moderation’, ’deep fake’, ’white suprem-
acist’, ’troll farm’, ’section 230’, ’media literacy’, ’disinformation’,
’anti-vaccine’, ’infowars’, ’fact-checking’, ’anti-semitism’, ’gab’
privacy ’facebook user’, ’cambridge analytica’, ’user data’, ’location data’,
’data broker’, ’privacy setting’, ’privacy scandal’, ’privacy stand-
ard’, ’gdpr compliance’, ’gdpr’, ’aadhaar’, ’duckduckgo’
democracy ’european election’, ’trade war’, ’vote leave’, ’yellow jacket’, ’no-
deal brexit’, ’russian interference’, ’election meddling’, ’influence
campaign’, ’vote machines’, ’netizen’
china ’chinese government’, ’chinese companies’, ’chinese telecom’,
’chinese tech’, ’chinese intelligence’
health ’health record’, ’screen time’, ’genome editing’, ’gene drive’, ’e-
cigarettes’, ’biohacking’

19
sexual harass- ’sexual harassment’, ’# metoo’, ’sexual misconduct’, ’sex traffick-
ment ing’
work ’tech worker’, ’work conditions’, ’home office’, ’wework’

475 Next, the evolution of the frequency of various terms is examined. Figure
476 3 compares the 3-monthly moving average of frequency per articles for 3
477 terms: AI, 5g and GDPR.

Figure 3: Term frequencies for selected unigrams and bigrams

20
Figure 4: Term frequencies for cryptocurrency topic

21
478 AI has been the second most increasing unigram in the collected articles,
479 showing a strong increase in the period between 09.2016 and 01.2018. During
480 2018, the term frequency has been rather stagnating. In comparison, the term
481 5g has experienced periods of quick increase and decline. In the beginning
482 of 2019, the average frequency of 5g has even surpassed that of AI. A similar
483 seasonality can be observed in the case of GDPR: the frequency of GDPR
484 has been first gradually increasing, then skyrocketed around May 2018, when
485 it came into force. After a short period of large media interest, the frequency
486 declined, but remained at a relatively high level.
487 Figure 4 presents the term frequencies for three closely related terms:
488 cryptocurrency, ICO (initial coin offering) and digital currency. The figure
489 reveals a massive growth of news stories in the second half of 2017, and a
490 sharp decline during 2018. This has been the period of rising bitcoin prices
491 and massive hype around various startups using ICOs to receive funding.
492 This indicates the necessity for policy to identify regulatory areas in an early
493 phase, and reduce the lag between regulations and actual events.
494 To conclude, the analysis of term frequencies helps to evaluate whether a
495 technology or related social issue is a temporary topic or remain relevant.

496 5.2. Co-occurrence analysis

497 The regression analysis served as an automated method to filter out the
498 most relevant terms from the the text corpus, providing an overview on hot
499 topics in the tech world. The next step to explore further details is to estab-
500 lish the relationships between trending terms. The analysis of co-occurrences
501 enables us to find which emerging terms were most often mentioned together
502 in the same article, hence finding most relevant pairs of expressions. In the
503 case of tech news, such method can be used to identify the areas where a
504 technology is applied, or connections to regulatory issues.
505 Various trending words have been selected for the co-occurrence analysis
506 that are listed in Figure 5. For all the trending words that were mentioned
507 in the articles containing the expression of interest (e.g.surveillance capital-
508 ism), co-occurrences were calculated. The appendix contains tables with the
509 calculated indices for the top 30 co-occurring terms. Among these terms, 10
510 bi-grams were selected for presentation that provide a wide insight into the
511 discussion around the examined term.
512 The co-occurrence analysis helps to unravel the news stories around the
513 selected words. As an example, Chinese tech was mostly mentioned together

22
514 in the context of such words as Chinese government, 5G network, giant Hua-
515 wei, Ren Zhengfei and security threat.
516 These keywords well describe recent news about the US administration
517 recommending to avoid Huawei networking equipment due to strong connec-
518 tions of Huawei to the Chinese government (Reichert, C., 2019).
519 A similar story can be explored in the case of 5G that has been covered
520 together with the terms Huawei, trade wars and security risk. Besides the
521 security aspect, 5G news has been also closely related to new consumer tech,
522 such as foldable phones with 5G support.
523 Similarly to 5G, the analysis shows the areas of implementations of neural
524 networks, including the Google Brain project, image classification, the ana-
525 lysis of Higgs Bosons or in medical diagnoses.
526 Finally, the exercise maps two social/regulatory issues: hate speech and
527 GDPR. The co-occurrences for hate speech show the related technologies
528 (machine learning), actors, (Mark Zuckerberg, Cambridge Analytica, Alex
529 Jones) and wider problems (e.g. terrorist content).

530 5.3. Sentiment analysis

531 Following the exploration of co-occurring terms around trending words,
532 a new layer can be added to the analysis: sentiment. News stories are of-
533 ten polarising, and public perception may evolve over time. Therefore, the
534 changing sentiment of trending topics is examined. Additionally, news story
535 involve positive and negative actors and relations. Analysing the sentiment
536 of co-occurring words, the different sides of debate can be identified.
537 Figures 6-8 present the evolution of sentiment for the terms GDPR,
538 Chinese tech, and Mark Zuckerberg. For each term, the average monthly
539 sentiment was calculated from the paragraphs containing them. The values
540 can vary between -1 (most negative sentiment) and +1 (most positive). In
541 general, the authors of the VADER tool (Hutto and Gilbert, 2014) recom-
542 mend to interpret the scores as:

543 • positive: > 0.05

544 • neutral: > −0.05 and < 0.05

545 • negative: < −0.05

546 Additionally, the size of bubbles also reveal the number of analysed para-
547 graphs. In each graph, the top 3 paragraph counts are shown.

23
Figure 5: Co-occurrences for selected terms

fake news
brand safety
illegal content
surveillance capitalism
machine learning
election meddling
alex jones
hate speech
white supremacist
terrorist content
ceo mark
analytica data
pr firm
ai strategy
chinese government
5g network
artificial intelligence
giant huawei
trade war
chinese tech
google china
ren zhengfei
security threat
chinese telecom
machine learning

convolutional neural
training data
google brain
geoffrey hinton
accurate diagnosis
neural network
higgs boson
topics image classification
reinforcement learning
# metoo black box
voice assistant

data breach
location data cambridge analytica
user data
artificial intelligence
machine learning
gdpr
facebook user
facial recognition
consent agreement
gdpr compliance
electronic health
open ai
5g mobile
5g service
chinese companies
huawei equipment
trade war
5g trump administration
anti-vaccine mate x
blockchain-based galaxy fold
vote leave tech giant
digital tax secuity risk
ajit pai

24
548 Public perception has been rather volatile for the analysed terms. In
549 the case of GDPR, the overall positive sentiment declined around May 2018,
550 possibly as a reaction to the complications faced by users and businesses
551 during the introduction. The public sentiment on Chinese tech also declined,
552 changing from positive to neutral in 2018, most probably due to the issues
553 related to cybersecurity. The increase of negative news stories on Facebook
554 CEO Mark Zuckerberg is especially visible. The first great decline happened
555 at the end of 2016, possibly due to the scandals related to misinformation
556 campaigns and fake news on Facebook (Goulard, H., 2016). The second rapid
557 decline is reported at the beginning of the Cambridge Analytica scandal that
558 began in March 2018.

Figure 6: GDPR: Sentiment over time

559 Besides tracking the positive and negative sentiments for selected topics,
560 the different shades of selected topics can be further examined by combining
561 the co-occurrence analysis and the sentiment analysis. Tables 5 - 7 demon-
562 strate that technologies and social challenges are related to numerous news
563 stories that are described differently by the media. The tables summarise the
564 most positive and negative co-occurrences based on articles containing both
565 expressions, with sentiment computed on paragraphs. The words have been
566 selected among the most positive and negative 30 co-occurrences. Besides
567 the calculated sentiments, the tables also show the number of paragraphs.

25
Figure 7: Chinese Tech: Sentiment over time

Figure 8: Mark Zuckerberg: Sentiment over time

26
568 Therefore, news stories on GDPR have been most positive in the context
569 of data management and digital market, while negative when covering the
570 British Airways (Gibbs, 2018), and neutral in the case of electronic health
571 records or hate speech.
572 In the case of Chinese tech, media coverage has been positive in the
573 context of technologies, while negative in the context of scandals related to
574 Huawei in the US.
575 Finally, news stories related to Mark Zuckerberg have been more positive
576 in relation to GDPR, while negative when mentioning content moderation,
577 and neutral in the news stories on election interference or the work of the
578 committee led by Damian Collins in the British Parliament (Cadwalladr, C.,
579 2018).

Table 5: GDPR: Positive and negative sentiments

Sentiment Count
data management 0.2556 2594
machine learning 0.2521 4355
digital market 0.2496 991
processing data 0.2052 1289
explicit consent 0.1635 1179
british airways -0.1704 177
cybersecurity serious -0.0516 120
electronic health -0.0246 201
face recognition 0.0129 236
hate speech 0.0488 1004

27
Table 6: Chinese tech: Positive and negative sentiments

Sentiment Count
ai capable 0.276 303
ai research 0.2438 459
machine learning 0.2304 429
artificial intelligence 0.198 1281
5g smartphone 0.1874 276
face extradition -0.2212 114
meng wanzhou -0.1062 522
trade secrets -0.0916 221
ren zhengfei -0.0625 591
against huawei -0.0298 286

Table 7: Mark Zuckerberg: Positive and negative sentiments

Sentiment Count
jeff bezos 0.1967 1921
machine learning 0.1796 2740
artificial intelligence 0.1372 5971
privacy law 0.0944 1738
tech giant 0.0941 4912
content moderation -0.0745 1158
hate speech -0.0424 5182
damian collins -0.0274 2149
election interference 0.0057 1589
fake account 0.007 2332

580 5.4. Topic modelling

581 The final layer of our analysis is topic modelling with the use of LDA
582 algorithm. The aim of this exercise is to automatically discover latent themes
583 of recent technological discussions. This serves as a robustness check to our
584 previous results.
585 LDA modelling results are sensitive to the chosen parameters values.
586 We have evaluated a range of LDA topic models using the coherence score
587 (see: 4.4). The sensitivity analysis was conducted in respect to three hyper-
588 parameters - number of topics, α: a Dirichlet prior on the per-document

28
589 topic distribution and β: a Dirichlet prior on the per-topic word distribution.
590 The hyper-parameters chosen based on the coherence score maximisation are
591 presented in the Table 8, for all calculated values see: Table D.20. We have
592 obtained coherence values (cv) ranging from 0.55 to 0.60.

Table 8: LDA hyper-parameters

parameter values best model worst model

(cv = .6) (cv = .55)
no. of topics (t) [10, 15, 20, ..., 50] 35 10
α 1
[ 1t , 10∗t , 0.01] 0.01 0.01
1 1 1 1
β [ t , 10∗t , 0.01] t t

593 In Figure 9, bubbles represent topics identified by the LDA algorithm

594 within the tech press articles published in 2018. The analysed text corpus
595 consists of 43093 articles and 144233 original tokens. The positions of bubbles
596 were determined by computing the distance between the topics with the help
597 of Principal Coordinate Analysis (PCoA), and then the inter-topic distances
598 were projected onto two dimensions (Chuang et al., 2012). PCoA is a form
599 of a non-linear dimensionality reduction which aims at visualizing the level
600 of similarity of the dataset’s elements. The topic’s prevalence is reflected by
601 the size of the bubble. In the right panel, the bars represent the individual
602 terms which help to interpret the currently selected topic on the left. A pair
603 of overlapping bars represent both the corpus-wide frequency of a given term,
604 as well as the topic-specific frequency of the term ((Chuang et al., 2012)).
605 For instance, the marked (active - red) topic depicts the recurring theme
606 of the 2018 online privacy violation scandal (terms: cambridg(e), analytica,
607 facebook, [Aleksandr] kogan, delet(e), scandal), with a strong focus on the
608 political context (as suggested by the terms: elect(ion), campaign, voter).
609 All topics can be explored using the interactive visualization tool available
610 in the LDA section at Engineroom project webiste (ngi.delabapps.eu).

29
Figure 9: LDA: Cambridge analytica topic

Source: Own elaboration using pyLDAvis package.

611 Among the 35 identified topics we have selected 10 topics particularly

612 interesting from the standpoint of emerging technologies and interrelated
613 social issues perspective. In tables 9 and 10, we present the topics divided
614 into two broad categories: social issues and technologies.
615 Each topic in the tables is demonstrated with its top 15 most frequent
616 words, whereas topic names are arbitrary umbrella terms. The higher posi-
617 tion a word takes in the list, the more specific it is for the given topic. No
618 intervention in the word selection was undertaken.
619 The results of the topic modelling exercise complement and reinforce the
620 results obtained in the earlier phases of the analysis: identification of emer-
621 ging terms and co-occurrence analysis. Topic 2 discusses the penetration
622 of social media into politics, with a focus on Donald Trump’s presence in
623 social media. Topic 5 pertains to government investments into AI techno-
624 logies, especially in the healthcare. It raises the issues of the cyber risks
625 related to the AI development in this area. Topic 6 mirrors the heated dis-
626 pute about the tech giants’ privacy practices. It takes up the subject of data
627 breaches and privacy laws. It is accompanied by subject 22 referring to the
628 notorious Cambridge Analytica data scandal, its main actors and political

30
629 repercussions (’campaign’). Topic 12 encompasses regulatory issues around
630 cryptocurrencies in the EU and US (’US Securities and Exchange Commis-
631 sion’). It tackles the subjects of taxation and novel forms of financing such
632 as ICO.
633 The main technological topics identified are presented in Table 10. Topic
634 1 accounts for the largest share of tokens, relating to Internet of Things and
635 cloud computing services. A decentralised approach to these technologies is
636 indicated by the appearance of the term ’blockchain’. Topic 8 features issues
637 related to smart devices. It is focused on two renowned gadget manufacturers
638 famous from their rivalry - Samsung and Apple. Both hardware issues (dis-
639 plays and battery lifetime) and software issues (mobile operating systems)
640 are encompassed in the topic. Topic 10 discusses broader cybersecurity issues
641 such as cyber frauds (’ransomware’) and attacks exploiting critical vulnerab-
642 ilities in the modern processors (’meltdown’). Topic 13 addresses the 5G race,
643 its main contestants as well as regulatory and hardware challenges. Topic
644 20 touches upon issues related to autonomous vehicles, such as controversies
645 around crashes of self-driving cars and the use of the autopilot feature as well
646 as the Federal Communications Commission spectrum proposals for vehicle
647 related communications.

31
Table 9: LDA: Top-15 keywords in social topics

Topic 2 US Topic 5 Social Topic 6 Online Topic 12 Topic 22 (Cam-

politics and aspects of AI privacy (4.6% Cryptocurren- bridge analytica
social media (7.1% tokens) tokens) cies regulation scandal (0.6%
(17.7% tokens) (1.7% tokens) tokens)
facebook government facebook eu facebook
trump ai google european cambridge
content health privacy ico analytica
news data user commission zuckerberg
media employee law cryptocurrency kogan

32
youtube security breach court data
video cyber inform regulation nix
people research app sec profile
state attack encrypt tax zeroday
internet google access bitcoin harvest
law risk blackberry lawsuit scandal
rule intelligence protection copyright scl
house agency email rule user
senate departure advertisement law schroepfer
social healthcare ad proposal campaign
Table 10: LDA: Top-10 keywords in technological topics

Topic 1 IoT & Topic 8 Smart Topic 10 Cy- Topic 13 5G Topic 20

cloud (18.5% devices (2.9% bersecurity (2% (1.7% tokens) Autonomous
tokens) tokens) tokens) vehicles (0.7%
tokens)
iot apple attack 5g selfdriving
cloud iphone vulnerability fcc waymo
data ipad malware intel car
custom samsung patch nokia vehicle
market phone exploit network uber

33
amazon battery flaw chip autonomous
digital ios secure pai tesla
business android bug qualcomm driver
technology pixel hacker broadband fcc
blockchain display ransomware mobile neutral
storage smartphone malicious lg driverless
brand broadcom infect amd crash
product xs code ericsson road
service device server carrier cruise
industry mac meltdown 4g autopilot
648 5.5. Robustness checks
649 Various robustness checks were carried out to validate the topic identi-
650 fication methodology that are included in Appendix B. The results suggest
651 that the assumptions of the methodology are adequate, yielding consistent
652 results.
653 As it has been revealed by Figures 3 and 4, trends are not always stable
654 over longer time periods. In order to better highlight the most recent trends,
655 the regression analysis is calculated for shorter time periods: the last 3, 6
656 and 12 months (Table B.13). The results suggest that the baseline regression
657 performs well in capturing the most trending terms at the shorter time peri-
658 ods (e.g. such topics as 5G, deepfakes, elections, fake news and consumer
659 electronics).
660 As an additional robustness check, exponential regression was calculated
661 that may better suit the dynamics of term frequencies. However, various
662 disadvantages are related to this form of equation, as discussed in Appendix
663 Appendix B.4.
664 The approach presented in the trend analysis (Section 5.1) is based on the
665 assumption that all words in an article are equally important. However, it
666 can be argued that the most important terms are located in the title and the
667 beginning of the text. Therefore, the term frequencies were recalculated, with
668 more weight assigned to the title (5) and the first paragraph (3). Following
669 the regression analysis, 92 words of the original top 100 most trending were
670 located in the new top 250 (Table B.14).
671 The second robustness check concerned the weights of the sources. In-
672 stead of taking into account the number of articles published by the sources,
673 articles from all sources are assigned equal weights. The results were similar
674 tor our baseline, with 87 words from top 100 appearing in the new top 250
675 (Table B.15).

676 6. Conclusions
677 This study presented a methodology for identifying trending topics in
678 online news media, enabling a deeper exploration of technologies and related
679 social challenges. Although text mining has an established literature for the
680 identification of trending topics, we address numerous research gaps.
681 The previous studies are narrowly specialised in regards of the applied
682 methods and the examined technological areas. However, policy-makers need
683 information on the overall technological landscape at the very beginning of

34
684 a policy cycle. In this paper, we have proposed a sequential text mining
685 framework apt to inform policy-makers about the fast changing tech land-
686 scape. The presented methods give policy-makers tools to quickly process
687 vast amounts of information and discover new knowledge at a low cost. The
688 temporal feature of the analysis enables to select the most relevant issues
689 and dismiss overhyped hot topics, characterised by a sudden increase and
690 immediate drop in public discussion.
691 Our methodology brings together a set of straightforward text mining
692 methodologies that are easy to diagnose, tune, evaluate and interpret. The
693 proposed sequence of methods enables the exploration of news stories by
694 different levels of granularity. The terms frequency analysis provides a bird’s
695 eye view on the emerging technologies and interrelated social issues. The co-
696 occurrence analysis helps building the topologies of the most relevant topics.
697 The changing public perception is tracked by the sentiment analysis. The
698 combination of the co-occurrence and sentiment analysis is used to unravel
699 the positive and negative stories related to a topic. Finally, topic modelling
700 provides a robustness check to identify dominant themes of discussion.
701 The implementation of our methodology is illustrated with the exemplary
702 path a policy maker can take through the results obtained for the period
703 01.2016 and 03.2018 from 13 popular English language online tech press
704 sources. The topic identification exercise revealed that the most trending
705 technologies include AI and ML algorithms, blockchain, robotics, quantum
706 computing, autonomous vehicles and various consumer technologies such as
707 IoT devices. Among the most debated social issues, we identified content
708 crisis and fake news, privacy, election meddling, the rising influence of China,
709 cybersecurity and competition in the digital economy.
710 Following the presentation of the main trending topics, selected case stud-
711 ies are explored in greater details, including privacy and GDPR, the Chinese
712 tech sector and the content crisis in social media.
713 The LDA analysis largely supports the results, highlighting such tech-
714 nologies as autonomous vehicles, and regulatory challenges including online
715 content and the Chinese tech sector.
716 The raw results, documented programming scripts and interactive visu-
717 alizations available in the accompanying paper’s website let users explore
718 the tech landscape from different angles. Basic programming background is
719 sufficient for users to reproduce the results for different set of sources and
720 different time periods.
721 We have demonstrated that simple and explicable text mining techniques

35
722 enable the automatic identification of trending topics based on online news.
723 Moreover, the combination of methods provides more nuanced details on
724 the stories from the tech world. Therefore, the methodology has potential
725 to decrease the policy lag, i.e. the time between the recognition of policy
726 challenge and implementation of the solution.

727 Acknowledgements
728 We would like to express our sincere gratitude to dr hab. Katarzyna
729 Śledziewska and our partners from the Engineroom and NGI Forward pro-
730 jects for the constant encouragement and support during the research process.

36
731 Appendix A. Data set

Table A.11: Websites’ categories included in the web-scraping process

Source Sections N. of Weight

articles
Gigaom All sections 686 1%
Euractiv Digital section 1280 4%
The Conversa- Science+Technology section 1782 6%
tion
Politico Europe Data and digitization, Technology, Sustainability 1815 6%
sections
IEEE Spectrum Tech-talk, The-human-os, Riskfactor, Automaton, 2495 6%
View-from-the-valley, Nanoclast, Cars-that-think
Techforge www.cloudcomputing-news.net, 6093
www.developer-tech.com, www.enterprise-
cio.com, www.iottechnews.com,
www.marketingtechnews.net, virtualreality-
news.net, www.telecomstechnews.com,
Fastcompany Technology section 6407 6%
The Guardian Technology section 13015 16.5%
Arstechnica Biz and IT, Tech, Science, Policy, Cars 14110 6%
Reuters Technology section 15589 6%
Gizmodo All sections 24674 10%
ZDNet Artificial intelligence, banking, data centers, data 25807 10%
management, developer, digital transformation,
e-commerce, enterprise software, EU, Future of
Work, Google, Government, Great debate, Innova-
tion, Internet of things, IT priorities, legal, master-
ing business analytics, networking, open source, re-
imagining the enterprise, robotics, security, smart
cities, social enterprise, startups, tech industry,
virtual reality, 3d printing
The Register All sections 27496 16.5%

37
Table A.12: Links to the webscraped sources

Source Link
The Register https://www.theregister.co.uk/
ZDNet https://www.zdnet.com/
Gizmodo https://gizmodo.com/
Reuters https://www.reuters.com/
Arstechnica https://arstechnica.com/
The Guardian https://www.theguardian.com/uk
Fastcompany https://www.fastcompany.com/
Techforge https://www.techforge.pub/
IEEE Spectrum https://spectrum.ieee.org/
Politico Europe https://www.politico.eu/
The Conversation https://theconversation.com/
Euractiv https://www.euractiv.com/
Gigaom https://gigaom.com/

38
732 Appendix B. Robustness checks
733 Appendix B.1. Regression analysis with shorter time period

Table B.13: Most growing terms in last 3, 6 and 12 months among those significantly
growing (coef norm > 0.025) in the whole analysis period

term coef 3m term coef 6m term coef 12m

0 5g 0.106007 huawei 0.102429 huawei 0.036941
1 huawei 0.055253 2019 0.064669 2019 0.031705
2 s10 0.054069 chinese 0.051774 5g 0.020172
3 rover 0.052041 5g 0.048710 china 0.019139
4 deepfake 0.043381 china 0.043235 chinese 0.018071
5 election 0.040109 facebook 0.037999 expo 0.015196
6 vaccine 0.038853 s10 0.017166 2018 0.014615
7 bezos 0.036094 2018 0.015670 amazon 0.014275
8 blockchain 0.032233 ban 0.015653 climate 0.011847
9 wework 0.032060 facial 0.013327 meng 0.006767
10 regret 0.031143 consent 0.012964 upcoming 0.006377
11 ban 0.029650 meng 0.012591 instagram 0.006157
12 crater 0.028115 rover 0.012241 amsterdam 0.005643
13 fold 0.027508 2017 0.012082 pro 0.005251
14 instacart 0.024804 fold 0.011924 xs 0.005171
15 2019 0.023427 climate 0.011062 rover 0.005058
16 facial 0.022944 violation 0.010118 s10 0.004937
17 consent 0.022739 crater 0.010065 podcast 0.004902
18 pro 0.020928 deepfake 0.010029 canada 0.004844
19 bookmark 0.020308 regret 0.009809 attend 0.004408
20 nso 0.020297 vaccine 0.009093 pixel 0.004235
21 ami 0.020078 bezos 0.008802 quantum 0.004199
22 enquire 0.019966 canada 0.008668 tumblr 0.003700
23 testosteron 0.019716 shutdown 0.008585 facial 0.003543
24 misinformation 0.019243 wework 0.008466 fold 0.003518
25 dey 0.018750 beij 0.008061 xr 0.003466
26 multi-cloud 0.018284 poland 0.007901 crater 0.003427
27 self-harm 0.018188 gamble 0.007126 disinformation 0.003319
28 2017 0.017735 instagram 0.007030 co-location 0.003251
29 scott-railton 0.017195 misinformation 0.006865 deepfake 0.003235

39
734 Appendix B.2. Increased importance of title and first paragraph
735 This research is based on multiple implicit underlying assumptions, for ex-
736 ample that all words in an article are equally important and that the assigned
737 weights correctly represent the importance of a source. We recalculated all
738 results changing one of the assumptions, but the final results do not change
739 significantly regardless of the method. Out of top 100 most growing words in
740 the base methodology, 92 and 87 occur in top 250 most growing words after
741 first and second change respectively.
742 More weight has been given to title and the first paragraph in coefficient
743 calculations (title weight: 5, first paragraph: weight 3). Terms which are
744 more salient and important are assumed to be at the beginning of article.
745 It may be misleading if title and the first paragraph are “clickbait” – there
746 is risk of capturing divisive and eye-catching words, which are not emerging
747 technologies. The changes may be desirable – e.g. 2017 disappears from
748 the growing list, mostly because describing the past is unlikely to be the
749 main theme of an article, however words like “artificial”, “quantum” and
750 “ethics” disappear too, even though they may warrant further investigation
751 – ”artificial” (intelligence) in the case of multiple articles may be a solution to
752 a problem described in title and lead paragraph, so would “ethics” (boards,
753 committees) make an attempt to solve AI issues such as “killer robots”.

Table B.14: Weighting methods comparison: main vs 531 method

Words among top100 trending in main Words among top100 trending in 531*,
method, not top250 trending in 531* not top250 trending in main method
amazon recognition
2017 pdf
artificial cup
quantum
ethical
upcoming
attend
beijing
*531 method: title’s weight 5, first paragraph’s weight 3, remaining para-
graphs’ weight 1

40
754 Appendix B.3. Removed source weights
755 Weights of sources have been removed, so that all articles are equally
756 important. In spite of Fastcompany and ZDNet being similarly popular ac-
757 cording to Alexa.com rank, ZDNet has much more influence on final results,
758 as ZDNet has four times as many articles as Fastcompany. There have been
759 more changes in trending keywords than in the first method, but the addi-
760 tional keywords mostly concern investigation about Russian collusion, some
761 specific cloud and AI technologies, as well as telecommunication networks.
762 As the goal of this research is to discover trends in a wide variety of
763 sources and find emerging technologies, we have chosen the methodology
764 of weighting by source, and not changing terms’ importance based on their
765 position in an article.

41
Table B.15: Weighting methods comparison: main vs equal source weight

Words among top100 trending in main Words among top100 trending in equal
method, not top250 trending in equal source weight, not top250 trending in
source weight main method
election u.s.
artificial russian
fake equipment
climate edge
voter nbn
attend gizmodo
episode chrome
democracy speaker
yeah azure
raccine recognition
reese mine
le telstra
biz parliament
neural
nvidia
4g
ces
nhs
365
actor
rollout
optus
gift
lte

766 Appendix B.4. Exponential regression

767 As logarithm is the inverse of the exponential function, running a OLS
768 regression on logarithms is equivalent to exponential regression. Zero values
769 pose a problem, as logarithm of 0 does not exist – they are discarded. In the
770 regression, the explanatory variable is the number of the month (between
771 1 and 38, inclusive), and the response variable is the logarithm of the fre-
772 quency for the corresponding month. We get two parameter values from the
773 regression – constant and β, the second of which is used further.

42
774 The β parameter is highest for terms with very high values in the last
775 analysed month, low in the second-to-last month, and zero in all previous
776 months. Consequently, normalization is required – we multiply the β para-
777 meter by mean frequency to the power of a scaling parameter:

exp coef norm = β ∗ mean f requency scaling parameter (B.1)

778 If the scaling parameter is 1, top results are similar to our results from
779 OLS: facebook, ai, 2018, 2017, and 5g are five most trending words (1st,
780 2nd, 3rd, 8th and 5th highest OLS coefficients respectively). The lower the
781 scaling parameter, the more specific the results get. With the value of 0.8,
782 view20 (a Huawei smartphone released in December 2018) enters the top 10
783 and becomes the most trending term with scaling parameter 32 . With scaling
784 parameter 12 , some other terms (like thunberg) enter the top 10 most trending
785 terms:
1
Table B.16: Highest scaled exponential regression coefficients with scaling parameter 2

exp coef norm

view20 0.108405
2018 0.039164
ai 0.033128
analytica 0.028841
thunberg 0.026780
tlo 0.025500
cryptocurrency 0.024800
2019 0.024682
5g 0.024487
facebook 0.023598

786 The trending terms change significantly with the scaling parameter below
787 0.5, the most trending words for scaling parameter 13 are as follows:

43
1
Table B.17: Highest scaled exponential regression coefficients with scaling parameter 3

exp coef norm

view20 0.387819
thunberg 0.097115
tlo 0.095507
blorp 0.091832
bov 0.079894
mvme 0.079412
polytop 0.078438
makelovenotporn 0.076587
sheertex 0.076011
kirkhorn 0.067640

44
788 Appendix C. Co-occurrence analysis
789 The top 30 co-occurring words are presented for selected terms.

45
Table C.18: Calculated co-occurrences for terms Chinese tech and neural network

chines tech chines neural network neural

tech cooc net cooc
0 chinese tech 93 machine learning 148.17
1 tech giant 46.2253 neural network 90
2 chinese company 27.9672 artificial intelligence 69.8563
3 chinese government 22.6538 sensory science 27.5
4 5g network 22.3407 beverage companies 14.6667
5 artificial intelligence 22.3046 learning model 14.4768
6 xi jinp 20.7442 convolutional neural 14.3131
7 giant huawei 12.0002 train data 12.2396
8 trump administration 11.4358 ai research 11.5593
9 trade war 11.3877 these algorithms 10.1875
10 chinese firm 10.6455 train neural 7.88228
11 french president 9.46831 google brain 7.77783
12 google china 9 ai system 7.57106

46
13 kai-fu lee 9 tech giant 7.21792
14 security concern 8.54665 geoffrey hinton 6.97004
15 communist party 8.07109 sophisticated algorithm 6.08807
16 ren zhengfei 7.85066 certain diseases 6
17 security risk 7.65887 accuracy diagnosis 6
18 security threat 7.64338 higgs boson 5.83333
19 meng wanzhou 7.4464 new image 5.71583
20 chinese telecom 7.39499 image classification 5.29268
21 huawei equipment 7.36787 reinforcement learning 5.23159
22 machine learning 7.25645 recognition system 4.81285
23 other chinese 6.65046 black box 4.40454
24 president trump 6.46544 quantum computing 4.36231
25 chief financial 6.28317 new ai 4.2136
26 financial officer 6.28317 voice assistance 4.11935
27 translation system 6 ai into 4.04834
28 another language 6 machine-learning model 4.04509
29 babel fish 6 expo world 4
Table C.19: Calculated co-occurrences for terms GDPR, hate speech and 5G

5g 5g cooc hate speech hate gdpr gdpr cooc

speech cooc
0 5g network 12.1915 hate speech 93 protection regulation 25.9801
1 artificial intelligence 3.50612 fake news 45.592 general data 24.8031
2 5g mobile 3.33995 brand safety 18.366 data breach 21.1983
3 5g technology 3.11791 mark zuckerberg 12.9808 cambridge analytica 18.0877
4 chinese government 3.03115 cambridge analytica 12.4502 privacy law 9.511
5 5g service 2.82286 illegal content 10.9004 mark zuckerberg 7.79553
6 meng wanzhou 2.64062 media platform 10.5567 user data 7.2727
7 machine learning 2.63024 machine learning 9.35787 information commission 6.36912
8 5g wireless 2.34961 media user 5.75119 artificial intelligence 6.28514
9 chinese company 2.25625 alex jones 5.65229 new privacy 5.98849
10 huawei equipment 2.16438 white supremacist 5.5793 machine learning 5.30798
11 galaxy s10 1.94602 artificial intelligence 5.47661 facebook user 5.1322
12 trade war 1.93488 false information 5.30602 into force 4.95583

47
13 mate x 1.9152 tech giant 4.84816 privacy regulation 4.69564
14 telecom equipment 1.80113 terrorist content 4.66608 facial recognition 4.21035
15 security concern 1.75714 facebook user 4.03051 tech giant 3.85276
16 new 5g 1.67168 ceo mark 3.85351 data broker 3.72572
17 chinese telecom 1.63324 analytica data 3.85322 analytica scandal 3.70646
18 galaxy fold 1.59965 2016 us 3.85134 privacy setting 3.31546
19 ren zhengfei 1.58783 fake account 3.83033 big tech 3.30796
20 trump administration 1.56132 conspiracy theory 3.76995 under gdpr 3.13044
21 chief financial 1.5492 content moderation 3.74147 consent agreement 3.1126
22 financial officer 1.54848 pr firm 3.45908 protection act 2.9683
23 network equipment 1.51906 graphic violence 3.45055 protection rule 2.87675
24 chinese firm 1.46971 media giant 3.43518 gdpr compliance 2.80291
25 cyber security 1.41964 data leak 3.31173 new general 2.71066
26 huawei technology 1.38123 content appearance 3.27126 health record 2.55844
27 tech giant 1.36108 account over 3.19663 electronic health 2.54785
28 security risk 1.2847 negative press 3.0735 ceo mark 2.49394
29 ajit pai 1.27578 communication officer 3.0735 global turnover 2.4243
790 Appendix D. Topic modelling

Hyper-parameter optimization

48
Topics Alpha Beta Coherence
35 0.01 symmetric 0.60
35 0.01 0.01 0.59
35 symmetric symmetric 0.59
35 symmetric 0.01 0.58
40 0.003 0.003 0.58
40 0.003 0.01 0.58
35 0.003 0.01 0.58
35 0.003 symmetric 0.58
35 symmetric 0.003 0.57
40 0.01 0.01 0.57
40 0.01 symmetric 0.57
30 0.01 symmetric 0.57
40 0.003 symmetric 0.57
35 0.003 0.003 0.57
30 symmetric 0.003 0.57
30 0.003 symmetric 0.57
35 0.01 0.003 0.57
30 0.01 0.003 0.56
30 0.003 0.003 0.56
30 symmetric symmetric 0.56
30 symmetric 0.01 0.56
40 0.01 0.003 0.56
45 symmetric symmetric 0.56
45 0.002 symmetric 0.56
45 0.01 symmetric 0.56
40 symmetric 0.01 0.55
45 symmetric 0.002 0.55
45 symmetric 0.01 0.55
30 0.01 0.01 0.55
45 0.002 0.002 0.55
30 0.003 0.01 0.55
45 0.002 0.01 0.55
40 symmetric symmetric 0.55
45 0.01 0.01 0.55
10 0.01 symmetric 0.55
10 0.01 symmetric 0.55

49
791 References
792 European Commission, Competence Centre on Text Mining and
793 Analysis, 2016. Available online at https://ec.europa.eu/jrc/en/
794 text-mining-and-analysis(Accessed 31.03.2019).

795 D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, J. B. Edu, Latent

796 Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993–
797 1022.

798 L. Kim, J. Ju, Can media forecast technological progress?: A text-mining

799 approach to the on-line newspaper and blog’s representation of prospective
800 industrial technologies, Information Processing and Management (2018)
801 1–20.

802 D. Rotolo, D. Hicks, B. R. Martin, What is an emerging technology?, Re-

803 search Policy 44 (2015) 1827–1843.

804 V. Kayser, K. Blind, Extending the knowledge base of foresight: The con-
805 tribution of text mining, Technological Forecasting and Social Change
806 (2017).

807 S. A. Morris, G. Yen, Z. Wu, B. Asnake, Time line visualization of research

808 fronts, Journal of the American Society for Information Science and Tech-
809 nology (2003).

810 L. M. Bettencourt, D. I. Kaiser, J. Kaur, C. Castillo-Chávez, D. E. Wojick,

811 Population modeling of the emergence and development of scientific fields,
812 Scientometrics (2008).

813 E. W. T. Ngai, P. T. Y. Lee, A Review of the Literature on Applications of

814 Text Mining in Policy Making, 20th Pacific Asia Conference on Informa-
815 tion Systems (PACIS 2016) (2016) 1–11.

816 J. Yoon, Detecting weak signals for long-term business opportunities using
817 text mining of Web news, Expert Systems with Applications 39 (2012)
818 12543–12550.

819 T. Albert, M. G. Moehrle, S. Meyer, Technology maturity assessment based

820 on blog analysis, Technological Forecasting and Social Change (2015).

50
821 Y. J. Lee, J. Y. Park, Identification of future signal based on the quantitative
822 and qualitative text mining: a case study on ethical issues in artificial
823 intelligence, Quality and Quantity 52 (2018) 653–667.
824 I. Bildosola, R. M. Rı́o-Bélver, G. Garechana, E. Cilleruelo, TeknoRoadmap,
825 an approach for depicting emerging technologies, Technological Forecasting
826 and Social Change 117 (2017) 25–37.
827 B. Lee, Y. I. Jeong, Mapping Korea’s national R&D domain of robot tech-
828 nology by using the co-word analysis, Scientometrics 77 (2008) 3–19.
829 Y. Kajikawa, J. Yoshikawa, Y. Takeda, K. Matsushima, Tracking emerging
830 technologies in energy research: Toward a roadmap for sustainable energy,
831 Technological Forecasting and Social Change 75 (2008) 771–782.
832 I. Roche, D. Besagni, C. François, M. Hörlesberger, E. Schiebel, Identifica-
833 tion and characterisation of technological topics in the field of Molecular
834 Biology, Scientometrics 82 (2010) 663–676.
835 X. Li, Q. Xie, T. Daim, L. Huang, Forecasting technology trends using text
836 mining of the gaps between science and technology: The case of perovskite
837 solar cell technology, Technological Forecasting and Social Change (2019)
838 1–18.
839 Q.-q. Xie, X. Li, L.-c. Huang, Identifying the Development Trends of Emer-
840 ging Technologies : A Social Awareness Analysis Method Using Web News
841 Data Mining, 2018 Portland International Conference on Management of
842 Engineering and Technology (PICMET) (2018) 1–12.
843 H. Niemann, M. G. Moehrle, J. Frischkorn, Use of a new patent text-
844 mining and visualization method for identifying patenting patterns over
845 time: Concept, method and test application, Technological Forecasting
846 and Social Change 115 (2017) 210–220.
847 H. Li, W. Fang, H. An, X. Huang, Words analysis of online Chinese news
848 headlines about trending events: A complex network perspective, PLoS
849 ONE 10 (2015) 1–22.
850 Y. Choi, Y. Jung, S. H. Myaeng, Identifying controversial issues and their
851 sub-topics in news articles, Lecture Notes in Computer Science (includ-
852 ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
853 Bioinformatics) 6122 LNCS (2010) 140–153.

51
854 L.-W. Ku, Y.-T. Liang, H.-H. Chen, Opinion extraction, summarization and
855 tracking in news and blog corpora, In Proceedings of AAAI-2006 Spring
856 Symposium on Computational Approaches to Analyzing Weblogs pages
857 (2006) 100–107.

858 S.-m. Kim, E. Hovy, M. Rey, I. S. I. Edu, Extracting Opinions , Opinion

859 Holders , and Topics Expressed in Online News Media Text (2006) 1–8.

860 J. Bian, K. Yoshigoe, A. Hicks, J. Yuan, Z. He, M. Xie, Y. Guo, M. Prosperi,

861 R. Salloum, F. Modave, Mining twitter to assess the public perception of
862 the ”internet of things”, PLoS ONE 11 (2016) 1–15.

863 C. Mejia, Y. Kajikawa, Assessing the sentiment of social expectations of

864 robotic technologies, PICMET 2017 - Portland International Conference
865 on Management of Engineering and Technology: Technology Management
866 for the Interconnected World, Proceedings 2017-Janua (2017) 1–7.

867 S. R. Baker, N. Bloom, S. J. Davis, Measuring economic policy uncertainty,

868 Quarterly Journal of Economics (2016).

869 E. Tobback, H. Naudts, W. Daelemans, E. Junqué de Fortuny, D. Martens,

870 Belgian economic policy uncertainty index: Improvement through text
871 mining, International Journal of Forecasting 34 (2018) 355–365.

872 J. H. Suh, Generating future-oriented energy policies and technologies from

873 the multidisciplinary group discussions by text-mining-based identification
874 of topics and experts, Sustainability (Switzerland) 10 (2018).

875 P. Kralj Novak, M. Puliga, V. Zlatić, B. Sluban, M. Popović, I. Mozetič,

876 M. Grčar, H. Štefančić, Extraction of Temporal Networks from Term Co-
877 Occurrences in Online Textual Sources, PLoS ONE 9 (2014) e99515.

878 Y. B. Kim, J. Lee, N. Park, J. Choo, J. H. Kim, C. H. Kim, When Bitcoin

879 encounters information in an online forum: Using text mining to analyse
880 user opinions and predict value fluctuation, PLoS ONE (2017).

881 R. Hisano, D. Sornette, T. Mizuno, T. Ohnishi, T. Watanabe, High Qual-

882 ity Topic Extraction from Business News Explains Abnormal Financial
883 Market Volatility, PLoS ONE 8 (2013).

52
884 A. Sun, M. Lachanski, F. J. Fabozzi, Trade the tweet: Social media text
885 mining and sparse matrix factorization for stock market prediction, Inter-
886 national Review of Financial Analysis 48 (2016) 272–281.

887 O. Dedehayir, M. Steinert, The hype cycle model: A review and future
888 directions, Technological Forecasting and Social Change (2016).

889 G. Ignatow, R. Mihalcea, Text Mining: A Guidebook for the Social Sciences,
890 2018.

891 J. Kincaid, R. Fishburne, R. Rogers, B. Chissom, Derivation of new readab-

892 ility formulas (automated readability index, fog count, and flesch reading
893 ease formula) for Navy enlisted personnel, Research Branch Report (1975).

894 R. Gunning, The Fog Index After Twenty Years, International Journal of
895 Business Communication (1968).

896 C. J. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for senti-

897 ment analysis of social media text, Proceedings of the Eighth International
898 AAAI Conference on Weblogs and Social Media (2014) 216–225.

899 D. M. Blei, Probabilistic topic models (article), Communications of the

900 ACM (2012a).

901 D. M. Blei, Topic Modeling and Digital Humanities, Journal of Digital

902 Humanities (2012b).

903 D. Greene, D. O’Callaghan, P. Cunningham, How many topics? stability

904 analysis for topic models, in: T. Calders, F. Esposito, E. Hüllermeier,
905 R. Meo (Eds.), Machine Learning and Knowledge Discovery in Databases,
906 Springer Berlin Heidelberg, Berlin, Heidelberg, 2014, pp. 498–513.

907 H. M. Wallach, D. M. Mimno, A. McCallum, Rethinking lda: Why priors

908 matter, in: Advances in neural information processing systems, pp. 1973–
909 1981.

910 M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence

911 measures, in: Proceedings of the eighth ACM international conference on
912 Web search and data mining, ACM, pp. 399–408.

53
913 J. Chang, J. Boyd-Graber, G. Sean, W. Chong, D. M. Blei, Reading Tea
914 Leaves: How Humans Interpret Topic Models, Advances in neural inform-
915 ation processing systems (2009).
916 N. Aletras, M. Stevenson, Evaluating topic coherence using distributional
917 semantics, in: Proceedings of the 10th International Conference on Com-
918 putational Semantics (IWCS 2013) – Long Papers, Association for Com-
919 putational Linguistics, Potsdam, Germany, 2013, pp. 13–22.
920 D. Newman, J. H. Lau, K. Grieser, T. Baldwin, Automatic evaluation of topic
921 coherence, in: Human Language Technologies: The 2010 Annual Confer-
922 ence of the North American Chapter of the Association for Computational
923 Linguistics, Association for Computational Linguistics, pp. 100–108.
924 R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large
925 Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges
926 for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. http://
927 is.muni.cz/publication/884893/en.
928 C. Sievert, K. Shirley, Ldavis: A method for visualizing and interpreting
929 topics, in: Proceedings of the workshop on interactive language learning,
930 visualization, and interfaces, pp. 63–70.
931 Reichert, C., US tells Germany to ban Huawei on 5G or it will share less
932 intelligence: Report, 2019. Available online at https://zd.net/2u261VC
933 (Accessed 31.03.2019).
934 Goulard, H., Facebook boss Mark Zuckerberg sued over hate speech,
935 2016. Available online at https://www.politico.eu/article/
936 facebook-boss-mark-zuckerberg-sued-over-hate-speech/ (Ac-
937 cessed 31.03.2019).
938 Gibbs, How did hackers manage to lift the details of BA custom-
939 ers?, 2018. Available online at https://www.politico.eu/article/
940 facebook-boss-mark-zuckerberg-sued-over-hate-speech/ (Accessed
941 31.03.2019).
942 Cadwalladr, C., Parliament seizes cache of Facebook internal papers,
943 2018. Available online at https://www.theguardian.com/technology/
944 2018/nov/24/mps-seize-cache-facebook-internal-papers (Accessed
945 31.03.2019).

54
946 J. Chuang, C. D. Manning, J. Heer, Termite : Visualization Techniques
947 for Assessing Textual Topic Models, International Working Conference on
948 Advanced Visual Interfaces (2012) 74.

949 Appendix E. Biographical endnotes

950 Kristóf Gyódi is a PhD candidate at the Faculty of Economic Sciences
951 and researcher at DELab UW at the University of Warsaw. His research
952 interests include the digitalisation of the economy, services provided by online
953 platforms and the sharing economy. His research has an empirical focus,
954 with the implementation of various data-science methods, such as NLP or
955 web-scraping. Currently, he is engaged in analysing the impact of Airbnb on
956 European cities and identifying socially relevant emerging technologies based
957 on online sources.
958 Lukasz Nawaro is a PhD candidate in the Doctoral School of Social Sci-
959 ences at the University of Warsaw, where he carries out research on digital
960 transformation of transport: particularly micro-mobility and its impact on
961 cities. He has completed Master’s degree in Data Science with a thesis about
962 benchmarking hyperparameter optimization algorithms in gradient boosting
963 models. At DELab UW, he is responsible for programming, web applications,
964 data analysis and processing.
965 Michal Paliński is a PhD candidate at the Faculty of Economic Sciences
966 at the University of Warsaw and researcher at DELab UW. Michal’s research
967 interests focus on the digital economy, especially on valuation of online pri-
968 vacy and the role online platforms are playing in solving market failures.
969 At DELab UW, Maciej Wilamowski coordinates research in the area of
970 Data Science. In particular, he is responsible for analytical problem solving,
971 forming analytical DELab team and enhancing corresponding skills of the
972 team. Maciej monitors current and emerging Data Science trends as well as
973 keeps track of the new Machine Learning solutions. In his spare time, Maciej
974 enjoys developing his skills develops as a “kaggler”. His research interests
975 lie primarily in the area of behavioral economics. Maciej works in DELab
976 since January 2017. He works at the Faculty of Economic Sciences, where
977 he is giving lectures on Data Science, Python and SQL, as well as Machine
978 Learning.

HGL 2022 Political Tech Landscape Report - 033023 (Clean)
83% (6)
HGL 2022 Political Tech Landscape Report - 033023 (Clean)
27 pages
On Finding The Natural Number of Topics With Latent Dirichlet Allocation Some Observations PDF
No ratings yet
On Finding The Natural Number of Topics With Latent Dirichlet Allocation Some Observations PDF
12 pages
Metis Bootcamp Curriculum
No ratings yet
Metis Bootcamp Curriculum
18 pages
Data Science Projects for thesis and Portfolio: Solving Political Problems
From Everand
Data Science Projects for thesis and Portfolio: Solving Political Problems
Dr. Zemelak Goraga
No ratings yet
The Social Dynamics of Open Data
From Everand
The Social Dynamics of Open Data
African Books Collective
No ratings yet
Hossain Et Al. - 2021 - Text Mining and Sentiment Analysis of Newspaper He
No ratings yet
Hossain Et Al. - 2021 - Text Mining and Sentiment Analysis of Newspaper He
15 pages
Comparative Media Policy, Regulation and Governance in Europe - Chapter 1: Chapter 1: Why Study Media Policy and Regulation?
From Everand
Comparative Media Policy, Regulation and Governance in Europe - Chapter 1: Chapter 1: Why Study Media Policy and Regulation?
Hannu Nieminen
No ratings yet
Tecnologia e Inovaccion
No ratings yet
Tecnologia e Inovaccion
13 pages
CC Text Mining Rigau - en
No ratings yet
CC Text Mining Rigau - en
67 pages
Ke Et Al. - 2024 - Recent Advances in Text Analysis
No ratings yet
Ke Et Al. - 2024 - Recent Advances in Text Analysis
60 pages
Influence of Recommender Systems on Consumer Behavior
From Everand
Influence of Recommender Systems on Consumer Behavior
Markus Lill
No ratings yet
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Technology Roadmap: Cases and Opportunities
From Everand
Technology Roadmap: Cases and Opportunities
Suzana Borschiver
No ratings yet
Toolkit for history classes: Debunking fake news and fostering critical thinking
From Everand
Toolkit for history classes: Debunking fake news and fostering critical thinking
María Sabiote González
No ratings yet
Disclosure on sustainable development, CSR environmental disclosure and greater value recognized to the company by users
From Everand
Disclosure on sustainable development, CSR environmental disclosure and greater value recognized to the company by users
Olga Maria Stefania Cucaro
No ratings yet
Research Challenge On Opinion Mining and Sentiment Analysis: Background
No ratings yet
Research Challenge On Opinion Mining and Sentiment Analysis: Background
9 pages
Natural Language Processing For Policymaking: Zhijing Jin and Rada Mihalcea
No ratings yet
Natural Language Processing For Policymaking: Zhijing Jin and Rada Mihalcea
22 pages
The Internet and Social Inequality at the Turn of the 20th Century
From Everand
The Internet and Social Inequality at the Turn of the 20th Century
Peter Fernsby
No ratings yet
Artificial intelligence in science: Challenges, opportunities and the future of research
From Everand
Artificial intelligence in science: Challenges, opportunities and the future of research
Alistair Nolan
No ratings yet
Expert Systems With Applications: Dirk Thorleuchter, Dirk Van Den Poel, Anita Prinzie
No ratings yet
Expert Systems With Applications: Dirk Thorleuchter, Dirk Van Den Poel, Anita Prinzie
7 pages
Gale Researcher Guide for: The Comparative-Historical Method
From Everand
Gale Researcher Guide for: The Comparative-Historical Method
Stamatel
No ratings yet
1-s2.0-S2352340924001914-main
No ratings yet
1-s2.0-S2352340924001914-main
7 pages
Technology Advancement: An Application of Text Mining: February 2019
No ratings yet
Technology Advancement: An Application of Text Mining: February 2019
14 pages
Digital Humanities Research Methods
From Everand
Digital Humanities Research Methods
Vikrant Iyer
No ratings yet
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
From Everand
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
Farhad Daneshgar PhD
No ratings yet
Get Smart Fast: An analysis of Internet based collaborative knowledge environments for critical digital media autonomy
From Everand
Get Smart Fast: An analysis of Internet based collaborative knowledge environments for critical digital media autonomy
Joe Tojek
No ratings yet
Markard Et Al. (2012) PDF
No ratings yet
Markard Et Al. (2012) PDF
13 pages
Research Policy: Henry Small, Kevin W. Boyack, Richard Klavans
No ratings yet
Research Policy: Henry Small, Kevin W. Boyack, Richard Klavans
18 pages
New sanitation techniques in the development cooperation: An economical reflection
From Everand
New sanitation techniques in the development cooperation: An economical reflection
Ulrich Menter
No ratings yet
1 s2.0 S004873331200056X Main PDF
No ratings yet
1 s2.0 S004873331200056X Main PDF
13 pages
Social Media News: Trends and Influence
From Everand
Social Media News: Trends and Influence
Umang Marar
No ratings yet
Critical Discourse Analysis in the Digital Age
From Everand
Critical Discourse Analysis in the Digital Age
Ridha Rouabhia
No ratings yet
Building Insight: Advanced Analytical Models for Decision-Making: O6.0 TRANSFORM DATA
From Everand
Building Insight: Advanced Analytical Models for Decision-Making: O6.0 TRANSFORM DATA
Elizabeth Mogopodi
No ratings yet
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
Quantitative Criminology Handbook
From Everand
Quantitative Criminology Handbook
Neeraj Venkataraman
No ratings yet
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
From Everand
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
Krishna Bista
No ratings yet
Karimi, Jannach, Jugovac - 2018 - News Recommender Systems – Survey and Roads Ahead
No ratings yet
Karimi, Jannach, Jugovac - 2018 - News Recommender Systems – Survey and Roads Ahead
25 pages
Decision Analysis: Fundamentals and Applications
From Everand
Decision Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
2502.15701v1
No ratings yet
2502.15701v1
9 pages
Megaproject Organization and Performance: The Myth and Political Reality
From Everand
Megaproject Organization and Performance: The Myth and Political Reality
Nuno Gil
No ratings yet
Discovering Prices: Auction Design in Markets with Complex Constraints
From Everand
Discovering Prices: Auction Design in Markets with Complex Constraints
Paul Milgrom
No ratings yet
Advances in Social Media Analysis: Mohamed Medhat Gaber Mihaela Cocea Nirmalie Wiratunga Ayse Goker Editors
No ratings yet
Advances in Social Media Analysis: Mohamed Medhat Gaber Mihaela Cocea Nirmalie Wiratunga Ayse Goker Editors
156 pages
R D Management - 2020 - Antons - The Application of Text Mining Methods in Innovation Research Current State Evolution
No ratings yet
R D Management - 2020 - Antons - The Application of Text Mining Methods in Innovation Research Current State Evolution
23 pages
Smart Country – Connected. Intelligent. Digital.
From Everand
Smart Country – Connected. Intelligent. Digital.
Bookwire
No ratings yet
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Merged Readings of CME4800
No ratings yet
Merged Readings of CME4800
411 pages
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
Alexander V. Andreichikov
No ratings yet
Synergy: A Theoretical Model of Canada’S Comprehensive Approach
From Everand
Synergy: A Theoretical Model of Canada’S Comprehensive Approach
Eric Dion CD MBA PhD
No ratings yet
Understanding Technological Innovation: Enabling the Bandwagon for Hydrogen Technology
From Everand
Understanding Technological Innovation: Enabling the Bandwagon for Hydrogen Technology
Boris Jermer
No ratings yet
Criminal Justice Statistics: Essential Methods
From Everand
Criminal Justice Statistics: Essential Methods
Sandeep Krishnamurthy
No ratings yet
Kivimaa n Kern 2016 - Creative destructio or mere niche support
No ratings yet
Kivimaa n Kern 2016 - Creative destructio or mere niche support
13 pages
Freedom of expression and defamation
From Everand
Freedom of expression and defamation
Tarlach McGonagle
No ratings yet
Analyzing and Ranking Prevalent News over Social Media
No ratings yet
Analyzing and Ranking Prevalent News over Social Media
12 pages
Resrep19557 5
No ratings yet
Resrep19557 5
31 pages
Real Time Text Mining On Twitter Data: Shilpy Gandharv Vivek Richhariya Richhariya
No ratings yet
Real Time Text Mining On Twitter Data: Shilpy Gandharv Vivek Richhariya Richhariya
5 pages
The Future of Scholarly Publishing: Open Access and the Economics of Digitisation
From Everand
The Future of Scholarly Publishing: Open Access and the Economics of Digitisation
African Books Collective
No ratings yet
Counter-radicalisation in the classroom: The challenges of counter-radicalisation policies in education in the Council of Europe member states
From Everand
Counter-radicalisation in the classroom: The challenges of counter-radicalisation policies in education in the Council of Europe member states
Francesco Ragazzi
No ratings yet
Stochastic Foundations: A Comprehensive Guide for Scholars and Practitioners: FINANCIAL ENGINEERING
From Everand
Stochastic Foundations: A Comprehensive Guide for Scholars and Practitioners: FINANCIAL ENGINEERING
Elizabeth Mogopodi
No ratings yet
FTI Tech Trends 2022 All
No ratings yet
FTI Tech Trends 2022 All
658 pages
On The Frontiers of Twitter Data and Sentiment Analysis in Election Prediction: A Review
No ratings yet
On The Frontiers of Twitter Data and Sentiment Analysis in Election Prediction: A Review
25 pages
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
From Everand
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
Daniel Richards
No ratings yet
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
No ratings yet
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
6 pages
University of Toronto School of Graduate Studies Thesis Template
100% (2)
University of Toronto School of Graduate Studies Thesis Template
8 pages
AI tools
No ratings yet
AI tools
16 pages
Prediction of Churn Probability in Ott Platform
No ratings yet
Prediction of Churn Probability in Ott Platform
24 pages
Journal Pre-Proof: KSCE Journal of Civil Engineering
No ratings yet
Journal Pre-Proof: KSCE Journal of Civil Engineering
45 pages
64. Machine learning approaches to facial and text analysis Discovering CEO oral communication styles
No ratings yet
64. Machine learning approaches to facial and text analysis Discovering CEO oral communication styles
28 pages
Text Mining Classification Clustering and Applications 1st Edition Ashok Srivastavadownload
100% (1)
Text Mining Classification Clustering and Applications 1st Edition Ashok Srivastavadownload
52 pages
Topic Modeling A Comprehensive Review
No ratings yet
Topic Modeling A Comprehensive Review
17 pages
2024_Developing a supervised learning model for anticipating potential technology convergence.....
No ratings yet
2024_Developing a supervised learning model for anticipating potential technology convergence.....
18 pages
Full Download Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 1 Vijendra Singh PDF DOCX
100% (15)
Full Download Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 1 Vijendra Singh PDF DOCX
61 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
100% (1)
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
506 pages
1 s2.0 S0926580524003728 Main
No ratings yet
1 s2.0 S0926580524003728 Main
14 pages
PDF Handbook of Mixed Membership Models and Their Applications 1st Edition Edoardo M. Airoldi download
No ratings yet
PDF Handbook of Mixed Membership Models and Their Applications 1st Edition Edoardo M. Airoldi download
77 pages
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
No ratings yet
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
10 pages
term paper of nlp
No ratings yet
term paper of nlp
6 pages
Facilitators and Barriers of Artificial Intelligence Adoption in Business - Insights From Opinions Using Big Data Analytics
No ratings yet
Facilitators and Barriers of Artificial Intelligence Adoption in Business - Insights From Opinions Using Big Data Analytics
24 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Topic Modeling in the Voynich Manuscript
No ratings yet
Topic Modeling in the Voynich Manuscript
18 pages
Article On Pharmaceutical Software
No ratings yet
Article On Pharmaceutical Software
16 pages
ICT202 Machine Learning - Assignment 2
No ratings yet
ICT202 Machine Learning - Assignment 2
2 pages
Cross-Lingual Contextualized Topic Models With Zero-Shot Learning
No ratings yet
Cross-Lingual Contextualized Topic Models With Zero-Shot Learning
8 pages
8 Tawosi2022saner
No ratings yet
8 Tawosi2022saner
12 pages
NLP in Medical
No ratings yet
NLP in Medical
11 pages
Fraud Detection in Python Chapter4
No ratings yet
Fraud Detection in Python Chapter4
33 pages
Machine Learning and AI in Marketing - Connecting Computing Power To Human Insights
No ratings yet
Machine Learning and AI in Marketing - Connecting Computing Power To Human Insights
24 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
10 1016@j Cosrev 2017 10 003
No ratings yet
10 1016@j Cosrev 2017 10 003
12 pages
AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
No ratings yet
AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
8 pages

Informing Policy With Text Mining PDF

Uploaded by

Informing Policy With Text Mining PDF

Uploaded by

Informing policy with text mining: technological change

and social challenges

Email addresses: [email protected] (Kristóf Gyódi),

22 1. Which are the most trending technologies and regulatory challenges?

26 The quantitative text mining analysis is based on a novel dataset of tech-

Figure 1: Blockchain query popularity

Source: Own elaboration using Google Trends data.

245 3.2. Descriptive statistics of sources

256 In order to examine the heterogeneity of sources in terms of readability,

Table 1: Sources readability indices

Sources Flesh read. std FOG index std Sentences std

ZDNet 44.2 14.7 15.1 3.9 24.0 20.0

308 4.2. Co-occurrence analysis

333 4.3. Sentiment analysis

430 5.1. Identification of emerging terms

Table 2: Unigram Coefficients

term coef coef norm

term coef coef norm

Figure 3: Term frequencies for selected unigrams and bigrams

496 5.2. Co-occurrence analysis

530 5.3. Sentiment analysis

543 • positive: > 0.05

544 • neutral: > −0.05 and < 0.05

545 • negative: < −0.05

Figure 6: GDPR: Sentiment over time

Figure 8: Mark Zuckerberg: Sentiment over time

Table 5: GDPR: Positive and negative sentiments

Table 7: Mark Zuckerberg: Positive and negative sentiments

580 5.4. Topic modelling

Table 8: LDA hyper-parameters

parameter values best model worst model

593 In Figure 9, bubbles represent topics identified by the LDA algorithm

Source: Own elaboration using pyLDAvis package.

611 Among the 35 identified topics we have selected 10 topics particularly

Topic 2 US Topic 5 Social Topic 6 Online Topic 12 Topic 22 (Cam-

Topic 1 IoT & Topic 8 Smart Topic 10 Cy- Topic 13 5G Topic 20

Table A.11: Websites’ categories included in the web-scraping process

Source Sections N. of Weight

term coef 3m term coef 6m term coef 12m

Table B.14: Weighting methods comparison: main vs 531 method

766 Appendix B.4. Exponential regression

exp coef norm = β ∗ mean f requency scaling parameter (B.1)

exp coef norm

exp coef norm

chines tech chines neural network neural

5g 5g cooc hate speech hate gdpr gdpr cooc

795 D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, J. B. Edu, Latent

798 L. Kim, J. Ju, Can media forecast technological progress?: A text-mining

802 D. Rotolo, D. Hicks, B. R. Martin, What is an emerging technology?, Re-

807 S. A. Morris, G. Yen, Z. Wu, B. Asnake, Time line visualization of research

810 L. M. Bettencourt, D. I. Kaiser, J. Kaur, C. Castillo-Chávez, D. E. Wojick,

813 E. W. T. Ngai, P. T. Y. Lee, A Review of the Literature on Applications of

819 T. Albert, M. G. Moehrle, S. Meyer, Technology maturity assessment based

858 S.-m. Kim, E. Hovy, M. Rey, I. S. I. Edu, Extracting Opinions , Opinion

860 J. Bian, K. Yoshigoe, A. Hicks, J. Yuan, Z. He, M. Xie, Y. Guo, M. Prosperi,

863 C. Mejia, Y. Kajikawa, Assessing the sentiment of social expectations of

867 S. R. Baker, N. Bloom, S. J. Davis, Measuring economic policy uncertainty,

869 E. Tobback, H. Naudts, W. Daelemans, E. Junqué de Fortuny, D. Martens,

872 J. H. Suh, Generating future-oriented energy policies and technologies from

875 P. Kralj Novak, M. Puliga, V. Zlatić, B. Sluban, M. Popović, I. Mozetič,

878 Y. B. Kim, J. Lee, N. Park, J. Choo, J. H. Kim, C. H. Kim, When Bitcoin

881 R. Hisano, D. Sornette, T. Mizuno, T. Ohnishi, T. Watanabe, High Qual-

891 J. Kincaid, R. Fishburne, R. Rogers, B. Chissom, Derivation of new readab-

896 C. J. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for senti-

899 D. M. Blei, Probabilistic topic models (article), Communications of the

901 D. M. Blei, Topic Modeling and Digital Humanities, Journal of Digital

903 D. Greene, D. O’Callaghan, P. Cunningham, How many topics? stability

907 H. M. Wallach, D. M. Mimno, A. McCallum, Rethinking lda: Why priors

910 M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence

949 Appendix E. Biographical endnotes

You might also like