Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
The document provides an overview of a proposed text categorization system using a modified Naive Bayes algorithm. It includes sections on the problem definition, objectives, literature review, methodology, proposed system architecture consisting of modules for dataset preprocessing, text categorization using the modified algorithm, a comparative study. The system would use a 20 newsgroup dataset, perform text reduction during preprocessing, classify unknown text, and compare the performance of the existing and proposed algorithms. It lists the software and hardware requirements and provides references for related work.
The document discusses algorithm analysis and complexity analysis. It introduces the concept of analyzing an algorithm's runtime by examining the number of key operations like comparisons and assignments, rather than just measuring execution time. This is known as complexity analysis. The document uses an example of summing the rows and values of a matrix to illustrate how complexity analysis can identify the most efficient of multiple algorithms for the same problem. It determines that two example algorithms for summing a matrix have the same asymptotic runtime of O(n^2). The document then introduces Big-O notation for describing an algorithm's asymptotic worst-case runtime.
This document discusses trees and binary trees. It covers tree terminology, tree traversals including preorder, inorder and postorder, implementing binary trees in Java, using binary trees to represent general trees, and applications of trees for data science. The next lecture will cover binary search trees, AVL trees, binary heaps and red black trees.
This document discusses search algorithms, including linear search and binary search. It covers:
1) Linear search sequentially scans an array to find a match, working for both sorted and unsorted arrays.
2) Binary search uses recursion to search a sorted array, eliminating half the search space each iteration, making it faster than linear search.
3) Implementation examples of linear and binary search in Java are provided, along with analysis of the time efficiency of each method.
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
This document discusses using information retrieval and relevance feedback techniques to help with the concept location task during software maintenance and evolution. It describes the concept location process, challenges with query formulation, and how relevance feedback can help developers iteratively refine queries to more accurately locate relevant code. The document outlines two studies conducted that show relevance feedback generally improves the results of information retrieval for concept location, though the benefits depend on properly calibrating the relevance feedback parameters for each software system.
Slides of my doctoral thesis dissertation talk, given on 20 March 2014 at Politecnico di Milano. Title: "Computational prediction of gene functions through machine learning methods and multiple validation procedures"
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
The document discusses C# strings and string manipulation. It begins by explaining that a C# string is a sequential collection of Unicode characters represented by a String object which is a collection of Char objects. It then covers key properties of strings like being immutable and reference types. The document also discusses string literals, common string operators, and methods to manipulate strings, as well as the StringBuilder class as a mutable alternative to strings.
This document discusses algorithm analysis and determining the time complexity of algorithms. It begins by defining an algorithm and noting that the efficiency of algorithms should be analyzed independently of specific implementations or hardware. The document then discusses analyzing the time complexity of various algorithms by counting the number of operations and expressing efficiency using growth functions. Common growth functions like constant, linear, quadratic, and exponential are introduced. The concept of asymptotic notation (Big O) for describing an algorithm's time complexity is also covered. Examples are provided to demonstrate how to determine the time complexity of iterative and recursive algorithms.
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
The paper preprint is available here:
https://arxiv.org/pdf/2005.11021.pdf
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
The document discusses minimum spanning trees and provides examples of Prim's and Kruskal's algorithms. It includes:
- A definition of minimum spanning tree as a subgraph that spans all nodes with minimum total edge weight.
- Characteristics of Prim's and Kruskal's algorithms such as working with undirected, weighted/unweighted graphs and producing optimal solutions greedily.
- A walk-through example of Prim's algorithm on a graph and calculating the minimum spanning tree cost.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
This document discusses recommender systems and linked open data. It begins with an introduction to linked open data, describing its key components like URIs, RDF, and popular vocabularies. It then provides an overview of recommender systems, explaining how they help with information overload by matching users to items. Different recommendation techniques are described like collaborative filtering, content-based, knowledge-based, and hybrid approaches. Evaluation methods for recommender systems like dataset splitting are also briefly covered. The document aims to lay the foundation for discussing how recommender systems can utilize linked open data.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
Here are some potential replies to your input tweet:
1. Summer is the best season in Hokkaido.
2. It does get pretty hot in Hokkaido in the summer.
3. I love Hokkaido in the summer too, so many outdoor activities!
4. What are your plans for the summer in Hokkaido?
5. The scenery is beautiful everywhere in Hokkaido.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
This document summarizes a tutorial for developing a state-of-the-art named entity recognition framework using deep learning. The tutorial uses a bi-directional LSTM-CNN architecture with a CRF layer, as presented in a 2016 paper. It replicates the paper's results on the CoNLL 2003 dataset for NER, achieving an F1 score of 91.21. The tutorial covers data preparation from the dataset, word embeddings using GloVe vectors, a CNN encoder for character-level representations, a bi-LSTM for word-level encoding, and a CRF layer for output decoding and sequence tagging. The experience of presenting this tutorial to friends highlighted the need for detailed comments and explanations of each step and PyTorch functions.
This document summarizes a thesis on automating test routine creation through natural language processing. The author proposes using word embeddings and recommender systems to automatically generate test cases from requirements documents and link them together. The methodology involves representing text as word vectors, calculating similarity between requirements and test blocks, and applying association rule mining on test block sequences. An experiment on a space operations dataset showed the approach improved productivity in test creation and requirements tracing over manual methods. Future work could explore using deep learning models and collecting additional evaluation metrics from users.
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
The document discusses C# strings and string manipulation. It begins by explaining that a C# string is a sequential collection of Unicode characters represented by a String object which is a collection of Char objects. It then covers key properties of strings like being immutable and reference types. The document also discusses string literals, common string operators, and methods to manipulate strings, as well as the StringBuilder class as a mutable alternative to strings.
This document discusses algorithm analysis and determining the time complexity of algorithms. It begins by defining an algorithm and noting that the efficiency of algorithms should be analyzed independently of specific implementations or hardware. The document then discusses analyzing the time complexity of various algorithms by counting the number of operations and expressing efficiency using growth functions. Common growth functions like constant, linear, quadratic, and exponential are introduced. The concept of asymptotic notation (Big O) for describing an algorithm's time complexity is also covered. Examples are provided to demonstrate how to determine the time complexity of iterative and recursive algorithms.
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
The paper preprint is available here:
https://arxiv.org/pdf/2005.11021.pdf
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
The document discusses minimum spanning trees and provides examples of Prim's and Kruskal's algorithms. It includes:
- A definition of minimum spanning tree as a subgraph that spans all nodes with minimum total edge weight.
- Characteristics of Prim's and Kruskal's algorithms such as working with undirected, weighted/unweighted graphs and producing optimal solutions greedily.
- A walk-through example of Prim's algorithm on a graph and calculating the minimum spanning tree cost.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
This document discusses recommender systems and linked open data. It begins with an introduction to linked open data, describing its key components like URIs, RDF, and popular vocabularies. It then provides an overview of recommender systems, explaining how they help with information overload by matching users to items. Different recommendation techniques are described like collaborative filtering, content-based, knowledge-based, and hybrid approaches. Evaluation methods for recommender systems like dataset splitting are also briefly covered. The document aims to lay the foundation for discussing how recommender systems can utilize linked open data.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
Here are some potential replies to your input tweet:
1. Summer is the best season in Hokkaido.
2. It does get pretty hot in Hokkaido in the summer.
3. I love Hokkaido in the summer too, so many outdoor activities!
4. What are your plans for the summer in Hokkaido?
5. The scenery is beautiful everywhere in Hokkaido.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
Similar to Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms (20)
This document summarizes a tutorial for developing a state-of-the-art named entity recognition framework using deep learning. The tutorial uses a bi-directional LSTM-CNN architecture with a CRF layer, as presented in a 2016 paper. It replicates the paper's results on the CoNLL 2003 dataset for NER, achieving an F1 score of 91.21. The tutorial covers data preparation from the dataset, word embeddings using GloVe vectors, a CNN encoder for character-level representations, a bi-LSTM for word-level encoding, and a CRF layer for output decoding and sequence tagging. The experience of presenting this tutorial to friends highlighted the need for detailed comments and explanations of each step and PyTorch functions.
This document summarizes a thesis on automating test routine creation through natural language processing. The author proposes using word embeddings and recommender systems to automatically generate test cases from requirements documents and link them together. The methodology involves representing text as word vectors, calculating similarity between requirements and test blocks, and applying association rule mining on test block sequences. An experiment on a space operations dataset showed the approach improved productivity in test creation and requirements tracing over manual methods. Future work could explore using deep learning models and collecting additional evaluation metrics from users.
Integration of speech recognition with computer assisted translationChamani Shiranthika
This document discusses the integration of speech recognition with computer-assisted translation. It begins by introducing machine translation and computer-assisted translation, then describes how automatic speech recognition works and how it can be integrated with translation. Key approaches to integration include using word graphs from ASR and MT systems or rescoring ASR hypotheses with translation models. Neural machine translation models that use attention mechanisms are also discussed. The document concludes by noting areas for further development in reducing human effort in translation and increasing quality and effectiveness of speech-to-text and translation tools.
This document proposes improvements to domain-specific term extraction for ontology construction. It discusses issues with existing term extraction approaches and presents a new method that selects and organizes target and contrastive corpora. Terms are extracted using linguistic rules on part-of-speech tagged text. Statistical distributions are calculated to identify terms based on their frequency across multiple contrastive corpora. The approach achieves better precision in extracting simple and complex terms for computer science and biomedical domains compared to existing methods.
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...AIST
The document describes a heuristic strategy for extracting terms from scientific texts. It discusses approaches to term extraction, including using statistical and linguistic criteria from large corpora or single texts. It also outlines developing term extraction procedures based on analyzing term types, structures, and contexts through linguistic patterns. The strategy was tested on Russian computer science and physics texts and compared to term dictionaries.
AlgoAnalytics is an analytics consultancy that uses advanced mathematical techniques and machine learning to solve business problems for clients across various industries. It has over 30 data scientists with expertise in mathematics, engineering, and cutting-edge methodologies like deep learning. AlgoAnalytics works closely with domain experts to effectively model problems and develop predictive analytics solutions using structured, text, image, sound, and other types of data. Some of its service offerings include contracts management, document decomposition, sentiment analysis, and predictive maintenance. The company is led by CEO and founder Aniruddha Pant, who has over 20 years of experience applying machine learning and analytics to academic and enterprise challenges.
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
This talk introduces several open-source software tools for accelerating materials design efforts:
- Atomate enables high-throughput DFT simulations through automated workflows. It has been used to generate large datasets for the Materials Project.
- Rocketsled uses machine learning to suggest the most informative calculations to optimize a target property faster than random searches.
- Matminer provides features to represent materials for machine learning and connects to data mining tools and databases.
- Automatminer develops machine learning models automatically from raw input-output data without requiring feature engineering by users.
- Robocrystallographer analyzes crystal structures and describes them in an interpretable text format.
1) The document presents a dependency-to-string translation model for a Chinese-Japanese statistical machine translation system.
2) The system achieves a BLEU score of 34.87 and a RIBES score of 79.25 on the Chinese-Japanese translation task, outperforming a baseline PBSMT system.
3) The dependency-to-string model uses two types of translation rules - HDR rules with generalized dependency fragments on the source side and strings on the target side, and H rules with single words on the source side.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
The document discusses several different machine learning approaches to plain text information extraction, including SRV, RAPIER, WHISK, AutoSlog, and CRYSTAL. These systems use both top-down and bottom-up approaches to induce rules or patterns for extracting structured information from unstructured text. The document compares the different systems and their rule representations, learning algorithms, experiments and performance on various information extraction tasks.
Software Defect Prediction on Unlabeled DatasetsSung Kim
The document describes techniques for software defect prediction when labeled training data is unavailable. It proposes Transfer Defect Learning (TCA+) to improve cross-project defect prediction by normalizing data distributions between source and target projects. For projects with heterogeneous metrics, it introduces Heterogeneous Defect Prediction (HDP) which matches similar metrics between source and target to build cross-project prediction models. It also discusses CLAMI for defect prediction using only unlabeled data without human effort. The techniques are evaluated on open source projects to demonstrate their effectiveness compared to traditional cross-project and within-project prediction.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
A pilot on Semantic Textual Similaritypathsproject
This document summarizes the SemEval 2012 task on semantic textual similarity. It describes the motivation for the task as measuring similarity between text fragments on a graded scale. It then outlines the datasets used, including the MSR paraphrase corpus, MSR video corpus, WMT evaluation data, and OntoNotes word sense data. It also discusses the annotation process, which involved a pilot with authors and crowdsourcing through Mechanical Turk. The results showed most systems performed better than baselines and the best systems achieved correlations over 0.8 with human judgments.
The document discusses optimizing the performance of Word2Vec on multicore systems through a technique called Context Combining. Some key points:
- Context Combining improves Word2Vec training efficiency by combining related windows that share samples, improving floating point throughput and reducing overhead.
- Experiments on Intel and Intel Knights Landing processors show Context Combining (pSGNScc) achieves up to 1.28x speedup over prior work (pWord2Vec) and maintains comparable accuracy to state-of-the-art implementations.
- Parallel scaling tests show pSGNScc achieves near linear speedup up to 68 cores, utilizing more of the available computational resources than previous techniques.
- Future
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Experiments on Design Pattern DiscoveryTim Menzies
The document describes experiments conducted to discover design patterns from source code. It outlines the approach taken by DP-Miner tool, presents experiment data on four Java systems, and evaluates results by calculating precision and recall values. Benchmarks are lacking for accurately evaluating design pattern discovery techniques.
A Machine learning approach to classify a pair of sentence as duplicate or not.Pankaj Chandan Mohapatra
The team presented their machine learning project on predicting question pairs on Quora. They used logistic regression, random forest, and XGBoost models with manually engineered features like word count and word match. XGBoost performed best with an AUC score of 0.936. Key lessons were the importance of preprocessing, using words as features requires dimension reduction, and feature hashing improves scalability over storing vocabularies. Future work could experiment with convolutional neural networks for sentence similarity as proposed by H. Hua et al.
This document discusses strategies for analyzing moderately large data sets in R when the total number of observations (N) times the total number of variables (P) is too large to fit into memory all at once. It presents several approaches including loading data incrementally from files or databases, using randomized algorithms, and outsourcing computations to SQL. Specific examples discussed include linear regression on large data sets and whole genome association studies.
Search engines (e.g. Google.com, Yahoo.com, and Bi
ng.com) have become the dominant model of online search. Large and small e-commerce provide built-in search capability to their visitors to examine the products they have. While most large business are able to hire the
necessary skills to build advanced search engines,
small online business still lack the ability to evaluate the results of their search engines, which means losing the opportunity to compete with larger business. The purpose of this paper is to build an open-source model that can measure the relevance of search results for online businesses
as well as the accuracy of their underlined algorithms. We used data from a Kaggle.com competition in order to show our model running on real data.
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptxhipachi8
Vermicomposting: A sustainable practice converting organic waste into nutrient-rich fertilizer using worms, promoting eco-friendly agriculture, reducing waste, and supporting environmentally conscious gardening and farming practices naturally.
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Sérgio Sacani
The origin of heavy elements synthesized through the rapid neutron capture process (r-process) has been an enduring mystery for over half a century. J. Cehula et al. recently showed that magnetar giant flares, among the brightest transients ever observed, can shock heat and eject neutron star crustal material at high velocity, achieving the requisite conditions for an r-process.A. Patel et al. confirmed an r-process in these ejecta using detailed nucleosynthesis calculations. Radioactive decay of the freshly synthesized nuclei releases a forest of gamma-ray lines, Doppler broadened by the high ejecta velocities v 0.1c into a quasi-continuous spectrum peaking around 1 MeV. Here, we show that the predicted emission properties (light curve, fluence, and spectrum) match a previously unexplained hard gamma-ray signal seen in the aftermath of the famous 2004 December giant flare from the magnetar SGR 1806–20. This MeV emission component, rising to peak around 10 minutes after the initial spike before decaying away over the next few hours, is direct observational evidence for the synthesis of ∼10−6 Me of r-process elements. The discovery of magnetar giant flares as confirmed r-process sites, contributing at least ∼1%–10% of the total Galactic abundances, has implications for the Galactic chemical evolution, especially at the earliest epochs probed by low-metallicity stars. It also implicates magnetars as potentially dominant sources of heavy cosmic rays. Characterization of the r-process emission from giant flares by resolving decay line features offers a compelling science case for NASA’s forthcomingCOSI nuclear spectrometer, as well as next-generation MeV telescope missions.
Investigating the central role that theories of the visual arts and creativity played in the development of fascism in France, Mark Antliff examines the aesthetic dimension of fascist myth-making within the history of the avant-garde. Between 1909 and 1939, a surprising array of modernists were implicated in this project, including such well-known figures as the symbolist painter Maurice Denis, the architects Le Corbusier and Auguste Perret, the sculptors Charles Despiau and Aristide Maillol, the “New Vision” photographer Germaine Krull, and the fauve Maurice Vlaminck.
Protective function of skin, protection from mechanical blow, UV rays, regulation of water and electrolyte balance, absorptive activity, secretory activity, excretory activity, storage activity, synthetic activity, sensory activity, role of sweat glands regarding heat loss, cutaneous receptors and stratum corneum
On the Lunar Origin of Near-Earth Asteroid 2024 PT5Sérgio Sacani
The near-Earth asteroid (NEA) 2024 PT5 is on an Earth-like orbit that remained in Earth's immediate vicinity for several months at the end of 2024. PT5's orbit is challenging to populate with asteroids originating from the main belt and is more commonly associated with rocket bodies mistakenly identified as natural objects or with debris ejected from impacts on the Moon. We obtained visible and near-infrared reflectance spectra of PT5 with the Lowell Discovery Telescope and NASA Infrared Telescope Facility on 2024 August 16. The combined reflectance spectrum matches lunar samples but does not match any known asteroid types—it is pyroxene-rich, while asteroids of comparable spectral redness are olivine-rich. Moreover, the amount of solar radiation pressure observed on the PT5 trajectory is orders of magnitude lower than what would be expected for an artificial object. We therefore conclude that 2024 PT5 is ejecta from an impact on the Moon, thus making PT5 the second NEA suggested to be sourced from the surface of the Moon. While one object might be an outlier, two suggest that there is an underlying population to be characterized. Long-term predictions of the position of 2024 PT5 are challenging due to the slow Earth encounters characteristic of objects in these orbits. A population of near-Earth objects that are sourced by the Moon would be important to characterize for understanding how impacts work on our nearest neighbor and for identifying the source regions of asteroids and meteorites from this understudied population of objects on very Earth-like orbits. Unified Astronomy Thesaurus concepts: Asteroids (72); Earth-moon system (436); The Moon (1692); Asteroid dynamics (2210)
A tale of two Lucies: talk at the maths dept, Free University of AmsterdamRichard Gill
Despite the title, this talk will focus on the case of Lucy Letby. It focusses on the way the police investigation determined "suspicious incidents" and enters into the actual medical condition of those babies. I hope to also discuss the mathematics of sandwich ELISA immunoassay and of neonatal insulin metabolism.
Structure formation with primordial black holes: collisional dynamics, binari...Sérgio Sacani
Primordial black holes (PBHs) could compose the dark matter content of the Universe. We present the first simulations of cosmological structure formation with PBH dark matter that consistently include collisional few-body effects, post-Newtonian orbit corrections, orbital decay due to gravitational wave emission, and black-hole mergers. We carefully construct initial conditions by considering the evolution during radiation domination as well as early-forming binary systems. We identify numerous dynamical effects due to the collisional nature of PBH dark matter, including evolution of the internal structures of PBH halos and the formation of a hot component of PBHs. We also study the properties of the emergent population of PBH binary systems, distinguishing those that form at primordial times from those that form during the nonlinear structure formation process. These results will be crucial to sharpen constraints on the PBH scenario derived from observational constraints on the gravitational wave background. Even under conservative assumptions, the gravitational radiation emitted over the course of the simulation appears to exceed current limits from ground-based experiments, but this depends on the evolution of the gravitational wave spectrum and PBH merger rate toward lower redshifts.
Examining Visual Attention in Gaze-Driven VR Learning: An Eye-Tracking Study ...Yasasi Abeysinghe
This study presents an eye-tracking user study for analyzing visual attention in a gaze-driven VR learning environment using a consumer-grade Meta Quest Pro VR headset. Eye tracking data were captured through the headset's built-in eye tracker. We then generated basic and advanced eye-tracking measures—such as fixation duration, saccade amplitude, and the ambient/focal attention coefficient K—as indicators of visual attention within the VR setting. The generated gaze data are visualized in an advanced gaze analytics dashboard, enabling us to assess users' gaze behaviors and attention during interactive VR learning tasks. This study contributes by proposing a novel approach for integrating advanced eye-tracking technology into VR learning environments, specifically utilizing consumer-grade head-mounted displays.
Examining Visual Attention in Gaze-Driven VR Learning: An Eye-Tracking Study ...Yasasi Abeysinghe
Ad
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms
1. Adapted TextRank for Term Extraction: A Generic Method of
Improving Automatic Term Extraction Algorithms
Semantics 2018 - 12 September 2018
Ziqi Zhang1
, Johann Petrak2
, Diana Maynard2
[email protected], [email protected],
[email protected]
1. Information School, The University of Sheffield, UK
2. Department of Computer Science, The University of Sheffield, UK
2. The Task of ATE
● Input: a (reasonably large) domain specific corpus
● Output: a list of candidate terms from the corpus,
representing the domain
● Approach
■ Candidate extraction: domain-dependent, usually noun
phrases, n-grams, or sequence matched by PoS patterns
■ Candidate ranking & selection: scoring candidates
based on corpus statistics, selection by threshold, or
machine learning
Domain specific
corpus
ATE
Terms for the
corpusCandidate
Extraction
Candidate
Ranking,
Selection
[ semantic, 0.67,
ontology, 0.34,
nlp, 0.33
text mining, 0.12
…
web page, 0.012 ]
3. The Task of ATE
● A classic text mining problem
■ Dating back to 1990s (Bourigault 1992)
■ To date still an active area of research
● A fundamental step to many complex tasks
■ Ontology engineering
■ Dictionary construction
■ Information Retrieval
■ Translation
■ …
● Context of this work: KNOWMAK (https://www.knowmak.eu/)
4. The Task of ATE
Differentiation from related tasks
ATE
Keyword
Extraction
- document specific
- only a handful
- mainly for indexing
- domain specific
- # depends on corpus
- mainly knowledge
acquisition
NER - usually real world
named entities
- sentence context is
more important
- semantic typing
- domain specific
terms
- corpus level statistics
are more important
- no typing
Source:
https://imanage.com/blog/named-entity-recognitio
n-ravn-part-1/
5. Motivation and Contribution
● ATE still an unsolved problem
■ No ‘all-rounder’ method
■ Performance always depends on data and domain
■ ‘one-size-fits-all’ solution feasible?
● ATE methods are predominantly unsupervised
■ For many domains there are already domain specific
resources potentially useful, e.g., unlabelled corpus,
pre-compiled named entity lists, partial ontologies, etc
■ Can we benefit from those?
6. Motivation and Contribution
A generic method that employs semantic relatedness to a set of
domain specific seed words to potentially improve any ATE
algorithms (by up to 25 percentage points in average precision in
experiments).
● ATE still an unsolved problem
■ No ‘all-rounder’ method
■ Performance always depends on data and domain
■ ‘one-size-fits-all’ solution feasible?
● ATE methods are predominantly unsupervised
■ For many domains there are already domain specific
resources potentially useful, e.g., unlabelled corpus,
pre-compiled named entity lists, partial ontologies, etc
■ Can we benefit from those?
7. AdaText - Overview
Adapted TextRank for Automatic Term Extraction
Domain specific
corpus
Domain
specific
seed
words/
phrases
Extract words
Semantic
relatedness
Filter by
threshold
[ w1
=0.67,
w2
=0.34,
w3
=0.22,
… ]
TextRank
ATE (any
algorithm)
[ t1
=1.99,
t2
=1.21,
t3
=1.10,
… ] +Re-rank
[ t1
=2.19,
t3
=1.41,
t2
=1.29,
… ]
8. AdaText - Overview
Adapted TextRank for Automatic Term Extraction
Domain specific
corpus
Domain
specific
seed words/
phrases
Extract words
Semantic
relatedness
Filter by
threshold
[ w1
=0.67,
w2
=0.34,
w3
=0.22,
… ]
TextRank
ATE (any
algorithm)
[ t1
=1.99,
t2
=1.21,
t3
=1.10,
… ] +Re-rank
[ t1
=2.19,
t3
=1.41,
t2
=1.29,
… ]
SEEDING CORPUS LEVEL
TEXTRANK
COMBINING
WITH ATE
9. AdaText - Seeding
● Input
■ C - the target corpus from which terms are extracted
■ S - a set of ‘seed’ word/phrases representing the
domain
● taken from existing domain lexicons, or generated
in an unsupervised way from available corpora
● May not contain real terms from C
● Process
■ Extract words from C, as W
■ Compute pairwise semantic relatedness for S x W
● Cosine similarity using GloVe embedding vectors
● OOV ignored, phrase based on compositional
averaging (Iyyer et al. 2015)
● Output
■ Wsub
a subset of W, satisfying relatedness > min
Intuitively, they are more ‘relevant’ to the domain
10. AdaText - Corpus Level TextRank
● Input
■ C - the target corpus from which terms are extracted
■ Wsub
- the subset of words selected before
● Process
■ Apply TextRank to the graph created for Wsub
to
compute a TextRank (tr) score of every word w in Wsub
■ Traditional TextRank (Mihalcea et al., 2004) is a
PageRank process to a graph of words from each
document, where an edge is created if words co-occur
in a context window of win
Compatibility of systems of linear constraints over the set of
natural numbers. Criteria of compatibility of a system of linear
Diophantine equations, strict inequations, and nonstrict
inequations are considered. Upper bounds for components of a
minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are
given. These criteria and the corresponding algorithms for
constructing a minimal supporting set of solutions can be used
in solving all the considered types systems and systems of
mixed types
11. AdaText - Corpus Level TextRank
● Input
■ C - the target corpus from which terms are extracted
■ Wsub
- the subset of words selected before
● Process
■ Apply TextRank to the graph created for Wsub
to
compute a TextRank (tr) score of every word w in Wsub
■ Here it is adapted in two ways
● A graph of words from the entire corpus
● An edge is created if two words appear within win
anywhere in the corpus (in any document)
● Output
■ tr scores for every word w in Wsub
12. AdaText - Combining with ATE
● Input
■ C - the target corpus from which terms are extracted
■ ATE - some ATE algorithm
■ tr scores for every word w in Wsub
● Process
■ Apply ATE to C to extract and score candidate terms
■ Revise each candidate term’s score using tr scores for
its composing words
■ Then re-rank candidate terms by the new score
● Output
■ Re-ranked list of candidate terms
13. Experiment and Findings
● Base ATE methods (as AdaText needs ATE scores of
candidate terms)
■ Modified TFIDF (Zhang et al., 2016)
■ CValue (Ananiadou 1994)
■ Basic (Bordea et al., 2013)
■ RAKE (Rose et al., 2010)
■ Weirdness (Ahmad et al., 1999)
■ LinkProbability (LP, Astrakhantsev, 2016)
■ X2
(Matsuo et al., 2003)
■ GlossEx (Park et al., 2002)
■ Positive Unlabelled (PU) learning (Astrakhantsev,
2016)
■ AvgRel - average relatedness score with seeds
● Use implementations:
■ JATE (https://github.com/ziqizhang/jate)
■ ATR4S (https://github.com/ispras/atr4s)
14. Experiment and Findings
Evaluation measures
■ Precision for top K ranked candidate terms
■ K = {50, 100, 500, 1000, 2000}
■ Average P@K for all five K’s
15. Experiment and Findings
Datasets
● GENIA
■ 2,000 semantically annotated Medline abstracts
■ 434k words
■ 33k target terms
● ACLv2
■ 300 ACL paper abstracts
■ 32k words
■ 3k target terms
16. Experiment and Findings
Seeds and parameters
● For GENIA:
5,502 named entities from the BioNLP Shared Task
2011, only 25 match candidate terms
● For ACLv2:
1,301 noun phrases from the titles of ACL, NAACL, and
EACL papers (since 2000), none matches candidate
terms
● Semantic relatedness threshold min=0.5 to 0.85 with 0.05
increment
● TextRank context window win=5, 10
17. Result - Base ATE
- Base ATE performance varies significantly depending on datasets.
- No single, consistently winning method on all five K’s.
- E.g., PU is the best performing in AvgP@K on the ACL corpus,
but the fourth worst performing on the GENIA corpus.
19. - The min threshold: too low (creating lots of isolated graphs) or too
high (including too many weakly related words) can harm
performance
- The win threshold: no strong pattern as to which (5 or 10) is better
- Within min=[0.6, 0.75], AvgP@K improvement by 1 ~ 25
percentage points depending on the base ATE, and dataset
20. Conclusion
● The takeaway message
■ There is probably never a ‘one-size-fit-all’ ATE method,
instead, think about improving existing ones
■ AdaText makes use of existing domain resources and
builds on the TextRank algorithm
■ Generic method able to improve, potentially, any ATE
method
● Future work
■ Whether and how the size and source of the seed lexicon
affects performance
■ Adapt TextRank to a graph of both words and phrases,
and see how this affects results
21. Resources and Software
● Data
■ Genia corpus, ACL corpus available
■ Glove embeddings available
● Software
■ JATE (https://github.com/ziqizhang/jate)
■ ATR4S (https://github.com/ispras/atr4s)
■ Code for this work: https://github.com/ziqizhang/texpr
22. References
1. Bourigault, D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases.
In 14th International Conference on Computational Linguistics - COLING 92, 977–98
2. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daume, H., 2015. Deep unordered composition rivals
syntactic methods for text classification, in: Association for Computational Linguistics. URL:
docs/2015_acl_dan.pdf.
3. Mihalcea, R., Tarau, P., 2004. TextRank: Bringing order into texts, in: Proc. of EMNLP’04.
4. Zhang, Z., Gao, J., Ciravegna, F., 2016. Jate 2.0: Java automatic term extraction with apache solr,
in: Proc. of LREC’16
5. Ananiadou, S., 1994. A methodology for automatic term recognition, in: Proc. of COLING1994, ACL,
Stroudsburg, PA, USA. pp. 1034–1038.
6. Bordea, G., Buitelaar, P., Polajnar, T., 2013. Domain-independent term extraction through domain
modelling, in: Proc. of the Conference on Terminology and Artificial Intelligence.
7. Astrakhantsev, N., 2015. Methods and software for terminology extraction from domainspecific text
collection, in: Ph.D. thesis. Institute for System Programming of Russian Academy of Sciences.
8. Rose, S., Engel, D., Cramer, N., Cowley, W., 2010. Automatic keyword extraction from individual
documents. John Wiley and Sons.
9. Ahmad, K., Gillam, L., Tostevin, L., 1999. University of surrey participation in trec 8: Weirdness
indexing for logical document extrapolation and retrieval (wilder), in: Proc. of TREC1999.
10. Astrakhantsev, N., 2016. Atr4s: Toolkit with state-of-the-art automatic terms recognition methods in
scala. arXiv preprint arXiv:1611.07804.
11. Matsuo, Y., Ishizuka, M., 2003. Keyword extraction from a single document using word
co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13,
157–169.
12. Park, Y., Byrd, R., Boguraev, B., 2002. Automatic glossary extraction: Beyond terminology
identification, in: Proc. of COLING’02, Association for Computational Linguistics. pp. 1–7.
23. Acknowledgements
This work is supported by the European Union's Horizon 2020
research and innovation programme under grant agreement
No. 726992 (KNOWMAK project)
https://www.knowmak.eu/