Introducing natural language processing(NLP) with rVivian S. Zhang
The document provides an introduction to natural language processing (NLP) with R. It outlines topics like foundational NLP frameworks, working with text in R, regular expressions, n-gram models, and morphological analysis. Regular expressions are discussed as a pattern matching device and their theoretical connection to finite state automata. N-gram models are introduced for recognizing and generating language based on the probabilities of word sequences. Morphological analysis is demonstrated through building a lexicon and applying regular expressions to extract agentive nouns.
R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
This document discusses different techniques for representing text data, including bag-of-words representations and neural embeddings generated by models like word2vec and doc2vec. It provides examples of using KoNLPy, NLTK and Gensim for text preprocessing, exploration and classification of Korean movie reviews. Specifically, it tokenizes the reviews using KoNLPy, explores the training data using NLTK, and considers representing the documents with bag-of-words or doc2vec before classification.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
This document summarizes a paper that proposes a new topic modeling method called SC-LDA that incorporates prior knowledge about word correlations into LDA. SC-LDA uses a factor graph to encode must-link and cannot-link constraints between words based on an external knowledge source. It then integrates this prior knowledge into the LDA inference process to influence the topic assignments. The paper experiments with SC-LDA on several datasets and knowledge sources, finding it converges faster than baselines and produces more coherent topics.
This document provides an overview of biological databases and SQL. It discusses different data levels in biological research like primary data, derived data, and interpreted data. It also summarizes some popular biological databases like Ensembl, ArrayExpress, and PharmGKB and whether they support direct SQL querying. The document then provides definitions for key database concepts like database, table, record, and query. It also describes different data types in SQL like numeric, string, date/time types and large object types. It discusses keys, integrity rules, and referential integrity in database design.
This document discusses optimization techniques for memory and cache usage. It begins with an overview of the memory hierarchy and justification for optimization. It then covers optimizing code and data caches through techniques like prefetching, structure layout, tree data structures, and linearization caching. It also discusses memory allocation policies and reducing aliasing through techniques like restricting pointers and analysis. The overall goal is to discuss how to improve cache utilization and thereby increase performance.
Strings are sequences of characters that can be manipulated and combined in various ways. Common string operations include inserting variables, concatenating strings, determining string length, checking if a substring is contained, extracting substrings by location, and splitting strings into an array. The document provides examples of performing these operations in Java, Objective-C, and Swift.
The document discusses various data sources for linguistic analysis, including corpora, dictionaries, social media, and linked open data. It provides details on accessing data from Facebook and Twitter using APIs and R packages. It also covers preprocessing text data through tokenization, lemmatization, stemming and creating term-document matrices. Sentiment analysis on data from sources like Experience Project is demonstrated through exploring word-category correlations.
Intro to Python for High School Students.
Unit #2: classes, as well as docstrings, lambda, map, filter, logging, testing, debugging
Does not include GUI content
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
This document discusses word embeddings and how they work. It begins by explaining how the author became an expert in distributional semantics without realizing it. It then discusses how word2vec works, specifically skip-gram models with negative sampling. The key points are that word2vec is learning word and context vectors such that related words and contexts have similar vectors, and that this is implicitly factorizing the word-context pointwise mutual information matrix. Later sections discuss how hyperparameters are important to word2vec's success and provide critiques of common evaluation tasks like word analogies that don't capture true semantic similarity. The overall message is that word embeddings are fundamentally doing the same thing as older distributional semantic models through matrix factorization.
This document contains code snippets from various programming languages including Brainfuck, Ruby, and domain specific languages. It also discusses parsing expression grammars (PEGs) and language implementation tools like Treetop and PEGs. Projects mentioned include arithmetic languages compiled to LLVM and domain specific languages for formats like HTML, JSON and SQL.
This document discusses using Python for penetration testing techniques. It provides an overview of why Python is well-suited for pen testing, including that it is easy to install, learn, code, and understand. It also discusses Python's history and common uses. The document then covers various Python libraries and modules that can be used for tasks like web scraping, password cracking, automating Office applications, and accessing Windows Management Instrumentation. It concludes with a demonstration of using Python to detect cross-site scripting vulnerabilities in a web application.
Social phenomena is coming. We have lot’s of social applications that we are using every day, let’s say Facebook, twitter, Instagram. Lot’s of such kind apps based on social graph and graph theory. I would like to share my knowledge and expertise about how to work with graphs and build large social graph as engine for Social network using python and Graph databases. We'll compare SQL and NoSQL approaches for friends relationships.
The document discusses various string manipulation functions in PHP including:
1. Functions to search and extract parts of strings like strpos(), substr(), strstr().
2. Functions to decompose strings like explode(), strtok(), sscanf().
3. Functions to manipulate strings like str_replace(), strrev(), str_pad().
It provides examples of how to use each function, the required parameters, and sample code. The document also covers decomposing URLs using parse_url() and tokenizing strings.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
Word2Vec is a model that learns vector representations of words from large amounts of text. It represents words in a continuous vector space where semantically similar words are located close to each other. The model is trained using a simple neural network to predict words from context. Word2Vec has been shown to produce word embeddings that exhibit linguistic regularities and can be used as features for various natural language processing tasks. It has efficient implementations in libraries like Gensim that make it widely used.
This document describes LSI text clustering. It discusses vector space models, term weighting using TF-IDF, similarity measures, latent semantic indexing using singular value decomposition, suffix arrays and longest common prefix arrays for phrase discovery. The clustering algorithm involves preprocessing text, feature extraction to find terms and phrases, applying LSI to discover concepts and determine cluster labels, assigning documents to clusters, and calculating cluster scores. Parameters and issues with the algorithm are also outlined. A demo clusters a set of question and answer documents.
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
This is a tutorial on topic modelling techniques - that informs the reader about the basic ingredients of all topic models, and allows them to develop a new model in the end.
This document compares the Vector Space Model (VSM) and Latent Semantic Indexing (LSI) techniques for information retrieval. VSM represents documents as vectors in a multi-dimensional space, where cosine similarity between vectors indicates document similarity. LSI builds on VSM by extracting concepts from terms and representing documents based on these concepts, allowing matching of documents using different vocabularies. While VSM is simpler, LSI can handle synonymy and polysemy better. Both are commonly used in search engines.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
This document summarizes the lda2vec model, which combines aspects of word2vec and LDA. Word2vec learns word embeddings based on local context, while LDA learns document-level topic mixtures. Lda2vec models words based on both their local context and global document topic mixtures to leverage both approaches. It represents documents as mixtures over sparse topic vectors similar to LDA to maintain interpretability. This allows it to predict words based on local context and global document content.
The document summarizes key classes in the C# standard library including generics, extension methods, Math, DateTime, Regex, collections, Nullable<T>, Path, DriveInfo, Directory, File, encodings, streams, and serialization/deserialization. It provides code examples and links to Microsoft documentation for each topic.
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
Today, more data is accumulated than ever before. It has been estimated that over 80% of data collected by businesses is unstructured, mostly in the form of free text. The statistical community has developed many tools for analysing textual data, both in the areas of exploratory data analysis (e.g. clustering methods) and predictive analytics. In this talk, Philipp Burckhardt will discuss tools and libraries that you can use today to perform text mining with Node.js. Creative strategies to overcome the limitations of the V8 engine in the areas of high-performance and memory-intensive computing will be discussed. You will be introduced to how you can use Node.js streams to analyse text in real-time, how to leverage native add-ons for performance-intensive code and how to build command-line interfaces to process text directly from the terminal.
Abstract:
Many machine learning algorithms can be implemented to run parallel operations on graphics cards. Deeplearning4j is a Java-based machine learning library, which includes implementations of many popular neural-network algorithms. Deeplearning4j uses uses a library called Nd4j to run matrix algebra operations on either CPUs or GPUs with NVIDIA’s CUDA API.
In this talk, I will show how to get a simple machine learning algorithm running on the GPU. I will also cover how to get started with CUDA development: how to get your code to run on the GPU, how to monitor the device, and how to write code to make effective use of parralelization.
Bio: Gary Sieling is a Lead Software Engineer at IQVIA, in Blue Bell, PA, with an interests in database technologies, machine learning, and software engineering practices. He has been involved in curating talks for a company lunch and learn program and the organizing committee for a tech conference. Building on these experiences, he built a search engine called FindLectures.com to help find great talks and speakers.
This document summarizes a presentation given by Diane Mueller from ActiveState and Dr. Mike Müller from Python Academy. It compares MATLAB and Python capabilities for scientific computing. Python has many libraries like NumPy, SciPy, IPython and matplotlib that provide similar functionality to MATLAB. Together these are often called "Pylab". The presentation provides an overview of Python, NumPy arrays, visualization with matplotlib, and integrating Python with other languages.
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
FUSE allows developers to create file systems in userspace without writing kernel code. It provides a virtual filesystem layer that intercepts system calls and redirects them to a userspace program. The seFS prototype uses FUSE to create an experimental file system that provides online data deduplication and compression using SQLite for storage. It demonstrates how FUSE enables rapid development of new file systems in Python by treating them as regular applications instead of kernel modules.
The document discusses various data sources for linguistic analysis, including corpora, dictionaries, social media, and linked open data. It provides details on accessing data from Facebook and Twitter using APIs and R packages. It also covers preprocessing text data through tokenization, lemmatization, stemming and creating term-document matrices. Sentiment analysis on data from sources like Experience Project is demonstrated through exploring word-category correlations.
Intro to Python for High School Students.
Unit #2: classes, as well as docstrings, lambda, map, filter, logging, testing, debugging
Does not include GUI content
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
This document discusses word embeddings and how they work. It begins by explaining how the author became an expert in distributional semantics without realizing it. It then discusses how word2vec works, specifically skip-gram models with negative sampling. The key points are that word2vec is learning word and context vectors such that related words and contexts have similar vectors, and that this is implicitly factorizing the word-context pointwise mutual information matrix. Later sections discuss how hyperparameters are important to word2vec's success and provide critiques of common evaluation tasks like word analogies that don't capture true semantic similarity. The overall message is that word embeddings are fundamentally doing the same thing as older distributional semantic models through matrix factorization.
This document contains code snippets from various programming languages including Brainfuck, Ruby, and domain specific languages. It also discusses parsing expression grammars (PEGs) and language implementation tools like Treetop and PEGs. Projects mentioned include arithmetic languages compiled to LLVM and domain specific languages for formats like HTML, JSON and SQL.
This document discusses using Python for penetration testing techniques. It provides an overview of why Python is well-suited for pen testing, including that it is easy to install, learn, code, and understand. It also discusses Python's history and common uses. The document then covers various Python libraries and modules that can be used for tasks like web scraping, password cracking, automating Office applications, and accessing Windows Management Instrumentation. It concludes with a demonstration of using Python to detect cross-site scripting vulnerabilities in a web application.
Social phenomena is coming. We have lot’s of social applications that we are using every day, let’s say Facebook, twitter, Instagram. Lot’s of such kind apps based on social graph and graph theory. I would like to share my knowledge and expertise about how to work with graphs and build large social graph as engine for Social network using python and Graph databases. We'll compare SQL and NoSQL approaches for friends relationships.
The document discusses various string manipulation functions in PHP including:
1. Functions to search and extract parts of strings like strpos(), substr(), strstr().
2. Functions to decompose strings like explode(), strtok(), sscanf().
3. Functions to manipulate strings like str_replace(), strrev(), str_pad().
It provides examples of how to use each function, the required parameters, and sample code. The document also covers decomposing URLs using parse_url() and tokenizing strings.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
Word2Vec is a model that learns vector representations of words from large amounts of text. It represents words in a continuous vector space where semantically similar words are located close to each other. The model is trained using a simple neural network to predict words from context. Word2Vec has been shown to produce word embeddings that exhibit linguistic regularities and can be used as features for various natural language processing tasks. It has efficient implementations in libraries like Gensim that make it widely used.
This document describes LSI text clustering. It discusses vector space models, term weighting using TF-IDF, similarity measures, latent semantic indexing using singular value decomposition, suffix arrays and longest common prefix arrays for phrase discovery. The clustering algorithm involves preprocessing text, feature extraction to find terms and phrases, applying LSI to discover concepts and determine cluster labels, assigning documents to clusters, and calculating cluster scores. Parameters and issues with the algorithm are also outlined. A demo clusters a set of question and answer documents.
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
This is a tutorial on topic modelling techniques - that informs the reader about the basic ingredients of all topic models, and allows them to develop a new model in the end.
This document compares the Vector Space Model (VSM) and Latent Semantic Indexing (LSI) techniques for information retrieval. VSM represents documents as vectors in a multi-dimensional space, where cosine similarity between vectors indicates document similarity. LSI builds on VSM by extracting concepts from terms and representing documents based on these concepts, allowing matching of documents using different vocabularies. While VSM is simpler, LSI can handle synonymy and polysemy better. Both are commonly used in search engines.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
This document summarizes the lda2vec model, which combines aspects of word2vec and LDA. Word2vec learns word embeddings based on local context, while LDA learns document-level topic mixtures. Lda2vec models words based on both their local context and global document topic mixtures to leverage both approaches. It represents documents as mixtures over sparse topic vectors similar to LDA to maintain interpretability. This allows it to predict words based on local context and global document content.
The document summarizes key classes in the C# standard library including generics, extension methods, Math, DateTime, Regex, collections, Nullable<T>, Path, DriveInfo, Directory, File, encodings, streams, and serialization/deserialization. It provides code examples and links to Microsoft documentation for each topic.
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
Today, more data is accumulated than ever before. It has been estimated that over 80% of data collected by businesses is unstructured, mostly in the form of free text. The statistical community has developed many tools for analysing textual data, both in the areas of exploratory data analysis (e.g. clustering methods) and predictive analytics. In this talk, Philipp Burckhardt will discuss tools and libraries that you can use today to perform text mining with Node.js. Creative strategies to overcome the limitations of the V8 engine in the areas of high-performance and memory-intensive computing will be discussed. You will be introduced to how you can use Node.js streams to analyse text in real-time, how to leverage native add-ons for performance-intensive code and how to build command-line interfaces to process text directly from the terminal.
Abstract:
Many machine learning algorithms can be implemented to run parallel operations on graphics cards. Deeplearning4j is a Java-based machine learning library, which includes implementations of many popular neural-network algorithms. Deeplearning4j uses uses a library called Nd4j to run matrix algebra operations on either CPUs or GPUs with NVIDIA’s CUDA API.
In this talk, I will show how to get a simple machine learning algorithm running on the GPU. I will also cover how to get started with CUDA development: how to get your code to run on the GPU, how to monitor the device, and how to write code to make effective use of parralelization.
Bio: Gary Sieling is a Lead Software Engineer at IQVIA, in Blue Bell, PA, with an interests in database technologies, machine learning, and software engineering practices. He has been involved in curating talks for a company lunch and learn program and the organizing committee for a tech conference. Building on these experiences, he built a search engine called FindLectures.com to help find great talks and speakers.
This document summarizes a presentation given by Diane Mueller from ActiveState and Dr. Mike Müller from Python Academy. It compares MATLAB and Python capabilities for scientific computing. Python has many libraries like NumPy, SciPy, IPython and matplotlib that provide similar functionality to MATLAB. Together these are often called "Pylab". The presentation provides an overview of Python, NumPy arrays, visualization with matplotlib, and integrating Python with other languages.
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
FUSE allows developers to create file systems in userspace without writing kernel code. It provides a virtual filesystem layer that intercepts system calls and redirects them to a userspace program. The seFS prototype uses FUSE to create an experimental file system that provides online data deduplication and compression using SQLite for storage. It demonstrates how FUSE enables rapid development of new file systems in Python by treating them as regular applications instead of kernel modules.
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
This document discusses analyzing Twitter data using text mining techniques in R. It outlines extracting tweets from Twitter and cleaning the text by removing punctuation, numbers, URLs, and stopwords. It then analyzes the cleaned text by finding frequent words, word associations, and creating a word cloud visualization. It performs text clustering on the tweets using hierarchical and k-means clustering. Finally, it models topics in the tweets using partitioning around medoids clustering. The overall goal is to demonstrate various text mining and natural language processing techniques for analyzing Twitter data in R.
Modern, Scalable, Ambitious apps with Ember.jsMike North
Emberjs is an opinionated web UI framework focused on developer productivity. I will introduce the basics of the framework, and provide several examples of where ember saves an unprecedented amount of time for dev teams. Additionally, I'll cover ember-cli, the extensible build tool that the Emberjs and Angular communities are depending on for code generation, asset compilation, and running tests
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
This talk introduces the main techniques of Recommender Systems and Topic Modeling.
Then, we present a case of how we've combined those techniques to build Smart Canvas (www.smartcanvas.com), a service that allows people to bring, create and curate content relevant to their organization, and also helps to tear down knowledge silos.
We present some of Smart Canvas features powered by its recommender system, such as:
- Highlight relevant content, explaining to the users which of his topics of interest have generated each recommendation.
- Associate tags to users’ profiles based on topics discovered from content they have contributed. These tags become searchable, allowing users to find experts or people with specific interests.
- Recommends people with similar interests, explaining which topics brings them together.
We give a deep dive into the design of our large-scale recommendation algorithms, giving special attention to our content-based approach that uses topic modeling techniques (like LDA and NMF) to discover people’s topics of interest from unstructured text, and social-based algorithms using a graph database connecting content, people and teams around topics.
Our typical data pipeline that includes the ingestion millions of user events (using Google PubSub and BigQuery), the batch processing of the models (with PySpark, MLib, and Scikit-learn), the online recommendations (with Google App Engine, Titan Graph Database and Elasticsearch), and the data-driven evaluation of UX and algorithms through A/B testing experimentation. We also touch topics about non-functional requirements of a software-as-a-service like scalability, performance, availability, reliability and multi-tenancy and how we addressed it in a robust architecture deployed on Google Cloud Platform.
This document summarizes a presentation on Spring Data by Eric Bottard and Florent Biville. Spring Data aims to provide a consistent programming model for new data stores while retaining store-specific features. It uses conventions over configuration for mapping objects to data stores. Repositories provide basic CRUD functionality without implementations. Magic finders allow querying by properties. Pagination and sorting are also supported.
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
The document discusses using the Rosie Pattern Language (RPL) instead of regular expressions for parsing log and data files. RPL aims to address issues with regex like readability, maintainability, and performance. It describes how RPL is designed like a programming language with common patterns. RPL patterns are loaded into the Rosie Pattern Engine which can parse files and annotate text with semantic tags.
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
This document discusses using machine learning and various machine learning platforms like MongoDB, Spark, Watson, Azure, and AWS to engage customers. It provides examples of using these platforms for tasks like topic detection on tweets, sentiment analysis, recommendation engines, forecasting, and marketing response prediction. It also discusses architectures, languages, and functions supported by tools like Mahout, MLlib, and Watson Developer Cloud.
Text mining and social network analysis of twitter data part 1Johan Blomme
Twitter is one of the most popular social networks through which millions of users share information and express views and opinions. The rapid growth of internet data is a driver for mining the huge amount of unstructured data that is generated to uncover insights from it.
In the first part of this paper we explore different text mining tools. We collect tweets containing the “#MachineLearning” hashtag, prepare the data and run a series of diagnostics to mine the text that is contained in tweets. We also examine the issue of topic modeling that allows to estimate the similarity between documents in a larger corpus.
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
The document summarizes a system called SQLShare that aims to make SQL-based data analysis more accessible to scientists by lowering initial setup costs and providing automated tools. It has been used by 50 unique users at 4 UW campus labs on 16GB of uploaded data from various science domains like environmental science and metagenomics. The system provides data uploading, query sharing, automatic English-to-SQL translation, and personalized query recommendations to lower barriers to working with relational databases for analysis.
This document provides an outline and overview of a presentation on Python programming. The outline includes sections on what Python is, why Python, an introduction to Python, Python programming tips and tricks, more on Python, and the scientific module. Under each section, there are bullet points explaining key aspects of Python like its design, uses, basic syntax, data structures, functions, classes, modules, and popular scientific programming libraries like NumPy.
Lucene is a free and open source information retrieval (IR) library written in Java. It is widely used to add search functionality to applications. Lucene features fast and scalable indexing and search, and supports various query types including phrase, wildcard, fuzzy and range queries. The Lucene project includes related sub-projects like Solr (search server), Nutch (web crawler), and Mahout (machine learning).
The document provides an overview of Redis, including:
1) Redis is an in-memory key-value store that allows users to store common data types like strings, lists, hashes and sets as values associated with a key.
2) Redis is free, fast, easy to install, and scales well as more data is inserted or operations are performed.
3) Redis can be used for caching, queues, analytics, and publish-subscribe systems to build applications that respond quickly even under heavy loads.
This document discusses metaprogramming with JavaScript. Metaprogramming allows program structure and functionality to be modified programmatically. It works at design-time, compile-time, run-time, or just-in-time. Changes can occur to data types, identifiers, calls, algorithm parameters, and more. Introspection is an important technique that allows examining an object's properties and methods. Several examples demonstrate extracting metadata from tasks to build a configurable metamodel for iterating tasks, filtering names based on conditions, and implementing introspection in a data source. Overall, metaprogramming can decrease code size, increase flexibility and integration, and provide more working pleasure while potentially decreasing speed slightly.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
2. please go to http://ADDRESS
and enter a sentence
interesting relationships?
gensim generated the data for those visualizations
by computing the semantic similarity of the input
3. who am I?
William Bert
developer at Carney Labs (teamcarney.com)
user of gensim
still new to world of topic modelling,
semantic similarity, etc
4. gensim: “topic modeling for humans”
topic modeling attempts to uncover the
underlying semantic structure of by identifying
recurring patterns of terms in a set of data
(topics).
topic modelling
does not parse sentences,
does not care about word order, and
does not "understand" grammar or syntax.
6. gensim isn't about topic modeling
(for me, anyway)
It's about similarity.
What is similarity?
Some types:
• String matching
• Stylometry
• Term frequency
• Semantic (meaning)
7. Is
A seven-year quest to collect samples from the
solar system's formation ended in triumph in a
dark and wet Utah desert this weekend.
similar in meaning to
For a month, a huge storm with massive
lightning has been raging on Jupiter under the
watchful eye of an orbiting spacecraft.
more or less than it is similar to
One of Saturn's moons is spewing a giant plume
of water vapour that is feeding the planet's
rings, scientists say.
?
8. Who cares about semantic similarity?
Some use cases:
• Query large collections of text
• Automatic metadata
• Recommendations
• Better human-computer interaction
9. gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
corpus = stream of vectors of document
feature ids
for example, words in documents are
features (“bucket of words”)
10. gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
Dictionary class
>>> print corpus.dictionary
Dictionary(8472 unique tokens)
dictionary maps features (words) to feature ids
(numbers)
11. need massive collection of documents that
ostensibly has meaning
sounds like a job for wikipedia
>>>wiki_corpus= WikiCorpus(articles) # articles
is Wikipedia text dump bz2 file. several hours.
>>>wiki_corpus.dictionary.save("wiki_dict.dict")
# persist dictionary
>>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpu
s) # uses numpy to persist corpus in Matrix
Market format. several GBs. can be BZ2’ed.
>>>wiki_corpus= MmCorpus("wiki_corpus.mm") #
revive a corpus
12. gensim.models
transform corpora using models classes
for example, term frequency/inverse document
frequency (TFIDF) transformation
reflects importance of a term, not just
presence/absence
13. gensim.models
>>> tfidf_trans =
models.TfidfModel(wiki_corpus, id2word=dictionar
y) # TFIDF computes frequencies of all document
features in the corpus. several hours.
TfidfModel(num_docs=3430645, num_nnz=547534266)
>>> tfidf_trans[documents] # emits documents
in TFIDF representation. documents must be in
the same BOW vector space as wiki_corpus.
[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...
]
>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans
[wiki_corpus], id2word=dictionary) # builds new
corpus by iterating over documents transformed
to TFIDF
16. topics again for a bit
• SVD decomposes a matrix into three simpler matrices
• full rank SVD would be able to recreate the underlying
matrix exactly from those three matrices
• lower-rank SVD provides the best (least square error)
approximation of the matrix
• this approximation can find interesting relationships among
data
• it preserves most information while reducing noise and
merging dimensions associated with terms that have similar
meanings
17. topics again for a bit
• SVD:
alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
•Original paper:
www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA
_Deerwester1990.pdf
• General explanation:
tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf
• Many more
18. gensim.models
>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400, decay=1.0, chunksize=2000
0) # creates LSI transformation model from tfidf
corpus representation
>>> print lsi_trans
LsiModel(num_terms=100000, num_topics=400, decay=1
.0, chunksize=20000)
19. gensim.similarities
(the best part)
>>> index =
Similarity(corpus=lsi_transformation[tfidf_trans
formation[index_corpus]], num_features=400, outp
ut_prefix=”/tmp/shard”)
>>>
index[lsi_trans[tfidf_trans[dictionary.doc2bow(to
kenize(query))]]] # similarity of each document
in the index corpus to a new query document
>>> [s for s in index] # a matrix of each
document’s similarities to all other documents
[array([ 1. , 0. , 0.08, 0.01]),
array([ 0. , 1. , 0.02, -0.02]),
array([ 0.08, 0.02, 1. , 0.15]),
array([ 0.01, -0.02, 0.15, 1. ])]
20. about gensim
four additional models available
dependencies: optional:
numpy Pyro
scipy Pattern
created by Radim Rehurek
•radimrehurek.com/gensim
•github.com/piskvorky/gensim
•groups.google.com/group/gensim
21. thank you
example code, visualization code, and ppt:
github.com/sandinmyjoints
interview with Radim:
williamjohnbert.com
23. gensim.models
• term frequency/inverse document frequency
(TFIDF)
• log entropy
• random projections
• latent dirichlet allocation (LDA)
• hierarchical dirichlet process (HDP)
• latent semantic analysis/indexing (LSA/LSI)
24. slightly more about gensim
Dependencies: numpy and scipy, and optionally
Pyro for distributed and Pattern for lemmatization
data from Lee 2005 and other papers is available
in gensim for tests
#3: Hi everyone, thanks for coming.I’m going to start off with a quick demo app. Please go to the address you see up there. Hookbox somtimes takes several seconds to connect to the channel, so give it some if it’s red. It will turn green. You’ll be invited to submit a sentence, particularly a statement or fact that has a mixture of nouns, verbs, and adjectives. There’s some examples of the kinds of sentence that might work well with this demo, but take a moment to think of a sentence of your own and go ahead and submit. We should see them pop up here on the visualization screen.The idea is to provide a bit of a concrete grounding for the talk, and this will serve as an example of one thing you can do with gensim, or at least the data generated by it.What do we see? We have a table comparing a number of submitted sentences with what I’m going to call similarity scores between them. The darker the green, the higher the score. Are there any interesting results?Here we also have some clustering visualizations that attempt to group the inputs that were found to have highest scores together. How do they cluster together?clickHopefully that worked and showed some interesting relationships among the input. (If not, well, I blame the input.)gensim, which I’ll be talking about today, was generating all the underlying similarity scores, measuring how similar each sentence was to the other ones. I’m going to explain how to get results like this from gensim.
#4: A few quick words about me:William BertDeveloper at Carney Labs for about seven monthsCarney Labs is basically a startup wholly owned by a larger company in Alexandria called Team Carney.I use gensim at work developing a conversational tutoring web app.Topic modelling is still pretty new to me and I’m constantly learning more about it so my knowledge is still growing, but I’m really fascinated by it and trying to learn more by working with it a lot, and by doing things like this presentation.
#5: gensim is a free Python framework for doing topic modelling. I’m going to blaze through a quick overview of topic modelling, then discuss how gensim uses it to do semantic similarity, generating data like what we saw.Topic modelling attempts to uncover the underlying semantic structure of text (or other data) by using statistical techniques to identify abstract, recurring patterns of terms in a set of data. These patterns are called topics. They may or may not correspond to our intuitive notion of a topic. Topic modelling models documents as collections of features, representing the documents as long vectors that indicate the presence/absence of important features, for example, the presence or absence of words in a document. We can use those vectors to create spaces and plot the locations of documents in those spaces and use that as a kind of proxy for their meaning.What isn't topic modelling?Topic modelling: does not parse sentences. in fact, knows nothing aboutword order. makes no attempt to "understand" grammar or languagesyntax. What does a topic look like?
#6: Let’s take a quick look at some topics now, and we’ll come back to them again after I walk through how to generate them.clickThese three abbreviated topics were extracted from a large corpora of texts by gensim using a technique called latent semantic analysis (LSA). Just quickly note how they are collections of words that don’t necessarily/intuitively seem to belong together. There are also positive and negative scalar factors for each word, which get smaller in magnitude as the topic goes on. We don't see it here, but each of these topics actually has thousands more terms—these are just the first ten. So that’s what a topic looks like when I talk about topics, but the truth is...
#7: gensim isn’t really about topic modeling, for me anyway. clickIt’s really about similarity. Topics are a means to an end.clickA few wordsabout similarity, because it can be elusive.clickThere are different kinds of similarity: String matching – how many characters strings have in commonStylometry – which is the similarity of style that looks at, say, length of words or sentences, use of function words, ratio of nouns to verbs, etc. Used to identify authors, for example.Term frequency – do the documents use the same words the same number of times (when scaled and normalized)?The kind of similarity I'm interested in is semantic similarity. Similarity of meanings.But what is that?
#8: Take a moment to read these three sentences.They might be said to share certain elements: non-earth planets, weather, duration, research and data collectionHow would you quantify their similarity? How would you decide that two are more similar to each other than to the third? A study done in Australia in 2005 skipped over the question of defining semantic similarity formally and abstractly and instead defined it as what a sample of Australian college students think is similar. They had students read hundreds of paired short excerpts from news articles (and these sentences are excerpted from some of those)and rank the pairwise similarity on a scale. They then examined all the classifications, and found that they had a correlation of0.6. That’s obviously a positive correlation, but not terribly high. So, humans doesn’t necessarily agree with each other about semantic similarity—it’s kind of a fuzzy notion.That said, we’re going to try to put a number on it. In fact, the study I mentioned found that a particular topic modelling technique called latent semantic analysis (LSA) could achieve also a 0.6 correlation with the human ratings—correlating with the study participant’s choice about as well as they correlated with each other.
#9: Why do we care about semantic similarity? Some use cases for document similarity comparison:-Traditionally used on large document collections. Legal discovery.Answer questions or aid searching huge corpora like government regulations, manuals, patent databases, etc.-Automatic metadata: system can intelligently suggest tags and categories for documents based on other documents they’re similar too. -Something that came up recently on the gensim google group: in a CMS, when a user creates a new post, they want to see posts that may be similar in content. We can do that with semantic similarity , and in fact, someone actuallly made this into a plugin for plone using gensim.-Recommendations, plagiarism dectection, exam scoring. And there are a number of other use cases. There are some new and fast online algorithms that work in realtime, whereas aspreviously, work was often done in batches. This brings up another potential use:-better HCI. Matching on similarityrather than words or with regexes allows us to accept broader ranges of input, in theory. So to make it happen, enter...gensim
#10: To get our topics that we can then use to compute semantic similarity, we’re going to start by turning a large set of documents, which we’ll call a training/background corpus, into numeric vectors. We’ll use the tools in gensim’s corpora package.When I say document, a document can be as short as one word, or as long as many pages of text, or anywhere in between. My examples and the demo app are mostly sentence-size documents.In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)Corpus can made from a file, database query, a network stream, etc, as long as you can stream the documents and emit vectors. gensim iterates over the documents in a corpus with generators, so it uses constant memory, which means you can have enormous corpora and indexes.How you generate those vectors from the documents is up to you. (So a corpus isn't inherently tied to words or even language, it could be constructed from features of anything such as music or video, if you can figure out what to use as features [like amplitude or frequencies?] and how to extract them.)If your features are the presence or absence of words, your corpus is in what's called "bucket of words" (BOW) format. Gensim provides a convenience class called TextCorpus for creating a such corpus from a text file.So here we have a list of documents, each document is a list of (feature id, count) tuples. Feature #40 appears one time in document #0, etc.For BOW, we also need a dictionary...
#11: Dictionary maps feature ids back to features (words). The corpus class will generate this for me.So the vectors indicate the presence of words in particular documents, and the resulting matrix containing these vectors will represent all the words appearing in all the documents.
#12: To do interesting and useful things with semantic similarity, we need a good training or background corpus. Finding a good training corpus is something of an art. You want a large collection of documents (at least ten of thousands) that are representative of your problem domain. it can be difficult to find or build such a corpus. clickor, you can just use wikipedia. Helpfullyfor experimenting, gensim comes with a WikiCorpus class and other code for building a corpus from awikipedia article dump. clickWikicorpus makes two passes, one to extract the dictionary, and another to create and store the sparse vectors. It takes about 10 hours on an i7 to generate and serializecorpus and dictionary, though it uses constant memory. The resulting output vectors are about 15 GB uncompressed, about 5 compressed.So after these operations, wiki_corpus is now a BOW vector space representation of Wikipedia, embodied in a large corpus file in the Matrix Market format (popular matrix file format) and a several megabyte dictionary mapping ids to tokens (words). What can we do with our corpus?
#13: We can transform corpora from one vector space to another using models. Transformationscan bring out hidden structure in the corpus, such as revealing relationships between words and documents. They can also represent the corpus in a more compact way, preserving much information while consuming fewer resources.A gensim 'transformation' is any object which accepts a sparse document via dictionary notation and returns another sparse document.”One useful transformation that we can generate from our BOW corpus is term frequency/inverse document frequency (TFIDF). Instead of a count of word appearances in a document, we get a score for each word that also takes into account the global frequency of that word. So a word’s TFIDF value in a given document increases proportionally to the number of times a word appears in that particular document, but is offset by the frequency of the word in the entire corpus, which helps to control for the fact that some words are generally more common than others.
#14: Transformations are initialized with a training corpus,so we realize a TFIDF transformation from the corpus we just generated,wiki_corpus. This also takes several hours to generate for wikipedia.num_docs is number of documents in dictionary.num_nnz is number of non-zeroes in the matrix.clickOnce our model is generated, we can transform documents represented in one vector space model—the wiki_corpus BOW space--and emit them in another—the wiki_corpus TFIDF space as (word_id, word_weight) tuples where weight is a positive, normalized float. These documents can be anything, new and unseen, as long as they have been tokenized and put into the BOW representation using the same tokenizer and dictionary word->id mappings that were used for the wiki corpus.click We can emit these new representations right into a fresh MmCorpus, which could also be serialized and persisted on disk (also requiring several GBs). However,
#15: TFIDF corpus itself is not all that interesting except as a stepping stone to another model called LSI.Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. Original paper is from 1990. It produced the results we saw in the visualization.We can generate an LSI model almost the same way we did the TFIDF model, but we do need to provide an extra parameter called num_features, which brings us back to topics...
#16: num_features is a parameter to LSI telling it how many topics to make. Here again are the topics we saw. These were generated by LSI from Wikipedia articles. What are these topics? It’s hard to say, exactly. The “themes” are unclear. But they are in some sense the corpora’s "principal components” (And in fact, principal component analysis is similar, if you know what that is.)Here’s a brief rundown of how LSI works to calculate these topics...
#17: LSI uses a technique called singular value decomposition (SVD) to reduce the original term/document matrix’s number of dimensions and keep the most information for a given number of topics. I understand the technique conceptually but I’m not going to try to get into the math behind it because I don’t really understand it well enough to explain it, and there are plenty of resources online that explain it in great and accurate detail. Nonetheless, I’ll at least describe it briefly: SVD decomposes the word/document matrix into three simpler matrices. Full rank SVD will recreate the underlying matrix exactly, but LSA uses lower-order SVD, which provides the best (in the sense of least square error) approximation of the matrix at lower dimensions. By lowering the rank, dimensions associated with terms that have similar meanings are merged together. This preserves the most important semantic information in the text while reducing noise, and can uncover interesting relationships among the data of the underlying matrix. Still, the meaning of the terms and topics is not really apparent to us. This is because a single LSI topic is not about a single thing; they work as a set, The topics contain both positive and negative values, which cancel each other out delicately when generating vectors for documents.This is one of the reasons LSI topics are hard to interpret.---The original matrix can be too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").The original matrix can be noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy.The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.---A = m*n matrixA = U * S * V^twhere U is an m*k matrix, V is an n*k matrix, and S is a k*k matrix, and k is the rank of the matrix A.
#18: As I said, there are plenty of resources online to explain it, so for now I will direct your questions there.
#19: So we generate our somewhat mysterious LSI model, asking for 400 topics. Interesting note: no one has really figured out how to determine the best numer of topics for a given corpus for LSI, but experimentally people have found good results between 200 and 500.This will also take several hours. LSI model generation can be distributed to multiple CPUs/machines through a library called Python Remote Objects (Pyro), leading to faster model generation times. When it’s done, we have an LSIModel with 100,000 terms—that’s the size of the dictionary we created. The decay parameter gives more emphasis to new documents if any are added to the model after initial generation.Because the SVD algorithm is incremental, the memory load isconstant and can be controlled by a chunksize parameter thatsays how many documents are to be loaded into RAM at once. Larger chunks speed things up, but also require more RAM.
#20: Now we get to the best part. With our LSI transformation, we can now use the classes in gensim.similarities to create an index of all the documents that we want to compare subsequent queries against. The Similarity class uses fixed memory by splitting the index across shards on disk and mmap'ing them in as necessary. output_prefix is for the filenames of the shards.What is index_corpus? Could be my original training/universe corpus--Wikipedia. Then the index would tell me which Wikipedia document any new queries are most similar to. But index corpus could also be a set of entirely different documents, for example arbitrary sentences typed in by a group of Python programmers,and the index will determine which of those documents my query is most similar to. You can even add new documents to an index in realtime.clickSo to calculate similarity of a query, we tokenize and preprocess query the same we treated wiki corpus, then convert to BOW, then do TFIDF transform, then LSI tranform, give that to the index, and we’ll get a list of similarity scores between the query and each document in the index. clickYou can also calculate the similarity scores between all documents in the index and get back a 2-dimensional matrix, which is what the visualization app was doing every time a new document was added to the index. This is what makes realtime similarity comparisons possible, for some value of similar.
#21: A few more things about gensim before I wrap up:TFIDF and LSI are only two of six models it implements. There are a couple more weighting models and a couple more dimensionality reduction models, each with different properties, but I haven’t had a chance to work with those very much.click clickgensim’s dependencies are numpy and scipy, and optionally Pyro for distributed model generation and Pattern for additional input processing.clickTo give credit where credit’s due, I want to say a few words about where gensim comes from.It was created by a Czech guy named Radim Rehurek, and his work to make the algorithms scalable and online contributed to his Phd thesis. He is an active developer and is very helpful on the mailing list. He’s working hard to build a community around gensim and make it into a robust open source project (LGPL license).I asked him a few questions about his work on gensim; the questions and the answers are up on my personal site, williamjohnbert.com.Randim says: "Gensim has no ambition to become an all-encompassing production level tool, with robust failure handling and error recoveries.”But in my experience it has performed well, and Radim mentioned several commercial applications that are using it in addition to universities
#22: Thanks for listening. This presentation, some sample, and the demo app are available on my github page, github.com/sandinmyjointsI should note that the demo web app and the visualization are actually not part of gensim. gensim generated the data but the app is Flask and hookbox, the clustering is scipy and scikit-learn,and the visualization is d3.questions?
#24: In addition to TFIDF, gensim has implemented several VSM algorithms, most of which I know nothing about, but to do justice to gensim’s capabilities:-TFIDF —weights tokens according to importance (local vs global).preserves dimensionality -Log Entropy- another term weighting function that uses log entropy normalization. preserves dimensionality. -Random Projections —approximates tfidf distances but less computationally expensive. reduces dimensionality -Latent Dirichlet Allocations (LDA) — a generative model that produces more human-readable topics. reduces dimensionality.-Hierarchical Dirichlet Process (HDP) — very new, first described in a paper from 2006, but not all operations are fully implemented in gensim yet.-Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. reduces dimensionality. Original paper is Deerwester et al 1990. I have used it most.
#25: Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is available in gensim for tests
#26: These recurring patterns called topics may or may not correspond to our intuitive notion of a topic. The abbreviated ones printed here were extracted from a large corpora of texts by gensim using a technique called latent dirichlet allocation (LDA) that actually does tend to produce human-readable topics (but not all the techniques do that, and latent semantic analysis, which is what the demo app used and what we’ll be looking at soon, does not)Some themes—they appear to be “about” something, with the terms having a decreasing weighting