(Data Day 2016)
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
This document summarizes the lda2vec model, which combines aspects of word2vec and LDA. Word2vec learns word embeddings based on local context, while LDA learns document-level topic mixtures. Lda2vec models words based on both their local context and global document topic mixtures to leverage both approaches. It represents documents as mixtures over sparse topic vectors similar to LDA to maintain interpretability. This allows it to predict words based on local context and global document content.
This document summarizes a presentation on KBGAN, an approach that uses adversarial learning to generate high-quality negative examples for training knowledge graph embedding models. KBGAN trains a generator and discriminator in an adversarial manner, with the generator providing negative examples to improve the discriminator's ability to distinguish positive and negative triples. Experimental results on standard knowledge graph datasets show KBGAN improves over methods that randomly sample negative examples, achieving better performance on knowledge graph completion tasks based on hits@10 and MRR evaluation metrics.
This document summarizes the DBSCAN clustering algorithm. DBSCAN finds clusters based on density, requiring only two parameters: Eps, which defines the neighborhood distance, and MinPts, the minimum number of points required to form a cluster. It can discover clusters of arbitrary shape. The algorithm works by expanding clusters from core points, which have at least MinPts points within their Eps-neighborhood. Points that are not part of any cluster are classified as noise. Applications include spatial data analysis, image segmentation, and automatic border detection in medical images.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. However, traditionally machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph. In this talk I will discuss methods that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. I will provide a conceptual review of key advancements in this area of representation learning on graphs, including random-walk based algorithms, and graph convolutional networks.
Visual Explanation of Ridge Regression and LASSOKazuki Yoshida
Ridge regression and LASSO are regularization techniques used to address overfitting in regression analysis. Ridge regression minimizes residuals while also penalizing large coefficients, resulting in all coefficients remaining in the model. LASSO also minimizes residuals while penalizing large coefficients, but performs continuous variable selection by driving some coefficients to exactly zero. Both techniques involve a tuning parameter that controls the strength of regularization. Cross-validation is commonly used to select the optimal tuning parameter value.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Netflix uses containers to run both batch jobs and services. For batch jobs, containers simplify resource management and allow jobs like model training and media encoding to easily share resources. Services are more complex to run in containers due to challenges like constant resizing, statefulness, and networking. Netflix addresses these challenges through solutions like a VPC networking driver and reusing existing infrastructure services for containers. Looking ahead, Netflix aims to run more containers at larger scale for areas like developer experience, continuous integration, and internal resource optimization.
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko
This document discusses the problems of exploding and vanishing gradients that can occur when training recurrent neural networks. It provides solutions for each problem, such as weight initialization techniques, echo state networks, and LSTM networks. It also references three seminal papers from 1991, 1994, and 2013 that investigated issues with training recurrent networks and long short-term dependencies.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
OpenShift 4 provides a fully automated installation and day-2 operations experience. It features over-the-air updates, hybrid and multi-cluster management through operators, and services for developers like OpenShift Service Mesh and Serverless. The operating system is Red Hat Enterprise Linux CoreOS, which is immutable and tightly integrated with OpenShift.
The document provides an overview of Long Short Term Memory (LSTM) networks. It discusses:
1) The vanishing gradient problem in traditional RNNs and how LSTMs address it through gated cells that allow information to persist without decay.
2) The key components of LSTMs - forget gates, input gates, output gates and cell states - and how they control the flow of information.
3) Common variations of LSTMs including peephole connections, coupled forget/input gates, and Gated Recurrent Units (GRUs). Applications of LSTMs in areas like speech recognition, machine translation and more are also mentioned.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
Mastering GitOps 2022, April, Mario-Leander Reimer (@LeanderReimer, Principal Software Architect bei QAware).
== Dokument bitte herunterladen, falls unscharf! Please download slides if blurred! ==
Crossplane or: kubectl apply -f cloud-Infrastructure-as-code.yaml
Developing cloud-native applications easily and efficiently presents significant challenges for many teams. This is because, in addition to implementing domain-specific features and microservices, developers are now often jointly responsible for building the required cloud services with Infrastructure as Code à la Terraform. Unfortunately, the associated high cognitive load quickly leads to overload and suboptimal solutions.
Crossplane is an open source add-on for Kubernetes that addresses this problem. Using Crossplane, one can declaratively build cloud infrastructure for all popular cloud providers without writing a line of code. In addition, there is the ability to create highly specific self-service APIs and abstractions that can then be applied very easily by feature teams.
This talk will demonstrate the practical use of Crossplane with its features in AWS and GCP, as well as the seamless integration with a GitOps approach.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Easy, Secure, and Fast: Using NATS.io for Streams and ServicesNATS
Colin Sullivan presented on using NATS for streaming and services. NATS is an open source cloud native messaging system that can be used for distributed communication patterns like publish/subscribe and request/reply. It provides high performance, simplicity, security and availability. Key features include streams for fan out data flows and load balanced services. NATS supports topologies from standalone servers to global clusters and uses subjects, accounts and permissions for security and multi-tenancy. JetStream adds capabilities like at-least-once delivery and data persistence.
This document provides an overview and agenda for a course on Spark MLlib. The course covers Spark fundamentals, SQL, streaming and MLlib. The MLlib section includes an overview of MLlib, a quick review of machine learning concepts, and why MLlib is useful. It describes the main concepts in MLlib like DataFrames, transformers, estimators and pipelines. It provides examples of classification using logistic regression on text data, regression to predict tweet impressions, and topic modeling on tweets. Finally, it lists some of the algorithms in MLlib, including classification, regression, clustering and tree ensemble methods.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
This document discusses autoscaling in Kubernetes. It describes horizontal and vertical autoscaling, and how Kubernetes can autoscale nodes and pods. For nodes, it proposes using Google Compute Engine's managed instance groups and cloud autoscaler to automatically scale the number of nodes based on resource utilization. For pods, it discusses using an autoscaler controller to scale the replica counts of replication controllers based on metrics from cAdvisor or Google Cloud Monitoring. Issues addressed include rebalancing pods and handling autoscaling during rolling updates.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
A pattern language for microservices - June 2021 Chris Richardson
The microservice architecture is growing in popularity. It is an architectural style that structures an application as a set of loosely coupled services that are organized around business capabilities. Its goal is to enable the continuous delivery of large, complex applications. However, the microservice architecture is not a silver bullet and it has some significant drawbacks.
The goal of the microservices pattern language is to enable software developers to apply the microservice architecture effectively. It is a collection of patterns that solve architecture, design, development and operational problems. In this talk, I’ll provide an overview of the microservice architecture and describe the motivations for the pattern language. You will learn about the key patterns in the pattern language.
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
Context-aware recommender systems better identify interesting items for users by adapting their suggestions to the specific contextual situations, e.g., to the current weather, if an excursion is to be recommended . But, the cold-start problem may jeopardise the quality of the recommendations: for users, items or contextual situations that are new to the system, recommendations are hard to compute. We have developed a number of novel techniques to tame this problem, and in particular, new hybrid algorithms that combine several, simpler, algorithms in order to exploit their strengths and avoid their weaknesses. We have also developed algorithms for actively identifying the most useful preference information to ask the user in order to bootstrap the system. Our results obtained from a series of offline and online experiments reveal that the proposed techniques can effectively alleviate the cold-start problem of context-aware recommender systems.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Netflix uses containers to run both batch jobs and services. For batch jobs, containers simplify resource management and allow jobs like model training and media encoding to easily share resources. Services are more complex to run in containers due to challenges like constant resizing, statefulness, and networking. Netflix addresses these challenges through solutions like a VPC networking driver and reusing existing infrastructure services for containers. Looking ahead, Netflix aims to run more containers at larger scale for areas like developer experience, continuous integration, and internal resource optimization.
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko
This document discusses the problems of exploding and vanishing gradients that can occur when training recurrent neural networks. It provides solutions for each problem, such as weight initialization techniques, echo state networks, and LSTM networks. It also references three seminal papers from 1991, 1994, and 2013 that investigated issues with training recurrent networks and long short-term dependencies.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
OpenShift 4 provides a fully automated installation and day-2 operations experience. It features over-the-air updates, hybrid and multi-cluster management through operators, and services for developers like OpenShift Service Mesh and Serverless. The operating system is Red Hat Enterprise Linux CoreOS, which is immutable and tightly integrated with OpenShift.
The document provides an overview of Long Short Term Memory (LSTM) networks. It discusses:
1) The vanishing gradient problem in traditional RNNs and how LSTMs address it through gated cells that allow information to persist without decay.
2) The key components of LSTMs - forget gates, input gates, output gates and cell states - and how they control the flow of information.
3) Common variations of LSTMs including peephole connections, coupled forget/input gates, and Gated Recurrent Units (GRUs). Applications of LSTMs in areas like speech recognition, machine translation and more are also mentioned.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
Mastering GitOps 2022, April, Mario-Leander Reimer (@LeanderReimer, Principal Software Architect bei QAware).
== Dokument bitte herunterladen, falls unscharf! Please download slides if blurred! ==
Crossplane or: kubectl apply -f cloud-Infrastructure-as-code.yaml
Developing cloud-native applications easily and efficiently presents significant challenges for many teams. This is because, in addition to implementing domain-specific features and microservices, developers are now often jointly responsible for building the required cloud services with Infrastructure as Code à la Terraform. Unfortunately, the associated high cognitive load quickly leads to overload and suboptimal solutions.
Crossplane is an open source add-on for Kubernetes that addresses this problem. Using Crossplane, one can declaratively build cloud infrastructure for all popular cloud providers without writing a line of code. In addition, there is the ability to create highly specific self-service APIs and abstractions that can then be applied very easily by feature teams.
This talk will demonstrate the practical use of Crossplane with its features in AWS and GCP, as well as the seamless integration with a GitOps approach.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Easy, Secure, and Fast: Using NATS.io for Streams and ServicesNATS
Colin Sullivan presented on using NATS for streaming and services. NATS is an open source cloud native messaging system that can be used for distributed communication patterns like publish/subscribe and request/reply. It provides high performance, simplicity, security and availability. Key features include streams for fan out data flows and load balanced services. NATS supports topologies from standalone servers to global clusters and uses subjects, accounts and permissions for security and multi-tenancy. JetStream adds capabilities like at-least-once delivery and data persistence.
This document provides an overview and agenda for a course on Spark MLlib. The course covers Spark fundamentals, SQL, streaming and MLlib. The MLlib section includes an overview of MLlib, a quick review of machine learning concepts, and why MLlib is useful. It describes the main concepts in MLlib like DataFrames, transformers, estimators and pipelines. It provides examples of classification using logistic regression on text data, regression to predict tweet impressions, and topic modeling on tweets. Finally, it lists some of the algorithms in MLlib, including classification, regression, clustering and tree ensemble methods.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
This document discusses autoscaling in Kubernetes. It describes horizontal and vertical autoscaling, and how Kubernetes can autoscale nodes and pods. For nodes, it proposes using Google Compute Engine's managed instance groups and cloud autoscaler to automatically scale the number of nodes based on resource utilization. For pods, it discusses using an autoscaler controller to scale the replica counts of replication controllers based on metrics from cAdvisor or Google Cloud Monitoring. Issues addressed include rebalancing pods and handling autoscaling during rolling updates.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
A pattern language for microservices - June 2021 Chris Richardson
The microservice architecture is growing in popularity. It is an architectural style that structures an application as a set of loosely coupled services that are organized around business capabilities. Its goal is to enable the continuous delivery of large, complex applications. However, the microservice architecture is not a silver bullet and it has some significant drawbacks.
The goal of the microservices pattern language is to enable software developers to apply the microservice architecture effectively. It is a collection of patterns that solve architecture, design, development and operational problems. In this talk, I’ll provide an overview of the microservice architecture and describe the motivations for the pattern language. You will learn about the key patterns in the pattern language.
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
Context-aware recommender systems better identify interesting items for users by adapting their suggestions to the specific contextual situations, e.g., to the current weather, if an excursion is to be recommended . But, the cold-start problem may jeopardise the quality of the recommendations: for users, items or contextual situations that are new to the system, recommendations are hard to compute. We have developed a number of novel techniques to tame this problem, and in particular, new hybrid algorithms that combine several, simpler, algorithms in order to exploit their strengths and avoid their weaknesses. We have also developed algorithms for actively identifying the most useful preference information to ask the user in order to bootstrap the system. Our results obtained from a series of offline and online experiments reveal that the proposed techniques can effectively alleviate the cold-start problem of context-aware recommender systems.
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
This document discusses word embeddings and how they work. It begins by explaining how the author became an expert in distributional semantics without realizing it. It then discusses how word2vec works, specifically skip-gram models with negative sampling. The key points are that word2vec is learning word and context vectors such that related words and contexts have similar vectors, and that this is implicitly factorizing the word-context pointwise mutual information matrix. Later sections discuss how hyperparameters are important to word2vec's success and provide critiques of common evaluation tasks like word analogies that don't capture true semantic similarity. The overall message is that word embeddings are fundamentally doing the same thing as older distributional semantic models through matrix factorization.
This document describes dependency language models for natural language processing tasks. Dependency language models estimate the probability of a sentence based on the dependency parse tree rather than contiguous n-grams. It assumes words are conditionally independent given their ancestor words in the tree. This allows it to capture long-distance relationships ignored by n-gram models. The document presents dependency language models and evaluates them on the Microsoft Sentence Completion Challenge where they achieve competitive performance without neural networks.
Recipe2Vec: Or how does my robot know what’s tastyPyData
By Meghan Heintz
PyData New York City 2017
Your user knows they want a healthyish but tasty pasta for dinner but aren't quite sure exactly which recipe to choose. How can you help narrow their search and show them closely related recipes to give them enough options without making their search exhausting? This talk will show you BuzzFeed/Tasty tech's solution to creating a consistent method for finding similar Tasty recipes using word2vec.
This is presentation about what skip-gram and CBOW is in seminar of Natural Language Processing Labs.
- how to make vector of words using skip-gram & CBOW.
Word embeddings are a technique for converting words into vectors of numbers so that they can be processed by machine learning algorithms. Words with similar meanings are mapped to similar vectors in the vector space. There are two main types of word embedding models: count-based models that use co-occurrence statistics, and prediction-based models like CBOW and skip-gram neural networks that learn embeddings by predicting nearby words. Word embeddings allow words with similar contexts to have similar vector representations, and have applications such as document representation.
Word embeddings have received a lot of attention since some Tomas Mikolov published word2vec in 2013 and showed that the embeddings that the neural network learned by “reading” a large corpus of text preserved semantic relations between words. As a result, this type of embedding started being studied in more detail and applied to more serious NLP and IR tasks such as summarization, query expansion, etc… More recently, researchers and practitioners alike have come to appreciate the power of this type of approach and have started a cottage industry of modifying Mikolov’s original approach to many different areas.
In this talk we will cover the implementation and mathematical details underlying tools like word2vec and some of the applications word embeddings have found in various areas. Starting from an intuitive overview of the main concepts and algorithms underlying the neural network architecture used in word2vec we will proceed to discussing the implementation details of the word2vec reference implementation in tensorflow. Finally, we will provide a birds eye view of the emerging field of “2vec" (dna2vec, node2vec, etc...) methods that use variations of the word2vec neural network architecture.
This (long) version of the Tutorial was presented at #O'Reilly AI 2017 in San Francisco. See https://bmtgoncalves.github.io/word2vec-and-friends/ for further details.
The document provides an overview of using Markov chains and recurrent neural networks (RNNs) for text generation. It discusses:
- How Markov chains can model text by treating sequences of words as "states" and predicting the next word based on conditional probabilities.
- The limitations of Markov chains for complex text generation.
- How RNNs address some limitations by incorporating memory via feedback connections, allowing them to better capture sequential relationships.
- Long short-term memory (LSTM) networks, which help combat the "vanishing gradient problem" to better learn long-term dependencies in sequences.
- How LSTMs can be implemented in Python using Keras to generate text character-by-character based on
Introductory seminar on NLP for CS sophomores. Presented to Texas A&M's Fall 2022 CSCE181 class. Slides are a bit redundant due to compatibility issues :\
The document provides an introduction to word embeddings and two related techniques: Word2Vec and Word Movers Distance. Word2Vec is an algorithm that produces word embeddings by training a neural network on a large corpus of text, with the goal of producing dense vector representations of words that encode semantic relationships. Word Movers Distance is a method for calculating the semantic distance between documents based on the embedded word vectors, allowing comparison of documents with different words but similar meanings. The document explains these techniques and provides examples of their applications and properties.
word2vec beginner.
vector space, distributional semantics, word embedding, vector representation for word, word vector representation, sparse and dense representation, vector representation, Google word2vec, tensorflow
This document provides an overview of the Word2Vec deep learning technique for generating word embeddings from large text corpora. It begins with an introduction to deep learning applications in biotechnology. The document then covers the traditional one-hot encoding representation of words and its limitations. It introduces Word2Vec as a method to map words to vectors of continuous values such that similar words have similar vectors. Key aspects covered include the skip-gram architecture, negative sampling, and training Word2Vec models on large datasets. Applications to materials science literature are discussed. Finally, potential project ideas involving applying Word2Vec to biological literature and genomes are proposed.
Research on word representation models, word embeddings, has gained a lot of attention in the recent years thanks to Word2Vec by Mikolov et al. The main purpose of this work is to validate previously proposed experiments for the English language and then trying to figure out if it is possibile to reproduce the same accuracy and performance with the Italian language.
Semantic similarity between two sentences in arabicKhadija Mohamad
1. The document describes four methods for measuring semantic similarity between sentences in Arabic: Word2Vec, LMF Dictionaries, Wu & Palmer, and Lesk algorithm.
2. Word2Vec trains word embeddings and measures similarity based on cosine distance between sentence or word vectors. LMF Dictionaries create semantic vectors from dictionary definitions and uses Jaccard coefficient.
3. Wu & Palmer calculates similarity between words based on their positions in a WordNet taxonomy. Lesk algorithm disambiguates words by comparing dictionary definitions of surrounding words.
The document discusses word embeddings, which learn vector representations of words from large corpora of text. It describes two popular methods for learning word embeddings: continuous bag-of-words (CBOW) and skip-gram. CBOW predicts a word based on surrounding context words, while skip-gram predicts surrounding words from the target word. The document also discusses techniques like subsampling frequent words and negative sampling that improve the training of word embeddings on large datasets. Finally, it outlines several applications of word embeddings, such as multi-task learning across languages and embedding images with text.
The document summarizes the skip-gram model used in natural language processing. It discusses how the skip-gram model uses a neural network to create vector representations of words based on their contexts. These word vectors encode semantic relationships between words and can be trained using negative sampling to predict a target word from an input word. The training objective is to maximize the probability of predicting the correct context words.
Paper dissected glove_ global vectors for word representation_ explained _ ...Nikhil Jaiswal
This document summarizes and explains the GloVe model for generating word embeddings. GloVe aims to capture word meaning in vector space while taking advantage of global word co-occurrence counts. Unlike word2vec, GloVe learns embeddings based on a co-occurrence matrix rather than streaming sentences. It trains vectors so their differences predict co-occurrence ratios. The document outlines the key steps in building GloVe, including data preparation, defining the prediction task, deriving the GloVe equation, and comparisons to word2vec.
Impacts of ship noise on harbour porpoise movements and population dynamicsJacob Nabe-Nielsen
Noise from commercial ships can affect cetaceans by scaring them away from their foraging grounds, which may ultimately affect populations negatively. As the number of ships increases globally, we urgently need new methods for assessing such impacts to guide management and conservation. Here we used data from GPS tracked harbour porpoises in Danish waters to assess to what extent they are deterred by vessel noise. The received noise levels were calculated based on the sound levels emitted by the ships, combined with knowledge of how sound propagates under different environmental conditions. Subsequently we incorporated the observed behavioural responses to individual ships in an agent-based model (ABM) where the individual animals’ movements were simulated to closely resemble those of real porpoises. The ABM also incorporated ships that moved along the same shipping routes as they use in the real world, while they emitted realistic amounts of noise. The modelled animals’ foraging efficiency was influenced by the extent to which they were scared away from their foraging grounds by ships, and their survival and reproduction depended on whether they were able to maintain a positive energy balance. Our results indicate that GPS tracked porpoises were deterred by ships 5–9% of the time even at distances >2 km. Model simulations for a re-routing of ship traffic which occurred in eastern Kattegat (between Denmark and Sweden) reproduced realistic temporal changes in soundscapes and porpoise presence. The results suggest that re-routing had no impact on abundance of porpoises in the region, but that further intensification of ship traffic could affect the porpoise population negatively. This illustrates how ABMs based on realistic animal movements and energetics can be valuable for guiding management and reducing negative impacts of disturbances.
This presentation explores Maxwell's Law of Distribution of Molecular Velocities, a fundamental principle in statistical mechanics and kinetic theory of gases. It provides a detailed overview of how molecular speeds in a gas vary at a given temperature, illustrating the probabilistic nature of molecular motion. The slides cover the mathematical formulation of the Maxwell-Boltzmann distribution, graphical interpretations, key velocity parameters (most probable, average, and root-mean-square speeds), and the effects of temperature and molecular mass. This resource is ideal for students and educators in physical chemistry and physics seeking a clear understanding of gas behavior at the molecular level.
About Author:
Noor Zulfiqar is an award-winning chemist, Premium member of American Chemical Society (ACS), certified publisher & peer reviewer, and an experienced academic lecturer. As a professional content creator, she offers top-tier presentation design, research writing, and scientific content development services. Her multidisciplinary expertise spans computational science, chemistry, nanotechnology, environmental studies, socio-economics, human resource management, life sciences, engineering management, medical and pharmaceutical sciences, and business, her work ensures clarity, creativity, and academic excellence. Her services are ideal for those seeking impactful, visually compelling content tailored to diverse academic and research needs.
For collaborations or custom-designed academic content, feel free to reach out!
Contact:
Email: [email protected]
Facebook: https://www.facebook.com/share/1LgwpoyLDg/
Website: https://professional-content-writings.jimdosite.com
Compound Microscope with working principleRahulRajai
A compound microscope is a type of optical microscope that uses two or more lenses to magnify a specimen. It achieves this by first magnifying the image using the objective lens, and then further magnifying that image using the eyepiece lens. This two-step magnification process allows for detailed observation of small objects,
Towards Scientific Foundation Models (Invited Talk)Steffen Staab
Foundation models are machine-learned models that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. Foundation models have been used successfully for question answering and text generation (ChatGPT), image understanding (Clip, VIT), or image generation. Recently, the basic idea underlying foundation models been considered for learning scientific foundation models that capture expectations about partial differential equations. Existing scientific foundation models have still been very much limited wrt. the type of PDEs or differential operators . In this talk, I present some of our recent work on paving the way towards scientific foundation models that aims at making them more robust and better generalisable.
Thermodynamic concepts of zinc availability in soil and recent advances.pptxArchana Verma
Zinc (Zn) deficiency, which is a common micronutrient disorder in plants, reduces crop yields and nutritional quality. About 50% of cereal crops are cultivated on soils with low Zn availability worldwide (Alloway, 2009). The general mechanisms involved in the transformation of zinc ions in the soil lead to retention (mediated by sorption, precipitation, and complexation reactions) or loss (plant uptake, leaching) of zinc (Seshadri et al. 2015). Negative value of ∆G (kJ mol-1) concluded the overall biosorption of Zn(II) ion on the surface of CDS biomass as spontaneous at liquid solid interface during the sorption of Zn(II) ion (Mishra et al. 2012). The properties of rhizosphere vary according to the plant species, where the width has been shown to extend from 2–80 mm from the root surface. Concentration of root exudates and extent of microbial activity are useful indicators of demarcation of rhizosphere and bulk soil zone (Seshadri et al. 2015). Kinetic studies are required to find out the rate and mechanism of reaction coupled with the determination of rate controlling step, while mechanistic model consists of equations describing nutrient influx are combined with equations describing plant growth in order to describe nutrient uptake (Adhikari and Rattan, 2000). Several factors influence Zn adsorption, desorption, and equilibrium between the solid and solution phases. These factors include soil pH, clay content, organic matter (OM), cation exchange capacity (CEC), and Fe/Al oxides (Gaudalix and Pardo, 1995), among which, soil pH is one of the most important factors (Barrow, 1987).
Calf born With 2 head, 3 ear and 4 eye, Masha, SWE.pdfmelesemelkamu81
I am pleased to submit a case report titled" Calf born with 2 head, 3 ear and 4 eye in Masha, south west Ethiopia and this type of case report is new in the area. we confirm that the case is original.
The atmosphere of Titan in late northern summer from JWST and Keck observationsSérgio Sacani
Saturn’s moon Titan undergoes a long annual cycle of 29.45 Earth years.
Titan’s northern winter and spring were investigated in detail by the Cassini–
Huygens spacecraft (2004–2017), but the northern summer season remains
sparsely studied. Here we present new observations from the James Webb
Space Telescope (JWST) and Keck II telescope made in 2022 and 2023 during
Titan’s late northern summer. Using JWST’s mid-infrared instrument, we
spectroscopically
detected the methyl radical, the primary product of methane
break-up and key to the formation of ethane and heavier molecules.
Using the near-infrared spectrograph onboard JWST, we detected several
non-local thermodynamic equilibrium CO and CO2 emission bands, which
allowed us to measure these species over a wide altitude range. Lastly, using
the near-infrared camera onboard JWST and Keck II, we imaged northern
hemisphere tropospheric clouds evolving in altitude, which provided
new insights and constraints on seasonal convection patterns. These
observations pave the way for new observations and modelling of Titan’s
climate and meteorology as it progresses through the northern fall equinox,
when its atmosphere is expected to show notable seasonal changes.
First Frequency Phase Transfer from the 3mm to the 1mm Band on an Earth-sized...Sérgio Sacani
Frequency phase transfer (FPT) is a technique designed to increase coherence and sensitivity in radio interferometry by making use of the nondispersive nature of the troposphere to calibrate high-frequency data using solutions derived at a lower frequency. While the Korean very long baseline interferometry (VLBI) network has pioneered the use of simultaneous multiband systems for routine FPT up to an observing frequency of 130 GHz, this technique remains largely untested in the (sub)millimeter regime. A recent effort has been made to outfit dualband systems at (sub)millimeter observatories participating in the Event Horizon Telescope (EHT) and to test the feasibility and performance of FPT up to the observing frequencies of the EHT. We present the results of simultaneous dual-frequency observations conducted in 2024 January on an Earth-sized baseline between the IRAM 30-m in Spain and the James Clerk Maxwell Telescope (JCMT) and Submillimeter Array (SMA) in Hawai‘i. We performed simultaneous observations at 86 and 215GHz on the bright sources J0958+6533 and OJ287, with strong detections obtained at both frequencies. We observe a strong correlation between the interferometric phases at the two frequencies, matching the trend expected for atmospheric fluctuations and demonstrating for the first time the viability of FPT for VLBI at a wavelength of ∼1 millimeter. We show that the application of FPT systematically increases the 215 GHz coherence on all averaging timescales. In addition, the use of the colocated JCMT and SMA as a single dual-frequency station demonstrates the feasibility of paired-antenna FPT for VLBI for the first time, with implications for future array capabilities (e.g., Atacama Large Millimeter/ submillimeter Array subarraying and ngVLA calibration strategies).
Analytical techniques in dry chemistry for heavy metal analysis and recent ad...Archana Verma
Heavy Metals is often used as a group name for metals and semimetals (metalloids) that have been associated with contamination and potential toxicity (Duffus, 2001). Heavy metals inhibit various enzymes and compete with various essential cations (Tchounwou et al., 2012). These may cause toxic effects (some of them at a very low content level) if they occur excessively, because of this the assessment to know their extent of contamination in soil becomes very important. Analytical techniques of dry chemistry are non-destructive and rapid and due to that a huge amount of soil samples can be analysed to know extent of heavy metal pollution, which conventional way of analysis not provide efficiently because of being tedious processes. Compared with conventional analytical methods, Vis-NIR techniques provide spectrally rich and spatially continuous information to obtain soil physical and chemical contamination. Among the calibration methods, a number of multivariate regression techniques for assessing heavy metal contamination have been employed by many studies effectively (Costa et al.,2020). X-ray fluorescence spectrometry has several advantages when compared to other multi-elemental techniques such as inductively coupled plasma mass spectrometry (ICP-MS). The main advantages of XRF analysis are; the limited preparation required for solid samples and the decreased production of hazardous waste. Field portable (FP)-XRF retains these advantages while additionally providing data on-site and hence reducing costs associated with sample transport and storage (Pearson et al.,2013). Laser Induced Breakdown Spectroscopy (LIBS) is a kind of atomic emission spectroscopy. In LIBS technology, a laser pulse is focused precisely onto the surface of a target sample, ablating a certain amount of sample to create plasma (Vincenzo Palleschi,2020). After obtaining the LIBS data of the tested sample, qualitative and quantitative analysis is conducted. Even after being rapid and non-destructive, several limitations are also there in these advance techniques such as more effective and accurate quantification models are needed. To overcome these problems, proper calibration models should be developed for better quantification of spectrum in near future.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
1. A word is worth a
thousand vectors
(word2vec, lda, and introducing lda2vec)
Christopher Moody
@ Stitch Fix
Welcome,
thanks for coming, having me, organizer
NLP can be a messy affair because you have to teach a computer about the irregularities and ambiguities of the English
language in this sort of hierarchical sparse nature in all the grammar
3rd trimester, pregnant
“wears scrubs” — medicine
taking a trip — a fix for vacation clothing
power of word vectors promise is to sweep away a lot of issues
2. About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Data Labs at Stitch Fix
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition
3. Credit
Large swathes of this talk are from
previous presentations by:
• Tomas Mikolov
• David Blei
• Christopher Olah
• Radim Rehurek
• Omer Levy & Yoav Goldberg
• Richard Socher
• Xin Rong
• Tim Hopper
5. 1. king - man + woman = queen
2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm
5. Comes pretrained
word2vec
1. Learns what words mean — can solve analogies cleanly.
1. Not treating words as blocks, but instead modeling relationships
2. Distributed representations form the basis of more complicated deep learning systems
3. Shallow — not deep learning!
1. Power comes from this simplicity — super fast, lots of data
4. Get a lot of mileage out of this
1. Don’t need to model the wikipedia corpus before starting your own
6. word2vec
1. Set up an objective function
2. Randomly initialize vectors
3. Do gradient descent
7. w
ord2vec
word2vec: learn word vector vin
from it’s surrounding context
vin
1. Let’s talk about training first
2. In SVD and n-grams we built a co-occurence and transition probability matrices
3. Here we will learn the embedded representation directly, with no intermediates, update it w/ every example
8. w
ord2vec
“The fox jumped over the lazy dog”
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
…instead of maximizing the likelihood of co-occurrence counts.
1. Context — the words surrounding the training word
2. Naively assume P(*|over) is independent conditional on the training word
3. Still a pretty simple assumption!
Conditioning on just *over* no other secret parameters or anything
10. w
ord2vec
P(vfox|vover)
Should depend on the word vectors.
P(fox|over)
Trying to learn the word vectors, so let’s start with those
(we’ll randomly initialize them to begin with)
11. w
ord2vec
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
“The fox jumped over the lazy dog”
P(vOUT|vIN)
12. w
ord2vec
“The fox jumped over the lazy dog”
vIN
P(vOUT|vIN)
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
IN = training word
13. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
14. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
15. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
16. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
17. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
18. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
19. w
ord2vec
P(vOUT|vIN)
“The fox jumped over the lazy dog”
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
20. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
21. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
22. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
23. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
24. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
25. w
ord2vec
“The fox jumped over the lazy dog”
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether it’s the input or the output.
Also a context window around every input word.
…So that at a high level is what we want word2vec to do.
two for loops
That’s it! A bit disengious to make this a giant network
26. objective
Measure loss between
vIN and vOUT?
vin . vout
How should we define P(vOUT|vIN)?
Now we’ve defined the high-level update path for the algorithm.
Need to define this prob exactly in order to define our updates.
Boils down to diff between in & out — want to make as similar as possible, and then the probability will go up.
Use cosine sim.
27. w
ord2vec
vin . vout ~ 1
objective
vin
vout
Dot product has these properties:
Similar vectors have similarly near 1
30. w
ord2vec
vin . vout ∈ [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)
…and we’d like a probability
31. w
ord2vec
But we’d like to measure a probability.
vin . vout ∈ [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)
…and we’d like a probability
32. w
ord2vec
But we’d like to measure a probability.
softmax(vin . vout ∈ [-1,1])
objective
∈ [0,1]
Transform again using softmax
33. w
ord2vec
But we’d like to measure a probability.
softmax(vin . vout ∈ [-1,1])
Probability of choosing 1 of N discrete items.
Mapping from vector space to a multinomial over words.
objective
Similar to logistic function for binary outcomes, but instead for 1 of N outcomes.
So now we’re modeling the probability of a word showing up as the combination of the training word vector and the target
word vector and transforming it to a 1 of N
34. w
ord2vec
But we’d like to measure a probability.
exp(vin . vout ∈ [0,1])softmax ~
objective
So here’s the actual form of the equation — we normalize by the sum of all of the other possible pairs of word combinations
35. w
ord2vec
But we’d like to measure a probability.
exp(vin . vout ∈ [-1,1])
Σexp(vin . vk)
softmax =
objective
Normalization term over all words
k ∈ V
So here’s the actual form of the equation — we normalize by the sum of all of the other possible pairs of word combinations
two effects
make vin and vout more similar
make vin and every other word less similar
36. w
ord2vec
But we’d like to measure a probability.
exp(vin . vout ∈ [-1,1])
Σexp(vin . vk)
softmax = = P(vout|vin)
objective
k ∈ V
This is the kernel of the word2vec. We’re just going to apply this operation every time we want to update the vectors.
For every word, we’re going to have a context window, and then for every pair of words in that window and the input word,
we’ll measure this probability.
37. w
ord2vec
Learn by gradient descent on the softmax prob.
For every example we see update vin
vin := vin + P(vout|vin)
objective
vout := vout + P(vout|vin)
…I won’t go through the derivation of the gradient, but this is the general idea
relatively simple, fast — fast enough to read billions of words in a day
41. Showing just 2 of the ~500 dimensions. Effectively we’ve PCA’d it
52. If we only had locality and not regularity, this wouldn’t necessarily be true
55. So we live in a vector space where operations like addition and subtraction are meaningful.
So here’s a few examples of this working.
Really get the idea of these vectors as being ‘mixes’ of other ideas & vectors
57. + ‘Pregnant’
I love the stripes and the cut around my neckline was amazing
someone else might write ‘grey and black’
subtlety and nuance in that language
We have lots of this interaction — of order wikipedia amount — far too much to manually annotate anything
67. Latent style vectors from text
Pairwise gamma correlation
from style ratings
Diversity from ratings Diversity from text
Lots of structure in both — but the diversity much higher in the text
Maybe obvious: but the way people describe items is fundamentally richer than the style ratings
69. word2vec is local:
one word predicts a nearby word
“I love finding new designer brands for jeans”
as if the world where one very long text string. no end of documents, no end of sentence, etc.
and a window across words
70. “I love finding new designer brands for jeans”
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
71. “I love finding new designer brands for jeans”
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
72. “I love finding new designer brands for jeans”
In LDA, documents globally predict words.
doc 7681
these are client comment which are short, only predict dozens of words
but could be legal documents, or medical documents, 10k words — here the difference between global and local algorithms
is much more important
74. typical word2vec vector
[ 0%, 9%, 78%, 11%]
typical LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
All sum to 100%All real values
75. 5D word2vec vector
[ 0%, 9%, 78%, 11%]
5D LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
LDA is a *mixture*
w2v is a bunch of real numbers — more like and *address*
much easier to say to another human 78% of something rather than it is +2.2 of something and -1.25 of something else
76. 100D word2vec vector
[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
dense sparse
78. can we do both? lda2vec
series of exp
grain of salt
very new — no good quantitative results only qualitative (but promising!)
79. The goal:
Use all of this context to learn
interpretable topics.
P(vOUT |vIN)word2vec
@chrisemoody
Use this at SF.
typical table
w2v will use w-w
80. word2vec
LDA P(vOUT |vDOC)
The goal:
Use all of this context to learn
interpretable topics.
this document is
80% high fashion
this document is
60% style
@chrisemoody
LDA will use that doc ID column
you can use this to steer the business as a whole
81. word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
But doesn’t predict word-to-word relationships.
in texas, maybe i want more lonestars & stirrup icons
in austin, maybe i want more bats
82. word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
love to learn client topics
are there ‘types’ of clients? q every biz asks
so this is the promise of lda2vec
83. lda2vec
word2vec predicts locally:
one word predicts a nearby word
P(vOUT |vIN)
vIN vOUT
“PS! Thank you for such an awesome top”
But doesn’t predict word-to-word relationships.
84. lda2vec
LDA predicts a word from a global context
doc_id=1846
P(vOUT |vDOC)
vOUT
vDOC
“PS! Thank you for such an awesome top”
But doesn’t predict word-to-word relationships.
86. lda2vec
“PS! Thank you for such an awesome top”doc_id=1846
vIN vOUT
vDOC
can we predict a word both locally and globally ?
P(vOUT |vIN+ vDOC)
doc vector captures long-distance dependencies
word vector captures short-distance
87. lda2vec
doc_id=1846
vIN vOUT
vDOC
*very similar to the Paragraph Vectors / doc2vec
can we predict a word both locally and globally ?
“PS! Thank you for such an awesome top”
P(vOUT |vIN+ vDOC)
88. lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the LDA topic vectors. 😔
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …
89. lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the LDA topic vectors. 😔
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …
about as interpretable a hash
90. lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the LDA topic vectors. 😔
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …
91. lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the LDA topic vectors. 😔
We’re missing mixtures & sparsity.
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …
92. lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the LDA topic vectors. 😔
Let’s make vDOC into a mixture…
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, …
93. lda2vec
Let’s make vDOC into a mixture…
vDOC = a vtopic1 + b vtopic2 +… (up to k topics)
sum of other word vectors
intuition here is that ‘hanoi = vietnam + capital’ and lufthansa = ‘germany + airlines’
so we think that document vectors should also be some word vector + some word vector
94. lda2vec
Let’s make vDOC into a mixture…
vDOC = a vtopic1 + b vtopic2 +…
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
twenty newsgroup dataset, free, canonical
95. lda2vec
Let’s make vDOC into a mixture…
vDOC = a vtopic1 + b vtopic2 +…
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
96. lda2vec
Let’s make vDOC into a mixture…
vDOC = a vtopic1 + b vtopic2 +…
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
97. lda2vec
Let’s make vDOC into a mixture…
vDOC = a vtopic1 + b vtopic2 +…
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
purple a,b coefficients tell you how much it is that topic
98. lda2vec
Let’s make vDOC into a mixture…
vDOC = 10% religion + 89% politics +…
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
Doc is now 10% religion 89% politics
mixture models are powerful for interpretability
99. lda2vec
Let’s make vDOC sparse
[ -0.75, -1.25, …]
vDOC = a vreligion + b vpolitics +…
Now 1st time I did this…
Hard to interpret. What does -1.2 politics mean? math works, but not intuitive
100. lda2vec
Let’s make vDOC sparse
vDOC = a vreligion + b vpolitics +…
How much of this doc is in religion, how much in poltics
but doesn’t work when you have more than a few
101. lda2vec
Let’s make vDOC sparse
vDOC = a vreligion + b vpolitics +…
How much of this doc is in religion, how much in cars
but doesn’t work when you have more than a few
102. lda2vec
Let’s make vDOC sparse
{a, b, c…} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +…
trick we can steal from bayesian
make it dirichlet
skipping technical details
make everything sum to 100%
penalize non-zero
force model to only make it non-zero w/ lots of evidence
103. lda2vec
Let’s make vDOC sparse
{a, b, c…} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +…
sparsity-inducing effect.
similar to the lasso or l1 reg, but dirichlet
few dimensions, sum to 100%
I can say to the CEO, set of docs could have been in 100 topics, but we picked only the best topics
104. word2vec
LDA
P(vOUT |vIN + vDOC)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
this document is
80% high fashion
this document is
60% style
go back to our problem lda2vec is going to use all the info here
105. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
add column = adding a term
add features in an ML model
106. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
in addition to doc topics, like ‘rec SF’
107. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
client topics — sporty, casual,
this is where if she says ‘3rd trimester’ — identify a future mother
‘scrubs’ — medicine
108. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)
P(sold | vCLIENTS)
lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
Can also make the topics
supervised so that they predict
an outcome.
helps fine-tune topics so that correlate with your favorite business metric
align topics w/ expectations
helps us guess when revenue goes up what the leading causes are
110. “PS! Thank you for such an awesome idea”
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
SF is all about mixing cutting edge algorithms but we absolutely need interpretability. human component to algos is not
negotiable
Could we demand the model make us a sentence that is 80% religion, 10% politics?
classify word level, LSTM on sentence, LDA on document level
111. “PS! Thank you for such an awesome idea”
@chrisemoody
doc_id=1846
Can we represent the internal LSTM
states as a dirichlet mixture?
Dirichlet-squeeze internal states and manipulations, that maybe will help us understand the science of LSTM dynamics —
because seriously WTF is going on there
112. Can we model topics to sentences?
lda2lstm
“PS! Thank you for such an awesome idea”doc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
Can we also extend this to image generation? TJ is working on a ridiculous VAE/GAN model… can we throw in a topic
model? Can we say make me an image that is 80% sweater, and 10% zippers, and 10% elbow patches?
119. Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of five items)
121. CBOW
“The fox jumped over the lazy dog”
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vIN
vIN vIN
SkipGram
“The fox jumped over the lazy dog”
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
CBOW sums words vectors, loses the order in the sentence
Both are good at semantic relationships
Child and kid are nearby
Or gender in man, woman
If you blur words over the scale of context — 5ish words, you lose a lot grammatical nuance
But skipgram preserves order
Preserves the relationship in pluralizing, for example
122. Shows that are many words similar to vacation actually come in lots of flavors
— wedding words (bachelorette, rehearsals)
— holiday/event words (birthdays, brunch, christmas, thanksgiving)
— seasonal words (spring, summer,)
— trip words (getaway)
— destinations
127. What I didn’t mention
A lot of text (only if you have a specialized vocabulary)
Cleaning the text
Memory & performance
Traditional databases aren’t well-suited
False positives
hundreds of millions of words, 1,000 books, 500,000 comments, or 4,000,000 tweets
high-memory and high-performance multicore machine.
Training can take several hours to several days but shouldn't need frequent retraining.
If you use pretrained vectors, then this isn't an issue.
Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication
searching in vector space requires. There are a few libraries that will help you quickly find the most similar items12: annoy,
ball trees, locality-sensitive hashing (LSH) or FLANN.
False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect.
Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.
129. All of the following ideas will change what
‘words’ and ‘context’ represent.
But we’ll still use the same w2v algo
130. paragraph
vector
What about summarizing documents?
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
131. On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector Normal skipgram extends C words before, and C words after.
IN
OUT OUT
Except we stay inside a sentence
132. On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347
134. translation
(using just a rotation
matrix)
M
ikolov
2013
English
Spanish
Matrix
Rotation
Blows my mind
Explain plot
Not a complicated NN here
Still have to learn the rotation matrix — but it generalizes very nicely.
Have analogies for every linalg op as a linguistic operator: + and - and matrix multiplies
Robust framework and new tools to do science on words
139. context
dependent
context
Levy
&
G
oldberg
2014
Also show that SGNS is simply factorizing:
w * c = PMI(w, c) - log k
This is completely amazing!
Intuition: positive associations (canada, snow)
stronger in humans than negative associations
(what is the opposite of Canada?)
Also means we can do SVD-like techniques to get a convex w2v, uses fast lining libs, uses compressed word count matrix
so also better storage…. but not online
140. deepwalk
Perozzi
etal2014
learn word vectors from
sentences
“The fox jumped over the lazy dog”
vOUT vOUT vOUT vOUT vOUTvOUT
‘words’ are graph vertices
‘sentences’ are random walks on the
graph
word2vec
150. A specific lda2vec model
Our text blob is a comment that comes from a region_id and a style_id
159. Can measure similarity between topic vectors m and n, and word vectors w
This gets you the ‘top’ words in a topic, can figure out what that topic is
161. lda2vec
Let’s make vDOC into a mixture…
vDOC = 10% religion + 89% politics +…
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
This is now on the 20 newsgroups dataset…
Doc is now 10% religion 89% politics
mixture models are powerful for interpretability