Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This talk is composed of 3 major parts: the iterative creation of a recommender engine, the labeling of images, the post processing of images.
After introducing the main topic, labeling images to improve recommendation engine performances, we start with a recommendation engine discussion. We briefly describe the “classical” recommender system (collaborative filtering, content based filtering) and their advantages and limitations. We then describe the re-ranking approach we used to combine different engines into one. Re-ranking is a method (used by Google for example) that takes the different ranking as features and optimizes a certain loss. In our case we combine our different recommendations through a logistic regression that predict the probability of purchases for each tuple (user, sale). This version of the engine led to +7% revenue per customer and is now running in production.
We then explain why we wanted to use images information. It seemed that sales with some given images were performing better than others. If we had labels on all images we could use them in a content-based recommender system (used itself in the re-ranking engine). We then described how to label our images using pre-trained models, transfer learning and external APIs. We also show how easy it is to steal these APIs.
The final part deals with post processing of the images. Since most pre-trained models only output one class prediction, we need to reshape these into broad themes that can be used in our engine. We use a Non Negative Matrix Factorization for this purpose and show that we have very interpretable results. We conclude by comparing visually the different engines.
The key take away (more information in the pitch part) are theses:
- Machine learning: overview of recommender systems, re-ranking, how to label images, transfer learning.
- Do iterative data science. Start simple, then try more complex systems.
- Avoid rushing in deep learning without checking what you can find on Internet. Use pre-trained models and transfer learning.
There is a lot of hype around deep learning and image recognition. However, there are not that many success stories for web pure player companies. In our case we explain how we started with simple recommender systems before improving them gradually and finally using images information.
One of the key take away is the following: do iterative data science. Always prefer shipping a minimum viable product before creating something complex. At our clients, we commonly see teams rushing into images projects for the only purpose of doing deep learning without a clear ROI in mind.
We insist on the fact that deep learning is not an end in itself. Here, it boils down to making new information available in the system. In this sense, deep learning methods are just an extension of Business Intelligence.
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
These slides are from a talk I at the papis conference in Boston in 2016. The main subject is uplift modelling. Starting from a churn model approach for an e-gaming company, we introduce when to apply uplift methods, how to mathematically model them, and finally, how to evaluate them.
I tried to bridge the gap between causal inference theory and uplift theory, especially concerning how to properly cross validate the results. The notation used is the one from uplift modelling.
From Labelling Open data images to building a private recommender systemPierre Gutierrez
Recommender systems are paramount for e-business companies. There is an increasing need to take into account all the user information to tailor the best product proposition. One of them is the content that the user actually sees: the visual of the product.
When it comes to hostels, some people can be more attracted by pictures of the room, the building or even the nearby beach.
In this talk, we will describe how we improved an e-business vacation retailer recommender system using the content of images. We’ll explain how to leverage open dataset and pre-trained deep learning models to derive user taste information. This transfer learning approach enables companies to use state of the art machine learning methods without having deep learning expertise.
This document discusses how to model customer churn through machine learning. It defines churn as customers leaving or stopping usage. There are two types of churn - for subscription models where leaving can be clearly defined, and non-subscription models where leaving must be approximated. The document recommends predicting churn through classification models to identify potential churners, using customer behavioral and profile features over time. It also discusses evaluating models on validation data and using models to predict future churn and inform retention offers.
This document summarizes an presentation about personalizing artwork selection on Netflix using multi-armed bandit algorithms. Bandit algorithms were applied to choose representative, informative and engaging artwork for each title to maximize member satisfaction and retention. Contextual bandits were used to personalize artwork selection based on member preferences and context. Netflix deployed a system that precomputes personalized artwork using bandit models and caches the results to serve images quickly at scale. The system was able to lift engagement metrics based on A/B tests of the personalized artwork selection models.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The document discusses various machine learning concepts like model overfitting, underfitting, missing values, stratification, feature selection, and incremental model building. It also discusses techniques for dealing with overfitting and underfitting like adding regularization. Feature engineering techniques like feature selection and creation are important preprocessing steps. Evaluation metrics like precision, recall, F1 score and NDCG are discussed for classification and ranking problems. The document emphasizes the importance of feature engineering and proper model evaluation.
Machine Learning: Business Perspective - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
The document discusses clustering and nearest neighbor algorithms for deriving knowledge from data at scale. It provides an overview of clustering techniques like k-means clustering and discusses how they are used for applications such as recommendation systems. It also discusses challenges like class imbalance that can arise when applying these techniques to large, real-world datasets and evaluates different methods for addressing class imbalance. Additionally, it discusses performance metrics like precision, recall, and lift that can be used to evaluate models on large datasets.
This document outlines an agenda for a data science boot camp covering various machine learning topics over several hours. The agenda includes discussions of decision trees, ensembles, random forests, data modelling, and clustering. It also provides examples of data leakage problems and discusses the importance of evaluating model performance. Homework assignments involve building models with Weka and identifying the minimum attributes needed to distinguish between red and white wines.
This document appears to be lecture slides for a course on deriving knowledge from data at scale. It covers many topics related to building machine learning models including data preparation, feature selection, classification algorithms like decision trees and support vector machines, and model evaluation. It provides examples applying these techniques to a Titanic passenger dataset to predict survival. It emphasizes the importance of data wrangling and discusses various feature selection methods.
How to Perform Churn Analysis for your Mobile Application?Tatvic Analytics
For every marketer of mobile application, acquiring new customers certainly requires more effort in terms of time and money. On the other hand, firm can always focus on maintaining existing customer base and gain maximum out of them. If this is the case, then predictive analysis will be the correct approach for this situation.
The primary goal of this webinar is to predict segment of Mobile application users,
* Who will uninstall the app
* Remain inactive (which will be also termed as a churner) for quite long time and are expected to churn.
Churn analysis is the approach by which we will predict the likelihood of this event to occur.
Our webinar covers:
* How to extract data from Google Analytics using R
* How to build churn model in R
* Identifying the customer/subscriber segment that are classified based on past data pattern, who are likely to churn (Study customer behavior Patterns)
Watch Full Webinar - http://www.tatvic.com/webinar/churn-analysis-for-mobile-application/
This document discusses the past, present, and future of machine learning. It outlines how machine learning has evolved from early attempts at neural networks and expert systems to today's deep learning techniques powered by large datasets and distributed computing. The document argues that machine learning and predictive analytics will be core capabilities that impact many industries and applications going forward, including personalized insurance, fraud detection, equipment monitoring, and more. Intelligence from machine learning will become "ambient" and help solve hard problems by extracting value from big data.
The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Innovations in technology has revolutionized financial services to an extent that large financial institutions like Goldman Sachs are claiming to be technology companies! It is no secret that technological innovations like Data science and AI are changing fundamentally how financial products are created, tested and delivered. While it is exciting to learn about technologies themselves, there is very little guidance available to companies and financial professionals should retool and gear themselves towards the upcoming revolution.
In this master class, we will discuss key innovations in Data Science and AI and connect applications of these novel fields in forecasting and optimization. Through case studies and examples, we will demonstrate why now is the time you should invest to learn about the topics that will reshape the financial services industry of the future!
Topic
- Frontier topics in Optimization
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
This document provides an introduction to machine learning, including:
- It discusses how the human brain learns to classify images and how machine learning systems are programmed to perform similar tasks.
- It provides an example of image classification using machine learning and discusses how machines are trained on sample data and then used to classify new queries.
- It outlines some common applications of machine learning in areas like banking, biomedicine, and computer/internet applications. It also discusses popular machine learning algorithms like Bayes networks, artificial neural networks, PCA, SVM classification, and K-means clustering.
"You can't just turn the crank": Machine learning for fighting abuse on the c...David Freeman
Fighting fake registrations, phishing, spam and other types of abuse on the consumer web appears at first glance to be an application tailor-made for machine learning: you have lots of data and lots of features, and you are looking for a binary response (is it an attack or not) on each request. However, building machine learning systems to address these problems in practice turns out to be anything but a textbook process. In particular, you must answer such questions as:
- How do we obtain quality labeled data?
- How do we keep models from "forgetting the past"?
- How do we test new models in adversarial environments?
- How do we stop adversaries from learning our classifiers?
In this talk I will explain how machine learning is typically used to solve abuse problems, discuss these and other challenges that arise, and describe some approaches that can be implemented to produce robust, scalable systems.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
The document outlines Scott Triglia's recommendations for building an initial recommender system at Yelp. It recommends focusing on solving the specific retrieval problem, building for the available infrastructure and team size, and creating a good product rather than beating benchmarks. The proposed system uses multiple experts that each handle a single recommendation reason, like liked businesses from friends. The experts' suggestions are efficiently searched and combined to produce the final results. Future plans include adding more context and personalized ranking.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
The ACM RecSys Challenge 2016 was focussing on the problem of job recommendations: given a user, return a ranked list of jobs that the user is likely to be interested in. More than 100 teams actively participated and submitted solutions. All the winning teams used an ensemble of recommender strategies (e.g. learning to rank approaches, matrix factorization techniques, etc.). More details: http://2016.recsyschallenge.com/
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...Ericsson France
Global mobile subscriptions reached 6.2 billion in Q1 2012 and are expected to grow to around 9 billion by 2017. Mobile broadband subscriptions grew 60% year-over-year in Q1 2012 and are predicted to reach 5 billion by 2017. Mobile data traffic is expected to grow 15 times by 2017 due to increased smartphone and mobile broadband device usage. Regional trends show Asia Pacific having the largest growth in subscriptions while North America is transitioning to LTE earlier than other regions.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
The document discusses clustering and nearest neighbor algorithms for deriving knowledge from data at scale. It provides an overview of clustering techniques like k-means clustering and discusses how they are used for applications such as recommendation systems. It also discusses challenges like class imbalance that can arise when applying these techniques to large, real-world datasets and evaluates different methods for addressing class imbalance. Additionally, it discusses performance metrics like precision, recall, and lift that can be used to evaluate models on large datasets.
This document outlines an agenda for a data science boot camp covering various machine learning topics over several hours. The agenda includes discussions of decision trees, ensembles, random forests, data modelling, and clustering. It also provides examples of data leakage problems and discusses the importance of evaluating model performance. Homework assignments involve building models with Weka and identifying the minimum attributes needed to distinguish between red and white wines.
This document appears to be lecture slides for a course on deriving knowledge from data at scale. It covers many topics related to building machine learning models including data preparation, feature selection, classification algorithms like decision trees and support vector machines, and model evaluation. It provides examples applying these techniques to a Titanic passenger dataset to predict survival. It emphasizes the importance of data wrangling and discusses various feature selection methods.
How to Perform Churn Analysis for your Mobile Application?Tatvic Analytics
For every marketer of mobile application, acquiring new customers certainly requires more effort in terms of time and money. On the other hand, firm can always focus on maintaining existing customer base and gain maximum out of them. If this is the case, then predictive analysis will be the correct approach for this situation.
The primary goal of this webinar is to predict segment of Mobile application users,
* Who will uninstall the app
* Remain inactive (which will be also termed as a churner) for quite long time and are expected to churn.
Churn analysis is the approach by which we will predict the likelihood of this event to occur.
Our webinar covers:
* How to extract data from Google Analytics using R
* How to build churn model in R
* Identifying the customer/subscriber segment that are classified based on past data pattern, who are likely to churn (Study customer behavior Patterns)
Watch Full Webinar - http://www.tatvic.com/webinar/churn-analysis-for-mobile-application/
This document discusses the past, present, and future of machine learning. It outlines how machine learning has evolved from early attempts at neural networks and expert systems to today's deep learning techniques powered by large datasets and distributed computing. The document argues that machine learning and predictive analytics will be core capabilities that impact many industries and applications going forward, including personalized insurance, fraud detection, equipment monitoring, and more. Intelligence from machine learning will become "ambient" and help solve hard problems by extracting value from big data.
The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Innovations in technology has revolutionized financial services to an extent that large financial institutions like Goldman Sachs are claiming to be technology companies! It is no secret that technological innovations like Data science and AI are changing fundamentally how financial products are created, tested and delivered. While it is exciting to learn about technologies themselves, there is very little guidance available to companies and financial professionals should retool and gear themselves towards the upcoming revolution.
In this master class, we will discuss key innovations in Data Science and AI and connect applications of these novel fields in forecasting and optimization. Through case studies and examples, we will demonstrate why now is the time you should invest to learn about the topics that will reshape the financial services industry of the future!
Topic
- Frontier topics in Optimization
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
This document provides an introduction to machine learning, including:
- It discusses how the human brain learns to classify images and how machine learning systems are programmed to perform similar tasks.
- It provides an example of image classification using machine learning and discusses how machines are trained on sample data and then used to classify new queries.
- It outlines some common applications of machine learning in areas like banking, biomedicine, and computer/internet applications. It also discusses popular machine learning algorithms like Bayes networks, artificial neural networks, PCA, SVM classification, and K-means clustering.
"You can't just turn the crank": Machine learning for fighting abuse on the c...David Freeman
Fighting fake registrations, phishing, spam and other types of abuse on the consumer web appears at first glance to be an application tailor-made for machine learning: you have lots of data and lots of features, and you are looking for a binary response (is it an attack or not) on each request. However, building machine learning systems to address these problems in practice turns out to be anything but a textbook process. In particular, you must answer such questions as:
- How do we obtain quality labeled data?
- How do we keep models from "forgetting the past"?
- How do we test new models in adversarial environments?
- How do we stop adversaries from learning our classifiers?
In this talk I will explain how machine learning is typically used to solve abuse problems, discuss these and other challenges that arise, and describe some approaches that can be implemented to produce robust, scalable systems.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
The document outlines Scott Triglia's recommendations for building an initial recommender system at Yelp. It recommends focusing on solving the specific retrieval problem, building for the available infrastructure and team size, and creating a good product rather than beating benchmarks. The proposed system uses multiple experts that each handle a single recommendation reason, like liked businesses from friends. The experts' suggestions are efficiently searched and combined to produce the final results. Future plans include adding more context and personalized ranking.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
The ACM RecSys Challenge 2016 was focussing on the problem of job recommendations: given a user, return a ranked list of jobs that the user is likely to be interested in. More than 100 teams actively participated and submitted solutions. All the winning teams used an ensemble of recommender strategies (e.g. learning to rank approaches, matrix factorization techniques, etc.). More details: http://2016.recsyschallenge.com/
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...Ericsson France
Global mobile subscriptions reached 6.2 billion in Q1 2012 and are expected to grow to around 9 billion by 2017. Mobile broadband subscriptions grew 60% year-over-year in Q1 2012 and are predicted to reach 5 billion by 2017. Mobile data traffic is expected to grow 15 times by 2017 due to increased smartphone and mobile broadband device usage. Regional trends show Asia Pacific having the largest growth in subscriptions while North America is transitioning to LTE earlier than other regions.
Отток клиентов - это не просто абстрактный показатель в отчетах. Это недополученные доходы, негативные отзывы клиентов и плохое настроение ваших сотрудников.
- Почему клиенты уходят?
- Как диагностировать отток клиентов?
- Какие способы борьбы существуют?
- Как оценить ущерб от потери клиентов?
- Возможно ли вернуть ушедших покупателей?
Узнайте, как снизить отток и увеличить эффективность работы с вашими клиентами.
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...Le_GFII
Intervention de Hugo Le Squeren, Sales Engineer chez Dataiku et Florian Servaux, Chef de projet chez Coyote.
Séminaire DIXIT : Les nouvelles frontières de la « data intelligence » : content analytics, machine-learning, prédictif
Abstract : omme dans les activités TELECOM, le modèle COYOTE est basé sur l’abonnement. À ce titre, la fidélisation du parc d’abonnés est un facteur clé de succès. Afin d’optimiser ses actions de fidélisation et d’accroître la connaissance client, COYOTE en partenariat avec DATAIKU, a croisé les différentes sources de données à sa disposition. Il en résulte des analyses prédictives sur le comportement client.
Source : http://www.gfii.fr/fr/document/seminaire-dixit-les-nouvelles-frontieres-de-la-data-intelligence-content-analytics-machine-learning-predictif
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
The document discusses how Dataiku aims to help data scientists focus on real problems by providing a ready-to-use data science studio platform. The platform offers visual and interactive data preparation tools for data cleaning, guided machine learning for non-ML experts, and production-ready models and insights. Dataiku was founded in 2013 to make data science accessible to anyone by handling real-life data challenges through a common and democratic data science environment.
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
Our pitch at Data-Driven NYC meetup on September 17th (http://datadrivennyc.com).
Speaking about Data Scientists pains and how Dataiku Data Science Studio can help them to more than Data Cleaners and Data Leak Fixers !
These slides are from a talk I gave at Google Campus Madrid for the Machine Learning Meetup. The main subject is uplift modelling. Starting from a churn model approach for an e-gaming company, we introduce when to apply uplift methods, how to mathematically model them, and finally, how to evaluate them.
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
This document proposes three options for renovating a corporate fitness center. Option 1 costs $149,962 and includes Cybex cardio and strength equipment plus a large functional training cage. Option 2 costs $149,596 and is similar but with a smaller training cage. Option 3 costs $146,752 and replaces some Cybex machines with less expensive options but adds more functional training accessories. The proposals include layout diagrams, equipment specifications, and options for financing.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Livre Blanc Attribution Management : entre technologie, marketing et statistiqueConverteo
Converteo et Adversitement, sociétés de conseil et services spécialisées en mesure et amélioration de l’efficacité marketing, ont conjointement élaboré ce livre blanc pour éclairer les annonceurs sur une discipline souvent évoquée dans le monde de l’investissement media mais encore mal maîtrisée : l’Attribution Management.
Each month, join us as we highlight and discuss hot topics ranging from the future of higher education to wearable technology, best productivity hacks and secrets to hiring top talent. Upload your SlideShares, and share your expertise with the world!
Not sure what to share on SlideShare?
SlideShares that inform, inspire and educate attract the most views. Beyond that, ideas for what you can upload are limitless. We’ve selected a few popular examples to get your creative juices flowing.
SlideShare is a global platform for sharing presentations, infographics, videos and documents. It has over 18 million pieces of professional content uploaded by experts like Eric Schmidt and Guy Kawasaki. The document provides tips for setting up an account on SlideShare, uploading content, optimizing it for searchability, and sharing it on social media to build an audience and reputation as a subject matter expert.
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
This document discusses three lessons learned from building machine learning systems at Stripe.
1. Don't treat models as black boxes. Early on, Stripe focused only on training with more data and features without understanding algorithms, results, or deeper reasons behind results. This led to overfitting. Introspecting models using "score reasons" helped debug issues.
2. Have a plan for counterfactual evaluation before production. Stripe's validation results did not predict poor production performance because the environment changed. Counterfactual evaluation using A/B testing with probabilistic reversals of block decisions allows estimating true precision and recall.
3. Invest in production monitoring of models. Monitoring inputs, outputs, action rates, score
This document discusses automated testing and different levels of testing. It explains that automated tests should be written to achieve good design, clarify how a system works, understand the system, and minimize risks. While automated tests require time to write, maintain, and execute, these efforts can be minimized by making tests easy to run and maintain with minimal dependencies. The document also discusses different perspectives on testing, including what developers and customers want. Developers are concerned with complexity and risk, while customers care about functionality and user scenarios. It notes that while unit tests have limitations, acceptance tests are also limited as they are slow, fragile, and not isolated. The "lost layer" of the testing pyramid is also mentioned - the presentation, service, and persistence
From science to engineering, the process to build a machine learning productBruce Kuo
This document discusses the process of developing a machine learning product from science to engineering. It begins with defining the business problem and objectives, then researching potential machine learning solutions through experimentation. Next, it covers evaluating solutions offline and defining metrics before integrating the model. Engineering aspects like serialization, APIs, pipelines and monitoring are also discussed. The goal is to share an overview of a machine learning project lifecycle and highlight connections between business needs and technical implementation.
This was my presentation at the World Congress for Project Managers and Business Analysts in Orlando (2013). While the title/teaser was simply playing with our fascination with the book "From good to great", the subject was very serious and pragmatic: using process simulations to learn, understand, explore, and ultimately decide.
Drifting Away: Testing ML Models in ProductionDatabricks
Deploying machine learning models has become a relatively frictionless process. However, properly deploying a model with a robust testing and monitoring framework is a vastly more complex task. There is no one-size-fits-all solution when it comes to productionizing ML models, oftentimes requiring custom implementations utilising multiple libraries and tools. There are however, a set of core statistical tests and metrics one should have in place to detect phenomena such as data and concept drift to prevent models from becoming unknowingly stale and detrimental to the business.
Combining our experiences from working with Databricks customers, we do a deep dive on how to test your ML models in production using open source tools such as MLflow, SciPy and statsmodels. You will come away from this talk armed with knowledge of the key tenets for testing both model and data validity in production, along with a generalizable demo which uses MLflow to assist with the reproducibility of this process.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
This document discusses using a hackathon to explore problems in digital and retail behavior analysis, propensity marketing, and promotion modeling using case studies, analytics strategies, and machine learning/data mining techniques. It provides examples of long tail analysis from search engines, describes visualizing query length data in real-time, and reviews using demographic data and web logs for targeting. The document emphasizes structuring hackathon problems, iterating on visualizations before machine learning, and partnering with retailers rather than just releasing APIs.
Toolkits and tips for UX analytics CRO by Craig SullivanUXPA UK
This document summarizes Craig Sullivan's presentation on UX and conversion rate optimization (CRO). Some of the key topics and tools discussed include: conducting guerrilla usability testing outside the office; using session replay tools to analyze user behavior; leveraging the voice of the customer through surveys and customer feedback; acting like a private investigator to research competitors; and experimenting with split testing tools and analytics to optimize conversions and revenue. Specific examples are provided of split testing people in photos and the effectiveness of different images in TV advertising. The presentation emphasizes the importance of blending UX techniques with analytics, testing, and data to create successful, customer-delighted products and services.
How to Apply Machine Learning by Lyft Senior Product ManagerProduct School
This document describes courses offered by Product School to help product managers gain skills in areas like product management, coding, data analytics, digital marketing, UX design, and product leadership. It also provides an overview of a talk on applying machine learning given by a Lyft senior product manager. The talk explains what machine learning is, the different types of machine learning problems, and how product managers can identify opportunities, define problems, and guide machine learning solutions and teams. Examples are provided around replacing cash bail and automating food delivery order disputes.
The document discusses agile estimation and planning techniques. It defines estimation as measuring the effort and time required to complete tasks and user stories. Some benefits of estimation include allowing effective decision making and prioritization. Techniques covered include relative estimation using story points, planning poker for estimating story points, and using velocity to estimate what a team can complete per sprint. The document also discusses agile planning at the release, sprint, and daily levels.
This presentation introduces the concept of Machine Learning and then discusses how Machine Learning is being used in the Predictive Maintenance domain.
Choosing the right process improvement tool for your project.
Learn how an experienced engineer decides when simulation is the right tool for his projects,
and when it isn't.
With the evolution of process improvement software, it can be difficult to decide the right tool for the job. Using something too powerful and complex can be a lengthy and unnecessary process, but underestimating the depth of analysis required and choosing something too simplistic early in a project can result in repeated work later.
The document summarizes a data science project on bank marketing data using various tools in IBM Watson Studio. The project followed a standard methodology of data exploration, feature engineering, model selection, training and evaluation. Random forest, XGBoost, LightGBM and deep learning models were tested. LightGBM performed best with a 95.1% ROC AUC score from AutoAI hyperparameter tuning. The best model was deployed to IBM Watson Machine Learning for production use. Overall, the project demonstrated the effectiveness of the Watson Studio platform and tools in developing performant models from structured data.
The Machine Learning Workflow with AzureIvo Andreev
This document provides an overview of real world machine learning using Azure. It discusses the machine learning workflow including data understanding, preprocessing, feature engineering, model selection, evaluation and tuning. It then describes various Azure machine learning tools for building, testing and deploying machine learning models including Azure ML Workbench, Studio, Experimentation Service and Model Management Service. It concludes with an upcoming demo of predictive maintenance using Azure ML Studio.
Simulation involves developing a model of a real-world system over time to analyze its behavior and performance. The key aspects covered in this document include defining simulation as modeling the operation of a system over time through artificial history generation and observation. Simulation models can be used as analysis and design tools to predict the effects of changes to a system before actual implementation. Discrete event simulation is discussed as a common technique that models systems with state changes occurring at discrete points in time. The document also outlines the steps in a typical simulation study including problem formulation, model conceptualization, experimentation and analysis.
Sample Codes: https://github.com/davegautam/dotnetconfsamplecodes
Presentation on How you can get started with ML.NET. If you are existing .NET Stack Developer and Wanna use the same technology into Machine Learning, this slide focuses on how you can use ML.NET for Machine Learning.
This document discusses how data can be leveraged for product management. It outlines how data can be used as the core of a product, to optimize unit economics by ensuring lifetime value exceeds customer acquisition costs, for marketing optimizations by testing channels and optimizing return on marketing investment, and for product optimizations through A/B testing. However, it notes that most A/B tests are not statistically valid or impactful. It also discusses using data for personalization, including basic personalization triggers not requiring data science. Overall, the document advocates using data to understand customers, test changes, and optimize performance.
Travis Cox, Kathy Applebaum, and Kevin McClusky from Inductive Automation will discuss key concepts and best practices, show demos, and answer questions from the audience, to help you start integrating ML into your day-to-day processes.
Learn more about:
• Practical ways to use ML in your factory or facility
• What you'll need to get started
• Existing ML tools and platforms
• And more
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
2. • Data Science competitions platform
(There are others : DataScience.net in France)
• 332,000 Data Scientists
• today : 192 competitions, 18 active
+ 516 In class, 12 active
• Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
3. • Price pool?
• 325,000 $ to make on August 31st
• Good luck with that !
• Not a good hourly wage
• today : 192 competitions, 18 active
Understand :
• Lot’s of datasets about approximately every DS topic
• Lot’s of winner solutions, tip and tricks, etc…
• Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
4. Most of the time:
• You have a train set with labels and a test set without labels.
• You need to learn a model using the train features and predict the test set labels
• Your prediction is evaluated using a specific metric
• The best prediction wins
What is a Data Science Competition ?
5. Most of the time:
• You have a train set with labels and a test set without labels.
• You need to learn a model using the train features and predict the test set labels
• Your prediction is evaluated using a specific metric
• The best prediction wins
What is a Data Science Competition?
Why
AUC?
F1
score?
Log
loss?
Could
that
depend
on
my
train/test
split?
Where
do
they
come
from
?
Do
you
always
have
some?
Why
is
the
split
this
way?
Random?
Time?
6. What you don’t learn on Kaggle (or in class?):
• How to model a business question into a ML problem.
• How to manage/create labels. (proxy / missing…)
• How to evaluate a model:
• How to choose your metric
• How to design your train/test split
• How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
• How to design your cross validation scheme (and not overfit)
• How to create relevant features
• Hacks and tricks (leak exploitation J)
What is a Data Science Competition?
9. • Introduction
• Labels?
• Train and test split?
• Feature Engineering?
• Evaluation Metric?
Introduction
10. • Introduction
• Labels?
• Train and test split?
• Feature Engineering?
• Evaluation Metric?
Introduction
The
newcomer
disillusion
The
produc(on
bad
surprise
The
business
obfusca(on
11. • Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor
• Twitter @prrgutierrez
Who I am
12. • Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor
• Twitter @prrgutierrez
Who I am
14. • Everywhere is fraud
E-business, Telco, Medicare,…
• Easily defined as a classification problem
• Target well defined ?
• E-business : yes with lag
• Elsewhere : need checks,
labels are expensive
Fraud Detection
15. • Wikipedia:
“Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the
number of individuals or items moving out of a collective group over a specific period of
time”
= Customer leaving
Churn
16. • Subscription models:
• Telco
• E-gamming (Wow)
• Ex : Coyote -> 1 year subscription
-> you know when someone leave
• Non subscription models:
• E-Business (Amazon, Price Minister, Vente Privée)
• E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving
Candy Crush: days / weeks
MMORPG: 2 months (holidays)
Price Minister: months
Two types of Churn
17. • Predict if a vehicle / machine / part is going to fail
• Classification Problem:
• Given a future horizon and a failure type. Will this happen for a given vehicle ?
-> 2 parameters describe the target
• Vary a lot the target -> spurious correlation
• Just choose it as the result of the exact business need
Predictive Maintenance
18. • Target is “will like” or “will buy”
• Target is often proxy of real interest (implicit feedback)
Recommender System
19. • Can you model the problem as a ML problem?
• Ex : predictive maintenance
• Ask the right question from a business point of view.
Not what you know how to do.
• Is your target a proxy?
• Recommendation system
• May need bandit algorithm
• Is it easy to get labels?
• Ex : Fraud detection
• Can be expensive
• Mechanical Turk can be the answer
Summary on Labels
20. • Random Split
• Just like in school
Train / test split
• When
and
why
?
-‐>
When
each
line
is
independent
from
the
rest
(not
that
common
!)
image,
document
classifica(on,
sen(ment
analysis
(“but
aha
is
the
new
lol”
)
-‐>
When
you
want
to
quickly
iterate
/
benchmark:
“is
it
even
possible?”
-‐>
When
you
want
to
sell
something
to
your
boss
21. • Column / group based
Ex : Caterpillar challenge
• Predict a price
• for each tube id
• Tube id in train and test
are different
Objective :
being able to generalize to
other tubes!
Train / test split
22. • Time based
• Simply separate train and test on a time variable
• When and Why?
-> When you want a model that “predict the future”
-> When things evolve with time! (most problems!)
-> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive
maintenance,…
Train / test split
23. • No subscription example
• Target : 4 month without buying
• Features ?
Train / test split : Churn example
24. Ex : Train and predict scheme
Time
T
:
present
(me
T
–
4
month
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
Data
is
used
for
feature
genera(on.
Use
model
to
predict
future
churn
Train
model
using
features
and
target
25. Ex : Train Evaluation and Predict Scheme
Time
T
:
present
(me
T
–
4
month
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
Data
is
used
for
feature
genera(on
Valida&on
set
Use
model
to
predict
future
churn
Training
Evaluate
on
the
target
of
the
valida(on
set
T
–
8
month
Data
is
used
for
features
genera(on.
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
26. • More complex design
• Graph sampling (fraud rings ? )
• Random sampling in client / machine life
• Mix of column based and time based …
• The rule :
1) What is the problem ?
2) To what would I like to generalize my model ?
Future ? Other individuals ? …
3) => Train / Test split
Train / test split
27. • Predictive Maintenance problem
• Objective : predict failure in next 3 days.
• Metric is proportional to accuracy (and 0.57 is the best score !)
• Link to data :
https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)
31. • How to design the evaluation scheme?
• What is the probability that an asset fail in the next 3 days from Now?
-> classification problem
-> Time based split
-> but how do I create a train and a test?
• Choose a date and evaluate what happens 3 days later?
-> pb : not enough failures happening
• Choose several dates for each asset?
-> beware of asset over-fitting
• In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society
37. • Beware of the distribution of you features!
• Is there a time dependency?
• Ex : count, sum, … that will only increase with time
• -> Calculate count and sum rescaled by time / in moving windows instead.
• Can be found in Churn, Fraud detection, Ad click prediction,…
• A categorical variable dependency?
• Ex : email flag in fraud detection
• Is there a Network dependency?
• Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering
38. • Final trick :
- Stack train and test and add is_test boolean
- Try to predict is_test
- Check if the model is able to predict
- If so :
- check the feature importance
- Remove / modify feature and iterate
Feature Engineering
39. • Final trick:
• Back to Phm example:
Feature Engineering
Huge
(me
leak
!
40. • “Treshold dependant”
• Accuracy
• Precision and Recall
• F1 score
• “Treshold independant”
• AUC
• Log Loss
• Others (Mean average precision)…
Evaluation metric : Classification
41. • “Treshold dependant”
• Accuracy
• Precision and Recall
• F1 score
• “Treshold independant”
• AUC
• Log Loss
• Others (Mean average precision)…
• Customs
Evaluation metric : Classification
Not
good
if
unbalanced
target
When
you
have
an
order
problem
When
you
are
going
stochas(c
When
you
need
to
s(ck
to
business
Accuracy
alterna(ve
42. • Custom metrics
• Cost based
• Ex Fraud:
• Mean loss of 50 $ / fraud (FN)
• Mean loss of 20 $ / wrongly cancelled transaction (FP)
• F1 score often used in papers
• in practice, you often have a business cost
Evaluation metric : Classification
TP
FN
TN
FP
43. • Custom metrics
• Fraud Example 1:
• “I have fraudsters on my e-business website”
• I generate a score for each transaction
• I handle this by manually handling transactions with score higher than threshold
• I have 1 person that does this fulltime and able to deal with 100 transactions / day
• The rest is automatically accepted
-> AUC is not bad
-> Recall in 100 transactions / day
-> Total money blocked 100 transactions / day
In practice AUC more stable… But the money metric can also be used for communication.
Evaluation metric : Classification
44. • Custom metrics
• Fraud Example 2:
• “I have fraudsters on my e-business website”
• I generate a score for each transaction
• I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value.
-> F1–Score?
-> Cost based is better
Evaluation metric : Classification
45. • My cheat sheet
Evaluation metric : Classification
Metric
Op&mized
By
ML
model
?
Treshold
Dependant
Applica&on
example
Accuracy
YES
YES
image
classifica(on,
nlp
…
F1-‐score
NO
YES
?
Papers
?
AUC
NO
NO
fraud
detec(on,
churn,
healthcare
…
Log-‐Loss
YES
NO
add
click
predic(on
Custom
metric
NO
?
all
?
46. • Business Question dictates Evaluation Scheme!
• test set design
• evaluation metric
• Indirectly impact feature engineering
• Indirectly impact label quality
• Think (not too much) before coding
• Don’t try to optimize the wrong problem!
Conclusion