Automated Reports with Rstudio Server
Automated KPI reporting with Shiny Server
Process Validation Documentation with Jupyter Notebook
Automated Machine Learning with Dataiku
Using Screaming Frog to crawl a website
Using R for SEO Analysis
Using PaasLogs to centralize logs
Using Kibana to build fancy dashboards
Tutorial : www.data-seo.com
Cool bonsai cool - an introduction to ElasticSearchclintongormley
An introduction to Clinton Gormley and the search engine Elasticsearch. It discusses how Elasticsearch works by tokenizing text, creating an inverted index, and using relevance scoring. It also summarizes how to install and use Elasticsearch for indexing, retrieving, and searching documents.
This document summarizes and compares the Xapian and Sphinx search engines. It discusses their key features, strengths, and weaknesses. Xapian is described as having fast search speeds, full-text search support, and low memory usage but lacks in areas like database recovery and field support. Sphinx is characterized as being faster than other solutions, integrating well with databases and NoSQL, and supporting complex queries, but its data source definitions and real-time indexing need more work. The document concludes that while Xapian is good, Sphinx appears better suited for general applications due to its capabilities.
Log File Analysis: The most powerful tool in your SEO toolkitTom Bennet
Slide deck from Tom Bennet's presentation at Brighton SEO, September 2014. Accompanying guide can be found here: http://builtvisible.com/log-file-analysis/
Image Credits:
https://www.flickr.com/photos/nullvalue/4188517246
https://www.flickr.com/photos/small_realm/11189803763/
https://www.flickr.com/photos/florianric/7263382550
http://fotojenix.wordpress.com/2011/07/08/weekly-photo-challenge-old-fashioned/
The document discusses using Elasticsearch and Kibana for big data search and visualization. It describes some problems with traditional request-response models like long cycles and engineers being bottlenecks. It then outlines requirements for the solution like being easy to use, fast, scalable, and solving 80% of problems. Elasticsearch and Kibana are presented as a solution, with Elasticsearch indexing and storing large amounts of data and Kibana providing interactive visualizations without coding. Examples of how it has been used at Gogolook are provided, and future plans like logging all user events are discussed.
ElasticSearch is an open source, distributed, RESTful search and analytics engine. It allows storage and search of documents in near real-time. Documents are indexed and stored across multiple nodes in a cluster. The documents can be queried using a RESTful API or client libraries. ElasticSearch is built on top of Lucene and provides scalability, reliability and availability.
Денис Головняк - Продвинутый поиск с помощью Search APILEDC 2016
The document discusses the Search API module for Drupal. It provides an advanced search framework that abstracts from data sources and backend implementations. It allows creating searches that can index any data and integrate with Views. Extensions exist for backends like Solr, databases, and other search engines. Features include facets, autocomplete, spellcheck, and the ability to freely configure the indexing workflow. Recipes are provided for subtitle and location-based searches out of the box.
Real-time search in Drupal. Meet ElasticsearchAlexei Gorobets
This document provides an introduction to using Elasticsearch for real-time search in Drupal. It discusses Elasticsearch's features like being RESTful, open source, JSON over HTTP, distributed, highly available, and schema free. It then demonstrates how to setup Elasticsearch, index and search data, and use facets. Finally, it mentions Elasticsearch and Search API Elasticsearch modules for Drupal, and ongoing work to implement the Field Storage API in Elasticsearch to integrate it more fully into Drupal.
Practical Elasticsearch - real world use casesItamar
Elasticsearch - a search and real-time analytics server based on Apache Lucene - is gaining a lot of popularity lately, and is being used world-wide to power many sophisticated systems. While many use it for the "standard" stuff (that is, simple full-text search and real-time log analysis), there are some really interesting usage patterns that can prove useful in many real-world scenarios. In this talk we will briefly talk about Elasticsearch and its common use-cases, and then showcase some less common use-cases leveraging Elasticsearch in an interesting and often times innovating ways.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...javier ramirez
Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun.
In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.
The ultimate guide for Elasticsearch pluginsItamar
This document discusses Elasticsearch plugins, including:
- The different types of plugins like analysis, scripting, and site plugins
- Integration points in Elasticsearch that can be extended by plugins
- Examples of plugins like custom analyzers and percolators
- Considerations for writing plugins like maintenance, versioning, and testing
- How to package and install plugins within Elasticsearch
Real-time search in Drupal with Elasticsearch @MoldcampAlexei Gorobets
This document provides an introduction to Elasticsearch, an open source, distributed real-time search and analytics engine. It discusses how to setup Elasticsearch in 2 steps by extracting the archive and running a command. It then demonstrates how to index and search data using Elasticsearch's RESTful API and JSON over HTTP. Examples are provided for indexing, getting, updating, deleting, and searching data as well as distributed, concurrency, and pagination features.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Scrapy is a fast, open source Python framework for scraping web pages and extracting structured data using XPath selectors. It can build and scale large crawling projects easily, handles requests asynchronously to be fast, and automatically adjusts crawling speed using auto-throttling. Scrapy generates exports in formats like JSON, CSV, and XML and has built-in support for extracting data by XPath or CSS expressions from web pages in an automatic way.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
Elasticsearch is a distributed RESTful search engine that is potentially vulnerable to injection attacks if input is not validated properly. The document discusses previous security issues in Elasticsearch including remote code execution bugs, and analyzes the security of popular PHP Elasticsearch client libraries. It finds that the original Elasticsearch and Elastica clients encode URLs properly but Nervetattoo is potentially vulnerable if raw sockets are used without validation of index and type names passed to the client. Proper input validation is needed to prevent attacks through Elasticsearch clients.
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
This document provides an overview of the ELK stack components Elasticsearch, Logstash, and Kibana. It describes what each component is used for at a high level: Elasticsearch is a search and analytics engine, Logstash is used for data collection and normalization, and Kibana is a data visualization platform. It also provides basic instructions for installing and running Elasticsearch and Kibana.
A brief introduction to Elasticsearch and the many possibilities Elasticsearch offers in terms of search, data exploration and data aggregation. The presentation includes a brief introduction to search engine fundamentals and core features of Elasticsearch. The talk focuses on how we can navigate structured and unstructured data for search as well as aggregating and visualizing data for analytical purposes.
The talk aims to demonstrate case studies beyond traditional full-text-search, and hopefully show that Elasticsearch can help us build so much more than just a search engine.
Elasticsearch is a JSON document database that allows for powerful full-text search capabilities. It uses Lucene under the hood for indexing and search. Documents are stored in indexes and types which are analogous to tables in a relational database. Documents can be created, read, updated, and deleted via a RESTful API. Searches can be performed across multiple indexes and types. Elasticsearch offers advanced search features like facets, highlighting, and custom analyzers. Mappings allow for customization of how documents are indexed. Shards and replicas improve performance and availability. Multi-tenancy can be achieved through separate indexes or filters.
The document discusses the new JSON REST API for WordPress, which provides a modern REST API for WordPress sites using JSON instead of the outdated XML-RPC format. It allows users to create, read, update and delete WordPress content like posts, pages, users and media through HTTP requests. The API can be accessed through plugins or by making requests directly to the /wp-json/ endpoints. It also supports features like authentication, pagination and filtering to build powerful applications that interact with WordPress content and data.
This document provides an overview of Elasticsearch including:
- Elasticsearch is a distributed, real-time search and analytics engine. It allows storing, searching, and analyzing big volumes of data in near real-time.
- Documents are stored in indexes which can be queried using a RESTful API or with query languages like the Query DSL.
- CRUD operations allow indexing, retrieving, updating, and deleting documents. More operations can be performed efficiently using the bulk API.
- Documents are analyzed and indexed to support full-text search queries and structured queries against specific fields. Mappings and analyzers define how text is processed for searching.
ElasticSearch - index server used as a document databaseRobert Lujo
Presentation held on 5.10.2014 on http://2014.webcampzg.org/talks/.
Although ElasticSearch (ES) primary purpose is to be used as index/search server, in its featureset ES overlaps with common NoSql database; better to say, document database.
Why this could be interesting and how this could be used effectively?
Talk overview:
- ES - history, background, philosophy, featureset overview, focus on indexing/search features
- short presentation on how to get started - installation, indexing and search/retrieving
- Database should provide following functions: store, search, retrieve -> differences between relational, document and search databases
- it is not unusual to use ES additionally as an document database (store and retrieve)
- an use-case will be presented where ES can be used as a single database in the system (benefits and drawbacks)
- what if a relational database is introduced in previosly demonstrated system (benefits and drawbacks)
ES is a nice and in reality ready-to-use example that can change perspective of development of some type of software systems.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
This document discusses Elasticsearch and its uses. It outlines 6 common use cases for Elasticsearch: 1) site search, 2) related posts, 3) replacing WP_Query, 4) log analytics with Logstash, 5) content reranking, and 6) breaking the blog boundary. It also provides an overview of what Elasticsearch is, including that it is a search engine, distributed, scalable, and supports analytics and multiple languages.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
Real-time search in Drupal. Meet ElasticsearchAlexei Gorobets
This document provides an introduction to using Elasticsearch for real-time search in Drupal. It discusses Elasticsearch's features like being RESTful, open source, JSON over HTTP, distributed, highly available, and schema free. It then demonstrates how to setup Elasticsearch, index and search data, and use facets. Finally, it mentions Elasticsearch and Search API Elasticsearch modules for Drupal, and ongoing work to implement the Field Storage API in Elasticsearch to integrate it more fully into Drupal.
Practical Elasticsearch - real world use casesItamar
Elasticsearch - a search and real-time analytics server based on Apache Lucene - is gaining a lot of popularity lately, and is being used world-wide to power many sophisticated systems. While many use it for the "standard" stuff (that is, simple full-text search and real-time log analysis), there are some really interesting usage patterns that can prove useful in many real-world scenarios. In this talk we will briefly talk about Elasticsearch and its common use-cases, and then showcase some less common use-cases leveraging Elasticsearch in an interesting and often times innovating ways.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...javier ramirez
Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun.
In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.
The ultimate guide for Elasticsearch pluginsItamar
This document discusses Elasticsearch plugins, including:
- The different types of plugins like analysis, scripting, and site plugins
- Integration points in Elasticsearch that can be extended by plugins
- Examples of plugins like custom analyzers and percolators
- Considerations for writing plugins like maintenance, versioning, and testing
- How to package and install plugins within Elasticsearch
Real-time search in Drupal with Elasticsearch @MoldcampAlexei Gorobets
This document provides an introduction to Elasticsearch, an open source, distributed real-time search and analytics engine. It discusses how to setup Elasticsearch in 2 steps by extracting the archive and running a command. It then demonstrates how to index and search data using Elasticsearch's RESTful API and JSON over HTTP. Examples are provided for indexing, getting, updating, deleting, and searching data as well as distributed, concurrency, and pagination features.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Scrapy is a fast, open source Python framework for scraping web pages and extracting structured data using XPath selectors. It can build and scale large crawling projects easily, handles requests asynchronously to be fast, and automatically adjusts crawling speed using auto-throttling. Scrapy generates exports in formats like JSON, CSV, and XML and has built-in support for extracting data by XPath or CSS expressions from web pages in an automatic way.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
Elasticsearch is a distributed RESTful search engine that is potentially vulnerable to injection attacks if input is not validated properly. The document discusses previous security issues in Elasticsearch including remote code execution bugs, and analyzes the security of popular PHP Elasticsearch client libraries. It finds that the original Elasticsearch and Elastica clients encode URLs properly but Nervetattoo is potentially vulnerable if raw sockets are used without validation of index and type names passed to the client. Proper input validation is needed to prevent attacks through Elasticsearch clients.
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
This document provides an overview of the ELK stack components Elasticsearch, Logstash, and Kibana. It describes what each component is used for at a high level: Elasticsearch is a search and analytics engine, Logstash is used for data collection and normalization, and Kibana is a data visualization platform. It also provides basic instructions for installing and running Elasticsearch and Kibana.
A brief introduction to Elasticsearch and the many possibilities Elasticsearch offers in terms of search, data exploration and data aggregation. The presentation includes a brief introduction to search engine fundamentals and core features of Elasticsearch. The talk focuses on how we can navigate structured and unstructured data for search as well as aggregating and visualizing data for analytical purposes.
The talk aims to demonstrate case studies beyond traditional full-text-search, and hopefully show that Elasticsearch can help us build so much more than just a search engine.
Elasticsearch is a JSON document database that allows for powerful full-text search capabilities. It uses Lucene under the hood for indexing and search. Documents are stored in indexes and types which are analogous to tables in a relational database. Documents can be created, read, updated, and deleted via a RESTful API. Searches can be performed across multiple indexes and types. Elasticsearch offers advanced search features like facets, highlighting, and custom analyzers. Mappings allow for customization of how documents are indexed. Shards and replicas improve performance and availability. Multi-tenancy can be achieved through separate indexes or filters.
The document discusses the new JSON REST API for WordPress, which provides a modern REST API for WordPress sites using JSON instead of the outdated XML-RPC format. It allows users to create, read, update and delete WordPress content like posts, pages, users and media through HTTP requests. The API can be accessed through plugins or by making requests directly to the /wp-json/ endpoints. It also supports features like authentication, pagination and filtering to build powerful applications that interact with WordPress content and data.
This document provides an overview of Elasticsearch including:
- Elasticsearch is a distributed, real-time search and analytics engine. It allows storing, searching, and analyzing big volumes of data in near real-time.
- Documents are stored in indexes which can be queried using a RESTful API or with query languages like the Query DSL.
- CRUD operations allow indexing, retrieving, updating, and deleting documents. More operations can be performed efficiently using the bulk API.
- Documents are analyzed and indexed to support full-text search queries and structured queries against specific fields. Mappings and analyzers define how text is processed for searching.
ElasticSearch - index server used as a document databaseRobert Lujo
Presentation held on 5.10.2014 on http://2014.webcampzg.org/talks/.
Although ElasticSearch (ES) primary purpose is to be used as index/search server, in its featureset ES overlaps with common NoSql database; better to say, document database.
Why this could be interesting and how this could be used effectively?
Talk overview:
- ES - history, background, philosophy, featureset overview, focus on indexing/search features
- short presentation on how to get started - installation, indexing and search/retrieving
- Database should provide following functions: store, search, retrieve -> differences between relational, document and search databases
- it is not unusual to use ES additionally as an document database (store and retrieve)
- an use-case will be presented where ES can be used as a single database in the system (benefits and drawbacks)
- what if a relational database is introduced in previosly demonstrated system (benefits and drawbacks)
ES is a nice and in reality ready-to-use example that can change perspective of development of some type of software systems.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
This document discusses Elasticsearch and its uses. It outlines 6 common use cases for Elasticsearch: 1) site search, 2) related posts, 3) replacing WP_Query, 4) log analytics with Logstash, 5) content reranking, and 6) breaking the blog boundary. It also provides an overview of what Elasticsearch is, including that it is a search engine, distributed, scalable, and supports analytics and multiple languages.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
I would like to share my story about how our team was building an efficient testing process, how these changes affect the development process overall, how to solve common problems of BDD-style tests with DEMO on real examples. My story begins with several failures/problems, which every team meets at the beginning of involving BDD tools in automation tests.
The next topic is including several improvements such as universal step definitions, cucumber expressions, own parameter types, text localization testing, involving REGEXP to test special symbols, etc.
After, slides cover solving irritable problems of BDD tests such as: getting, remembering and reusing unique data during test run sessions, working with API to avoid repeatable steps, file verifications in headless mode, excel files content, hash, screenshot testing, etc.
The document discusses the tools and practices used by a Ruby development team, including using RVM for managing Ruby versions and gemsets, Postgres.app for the database, Pow for local development, Git for version control, GitHub pull requests for code reviews, CircleCI for continuous integration and deployment to Heroku, Capistrano or Mina for deployment automation, and services like Rollbar and HipChat for error tracking and communication. Consistent coding styles, Sublime Text settings, and code quality practices like testing and reviews are also recommended.
From development environments to production deployments with Docker, Compose,...Jérôme Petazzoni
In this session, we will learn how to define and run multi-container applications with Docker Compose. Then, we will show how to deploy and scale them seamlessly to a cluster with Docker Swarm; and how Amazon EC2 Container Service (ECS) eliminates the need to install,operate, and scale your own cluster management infrastructure. We will also walk through some best practice patterns used by customers for running their microservices platforms or batch jobs. Sample code and Compose templates will be provided on GitHub afterwards.
As PHP programmers we are used to waiting for network I/O, in general we may not even consider any other option. But why wait? Why not jump on board the Async bullet-train and experience life in the fast lane and give Go and NodeJS a run for the money. This talk will aim to make the audience aware of the benefits, opportunities, and pitfalls of asynchronous programming in PHP, and guide them through the native functionality, frameworks and PHP extensions though which it can be facilitated.
Spicy javascript: Create your first Chrome extension for web analytics QAAlban Gérôme
Adobe Launch has a monitoring hooks API that provides more details about the rules that passed or failed. That's a great excuse for writing a Chrome extension. This will benefit you even if you have no need or experience with Adobe Launch.
Exploring Async PHP (SF Live Berlin 2019)dantleech
(note slides are missing animated gifs and video)
As PHP programmers we are used to waiting for network I/O, in general we may not even consider any other option. But why wait? Why not jump on board the Async bullet-train and experience life in the fast lane and give Go and NodeJS a run for the money. This talk will aim to make the audience aware of the benefits, opportunities, and pitfalls of asynchronous programming in PHP, and guide them through the native functionality, frameworks and PHP extensions though which it can be facilitated.
This document discusses various Docker orchestration tools including Kubernetes, Marathon, Rancher, Helios, Ansible Container, Docker Swarm, and others. It provides brief descriptions and usage examples for several of the tools. Kubernetes and Marathon are covered in slightly more detail, with notes on prerequisites and basic usage commands provided. The document aims to introduce developers to common options for orchestrating Docker containers at scale.
This document discusses Docker and provides an introduction and overview. It introduces Docker concepts like Dockerfiles, commands, linking containers, volumes, port mapping and registries. It also discusses tools that can be used with Docker like Fig, Baseimage, Boot2Docker and Flynn. The document provides examples of Dockerfiles, commands and how to build, run, link and manage containers.
The document discusses two serverless computing platforms that support Swift - OpenWhisk and Fn.
OpenWhisk is an open source system that is event-driven, containerized, and allows chaining of actions. It is hosted on Bluemix but can be difficult to deploy elsewhere. Fn is container-native and deploys functions as containers communicating via standard input/output. Both allow simple Swift functions to be deployed and called remotely with REST APIs or command line tools. The document provides examples of writing, deploying and calling functions on each platform.
Developing and Deploying PHP with DockerPatrick Mizer
The document discusses using Docker for developing and deploying PHP applications. It begins with an introduction to Docker, explaining that Docker allows applications to be assembled from components and eliminates friction between development, testing and production environments. It then covers some key Docker concepts like containers, images and the Docker daemon. The document demonstrates building a simple PHP application as a Docker container, including creating a Dockerfile and building/running the container. It also discusses some benefits of Docker like portability, separation of concerns between developers and DevOps, and immutable build artifacts.
Development Workflow Tools for Open-Source PHP LibrariesPantheon
Having a fine-tuned continuous integration environment is extremely valuable, even for small projects. Today, there is a wide variety of standalone projects and online Software-As-A-Service offerings that can super-streamline your everyday development tasks that can help you get your projects up and running like a pro. In this session, we'll look at how you can get the most out of:
* GitHub source code repository
* Packagist package manager for Composer
* Travis CI continuous integration service
* Coveralls code coverage service
* Scrutinizer static analysis service
* Box2 phar builder
* Sami api documentation generator
* ReadTheDocs online documentation reader service
* Composer scripts and projects for running local tests and builds After mastering these tools, you will be able to quickly set up a new php library project and use it in your Drupal modules.
Session presented at Stanford Drupal Camp: https://drupalcamp.stanford.edu/development-workflow-tools-open-source-php-libraries
Puppi is a Puppet modules that drives Puppet's knowledge of the Systems to a command line tool that you can use to check services availability, gather info on the system and deploy application with a single command.
This document introduces libTimeMachine, a library that allows manipulating the system time reported to PHP for testing purposes. It works by overriding PHP's system calls for getting the time (time(), gettimeofday(), etc.) and changing them to report a simulated past or future time. This allows unit testing code that relies on or processes time-based logic and events. The library works on Linux and Mac OS X and can be used with the PHP command line, built-in web server, and Apache+mod_php.
The document discusses using Parse Cloud Code to build web applications, including basic operations like create, read, update, delete, how Parse and RESTful APIs work, and how to use Cloud Code to call external APIs, run background jobs, and include other JavaScript modules.
Keep hearing about Plack and PSGI, and not really sure what they're for, and why they're popular? Maybe you're using Plack at work, and you're still copying-and-pasting `builder` lines in to your code without really knowing what's going on? What's the relationship between Plack, PSGI, and CGI? Plack from first principles works up from how CGI works, the evolution that PSGI represents, and how Plack provides a user-friendly layer on top of that.
SEO CAMP'us Paris 2024 - Déploiement de l'IA générative privée dans les organ...Vincent Terrasi
Présentation d'une étude de cas approfondie sur l'implémentation réussie de l'IA générative chez TotalEnergies, mettant en lumière les stratégies et les bonnes pratiques de déploiement adaptées à diverses organisations. De plus, la conférence explore comment l'optimisation du contenu via l'IA générative peut influencer positivement le classement et le trafic sur internet, tout en discutant des perspectives d'avenir pour le SEO et l'IA dans un contexte d'évolution technologique.
IA générative : Menace ou Opportunité pour le SEOVincent Terrasi
La conférence présente un aperçu équilibré des opportunités et des menaces liées à l'IA générative dans le domaine du SEO. Elle met en évidence la nécessité d'une utilisation judicieuse de cette technologie pour maximiser ses avantages tout en atténuant ses inconvénients potentiels.
slides SEO CAMP'us Paris 2022 - Google et tools SEO On vous a mentiVincent Terrasi
Venez découvrir ce que vous réserve Google et les Tools SEO en 2022. Google est devenu très fort pour détecter les sur-optimisations, nous vous dévoilerons
un benchmark complet sur ce qu'il faut faire et les meilleures méthodes pour identifier les nouveaux sujets éditoriaux
Une IA pour votre SEO, une méthode inédite pour accélérer vos projets Data SEOVincent Terrasi
Allons droit au but, de nombreuses IA savent faire du SEO désormais, avec par exemple : trouver des solutions à des problèmes connus, répondre à des questions SEO, générer du contenu de haute qualité. Dans cette conférence, je vous propose de découvrir des outils et des cas concrets pour donner une nouvelle dimension à vos projets. Venez apprécier comment le combo humain - IA peut accélérer vos projets Data SEO. ...
SEO AnswerBox, une méthode inédite pour interroger vos données et créer vos d...Vincent Terrasi
Google l’a intégré dans GA v4, mais imaginez pouvoir le faire sur vos propres données SEO. Découvrez comment gagner en productivité en configurant avec des phrases vos dashboards et vos alertes. Une méthode complète et actionnable vous sera dévoilée.
Comment faire du Data SEO sans savoir programmer ?Vincent Terrasi
Désormais il est très facile de faire de la Data Science sans savoir programmer, venez découvrir une approche nouvelle et apprenez à le faire par vous-même avec la formation no-code : https://www.datamarketinglabs.com/data-science-seo-sans-savoir-coder
Explainable Machine Learning for Ranking FactorsVincent Terrasi
It’s now common knowledge that machine learning can be used to predict your ranking factors, your traffic, or perfectly classify images, but it's impossible to point to the factors that most influence a prediction or a classification in each specific case.
However, new methods and techniques make algorithms more understandable to humans, even when dealing with very complex models. This is a huge benefit for machine learning use cases in SEO.
Comment les plateformes de Data Science métamorphosent le SEO ?Vincent Terrasi
La datascience est désormais présente dans toutes les disciplines. Découvrez des outils et des process disruptifs pour l'intégrer dans tous vos projets marketing : résumé de texte avec les méthodes extractives et abstractives.
Find out how DataScience has revolutionized SEO for OVHVincent Terrasi
This document discusses using data science and machine learning techniques like XGBoost for search engine optimization (SEO). It recommends building an SEO datamart using data from sources like Semrush, Majestic, and Screaming Frog. Models can be built using XGBoost to predict Google rankings based on variables like backlinks, domains, trust flow, and response time. The document advocates for a new role of data scientist SEO to help automate SEO using machine learning platforms like Dataiku.
How to boost your datamanagement with Dremio ?Vincent Terrasi
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
This document discusses how data science can be used to boost SEO efforts. It provides examples of using data science tools like Dataiku to analyze web server logs, predict Google rankings, and detect SEO opportunities. Key steps include collecting SEO and web log data, cleaning the data, running machine learning algorithms to predict rankings, and creating custom SEO data science plugins. The overall goal is to build an SEO data science platform to improve monitoring, reporting, and strategic decision making.
This document discusses data organization and big data architecture. It covers data organization topics like line of business, data office, data maturity levels, and a data maturity matrix. It then discusses big data architecture, including the roles of data scientists, analysts, engineers and more. Tooling and infrastructure are also summarized, like Hadoop, Spark, Kafka and data lakes. The document aims to provide an overview of how to structure and analyze large volumes of organizational data at multiple levels of maturity.
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
Lalit Wangikar, a partner at CKM Advisors, is an experienced strategic consultant and analytics expert. He started looking for data driven ways of conducting process discovery workshops. When he read about process mining the first time around, about 2 years ago, the first feeling was: “I wish I knew of this while doing the last several projects!".
Interviews are subject to all the whims human recollection is subject to: specifically, recency, simplification and self preservation. Interview-based process discovery, therefore, leaves out a lot of “outliers” that usually end up being one of the biggest opportunity area. Process mining, in contrast, provides an unbiased, fact-based, and a very comprehensive understanding of actual process execution.
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
This project demonstrates the application of machine learning—specifically K-Means Clustering—to segment customers based on behavioral and demographic data. The objective is to identify distinct customer groups to enable targeted marketing strategies and personalized customer engagement.
The presentation walks through:
Data preprocessing and exploratory data analysis (EDA)
Feature scaling and dimensionality reduction
K-Means clustering and silhouette analysis
Insights and business recommendations from each customer segment
This work showcases practical data science skills applied to a real-world business problem, using Python and visualization tools to generate actionable insights for decision-makers.
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
10. • Docker on Ubuntu 16.04 Server
• From the docker window, run:
• sudo docker run -d -p 8787:8787 rocker/rstudio
• e.g. http://yourIP:8787, and you should be greeted by the RStudio
welcome screen.
Log in using:
• username: rstudio
• password: rstudio
RStudio Server - Install
17. R – Scraper – OpenMpi
• MPI : Message Passing Interface is a specification for an API for passing
messages between different computers.
• Programming with MPI
• Difficult because of Rmpi package defines about 110 R functions
• Needs a parallel programming system to do the actual work in parallel
• The doMPI package acts as an adaptor to the Rmpi package, which in
turn is an R interface to an implementation of MPI
• Very easy to install Open MPI, and Rmpi on Debian / Ubuntu
• You can test with one computer
19. R – Scraper – Test doMpi
library(doMPI)
#start your cluster
cl <- startMPIcluster(count=20)
registerDoMPI(cl)
#
max <- dim(mydataset)[1]
x <- foreach(i=1:max, .combine="rbind") %dopar% seocrawlerThread(mydataset,i)
#close your cluster
closeCluster(cl)
27. • Dplyr
• Readxl
• SearchConsoleR
• googleAuthR
• googleAnalyticsR
R – Packages SEO
Thanks to Mark Edmondson
28. R – SearchConsoleR
library(googleAuthR)
library(searchConsoleR)
# get your password on google console api
options("searchConsoleR.client_id" = "41078866233615q3i3uXXXX.apps.googleusercontent.com")
options("searchConsoleR.client_secret" = "GO0m0XXXXXXXXXX")
## change this to the website you want to download data for. Include http
website <- "https://data-seo.fr"
## data is in search console reliably 3 days ago, so we donwnload from then
## today - 3 days
start <- Sys.Date() - 3
## one days data, but change it as needed
end <- Sys.Date() - 3
29. R – SearchConsoleR
## what to download, choose between data, query, page, device, country
download_dimensions <- c('date','query')
## what type of Google search, choose between 'web', 'video' or 'image'
type <- c('web')
## Authorize script with Search Console.
## First time you will need to login to Google but should auto-refresh after that so can be put in
## Authorize script with an account that has access to website.
googleAuthR::gar_auth()
## first time stop here and wait for authorisation
## get the search analytics data
data <- search_analytics(siteURL = website, startDate = start, endDate = end, dimensions =
download_dimensions, searchType = type)
31. • Table: Crontab Fields and Allowed Ranges (Linux Crontab Syntax)
• MIN Minute field 0 to 59
• HOUR Hour field 0 to 23
• DOM Day of Month 1-31
• MON Month field 1-12
• DOW Day Of Week 0-6
• CMD Command Any command to be executed.
• $ crontab –e
• Run the R script filePath.R at 23:15 for every day of the year :
15 23 * * * Rscript filePath.R
R – CronTab – Method 1
32. • R Package : https://github.com/bnosac/cronR
R – Cron – Method 2
library(cronR)
cron_add(cmd, frequency = 'hourly', id = 'job4', at = '00:20',
days_of_week = c(1, 2))
cron_add(cmd, frequency = 'daily', id = 'job5', at = '14:20')
cron_add(cmd, frequency = 'daily', id = 'job6', at = '14:20',
days_of_week = c(0, 3, 5))
OR
36. Shiny Server – Where and How
• ShinyApps.io
• A local server
• Hosted on your server
37. • docker run --rm -p 3838:3838
-v /srv/shinyapps/:/srv/shiny-server/
-v /srv/shinylog/:/var/log/
rocker/shiny
• If you have an app in /srv/shinyapps/appdir, you can run the app
by visiting http://yourIP:3838/appdir/.
Shiny Server - Install
38. Shiny – ui.R
fluidPage(
titlePanel("Compute your internal pagerank"),
sidebarLayout(
sidebarPanel(
a("data-seo.com", href="https://data-seo.com"),
tags$hr(),
p('Step 1 : Export your outlinks data from ScreamingFrog'),
fileInput('file1', 'Choose file to upload (e.g. all_outlinks.csv)',
accept = c('text/csv'), multiple = FALSE
),
tags$hr(),
downloadButton('downloadData', 'Download CSV')
),
mainPanel(
h3(textOutput("caption")),
tags$hr(),
tableOutput('contents')
)
)
)
40. https://mark.shinyapps.io/GA-dashboard-demo
Code on Github: https://github.com/MarkEdmondson1234/ga-dashboard-demo
• Interactive trend graphs.
• Auto-updating Google Analytics data.
• Zoomable day-of-week heatmaps.
• Top Level Trends via Year on Year, Month on Month
and Last Month vs Month Last Year data modules.
• A MySQL connection for data blending your own data with GA data.
• An easy upload option to update a MySQL database.
• Analysis of the impact of marketing events via Google's CausalImpact.
• Detection of unusual time-points using Twitter's Anomaly Detection.
Shiny – Use case
52. $ adduser vincent sudo
$ sudo apt-get install default-jre
$ wget https://downloads.dataiku.com/public/studio/4.0.1/dataiku-dss-4.0.1.tar.gz
$ tar xzf dataiku-dss-4.0.1.tar.gz
$ cd dataiku-dss-4.0.1
>> install all prerequites
$ sudo -i "/home/dataiku-dss-4.0.1/scripts/install/install-deps.sh" -without-java
>> install dataiku
$ ./installer.sh -d DATA_DIR -p 11000
$ DATA_DIR/bin/dss start
http://<your server address>:11000.
Dataiku- Install on Instance Cloud
53. Go to the DSS data dir
$ cd DATADIR
Stop DSS
$ ./bin/dss stop
Run the installation script
$ ./bin/dssadmin install-R-integration
$ ./bin/dss start
Dataiku- Install R
56. • Get all your featured snippet with Ranxplorer
• Get SERP for each keywords with Ranxplorer
• Use homemade scraper to enrich data :
• 'Keyword' 'Domain' 'StatusCode' 'ContentType' 'LastModified' 'Location'
• 'Title' 'TitleLength' 'TitleDist' 'TitleIsQuestion'
• 'noSnippet' 'isJsonLD' 'isItemType' 'isItemProp'
• 'Wordcount' 'Size' 'ResponseTime'
• 'H1' 'H1Length' 'H1Dist' 'H1IsQuestion'
• 'H2' 'H2Length' 'H2Dist' 'H2IsQuestion‘
• Use AML to find importance features
Dataiku : Featured Snippet
64. Dataiku : My Plugins
• SEMrush
• SearchConsole
• Majestic
• Visiblis [ongoing]
A DSS plugin is a zip file.
Inside DSS, click the top right gear → Administration → Plugins → Store.
https://github.com/voltek62/Dataiku-SEO-Plugins
67. • Learn from the success of others with AML
• Use all methods at your disposal to show Google you are the
answer to the question. ( Title, H1, H2, … )
Dataiku : Results
70. • Yes, you can because :
• Great advertising
• Get customers for specific features and trainings
Open Source & SEO ?
• Showing your work
• Attract talent
• Teaching the next generation
71. • Automated Reports with Rstudio Server
• Automated KPI reporting with Shiny Server
• Process Validation Documentation with Jupyter Notebook
• Automated Machine Learning with Dataiku
Take away
72. Now, machines can learn and adapt,
it is time to take advantage of the
opportunity to create new jobs.
Data-SEO, Data-Doctor, Data-Journalist …
#7: R est un langage informatique dédié aux statistiques et à la science des données. L'implémentation la plus connue du langage R est le logiciel GNU R.
#13: Header de la response HTTP : collect the contents of the header of an HTTP response
#26: Itoken : This function creates iterators over input objects to vocabularies, corpora, or DTM and TCM matrices. This iterator is usually used in following functions : create_vocabulary, create_corpus, create_dtm, vectorizers,create_tcm. See them for details.
create_vocabulary : This function collects unique terms and corresponding statistics. See the below for details.
#36: Shiny is a toolkit from RStudio that makes creating web applications much easier. (HTML, CSS, Java, JavaScript et jQuery )
Shiny is licensed GPLv3, and the source is available on GitHub.
#37: Shiny is a toolkit from RStudio that makes creating web applications much easier. (HTML, CSS, Java, JavaScript et jQuery )
Shiny is licensed GPLv3, and the source is available on GitHub.
#52: Benchmarking : AML can quickly present a lot of models using the same training set
Detecting Target Leakage: AML builds candidate models extremely fast in an automated way
Diagnostics: Diagnostics can be automatically generated such as learning curves, feature importances, etc.
Automation : Tasks like exploratory data analysis, pre-processing of data, model selection and putting models into production can be automated.