Python Workshop on-line at Mutirao Python on-line via pycursos.com
https://www.youtube.com/watch?feature=player_embedded&v=DFh6l-h6-gw
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
Python and BIG Data analytics | Python Fundamentals | Python ArchitectureSkillspeed
This Python tutorial will unravel the pro and cons of Python; covering Fundamentals and Advantages of Python. A comprehensive comparison of MapR and Python has been mentioned. At the end, you'll know why Python is a High Level Scripting Tool for BIG Data Analytics
---------
PPT Agenda:
Introduction to Python
Web Scraping Use Case?
Introduction to BIG Data and Hadoop
MapReduce
PyDoop
Word Count Use Case
---------
What is Python? - Introduction Python
Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.
----------
Why Python? - Python Advantages
Clear Syntax
Good for Text Processing
Extended in C and C++
Generates HTML content
Pre-Defined Libraries – NumPy, SciPy
Interpreted Environment
Automatic Memory Management
Good for Code Steering
Merging Multiple Programs
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor-led training in BIG Data & Hadoop featuring 24/7 Lifetime Support, 100% Placement Assistance & Real-time Projects.
Email: [email protected]
Website: www.skillspeed.com
Number: +91-90660-20904
Facebook: https://www.facebook.com/SkillspeedOnline
Linkedin: https://www.linkedin.com/company/skillspeed
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
This document discusses data visualization tools in Python. It introduces Matplotlib as the first and still standard Python visualization tool. It also covers Seaborn which builds on Matplotlib, Bokeh for interactive visualizations, HoloViews as a higher-level wrapper for Bokeh, and Datashader for big data visualization. Additional tools discussed include Folium for maps, and yt for volumetric data visualization. The document concludes that Python is well-suited for data science and visualization with many options available.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
SAGE is a free open-source mathematical software system that uses Python as its primary programming language. It aims to promote open-source software in mathematics. SAGE includes interfaces to many existing mathematical software packages and provides its own functionality for areas like algebra, calculus, and linear algebra. Using SAGE allows mathematical researchers to access powerful tools for computation while having the flexibility of an open-source system programmed in Python. However, SAGE currently has limited funding, which restricts its development and ability to match capabilities of closed-source alternatives.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Collaborations in the Extreme: The rise of open code development in the scie...Kelle Cruz
Video: https://www.simonsfoundation.org/event/collaborations-in-the-extreme-the-rise-of-open-code-development-in-the-scientific-community/
The internet is changing the scientific landscape by fostering international, interdisciplinary and collaborative software development. More than ever before, software is a crucial component of any scientific result. The ability to easily share code is reshaping expectations about reproducibility -- a fundamental tenet of the scientific process. In this lecture, Kelle Cruz will briefly provide the backstory of how these shifts have come about, describe some of the most impactful open source projects, and discuss efforts currently underway aimed at ensuring these community-led projects are sustainable and receive support.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
This is the slidedeck used for the presentation of the Amsterdam Pipeline of Data Science, held in December 2016. TensorFlow in the open source library from Google to implement deep learning, neural networks. This is an introduction to Tensorflow.
Note: Videos are not included (which were shown during the presentation)
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
Atomate is a high-level interface that makes it easy to generate, execute, and analyze computational materials science workflows. It contains a library of simulation procedures for different packages like VASP. Each procedure translates instructions into workflows of jobs and tasks. Atomate encodes expertise to run simulations and allows customizing workflows. It integrates with FireWorks to execute workflows on supercomputers and store results in databases for further analysis. The goal is to automate simulations and scale to millions of calculations.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
RESTo - restful semantic search tool for geospatialGasperi Jerome
RESTo implements search service with semantic query analyzis on Earth Observation metadata database. It conforms to OGC 13-026 standard - OpenSearch Extension for Earth Observation
This document provides an overview of machine learning and artificial intelligence presented by Arno Candel, Chief Architect at H2O.ai. It discusses the history and evolution of AI from early concepts in the 1950s to recent advances in deep learning. It also describes H2O.ai's platform for scalable machine learning and how it works, allowing users to easily build and deploy models on big data using APIs for R, Python, and other languages.
At my first visit to SciPy in Latin America, I was able to review the history of PyData, SciPy, and NumFOCUS, and discuss how to grow its communities and cooperate in the future. I also introduce OpenTeams as a way for open-source contributors to grow their reputation and build businesses.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
This document provides an overview of Pandas, a Python library used for data analysis and manipulation. Pandas allows users to manage, clean, analyze and model data. It organizes data in a form suitable for plotting or displaying tables. Key data structures in Pandas include Series for 1D data and DataFrame for 2D (tabular) data. DataFrames can be created from various inputs and Pandas includes input/output tools to read data from files into DataFrames.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
This document provides an overview of data science and machine learning with Anaconda. It begins with an introduction to Travis Oliphant, the founder of Continuum Analytics. It then discusses how Continuum created two organizations, NumFOCUS and Continuum Analytics, to support open source scientific computing and provide enterprise software and services. The rest of the document outlines how data science and machine learning are growing rapidly with Python and describes some of Anaconda's key capabilities for data science workflows and empowering data science teams.
This document provides an overview of data visualization in Python. It discusses popular Python libraries and modules for visualization like Matplotlib, Seaborn, Pandas, NumPy, Plotly, and Bokeh. It also covers different types of visualization plots like bar charts, line graphs, pie charts, scatter plots, histograms and how to create them in Python using the mentioned libraries. The document is divided into sections on visualization libraries, version overview of updates to plots, and examples of various plot types created in Python.
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf
This document provides an overview of TensorFlow, an open source machine learning framework. It discusses how machine learning systems can become complex with modeling complexity, heterogeneous systems, and distributed systems. It then summarizes key aspects of TensorFlow, including its architecture, platforms, languages, parallelism approaches, algorithms, and tooling. The document emphasizes that TensorFlow handles complexity so users can focus on their machine learning ideas.
Recommender Systems with Ruby (adding machine learning, statistics, etc)Marcel Caraciolo
This document discusses the use of Ruby for recommendation systems and related tasks like data analysis and visualization. It provides examples of how Ruby libraries and tools like Recommendable, NMatrix, BioRuby, and RubyDoop can be used for tasks like collaborative filtering, content-based recommendations, machine learning, scientific computing, and processing large datasets. The document also discusses some common challenges for recommendation systems and how different approaches like content-based and collaborative filtering attempt to address them.
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...Marcel Caraciolo
This document describes a tool called GeoLocation Friends Visualizer that plots social network location data on a map. It was created by Marcel Caraciolo, a Python developer from Recife, Brazil who has been working with Python for 6 years. The tool and its source code are available on GitHub at a provided link.
SAGE is a free open-source mathematical software system that uses Python as its primary programming language. It aims to promote open-source software in mathematics. SAGE includes interfaces to many existing mathematical software packages and provides its own functionality for areas like algebra, calculus, and linear algebra. Using SAGE allows mathematical researchers to access powerful tools for computation while having the flexibility of an open-source system programmed in Python. However, SAGE currently has limited funding, which restricts its development and ability to match capabilities of closed-source alternatives.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Collaborations in the Extreme: The rise of open code development in the scie...Kelle Cruz
Video: https://www.simonsfoundation.org/event/collaborations-in-the-extreme-the-rise-of-open-code-development-in-the-scientific-community/
The internet is changing the scientific landscape by fostering international, interdisciplinary and collaborative software development. More than ever before, software is a crucial component of any scientific result. The ability to easily share code is reshaping expectations about reproducibility -- a fundamental tenet of the scientific process. In this lecture, Kelle Cruz will briefly provide the backstory of how these shifts have come about, describe some of the most impactful open source projects, and discuss efforts currently underway aimed at ensuring these community-led projects are sustainable and receive support.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
This is the slidedeck used for the presentation of the Amsterdam Pipeline of Data Science, held in December 2016. TensorFlow in the open source library from Google to implement deep learning, neural networks. This is an introduction to Tensorflow.
Note: Videos are not included (which were shown during the presentation)
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
Atomate is a high-level interface that makes it easy to generate, execute, and analyze computational materials science workflows. It contains a library of simulation procedures for different packages like VASP. Each procedure translates instructions into workflows of jobs and tasks. Atomate encodes expertise to run simulations and allows customizing workflows. It integrates with FireWorks to execute workflows on supercomputers and store results in databases for further analysis. The goal is to automate simulations and scale to millions of calculations.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
RESTo - restful semantic search tool for geospatialGasperi Jerome
RESTo implements search service with semantic query analyzis on Earth Observation metadata database. It conforms to OGC 13-026 standard - OpenSearch Extension for Earth Observation
This document provides an overview of machine learning and artificial intelligence presented by Arno Candel, Chief Architect at H2O.ai. It discusses the history and evolution of AI from early concepts in the 1950s to recent advances in deep learning. It also describes H2O.ai's platform for scalable machine learning and how it works, allowing users to easily build and deploy models on big data using APIs for R, Python, and other languages.
At my first visit to SciPy in Latin America, I was able to review the history of PyData, SciPy, and NumFOCUS, and discuss how to grow its communities and cooperate in the future. I also introduce OpenTeams as a way for open-source contributors to grow their reputation and build businesses.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
This document provides an overview of Pandas, a Python library used for data analysis and manipulation. Pandas allows users to manage, clean, analyze and model data. It organizes data in a form suitable for plotting or displaying tables. Key data structures in Pandas include Series for 1D data and DataFrame for 2D (tabular) data. DataFrames can be created from various inputs and Pandas includes input/output tools to read data from files into DataFrames.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
This document provides an overview of data science and machine learning with Anaconda. It begins with an introduction to Travis Oliphant, the founder of Continuum Analytics. It then discusses how Continuum created two organizations, NumFOCUS and Continuum Analytics, to support open source scientific computing and provide enterprise software and services. The rest of the document outlines how data science and machine learning are growing rapidly with Python and describes some of Anaconda's key capabilities for data science workflows and empowering data science teams.
This document provides an overview of data visualization in Python. It discusses popular Python libraries and modules for visualization like Matplotlib, Seaborn, Pandas, NumPy, Plotly, and Bokeh. It also covers different types of visualization plots like bar charts, line graphs, pie charts, scatter plots, histograms and how to create them in Python using the mentioned libraries. The document is divided into sections on visualization libraries, version overview of updates to plots, and examples of various plot types created in Python.
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf
This document provides an overview of TensorFlow, an open source machine learning framework. It discusses how machine learning systems can become complex with modeling complexity, heterogeneous systems, and distributed systems. It then summarizes key aspects of TensorFlow, including its architecture, platforms, languages, parallelism approaches, algorithms, and tooling. The document emphasizes that TensorFlow handles complexity so users can focus on their machine learning ideas.
Recommender Systems with Ruby (adding machine learning, statistics, etc)Marcel Caraciolo
This document discusses the use of Ruby for recommendation systems and related tasks like data analysis and visualization. It provides examples of how Ruby libraries and tools like Recommendable, NMatrix, BioRuby, and RubyDoop can be used for tasks like collaborative filtering, content-based recommendations, machine learning, scientific computing, and processing large datasets. The document also discusses some common challenges for recommendation systems and how different approaches like content-based and collaborative filtering attempt to address them.
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...Marcel Caraciolo
This document describes a tool called GeoLocation Friends Visualizer that plots social network location data on a map. It was created by Marcel Caraciolo, a Python developer from Recife, Brazil who has been working with Python for 6 years. The tool and its source code are available on GitHub at a provided link.
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
Benchy: Lightweight framework for Performance Benchmarks on Python Scripts.
Presented at XXVI Pernambuco Python User Group Meeting at Recife, Pernambuco, Brazil on 06.04.2013
Palestra sobre Computação Científica com Python, Scipy e Numpy ministrada durante o XVI Encontro do Grupo de Usuários de Python de Pernambuco, Recife - Pernambuco - 03/09/2011 por Marcel Pinheiro Caraciolo
Construindo Soluções Científicas com Big Data & MapReduceMarcel Caraciolo
Este documento resume as principais informações sobre o uso de MapReduce e Big Data. Em três frases:
MapReduce é uma abordagem para processamento distribuído de grandes conjuntos de dados através de funções map e reduce. MrJob permite rodar trabalhos MapReduce em Python no Amazon EMR ou Hadoop de forma fácil. Exemplos mostram como usar MapReduce para recomendação de amigos em larga escala.
Como Python está mudando a forma de aprendizagem à distância no BrasilMarcel Caraciolo
1) Marcel Caraciolo é um cientista chefe e professor que usa Python para promover a educação à distância no Brasil.
2) Ele co-fundou o PyCursos.com, que oferece cursos on-line gratuitos de Python que atraíram centenas de alunos em várias cidades.
3) Dados mostram que abordagens interativas como exercícios on-line durante os vídeos melhoram o engajamento e desempenho dos alunos.
O documento resume o VII Encontro do PUG-PE, grupo de usuários de Python em Pernambuco. Apresenta brevemente os tópicos discutidos: redes neurais artificiais, classificação de dados usando perceptrons e MLPs, e exemplos de aplicações como classificação de saúde e reconhecimento de caracteres.
Palestra: Cientista de Dados – Dominando o Big Data com Software LivreAmbiente Livre
O documento discute o Big Data e como o software livre, especialmente Hadoop e Pentaho, podem ser usados para analisar grandes volumes de dados. O palestrante Marcio Junior Vieira apresenta suas credenciais e experiência com software livre e Big Data, e descreve conceitos como os 4Vs de Big Data, HDFS, MapReduce e outros componentes do ecossistema Hadoop. Exemplos de uso de Big Data em esportes e empresas também são apresentados.
Python não força o programador a pensar em objetos, mas eles fazem parte da linguagem desde o início, incluindo conceitos avançados como sobrecarga de operadores, herança múltipla e introspecção. Com sua sintaxe simples, é muito natural aprender orientação a objetos em Python
Este documento fornece uma introdução à linguagem de programação Python. Resume os principais pontos sobre o que é Python, por que usar Python, e compara Python com outras linguagens. O documento também fornece detalhes sobre recursos, produtividade, aplicações e comunidades de Python.
Crab - A Python Framework for Building Recommendation SystemsMarcel Caraciolo
Crab is a Python framework for building recommendation engines. It began as a Mahout alternative for Python developers and is being rewritten as a Scikit-learn submodule. Crab currently features collaborative filtering algorithms and evaluation metrics. Developers are working in sprints to optimize performance by integrating Numpy and migrating Crab to work as a Scikit in order to make it faster and more accessible to the scientific community.
This document provides an introduction to computer vision. It discusses how computer vision works by acquiring images, processing them, and analyzing them. It covers various computer vision techniques like template matching to find instances in images, keypoint matching to find features, and using Haar-like features to classify generic objects like faces. It also provides examples of code snippets using the SimpleCV library to implement these techniques. The document is meant to demonstrate how to get started with computer vision using Python.
Desenvolvendo uma aplicacao Full JavascriptDenis Vieira
O documento discute o desenvolvimento de uma aplicação full stack JavaScript usando MongoDB, Express, AngularJS e Node.js. Ele explica como usar essas tecnologias juntas, incluindo o uso de bancos de dados NoSQL, arquitetura MVC, APIs RESTful e WebSockets. Além disso, discute ferramentas como Gulp para automatizar tarefas.
Este documento fornece um tutorial sobre como pensar como um cientista da computação usando a linguagem de programação Python. O documento introduz conceitos básicos de programação como variáveis, tipos de dados, funções, condicionais, recursividade, iteração e estruturas de dados como listas, dicionários e objetos. O documento também aborda tópicos como classes, herança, arquivos, exceções e outros para fornecer uma base sólida sobre programação orientada a objetos usando Python.
O curso online de Excel 2007 Avançado vai trazer para você conhecimentos fundamentais sobre: Visual Basic Editor, Primeiro Programa, Operadores e Dados, Laços de Repetição, entre outros. Aproveite a oportunidade e permita que o ensino à distância proporcione a você um material de qualidade, obtenha o seu certificado e uma melhor colocação profissional em pouco tempo.
O documento descreve as APIs do Facebook, incluindo o que é uma API, as permissões básicas e avançadas que uma aplicação pode solicitar, como login com o Facebook, Open Graph e ferramentas de desenvolvimento.
Recommendation systems with Mahout: introductionAdam Warski
An introduction to recommendation systems: types of recommenders, types of data (input), user-user collaborative filtering, item-item collaborative filtering, running Mahout on a single node and on multiple nodes (on Hadoop)
O documento apresenta sobre NoSQL e MongoDB. Resume os principais pontos sobre esquemas flexíveis, escalabilidade horizontal e como MongoDB pode ser usado para armazenar dados analíticos de forma mais eficiente do que bancos de dados relacionais.
The document discusses the concept of scalability in computing. Operationally, scalability now means being able to utilize thousands of inexpensive computers. Formally, in the past scalability meant algorithms with polynomial time complexity, while now it means logarithmic time complexity, allowing data to be processed more efficiently as data sizes increase. The document provides examples of finding matching DNA sequences and word frequency analysis to illustrate how distributed and parallel algorithms can improve scalability.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
The document discusses functional data structures. It begins by defining functional data structures as data structures suitable for functional programming languages or for coding in an imperative language using a functional style. Key characteristics include immutability, recursion, garbage collection, and pattern matching. Examples of functional implementations of stacks, sets using binary search trees, and priority queues (heaps) using skew heaps are provided in Haskell and Java. Functional data structures have advantages like fewer bugs due to immutability and increased sharing through lack of defensive cloning. The document discusses the tree-copying involved in operations on functional data structures and provides benchmark results showing improved performance of binary search trees over naive lists for sets.
eScience: A Transformed Scientific MethodDuncan Hull
The document discusses the concept of eScience, which involves synthesizing information technology and science. It explains how science is becoming more data-driven and computational, requiring new tools to manage large amounts of data. It recommends that organizations foster the development of tools to help with data capture, analysis, publication, and access across various scientific disciplines.
Lecture 1 Slides -Introduction to algorithms.pdfRanvinuHewage
- The document discusses reasons for studying algorithms and their broad impacts.
- Key reasons include solving hard problems, intellectual stimulation, becoming a proficient programmer, unlocking secrets of life and the universe, and fun.
- Algorithms have roots in ancient times but new opportunities in the modern era with computers and large data. They allow addressing problems that could not otherwise be solved.
In this video from ChefConf 2014 in San Francisco, Cycle Computing CEO Jason Stowe outlines the biggest challenge facing us today, Climate Change, and suggests how Cloud HPC can help find a solution, including ideas around Climate Engineering, and Renewable Energy.
"As proof points, Jason uses three use cases from Cycle Computing customers, including from companies like HGST (a Western Digital Company), Aerospace Corporation, Novartis, and the University of Southern California. It’s clear that with these new tools that leverage both Cloud Computing, and HPC – the power of Cloud HPC enables researchers, and designers to ask the right questions, to help them find better answers, faster. This all delivers a more powerful future, and means to solving these really difficult problems."
Watch the video presentation: http://insidehpc.com/2014/09/video-hpc-cluster-computing-64-156000-cores/
The document discusses various research projects involving the automated design and optimization of complex physical, chemical, and biological systems using evolutionary algorithms and machine learning techniques. It describes current and planned usage of computer clusters to run simulations and experiments for protein structure prediction, software self-assembly, and modeling physico-chemical systems through evolutionary optimization of parameters. The research requires significant computational resources to process large datasets and evaluate models in parallel.
The document outlines the schedule for a full-day machine learning course. The morning sessions introduce deep learning and cover a "hello world" machine learning exercise using MNIST data. Feedforward neural networks are also discussed. The afternoon focuses on computer vision with convolutional neural networks, natural language processing, generative models, time series prediction, and deploying machine learning models. Live coding exercises accompany many of the talks to provide hands-on learning.
This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
Extending Complex Event Processing to Graph-structured InformationAntonio Vallecillo
Complex Event Processing (CEP) is a powerful technology in realtime distributed environments for analyzing fast and distributed streams of data, and deriving conclusions from them. CEP permits defining complex events based on the events produced by the incoming sources in order to identify complex meaningful circumstances and to respond to them as quickly as possible. However, in many situations the information that needs to be analyzed is not structured as a mere sequence of events, but as graphs of interconnected data that evolve over time. This paper proposes an extension of CEP systems that permits dealing with graph-structured information. Two case studies are used to validate the proposal and to compare its performance with traditional CEP systems. We discuss the benefits and limitations of the CEP extensions presented.
What are algorithms? How can I build a machine learning model? In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. At Amazon, we created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorisation machines for recommendations, and time-series forecasting. This talk will discuss those algorithms, understand where and how they can be used, and our design choices.
Wikipedia Views As A Proxy For Social EngagementDaniel Cuneo
Wikipedia is now offering up to 7 years of page view data.
Can we use this data to measure social engagement ?
I gather some data in this test of the cancer drug Tarceva to see what the view data looks like.
The document discusses how computation can accelerate the generation of new knowledge by enabling large-scale collaborative research and extracting insights from vast amounts of data. It provides examples from astronomy, physics simulations, and biomedical research where computation has allowed more data and researchers to be incorporated, advancing various fields more quickly over time. Computation allows for data sharing, analysis, and hypothesis generation at scales not previously possible.
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
Real-time stream processing systems are useful for analyzing continuous data streams to gain instant insights. They need to be fast, scalable, and fault tolerant. Apache Storm is a popular open-source stream processing system that uses task parallelism. In Storm, spouts act as data sources, bolts perform processing tasks, and topologies define the graph of spouts and bolts. An example topology counts word occurrences in sentences from a random sentence spout. The Lambda architecture combines real-time and batch processing layers to provide both instant and reliable insights from streaming data. Apache Spark can also be used for stream processing using microbatches.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
The document provides information about the Spartan HPC system at the University of Melbourne, including:
- Spartan is the University of Melbourne's general purpose high performance computing system.
- The document outlines logging into Spartan, using environment modules and job submission, and introduces parallel programming with OpenMP and MPI.
- Spartan has been recognized internationally as a model HPC-cloud hybrid system and has supported over 150 research papers.
Prepares the students for (and is a prerequisite for) the more advanced material students will encounter in later courses. Data structures organize data Þ more efficient programs.
Este documento descreve como analisar seu próprio genoma usando tecnologias como Python. Apresenta os conceitos de sequenciamento de DNA, mapeamento, chamada de variantes e interpretação. Explica o fluxo de trabalho de um pipeline simples para analisar variantes em um genoma e fornece recursos para aprender mais sobre bioinformática.
Joblib: Lightweight pipelining for parallel jobs (v2)Marcel Caraciolo
This document discusses parallel computing in Python using joblib. It begins with an overview of different parallelization options in Python like threading and multiprocessing. It then discusses how joblib provides an easy way to parallelize Python code using multiprocessing without needing to explicitly manage processes. The document provides examples of using joblib to parallelize tasks like applying a function to a list of inputs and shows how it helps speed up computation by utilizing multiple CPU cores. It also discusses some considerations like interrupting jobs and memory usage when using joblib for parallelization.
Construindo softwares de bioinformática para análises clínicas : Desafios e...Marcel Caraciolo
O documento discute os desafios e oportunidades na construção de softwares de bioinformática para análises clínicas. Apresenta o laboratório Genomika, especializado em testes genéticos, e como a fusão de biologia molecular e tecnologia da informação é essencial para analisar grandes volumes de dados genéticos. Também destaca a importância da bioinformática para minerar bancos de dados na busca de mutações e como os sistemas de saúde podem ser aprimorados com novas tecnologias.
Como Python ajudou a automatizar o nosso laboratório v.2Marcel Caraciolo
O documento descreve como a linguagem de programação Python pode ser usada para automatizar tarefas em laboratórios de análise clínica, incluindo a análise de variantes genéticas, gestão de processos laboratoriais e infraestrutura de servidores. O autor também fornece recursos para quem deseja aprender bioinformática e trabalhar com análises genômicas usando Python.
Como Python pode ajudar na automação do seu laboratórioMarcel Caraciolo
O documento descreve como a linguagem Python pode ajudar a automatizar os processos de um laboratório de análises clínicas, incluindo a gestão de armazenamento e análise de grandes volumes de dados genômicos, o desenvolvimento de sistemas de gerenciamento laboratorial e notificações, e a infraestrutura de servidores e backups.
Marcel Caraciolo is a scientist and CTO who has worked with Python for 7 years. He is interested in machine learning, mobile education, and data. He is the current president of the Python Brazil Association. Caraciolo has created several scientific Python packages and taught Python online. He is now working on applying Python to bioinformatics and clinical sequencing through tools like biopandas.
Este documento apresenta um tutorial sobre como hackear a web com Python 3 ministrado por Marcel Caraciolo. O tutorial introduz Python 3 e mostra como interagir com plataformas como Facebook, Reddit, MongoDB, Foursquare, Twitter e dados abertos usando a linguagem. O documento fornece links e códigos para que os participantes possam experimentar coletar e analisar dados dessas plataformas.
O documento discute vários modelos de negócios relacionados a software open-source, incluindo suporte e treinamento, consultoria em softwares open-source, Software as a Service, e venda de pacotes de software proprietário baseado em código-fonte open-source. O documento também fornece conselhos sobre como iniciar e manter projetos de código aberto.
Benchy, python framework for performance benchmarking of Python ScriptsMarcel Caraciolo
Benchy is a lightweight Python framework for performing benchmarks on code. It allows generating performance and memory usage graphs to compare different code implementations. Benchmarks can be written as objects and executed via a BenchmarkRunner to obtain results. Results are stored in a SQLite database and full reports can be generated in reStructuredText format. The framework aims to provide an easy way to integrate benchmarks into the development workflow.
O documento apresenta Python e 10 motivos para conhecer a linguagem, incluindo que é fácil de aprender, multi-paradigma, e usada por empresas como Google, Dropbox e Mozilla. Também discute como Python é expressiva e integra-se com outras linguagens como C/C++, .NET e MATLAB. Redes de apoio à comunidade Python no Brasil também são apresentadas.
Neste tutorial apresentei usando Python Básico conceitos de como construir um sistema de recomendação por filtragem colaborativa.
Mutirão PyCursos:
Vídeo em: https://plus.google.com/u/0/events/c3hqbk20omt3r5uoq13gpk82i9g
Este documento apresenta Python como uma linguagem de programação interpretada, fácil de aprender e altamente produtiva que suporta paradigmas orientados a objetos, funcional e procedural. Apresenta exemplos básicos de código Python e discute como Python é usado por muitas grandes empresas, é de código aberto e possui uma comunidade ativa de desenvolvedores.
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Marcel Caraciolo
Apresentação realizada durante a Conferência Talk a Bit em Junho/2012 e realizada durante o PET 2012 por Marcel Caraciolo.
Universidade Federal de Pernambuco, 2012
O documento descreve como construir um webcrawler para rastrear encomendas usando Python e expressões regulares. Primeiro, ele explica como fazer o download da página HTML com os dados de rastreamento e extrair o conteúdo. Depois, ele discute como analisar o HTML baixado usando expressões regulares para obter as informações de status e localização da encomenda.
O documento descreve como manipular arquivos ZIP em Python usando o módulo zipfile. É possível criar, ler, extrair arquivos e obter metadados de arquivos ZIP como nome, tamanho e data de modificação.
PyFoursquare is a Python wrapper for the Foursquare API that allows developers to easily access Foursquare data from their Python applications. It currently supports searching for places and retrieving place details, tips, and user information. The wrapper follows a similar architecture to Tweepy, representing each Foursquare entity as a model. Developers can authenticate their app, make API requests, and access results as model objects. The project is open source and the author welcomes contributions to support additional Foursquare entities and features.
Sistemas de Recomendação: Como funciona e Onde Se aplica?Marcel Caraciolo
This document discusses mobile recommender systems. It describes how recommender systems on mobile devices face challenges due to limitations of mobile contexts, such as location and processing capabilities. It presents the workflow and architecture of a mobile restaurant recommendation and navigation system. The system collects and analyzes location-based user data on the server side and provides personalized recommendations to users on their mobile clients. It discusses using context such as location, tags, and implicit feedback for recommendations on mobile.
Content Recommendation Based on Data Mining in Adaptive Social NetworksMarcel Caraciolo
The document discusses content recommendation in adaptive social networks based on data mining. It aims to design a methodology for social recommender systems that incorporate different knowledge sources from structured and unstructured data. The objectives are to design improved explanations for recommendations to increase user acceptance and enhance the student experience. The approach uses a hybrid recommender system that adapts the weighting of collaborative and content-based filtering based on the type of content being recommended. Current results show the system integrated into a Brazilian social network with over 70,000 students and items, with early user feedback being positive. Expected results include analyzing how recommendations can improve the learning process and exploring hidden knowledge in social networks.
Crab: A Python Framework for Building Recommender Systems Marcel Caraciolo
Crab is a Python framework for building recommendation engines. It began as a community-driven project one year ago and was incorporated into the open-source labs Muriçoca in April 2011. Crab is being rewritten as a Scikit (toolkit for machine learning in Python) to take advantage of the Scikit-Learn algorithms and infrastructure. The current version of Crab implements collaborative filtering algorithms like user-based, item-based, and matrix factorization and can evaluate recommender algorithms with metrics like precision, recall, and RMSE. It also provides APIs to build recommendation systems and deploy them using REST frameworks. Crab is already used in some production recommender systems.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Big Data com Python
1. BigData with Python
A gentle and simple introduction
Marcel Caraciolo
@marcelcaraciolo
Developer, Cientist, contributor to the Crab recsys project,
works with Python for 6 years, interested at mobile,
education, machine learning and dataaaaa!
Recife, Brazil - http://aimotion.blogspot.com
3. About me
Co-founder of Crab - Python recsys library
Cientist Chief at Atepassar, e-learning social network
Co-Founder and Instructor of PyCursos, teaching Python on-line
Co-Founder of Pingmind, on-line infrastructure for MOOC’s
Interested at Python, mobile, e-learning and machine learning!
5. Big Data
“Big Data is any data that is expensive to manage
and hard to extract value from.”
Michael Franklin Thomas M. Siebel
Professor of Computer Science Director of the Algorithms,
Machines and People Lab University of Berkeley
6. Big Data
Erik Larson, 1989, Harper’s magazine
“The keepers of big data say they do it for the
consumer’s benefit. But data have a way of
being used for purposes other than originally
intended.”
8. Challenges
Volume
the size of the data
Velocity
the latency of data processing relative to the
growing demand for interactivity
Variety
the diversity of sources, formats, quality,
structures
18. What does
Scalable mean ?
Operationally:
In the past: Works even if data doesn’t fit in main memory
Now: Can make use of 1000s of cheap computers
Algorithmically:
In the past: If you have N data items, you must do no more
than Nm operations -- polynomial time algorithms
Now: If you have N data items, you must do no more than Nm/k operations,
for some large k -- “polynomial time algorithms must be parallelized”
Soon: If you have N data items, you should do no more than N * log(N) operations
19. Example
Given a set of DNA sequences, find all
sequences equal to GATTACGATATTA .
30. GATTACGATATTA
0% 100%
How many comparisons did we do?
40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N)) Far better scalability
31. Relational
Databases
Relational Databases
• Databases are good at Needle in Haystack problems:
– Extracting small results from big datasets
– Transparently provide old style scalability
– Your query will always* finish, regardless of dataset size.
– Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = GATTACGATATTA ;
32. Example
New task: Read Trimming
Given a set of DNA sequences,
trim the final n bps of each sequence
Generate a new dataset
43. Time = 7
time = 7
How much time did this take?
7 cycles
40 records, 6 workers
O(N/k)
time = 7
How much time did this ta
7 cycles
40 records, 6 workers
O(N/k)
44. Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
45. You are given TIFF
images
Distribute the images
among k computers
f f f f f f
f is a function to
convert TIFF to
PNG; apply it to
every item
Now we have a big
distributed set of
converted images
New Task: Convert 405k TIFF images to PNG
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/
Convert 405k TIFF images to PNG
46. You have sets of
parameters for
thousands of small
simulations
Divide the parameter
sets among k
computers
f f f f f f
f is runs the
simulation and
produces some
output; apply it to
every item
Now we have a big
distributed set of
simulation results
New Task: Run thousands of simulations
http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud
Run thousands of simulations
47. You have millions
of documents
Distribute the
documents among k
computers
f f f f f f
f finds the most
common word in a
single document
Now we have a big
distributed list of
(doc_id, word) pairs
Find the most common word in each document
Find the most common word in each
document
48. Consider a slightly more general program
to compute the word frequency of every
word in a singe documentConsider a slightly more general program to compute the
word frequency of every word in a single document
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
(people, 2)
(government, 6)
(assume, 1)
(history, 2)
…
49. You have millions
of documents
Distribute the
documents among k
computers
f f f f f f
For each document
f returns a set of
(word, freq) pairs
Compute the word frequency of 5M documents
Now we have a big
distributed list of
sets of word freqs.
50. There’s a pattern...
There’s a pattern here….
• A function that maps a read to a trimmed read
• A function that maps a TIFF image to a PNG image
• A function that maps a set of parameters to a simulation result
• A function that maps a document to its most common word
• A function that maps a document to a histogram of word
frequencies
51. What if we want to compute the word
frequency across all documents?
US Constitution Declaration of
Independence
Articles of
Confederation
(people, 78)
(government, 123)
(assume, 23)
(history, 38)
…
52. You have millions
of documents
Distribute the
documents among
k computers
map map map map map map
For each document,
return a set of (word,
freq) pairs
Compute the word frequency across 5M documents
Now what?
• But we don’t want a bunch of little histograms –
we want one big histogram.
• How can we make sure that a single computer
has access to every occurrence of a given word
regardless of which document it appeared in?
Compute the word frequency across 5M docs
53. Compute the word frequency across 5M docs
Distribute the
documents among k
computers
map map map map map map
For each document,
return a set of
(word, freq) pairs
Now we have a big
distributed list of
sets of word freqs.
reduce reduce reduce reduce
Now just count the
occurrences of each word
44 3
We have our
distributed histogram
Compute the word frequency across 5M documents
54. Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 11
Some distributed algorithm…
Map
(Shuffle)
Reduce
55. Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 12
MapReduce Programming Model
• Input & Output: each a set of key/value pairs
• Programmer specifies two functions:
– Processes input key/value pair
– Produces set of intermediate pairs
– Combines all intermediate values for a particular key
– Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
56. Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 13
Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
57. Example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
58. Word length histogram example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
3/12/09 Bill Howe, eScience Institute 15
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many big , medium , and small words are used?
59. Word length histogram example
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
60. Word length histogram example
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Split the document into
chunks and process each
chunk on a different
computer
Chunk 1
Chunk 2
Example: Word length histogram
61. Word length histogram example
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
Example: Word length histogram
62. Word length histogram example
3/12/09 Bill Howe, eScience Institute 19
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
Shuffle step
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
66. Taxonomy of
Parallel Architectures
5/15/13 Bill Howe, eScience Institute 22
Taxonomy of Parallel Architectures
Easiest to program,
but $$$$
Scales to 1000s of computers
69. 5/15/13 Bill Howe, eScience Institute 26
Large-Scale Data Processing
• Many tasks process big data, produce big data
• Want to use hundreds or thousands of CPUs
– ... but this needs to be easy
– Parallel databases exist, but they are expensive, difficult to set
up, and do not necessarily scale to hundreds of nodes.
• MapReduce is a lightweight framework, providing:
– Automatic parallelization and distribution
– Fault-tolerance
– I/O scheduling
– Status and monitoring
70. 5/15/13 Bill Howe, eScience Institute 35
MapReduce Contemporaries
• Dryad (Microsoft)
– Relational Algebra
• Pig (Yahoo)
– Near Relational Algebra over MapReduce
• HIVE (Facebook)
– SQL over MapReduce
• Cascading
– Relational Algebra
• Clustera
– U of Wisconsin
• Hbase
– Indexing on HDFS
MapReduce Contemporaries
72. Examples
More Examples: Build an Inverted Index
5/15/13 Bill Howe, UW 14
Input:
tweet1, (“I love pancakes for breakfast”)
tweet2, (“I dislike pancakes”)
tweet3, (“What should I eat for breakfast?”)
tweet4, (“I love to eat”)
Desired output:
“pancakes”, (tweet1, tweet2)
“breakfast”, (tweet1, tweet3)
“eat”, (tweet3, tweet4)
“love”, (tweet1, tweet4)
…
73. Examples
More Examples: Relational Join
5/15/13 Bill Howe, UW 15
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Emplyee ⨝ Assigned Departments
Employee
Assigned Departments
Name SSN EmpSSN DepName
Sue 999999999 999999999 Accounts
Tony 777777777 777777777 Sales
Tony 777777777 777777777 Marketing
74. Examples
Relational Join in MapReduce: Before Map Phase
5/15/13 Bill Howe, UW 16
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Employee
Assigned Departments
Key idea: Lump all the tuples
together into one dataset
Employee, Sue, 999999999
Employee, Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
What is this for?
75. Examples
Relational Join in MapReduce: Map Phase
5/15/13 Bill Howe, UW 17
Employee, Sue, 999999999
Employee, Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
key=999999999, value=(Employee, Sue, 999999999)
key=777777777, value=(Employee, Tony, 777777777)
key=999999999, value=(Department, 999999999, Accounts)
key=777777777, value=(Department, 777777777, Sales)
key=777777777, value=(Department, 777777777, Marketing)
why do we use this as the key?
80. Examples
Matrix Multiply in MapReduce
C = A X B
A has dimensions L,M
B has dimensions M,N
• In the map phase:
– for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N
– for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L
• In the reduce phase, emit
– key = (i,k)
– value = Sumj (A[i,j] * B[j,k])
5/15/13 Bill Howe, Data Science, Autumn 2012 21
83. Cluster Computing
• Large number of commodity servers,
connected by high speed, commodity
network
• Rack: holds a small number of servers
• Data center: holds many racks
!"#$%&'(#))%'&*"&+!"
#$%&'()*+(",-(+./0"123145"67$%8('"1",-(+./09":;1;<"/0=>4""
/?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$'$G$0"$0)"H==G$0""
),,(-.,(/0123,(456,7852(9,:3;-,(
84. Cluster Computing
• Massive parallelism:
– 100s, or 1000s, or 10000s servers
– Many hours
• Failure:
– If medium-time-between-failure is 1 year
– Then 10000 servers have one failure / hour
24slide src: Dan Suciu and Magda Balazinska
85. Distributed File System (DFS)
• For very large files: TBs, PBs
• Each file is partitioned into chunks,
typically 64MB
• Each chunk is replicated several times
(!3), on different racks, for fault
tolerance
• Implementations:
– Google’s DFS: GFS, proprietary
– Hadoop’s DFS: HDFS, open source
25slide src: Dan Suciu and Magda Balazinska
98. Exemplo de MapReduce
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount().run()
99. Projeto MrJob
Criado pela Equipe de Engenharia doYelp
Totalmente Open-Source
Todo em Python
Utiliza Map-Reduce para Processamento
Permite rodar tanto no Amazon EMR como no Hadoop
100. Objetivos do MrJobs
Se você quer aprender MapReduce, ele é para você
Se você tem um problema cavalar e precisa de muito
processamento e não está afim de mexer em Hadoop
Se você já tem um cluster Hadoop e quer rodar scripts
Python
Se você quer migrar seu código Python do Hadoop
para o EMR
Se você não quer escrever Python (Impossível!), não é
para você!
105. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
106. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
114. Projetos interessantes
Estrutura de dados para manipulação rápida
- slicing
- indexing
- subseting
Handling missing data
Agregações, Séries Temporais
Pandas: a data analysis library for Python,
poised to give R a run for its money… http://pandas.pydata.org/
115. Projetos interessantes
Outro framework para computação distribuída com
Python com MapReduce.
Criado pelo Instituto Nokia.
Backend dele é escrito em Erlang (funcional,
concorrente e bem escalável!)
Não utiliza o FileSystem mais usado HDFS e sim um
novo padrão por eles (DDFS).
http://discoproject.org/
116. Projetos interessantes
http://api.mongodb.org/python/2.0/examples/
map_reduce.html
Python & MongoDb
MongoDb - Banco de Dados Não relacional (NoSQL)
Possui suporte nativo built-in para fazer MapReduce.
Escrever o código em JS e não é muito legível e fica preso ao
Mongo ...
>>> reduce = Code("function (key, values) {"
... " var total = 0;"
... " for (var i = 0; i < values.length; i++) {"
... " total += values[i];"
... " }"
... " return total;"
... "}")
118. Projetos interessantes
Um wrapper em Python em cima do Hadoop
para computação distribuída.
http://pydoop.sourceforge.net/docs/index.html
Legal, mas dá um trabalho para configurar.
119. Projetos interessantes
Mapreduce com Python na Google AppEngine
https://developers.google.com/appengine/docs/python/
dataprocessing/
Ainda experimental e fica “preso” à plataforma
AppEngine.
120. Projetos interessantes
Algoritmos de aprendizagem de máquina
Supervisionados & Não supervisionados
Pré-processamento, extração de dados
Avaliação de classificadores, Pipeline,
seleção de atributos.
http://scikit-learn.org/stable/
121. Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,
pos tagging, named entity recognition,
classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
122. Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,
pos tagging, named entity recognition,
classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
Fiquem de olho...
Pipeline for distributed Natural Language Processing, made in Python
https://github.com/NAMD/pypln
124. Future Releases
Planned Release 0.13
New home for python-recsys:
https://github.com/python-recsys/crab
Planned Release 0.14
Support to Item-Based Recommenders using MapReduce with MrJob
New commiters: vinnigracindo, ocelma, fcurella
125. Join us!
1. Read our Wiki Page
https://github.com/muricoca/crab/wiki/Developer-Resources
2. Check out our current sprints and open issues
https://github.com/muricoca/crab/issues
3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
discussion list
http://groups.google.com/group/scikit-crab
133. BigData with Python
A gentle and simple introduction
Marcel Caraciolo
@marcelcaraciolo
Developer, Cientist, contributor to the Crab recsys project,
works with Python for 6 years, interested at mobile,
education, machine learning and dataaaaa!
Recife, Brazil - http://aimotion.blogspot.com