Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
This document provides a summary of a presentation on Python and its role in big data analytics. It discusses Python's origins and growth, key packages like NumPy and SciPy, and new tools being developed by Continuum Analytics like Numba, Blaze, and Anaconda to make Python more performant for large-scale data processing and scientific computing. The presentation outlines Continuum's vision of an integrated platform for data analysis and scientific work in Python.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
This document discusses tools for making NumPy and Pandas code faster and able to run in parallel. It introduces the Dask library, which allows users to work with large datasets in a familiar Pandas/NumPy style through parallel computing. Dask implements parallel DataFrames, Arrays, and other collections that mimic their Pandas/NumPy counterparts. It can scale computations across multiple cores on a single machine or across many machines in a cluster. The document provides examples of using Dask to analyze large CSV and text data in parallel through DataFrames and Bags. It also discusses scaling computations from a single laptop to large clusters.
This document summarizes Peter Wang's keynote speech at PyData Texas 2015. It begins by looking back at the history and growth of PyData conferences over the past 3 years. It then discusses some of the main data science challenges companies currently face. The rest of the speech focuses on the role of Python in data science, how the technology landscape has evolved, and PyData's mission to empower scientists to explore, analyze, and share their data.
This document provides an overview of data science and machine learning with Anaconda. It begins with an introduction to Travis Oliphant, the founder of Continuum Analytics. It then discusses how Continuum created two organizations, NumFOCUS and Continuum Analytics, to support open source scientific computing and provide enterprise software and services. The rest of the document outlines how data science and machine learning are growing rapidly with Python and describes some of Anaconda's key capabilities for data science workflows and empowering data science teams.
PyData NYC 2012 was a conference about using Python for scientific, engineering, and technical computing, as well as big data problems. Python has become widely used in industries like national labs, finance, oil and gas, and aerospace/defense. The PyData community aims to build tools for out-of-core and distributed data structures and algorithms using Python's accessibility. This will empower more domain experts and occasional programmers to solve real problems easily.
The document discusses Python and its suitability for data science. It describes Python's Zen-like approach of focusing on simplicity and empowering users. It promotes Python's data science stack, including NumPy, Pandas, scikit-learn and others, and how they allow for rapid data analysis and model building. It also describes the Anaconda distribution and conda package manager for easily managing Python environments and packages.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
A look inside pandas design and developmentWes McKinney
This document summarizes Wes McKinney's presentation on pandas, an open source data analysis library for Python. McKinney is the lead developer of pandas and discusses its design, development, and performance advantages over other Python data analysis tools. He highlights key pandas features like the DataFrame for tabular data, fast data manipulation capabilities, and its use in financial applications. McKinney also discusses his development process, tools like IPython and Cython, and optimization techniques like profiling and algorithm exploration to ensure pandas' speed and reliability.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
This document discusses pandas, a popular Python library for data analysis, and its limitations. It introduces Badger, a new project from DataPad that aims to address some of pandas' shortcomings like slow performance on large datasets and lack of tight database integration. The creator describes Badger as using compressed columnar storage, immutable data structures, and C kernels to perform analytics queries much faster than pandas or databases on benchmark tests of a multi-million row dataset. He envisions Badger becoming a distributed, multicore analytics platform that can also be used for ETL jobs.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
New Developments in H2O: April 2017 EditionSri Ambati
H2O presentation at Trevor Hastie and Rob Tibshirani's Short Course on Statistical Learning & Data Mining IV: http://web.stanford.edu/~hastie/sldm.html
PDF and Keynote version of the presentation available here: https://github.com/h2oai/h2o-meetups/tree/master/2017_04_06_SLDM4_H2O_New_Developments
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
This document discusses Apache Arrow, an open source project that aims to standardize in-memory data representations to enable efficient data sharing across systems. It summarizes Arrow's goals of improving performance by 10-100x on many workloads through a common data layer, reducing serialization overhead. The document outlines Arrow's language bindings for Java, C++, Python, R, and Julia and efforts to integrate Arrow with systems like Spark, Drill and Impala to enable faster analytics. It encourages involvement in the Apache Arrow community.
This document discusses tools for social network analysis and visualization. It covers Netvizz, which extracts data from Facebook for research. It also covers Pajek and Gephi, two programs for analyzing and visualizing networks. Pajek is suitable for large networks with thousands of nodes, while Gephi is interactive and can handle networks of up to 100,000 nodes. Both support a variety of input and output formats and feature layout algorithms and metrics for analysis.
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
Wes McKinney gave a keynote talk at SciPy 2015 about his journey with Python for data analysis from 2007 to present day. He started as a mathematician with no exposure to Python or data analysis tools. His first job was at a quant hedge fund where he encountered frustrations with productivity due to extensive use of SQL and Excel. In 2008, he began experimenting with Python and created early versions of pandas to improve productivity for his projects. This led to open sourcing pandas in 2009 and evangelizing Python more broadly within his company and community.
PyData: Past, Present Future (PyData SV 2014 Keynote)Peter Wang
From the closing keynoteLook back at the last two years of PyData, discussion about Python's role in the growing and changing data analytics landscape, and encouragement of ways to grow the community
Slides from Matt Dowle's presentation at H2O Open Tour: NYC
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
Python is sometimes discounted as slow because of its dynamic typing and interpreted nature and not suitable for scale because of the GIL. But, in this talk, I will show how with the help of talented open-source contributors around the world, we have been able to build systems in Python that are fast and scalable to many machines and how this has helped Python take over Science.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
Linux Distribution Collaboration …on a Mainframe!All Things Open
Presented at All Things Open 2023
Presented by Elizabeth K. Joseph - IBM
Title: Linux Distribution Collaboration …on a Mainframe!
Abstract: Linux has run on the mainframe architecture (s390x) for over 20 years now, and there’s even Linux-only mainframe hardware! But tight collaboration between the Linux distributions is rather new. Enter the Open Mainframe Project Linux Distributions Working Group, founded in late 2021.
Bringing together various Linux distributions, both corporate-backed and community-driven, representatives from openSUSE, Debian, Fedora, SUSE, and more immediately joined the effort to share bug reports and patches that impact all the distributions. Issues are often shared and discussed on the mailing list, and more complicated topics covered during the monthly meetings. The working group has a number of success stories that will be shared.
Future potential issues are also tackled, and notes shared about upstream changes that may soon impact the package processes. In the latest effort, the team has started thinking about actual upstream projects to invite to our group to be more pro-active about changes that may cause problems on the s390x architecture.
But more importantly, this is a story about community and collaboration. Many people view the various Linux distributions as a competitive space, but like so much of the open source software community, we are all more successful when we share knowledge about our core. The success of this working group, and growing enthusiasm for it from new Linux distributions who are joining, is a great example of this.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Large Data Analyze with PyTables,
This presentation has been collected from several other presentations(PyTables presentation).
For more presentation in this field please refer to this link (http://pytables.org/moin/HowToUse#Presentations).
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.
This document discusses PyTables, a Python library for managing hierarchical datasets and efficiently analyzing large amounts of data. It begins by introducing PyTables and its use of HDF5 for portability and extensibility. Key features of PyTables discussed include its object-oriented interface, optimization of memory and disk usage, and fast querying capabilities. The document then covers techniques for maximizing performance like Numexpr for complex expressions, NumPy for powerful data containers, compression algorithms, and caching. Blosc compression is highlighted for its ability to compress faster than memory speed.
This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
A look inside pandas design and developmentWes McKinney
This document summarizes Wes McKinney's presentation on pandas, an open source data analysis library for Python. McKinney is the lead developer of pandas and discusses its design, development, and performance advantages over other Python data analysis tools. He highlights key pandas features like the DataFrame for tabular data, fast data manipulation capabilities, and its use in financial applications. McKinney also discusses his development process, tools like IPython and Cython, and optimization techniques like profiling and algorithm exploration to ensure pandas' speed and reliability.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
This document discusses pandas, a popular Python library for data analysis, and its limitations. It introduces Badger, a new project from DataPad that aims to address some of pandas' shortcomings like slow performance on large datasets and lack of tight database integration. The creator describes Badger as using compressed columnar storage, immutable data structures, and C kernels to perform analytics queries much faster than pandas or databases on benchmark tests of a multi-million row dataset. He envisions Badger becoming a distributed, multicore analytics platform that can also be used for ETL jobs.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
New Developments in H2O: April 2017 EditionSri Ambati
H2O presentation at Trevor Hastie and Rob Tibshirani's Short Course on Statistical Learning & Data Mining IV: http://web.stanford.edu/~hastie/sldm.html
PDF and Keynote version of the presentation available here: https://github.com/h2oai/h2o-meetups/tree/master/2017_04_06_SLDM4_H2O_New_Developments
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
This document discusses Apache Arrow, an open source project that aims to standardize in-memory data representations to enable efficient data sharing across systems. It summarizes Arrow's goals of improving performance by 10-100x on many workloads through a common data layer, reducing serialization overhead. The document outlines Arrow's language bindings for Java, C++, Python, R, and Julia and efforts to integrate Arrow with systems like Spark, Drill and Impala to enable faster analytics. It encourages involvement in the Apache Arrow community.
This document discusses tools for social network analysis and visualization. It covers Netvizz, which extracts data from Facebook for research. It also covers Pajek and Gephi, two programs for analyzing and visualizing networks. Pajek is suitable for large networks with thousands of nodes, while Gephi is interactive and can handle networks of up to 100,000 nodes. Both support a variety of input and output formats and feature layout algorithms and metrics for analysis.
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
Wes McKinney gave a keynote talk at SciPy 2015 about his journey with Python for data analysis from 2007 to present day. He started as a mathematician with no exposure to Python or data analysis tools. His first job was at a quant hedge fund where he encountered frustrations with productivity due to extensive use of SQL and Excel. In 2008, he began experimenting with Python and created early versions of pandas to improve productivity for his projects. This led to open sourcing pandas in 2009 and evangelizing Python more broadly within his company and community.
PyData: Past, Present Future (PyData SV 2014 Keynote)Peter Wang
From the closing keynoteLook back at the last two years of PyData, discussion about Python's role in the growing and changing data analytics landscape, and encouragement of ways to grow the community
Slides from Matt Dowle's presentation at H2O Open Tour: NYC
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
Python is sometimes discounted as slow because of its dynamic typing and interpreted nature and not suitable for scale because of the GIL. But, in this talk, I will show how with the help of talented open-source contributors around the world, we have been able to build systems in Python that are fast and scalable to many machines and how this has helped Python take over Science.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
Linux Distribution Collaboration …on a Mainframe!All Things Open
Presented at All Things Open 2023
Presented by Elizabeth K. Joseph - IBM
Title: Linux Distribution Collaboration …on a Mainframe!
Abstract: Linux has run on the mainframe architecture (s390x) for over 20 years now, and there’s even Linux-only mainframe hardware! But tight collaboration between the Linux distributions is rather new. Enter the Open Mainframe Project Linux Distributions Working Group, founded in late 2021.
Bringing together various Linux distributions, both corporate-backed and community-driven, representatives from openSUSE, Debian, Fedora, SUSE, and more immediately joined the effort to share bug reports and patches that impact all the distributions. Issues are often shared and discussed on the mailing list, and more complicated topics covered during the monthly meetings. The working group has a number of success stories that will be shared.
Future potential issues are also tackled, and notes shared about upstream changes that may soon impact the package processes. In the latest effort, the team has started thinking about actual upstream projects to invite to our group to be more pro-active about changes that may cause problems on the s390x architecture.
But more importantly, this is a story about community and collaboration. Many people view the various Linux distributions as a competitive space, but like so much of the open source software community, we are all more successful when we share knowledge about our core. The success of this working group, and growing enthusiasm for it from new Linux distributions who are joining, is a great example of this.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Large Data Analyze with PyTables,
This presentation has been collected from several other presentations(PyTables presentation).
For more presentation in this field please refer to this link (http://pytables.org/moin/HowToUse#Presentations).
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.
This document discusses PyTables, a Python library for managing hierarchical datasets and efficiently analyzing large amounts of data. It begins by introducing PyTables and its use of HDF5 for portability and extensibility. Key features of PyTables discussed include its object-oriented interface, optimization of memory and disk usage, and fast querying capabilities. The document then covers techniques for maximizing performance like Numexpr for complex expressions, NumPy for powerful data containers, compression algorithms, and caching. Blosc compression is highlighted for its ability to compress faster than memory speed.
This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
Numba is a Python compiler that uses type information to generate optimized machine code from Python functions. It allows Python code to run as fast as natively compiled languages for numeric computation. The goal is to provide rapid iteration and development along with fast code execution. Numba works by compiling Python code to LLVM bitcode then to machine code using type information from NumPy. An example shows a sinc function being JIT compiled. Future work includes supporting more Python features like structures and objects.
ScicomP 2015 presentation discussing best practices for debugging CUDA and OpenACC applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Simplilearn
The document discusses several deep learning frameworks including TensorFlow, Keras, PyTorch, Theano, Deep Learning 4 Java, Caffe, Chainer, and Microsoft CNTK. TensorFlow was developed by Google Brain Team and uses dataflow graphs to process data. Keras is a high-level neural network API that runs on top of TensorFlow, Theano, and CNTK. PyTorch was designed for flexibility and speed using CUDA and C++ libraries. Theano defines and evaluates mathematical expressions involving multi-dimensional arrays efficiently in Python. Deep Learning 4 Java integrates with Hadoop and Apache Spark to bring AI to business environments. Caffe focuses on image detection and classification using C++ and Python. Chainer was developed in collaboration with several companies
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
Use Apache Spark Streaming in with IBM Watson on Bluemix to perform sentiment analysis and track how a conversation is trending on Twitter.
By David Taieb: https://twitter.com/DTAIEB55
Video: https://youtu.be/KLc_wazud3s
Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/
The lightning talks covered various Netflix OSS projects including S3mper, PigPen, STAASH, Dynomite, Aegisthus, Suro, Zeno, Lipstick on GCE, AnsWerS, and IBM. 41 projects were discussed and the need for a cohesive Netflix OSS platform was highlighted. Matt Bookman then gave a presentation on running Lipstick and Hadoop on Google Cloud Platform using Google Compute Engine and Cloud Storage. He demonstrated running Pig jobs on Compute Engine and discussed design considerations for cloud-based Hadoop deployments. Finally, Peter Sankauskas from @Answers4AWS discussed initial ideas around CloudFormation for Asgard and deploying various Netflix OSS
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
Delivery of a new Bio-informatics infrastructure at the Wellcome Trust Sanger Center. We include how to programatically create, manage and provide providence for images used both at Sanger and elsewhere using open source tools and continuous integration.
The document provides an overview and agenda for an Amazon Deep Learning presentation. It discusses AI and deep learning at Amazon, gives a primer on deep learning and applications, provides an overview of MXNet and Amazon's investments in it, discusses deep learning tools and usage, and provides two application examples using MXNet on AWS. It concludes by discussing next steps and a call to action.
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
AI has been a hot topic lately, with advances being made constantly in what is possible, there has not been as much discussion of the infrastructure and scaling challenges that come with it. How do you support dozens of different languages and frameworks, and make them interoperate invisibly? How do you scale to run abstract code from thousands of different developers, simultaneously and elastically, while maintaining less than 15ms of overhead?
At Algorithmia, we’ve built, deployed, and scaled thousands of algorithms and machine learning models, using every kind of framework (from scikit-learn to tensorflow). We’ve seen many of the challenges faced in this area, and in this talk I’ll share some insights into the problems you’re likely to face, and how to approach solving them.
In brief, we’ll examine the need for, and implementations of, a complete “Operating System for AI” – a common interface for different algorithms to be used and combined, and a general architecture for serverless machine learning which is discoverable, versioned, scalable and sharable.
At my first visit to SciPy in Latin America, I was able to review the history of PyData, SciPy, and NumFOCUS, and discuss how to grow its communities and cooperate in the future. I also introduce OpenTeams as a way for open-source contributors to grow their reputation and build businesses.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Standardizing arrays -- Microsoft PresentationTravis Oliphant
This document discusses standardizing N-dimensional arrays (tensors) in Python. It proposes creating a "uarray" interface that downstream libraries could use to work with different array implementations in a common way. This would include defining core concepts like shape, data type, and math operations for arrays. It also discusses collaborating with mathematicians on formalizing array operations and learning from NumPy's generalized ufunc approach. The goal is to enhance Python's array ecosystem and allow libraries to work across hardware backends through a shared interface rather than depending on a single implementation.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
This document provides an overview of Continuum Analytics and Python for data science. It discusses how Continuum created two organizations, Anaconda and NumFOCUS, to support open source Python data science software. It then describes Continuum's Anaconda distribution, which brings together 200+ open source packages like NumPy, SciPy, Pandas, Scikit-learn, and Jupyter that are used for data science workflows involving data loading, analysis, modeling, and visualization. The document outlines how Continuum helps accelerate adoption of data science through Anaconda and provides examples of industries using Python for data science.
Continuum Analytics provides the Anaconda platform for data science. It includes popular Python data science packages like NumPy, SciPy, Pandas, Scikit-learn, and the Jupyter notebook. Continuum was founded by Travis Oliphant, creator of NumPy and Numba, to support the open source Python data science community and make it easier to do data analytics and visualization using Python. The Anaconda platform has over 2 million users and makes it simple to install and work with Python and related packages for data science and machine learning.
Talk given to the Philly Python Users Group (PUG) on October 1, 2015: http://www.meetup.com/phillypug/ Thanks SIG (http://www.sig.com) for hosting!
Conda is a cross-platform package manager that lets you quickly and easily build environments containing complicated software stacks. It was built to manage the NumPy stack in Python but can be used to manage any complex software dependencies.
Blaze: a large-scale, array-oriented infrastructure for PythonTravis Oliphant
This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
Numba is a Python compiler that translates Python code into fast machine code using the LLVM compiler infrastructure. It allows Python code that works with NumPy arrays to be just-in-time compiled to native machine instructions, achieving performance comparable to C, C++ and Fortran for numeric work. Numba provides decorators like @jit that can compile functions for improved performance on NumPy array operations. It aims to make Python a compiled and optimized language for scientific computing by leveraging type information from NumPy to generate fast machine code.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
2. This talk will be about (mostly) the free and /
or open source software we are building.
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products
• Training
• Support
• Consulting
7. NumPy and SciPy are quite successful
Thanks to large and diverse community around it
(Matplotlib, IPython, SymPy, Pandas, etc.)
I estimate 1.5 million to 2 million users
Only incremental improvements possible with
these projects at this point.
Thus, we needed to start new projects...
8. Related Open Source Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Bokeh: Interactive, grammar-based visualization
system for large datasets
Common Thread: High-level, expressive language for
domain experts; innovative compilers & runtimes
for efficient, powerful data transformation
9. Conda and Anaconda
• Cross-platform package management
• Multiple environments allows you to have multiple
versions of packages installed in system
• Easy app-deployment
• Taming open-source
Free for all users
Enterprise
support available!
10. Why Conda?
• Linux users stopped complaining about Python
deployment
• I made major mistakes in management of NumPy/SciPy:
• too much in SciPy (SciPy as distribution) --- scikits model is much
better (tighter libraries developed by smaller teams)
• gave in to community desire for binary ABI compatibility in
NumPy motivated by difficulty of reproducing Python install
• Need for a cross-platform way to install major Python
extensions (many with dependencies on large C or C++
libraries
• Python can’t be ubiquitous if people struggle to just get
it and then manage it.
12. What is Conda
• Full package management (like yum or apt-get)
but cross-platform
• Control over environments (using hard-link
farms) --- better than virtual-env. virtualenv today is like
distutils and setuptools of several years ago (great at first but will end up
hating it)
• Architected to be able to manage any packages
(R, Scala, Clojure, Haskell, Ruby, JS)
• SAT solver to manage dependencies
• User-definable repositories
13. New Features and Binstar
• Build command from recipe --- many recipes here: https://
github.com/ContinuumIO/conda-recipes
• Upload recipes to Binstar (last mile in binary package
hosting and deployment for any language).
• “binstar in beta” is the beta code
• Personal conda repositories --- https://conda.binstar.org/
travis
• Free Continuum Anaconda repo will be on binstar.org
• Private packages and behind-the-firewall satellites available
• *Free build queue on Linux (Mac and Windows coming
soon) for hosted conda recipes
14. Demo
create Python 3 environment with IPython and scipy
create new recipe from PyPI (yunomi)
15. Packaging and Distribution Solved
• conda and binstar solve most of the problems that
we have seen people encounter in managing
Python installations (especially in large-scale
institutions).
• they are supported solutions that can remove the
technology pain of managing Python
• some problems, though, are people
16. Anaconda (open)
Free enterprise-ready Python distribution of open-
source tools for large-scale data processing,
predictive analytics, and scientific computing
17. Anaconda Add-Ons (paid-for)
•Revolutionary Python to GPU compiler
•Extends Numba to take a subset of Python
to the GPU (program CUDA in Python)
•CUDA FFT / BLAS interfaces
Fast, memory-efficient Python interface for
SQL databases, NoSQL stores,Amazon S3,
and large data files.
NumPy, SciPy, scikit-learn, NumExpr compiled
against Intel’s Math Kernel Library (MKL)
19. Why Numba?
•Python is too slow for loops
•Most people are not learning C/C++/Fortran today
•Cython is an improvment (but still verbose and
needs C-compiler)
•NVIDIA using LLVM for the GPU
•Many people working with large typed-containers
(NumPy arrays)
•We want to take high-level, tarray-oriented
expressions and compile it to fast code
20. NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVMPY
Python Function Machine Code
ARM
22. Numba
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
23. Numba changes the game!
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
Numba turns (a subset of) Python into a
“compiled language” as fast as C (but much more
flexible). You don’t have to reach for C/C++
24. Laplace Example
@jit('void(double[:,:], double, double)')
def numba_update(u, dx2, dy2):
nx, ny = u.shape
for i in xrange(1,nx-1):
for j in xrange(1, ny-1):
u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
(u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
Adapted from http://www.scipy.org/PerformancePython
originally by Prabhu Ramachandran
@jit('void(double[:,:], double, double)')
def numbavec_update(u, dx2, dy2):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
25. Results of Laplace example
Version Time Speed Up
NumPy 3.19 1.0
Numba 2.32 1.38
Vect. Numba 2.33 1.37
Cython 2.38 1.34
Weave 2.47 1.29
Numexpr 2.62 1.22
Fortran Loops 2.30 1.39
Vect. Fortran 1.50 2.13
https://github.com/teoliphant/speed.git
26. LLVMPy worth looking at
LLVM (via
LLVMPy)
has done
much heavy
lifting
LLVMPy =
Compilers for
everybody
27. What is wrong with NumPy?
• Dtype system is difficult to extend
• Many Dtypes needed (missing data, enums, variable length
strings)
• Immediate mode creates huge temporaries (spawning
Numexpr)
• “Almost” an in-memory data-base comparable to SQL-
lite (missing indexes)
• Integration with sparse arrays
• Standard structure of arrays representation...
• Missing Multi-methods
• Optimization / Minimal support for multi-core / GPU
28. Now What?
After watching NumPy and SciPy get used all over
Wall Street and by many scientists / engineers in
industry --- what would we do differently?
30. Blaze Array or Table
Data Descriptor
Data Buffer
Index
Operation
NumPy BLZ
Persistent
Format
RDBMS
CSVData Stream
31. Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph
• Represents deferred computation
Usually what you have when
you have a Blaze Array
34. Progress
• Basic calculations work out-of-core (via Numba and
LLVM)
• Hard dependency on dynd and dynd-python (a dynamic
C++-only multi-dimensional library like NumPy but with
many improvements)
• Persistent arrays from BLZ
• Basic array-server functionality for layering over CSV
files
• 0.2 release in 1-2 weeks. 0.3 within a month after that
(first usable release)
36. Bokeh Plotting Library
• Interactive graphics for the web
• Designed for large datasets
• Designed for streaming data
• Native interface in Python
• Fast JavaScript component
• DARPA funded
• v0.1 release imminent
37. Reasons for Bokeh
1. Plotting must happen near the data too
2. Quick iteration is essential => interactive visualization
3. Interactive visualization on remote-data => use the browser
4. Almost all web plotting libraries are either:
1. Designed for javascript programmers
2. Designed to output static graphs
5. We designed Bokeh to be dynamic graphing in the web for
Python programmers
6. Will include “Abstract” or “synthetic” rendering (working on
Hadoop and Spark compatibility)
38. Wakari
• Browser-based data analysis and
visualization platform
• Wordpress /YouTube / Github
for data analysis
• Full Linux environment with
Anaconda Python
• Can be installed on internal
clusters & servers
39. Why Wakari?
• Data is too big to fit on your desktop
• You need compute power but don’t have easy access to a
large cluster (cloud is sitting there with lots of power)
• Configuration of software on a new system stinks
(especially a cluster).
• Collaborative Data Analytics --- you want to build a
complex technical workflow and then share it with others
easily (without requiring they do painful configuration to
see your results)
• IPython Notebook is awesome --- let’s share it (but we
also need the dependencies and data).
40. Wakari
• Free account has 512 MB RAM / 10 GB disk and shared
multi-core CPU
• Easily spin-up map-reduce (Disco and Hadoop clusters)
• Use IPython Parallel on many-nodes in the cloud
• Develop GUI apps (possibly in Anaconda) and publish
them easily to Wakari (based on full power of scientific
python --- complex technical workflows (IPython
notebook for now)
42. Continuum Data Explorer (CDX)
• Open Source
• Goal is interactivity
• Combination of IPython REPL, Bokeh, and tables
• Tight integration between GUI elements and REPL
• Current features
- Namespace viewer (mapped to IPython namespace)
- DataTable widget with group-by, computed columns, advanced-
filters
- Interactive Plots connected to tables
44. Conclusion
Projects circle around giving tools to experts
(occasional programmers or domain
experts) to enable them to move their
expertise to the data to get insights --- keep
data where it is and move high-level but
performant code)
Join us or ask how we can help you!