SlideShare a Scribd company logo
{
PANDAS
Not your ordinary fuzzy bear
Agenda:
6:45-6:50 Why do we need Pandas
6:50-7:00 General Information (Download etc..)
7:00-7:15 Quick Tour and Use Cases using Python Notebooks
7:15-7:30 Q&A
Data Science has been
around for much longer
than it’s been buzz
worthy. At it’s core we
simply apply the
scientific method to data:
(Data) Scientists ask
questions, do research,
construct hypotheses,
test, analyze , interpret
results and repeat.
Too much testing Too little
Observe and Prep.
We think Data looks clean and
pretty
But the reality is that it looks
like well… like…
But, “ETL is not as fun as
Statistics, Machine learning …”
Queue the PANDA !
Time to show you that it is
pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data
structures and data analysis tools for the Python
programming language.
You can download Pandas using standard pypi. I recommend, however you try
the continuum analytics package provided for free by Anaconda. It includes
pandas, numpy, scipy ,python notebooks and the Spyder editor.
http://pandas.pydata.org/
http://docs.continuum.io/anaconda/install.html
Quick Tour using open Financial data set
1. Data cleansing, manipulation, merging etc…
2.Web Crawling and data preparation using scrapy-pandas pipeline
A fast and efficient DataFrame object for data manipulation with integrated
indexing;
Tools for reading and writing data between in-memory data structures and
different formats: CSV and text files, Microsoft Excel, SQL databases, and the
fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain
automatic label-based alignment in computations and easily manipulate
messy data into an orderly form;
Flexible reshaping and pivoting of data sets; Intelligent label-based slicing,
fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size
mutability; Aggregating or transforming data with a powerful group by
engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets; Hierarchical axis
indexing provides an intuitive way of working with high-dimensional data
in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion,
moving window statistics,moving window linear regressions, date shifting
and lagging. Even create domain-specific time offsets and join time series
without losing data;
Use Case 1: Determine metric for sales data for company/client using
common mapping (geography, age, etc..)
Data: Data set is in numerous files with different file names, columns,
data types, null values, numeric and text data. How can you:
a. Combine all the files quickly
b. identify and clean erroneous data
c. create mapping that will link data sets (all data links to location, age
etc..)
Solution: PANDAS! Usually with excel and an RDBMS this can take
days/weeks. In pandas it took me a day and can probably take a few
hours.
Added benefit: The code can be repurposed for all input data. With an
rdbms you would need to re-write sql (or related queries) for each
particular database. *
* There is also pandassql which connects to a sql instance. Please see the
pandas docs.
This is what I got:
Files that were neither here nor there
Files that were everywhere
You could not tell which was which
And the data quality was a B***
Demo:
Original files : Sales txt file, customer file, and address book file. No files
had headers. They had to be inserted using a main data definitions file.
Outcome: One File showing sales by zip, region, province and country ,
mapped to specific column types for APACHE HIVE
Exciting Libraries
Geo Mapping: https://github.com/kjordahl/geopandas
Machine Learning: https://github.com/paulgb/sklearn-pandas
External Data Sources growing:
http://pandas.pydata.org/pandas-docs/stable/remote_data.html
What else
You can write in Cython to get maximize performance. It will be up to 10X faster
than pure python because in python it calls a series for each row so with cython and
numpy you can pass ndarrays instead.
Development Roadmap:
(0.13) Improved SQL / relational database tools
Tools for working with data sets that do not fit into memory
(0.10) Better memory usage and performance when reading very large CSV files
Better statistical graphics using matplotlib
Integration with D3.js
Better support for integer NA values
Extend GroupBy functionality to regular ndarrays, record arrays
✔ numpy.datetime64 integration, scikits.timeseries codebase integration.
Substantially improved time series functionality.
✔ Improved PyTables (HDF5) integration
✔ NDFrame data structure for arbitrarily high-dimensional labeled data
✔ Better support for NumPy dtype hierarchy without sacrificing usability
✔ Add a Factor data type (in R parlance)
Ad

More Related Content

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
Mayuri Gupta
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
Ajay Ohri
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
wintersnow181189
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
Thanuja Seneviratne
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
Arohi Khandelwal
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
SpringPeople
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
Spotle.ai
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
Thanuja Seneviratne
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
Ajay Ohri
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
Arohi Khandelwal
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
SpringPeople
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
Spotle.ai
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 

Similar to Dc python meetup (20)

Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Jyrki Määttä
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
CMR WORLD TECH
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
Pavlo Baron
 
Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2
Dmitri Apassov
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
Nilay Mishra
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
ssuseracaaae2
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Jyrki Määttä
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
CMR WORLD TECH
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
Pavlo Baron
 
Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2
Dmitri Apassov
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
Nilay Mishra
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
Ad

More from Jeffrey Clark (20)

Python memory management_v2
Python memory management_v2Python memory management_v2
Python memory management_v2
Jeffrey Clark
 
Python meetup
Python meetupPython meetup
Python meetup
Jeffrey Clark
 
Jwt with flask slide deck - alan swenson
Jwt with flask   slide deck - alan swensonJwt with flask   slide deck - alan swenson
Jwt with flask slide deck - alan swenson
Jeffrey Clark
 
Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02
Jeffrey Clark
 
Pyramiddcpythonfeb2013 131006105131-phpapp02
Pyramiddcpythonfeb2013 131006105131-phpapp02Pyramiddcpythonfeb2013 131006105131-phpapp02
Pyramiddcpythonfeb2013 131006105131-phpapp02
Jeffrey Clark
 
Zpugdc2007 101105081808-phpapp01
Zpugdc2007 101105081808-phpapp01Zpugdc2007 101105081808-phpapp01
Zpugdc2007 101105081808-phpapp01
Jeffrey Clark
 
Zpugdc deformpresentation-100709203803-phpapp01
Zpugdc deformpresentation-100709203803-phpapp01Zpugdc deformpresentation-100709203803-phpapp01
Zpugdc deformpresentation-100709203803-phpapp01
Jeffrey Clark
 
Zpugdccherry 101105081729-phpapp01
Zpugdccherry 101105081729-phpapp01Zpugdccherry 101105081729-phpapp01
Zpugdccherry 101105081729-phpapp01
Jeffrey Clark
 
Tornado
TornadoTornado
Tornado
Jeffrey Clark
 
Science To Bfg
Science To BfgScience To Bfg
Science To Bfg
Jeffrey Clark
 
The PSF and You
The PSF and YouThe PSF and You
The PSF and You
Jeffrey Clark
 
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
Using Grok to Walk Like a Duck - Brandon Craig RhodesUsing Grok to Walk Like a Duck - Brandon Craig Rhodes
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
Jeffrey Clark
 
What Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike RobinsonWhat Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike Robinson
Jeffrey Clark
 
What Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike RobinsonWhat Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike Robinson
Jeffrey Clark
 
Plone I18n Tutorial - Hanno Schlichting
Plone I18n Tutorial - Hanno SchlichtingPlone I18n Tutorial - Hanno Schlichting
Plone I18n Tutorial - Hanno Schlichting
Jeffrey Clark
 
Real World Intranets - Joel Burton
Real World Intranets - Joel BurtonReal World Intranets - Joel Burton
Real World Intranets - Joel Burton
Jeffrey Clark
 
State Of Zope 3 - Stephan Richter
State Of Zope 3 - Stephan RichterState Of Zope 3 - Stephan Richter
State Of Zope 3 - Stephan Richter
Jeffrey Clark
 
KSS Techniques - Joel Burton
KSS Techniques - Joel BurtonKSS Techniques - Joel Burton
KSS Techniques - Joel Burton
Jeffrey Clark
 
Zenoss: Buildout
Zenoss: BuildoutZenoss: Buildout
Zenoss: Buildout
Jeffrey Clark
 
Opensourceweblion
OpensourceweblionOpensourceweblion
Opensourceweblion
Jeffrey Clark
 
Python memory management_v2
Python memory management_v2Python memory management_v2
Python memory management_v2
Jeffrey Clark
 
Jwt with flask slide deck - alan swenson
Jwt with flask   slide deck - alan swensonJwt with flask   slide deck - alan swenson
Jwt with flask slide deck - alan swenson
Jeffrey Clark
 
Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02
Jeffrey Clark
 
Pyramiddcpythonfeb2013 131006105131-phpapp02
Pyramiddcpythonfeb2013 131006105131-phpapp02Pyramiddcpythonfeb2013 131006105131-phpapp02
Pyramiddcpythonfeb2013 131006105131-phpapp02
Jeffrey Clark
 
Zpugdc2007 101105081808-phpapp01
Zpugdc2007 101105081808-phpapp01Zpugdc2007 101105081808-phpapp01
Zpugdc2007 101105081808-phpapp01
Jeffrey Clark
 
Zpugdc deformpresentation-100709203803-phpapp01
Zpugdc deformpresentation-100709203803-phpapp01Zpugdc deformpresentation-100709203803-phpapp01
Zpugdc deformpresentation-100709203803-phpapp01
Jeffrey Clark
 
Zpugdccherry 101105081729-phpapp01
Zpugdccherry 101105081729-phpapp01Zpugdccherry 101105081729-phpapp01
Zpugdccherry 101105081729-phpapp01
Jeffrey Clark
 
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
Using Grok to Walk Like a Duck - Brandon Craig RhodesUsing Grok to Walk Like a Duck - Brandon Craig Rhodes
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
Jeffrey Clark
 
What Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike RobinsonWhat Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike Robinson
Jeffrey Clark
 
What Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike RobinsonWhat Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike Robinson
Jeffrey Clark
 
Plone I18n Tutorial - Hanno Schlichting
Plone I18n Tutorial - Hanno SchlichtingPlone I18n Tutorial - Hanno Schlichting
Plone I18n Tutorial - Hanno Schlichting
Jeffrey Clark
 
Real World Intranets - Joel Burton
Real World Intranets - Joel BurtonReal World Intranets - Joel Burton
Real World Intranets - Joel Burton
Jeffrey Clark
 
State Of Zope 3 - Stephan Richter
State Of Zope 3 - Stephan RichterState Of Zope 3 - Stephan Richter
State Of Zope 3 - Stephan Richter
Jeffrey Clark
 
KSS Techniques - Joel Burton
KSS Techniques - Joel BurtonKSS Techniques - Joel Burton
KSS Techniques - Joel Burton
Jeffrey Clark
 
Ad

Recently uploaded (20)

Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 

Dc python meetup

  • 1. { PANDAS Not your ordinary fuzzy bear Agenda: 6:45-6:50 Why do we need Pandas 6:50-7:00 General Information (Download etc..) 7:00-7:15 Quick Tour and Use Cases using Python Notebooks 7:15-7:30 Q&A
  • 2. Data Science has been around for much longer than it’s been buzz worthy. At it’s core we simply apply the scientific method to data: (Data) Scientists ask questions, do research, construct hypotheses, test, analyze , interpret results and repeat.
  • 3. Too much testing Too little Observe and Prep. We think Data looks clean and pretty But the reality is that it looks like well… like…
  • 4. But, “ETL is not as fun as Statistics, Machine learning …” Queue the PANDA ! Time to show you that it is
  • 5. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can download Pandas using standard pypi. I recommend, however you try the continuum analytics package provided for free by Anaconda. It includes pandas, numpy, scipy ,python notebooks and the Spyder editor. http://pandas.pydata.org/ http://docs.continuum.io/anaconda/install.html Quick Tour using open Financial data set 1. Data cleansing, manipulation, merging etc… 2.Web Crawling and data preparation using scrapy-pandas pipeline
  • 6. A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; High performance merging and joining of data sets; Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics,moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • 7. Use Case 1: Determine metric for sales data for company/client using common mapping (geography, age, etc..) Data: Data set is in numerous files with different file names, columns, data types, null values, numeric and text data. How can you: a. Combine all the files quickly b. identify and clean erroneous data c. create mapping that will link data sets (all data links to location, age etc..) Solution: PANDAS! Usually with excel and an RDBMS this can take days/weeks. In pandas it took me a day and can probably take a few hours. Added benefit: The code can be repurposed for all input data. With an rdbms you would need to re-write sql (or related queries) for each particular database. * * There is also pandassql which connects to a sql instance. Please see the pandas docs.
  • 8. This is what I got: Files that were neither here nor there Files that were everywhere You could not tell which was which And the data quality was a B***
  • 9. Demo: Original files : Sales txt file, customer file, and address book file. No files had headers. They had to be inserted using a main data definitions file. Outcome: One File showing sales by zip, region, province and country , mapped to specific column types for APACHE HIVE
  • 10. Exciting Libraries Geo Mapping: https://github.com/kjordahl/geopandas Machine Learning: https://github.com/paulgb/sklearn-pandas External Data Sources growing: http://pandas.pydata.org/pandas-docs/stable/remote_data.html
  • 11. What else You can write in Cython to get maximize performance. It will be up to 10X faster than pure python because in python it calls a series for each row so with cython and numpy you can pass ndarrays instead. Development Roadmap: (0.13) Improved SQL / relational database tools Tools for working with data sets that do not fit into memory (0.10) Better memory usage and performance when reading very large CSV files Better statistical graphics using matplotlib Integration with D3.js Better support for integer NA values Extend GroupBy functionality to regular ndarrays, record arrays ✔ numpy.datetime64 integration, scikits.timeseries codebase integration. Substantially improved time series functionality. ✔ Improved PyTables (HDF5) integration ✔ NDFrame data structure for arbitrarily high-dimensional labeled data ✔ Better support for NumPy dtype hierarchy without sacrificing usability ✔ Add a Factor data type (in R parlance)

Editor's Notes

  • #6: Open Ipython notebooks have codes and all source data ready to go in notebook viewer. Set up notebook for this meeting specifically. Remember to leave Merck name out of all data.
  • #8: Demo the zip code pre-processing along with the mapping : Show code slices and then go to python notebook to show it in action Zip pre –process Clean the transfer sales Merge transfer sales with general sales Show mappings from digits to text (efficient way to change and manipulate data for visualization)
  • #10: Mention that this is one of the largest pharma companies in the world Mention hierarchical indexing