SlideShare a Scribd company logo
What’s new and awesome
       in pandas
pandas?
In [13]: foo
Out[13]:
    methyl1    age      edu         something   indic
0   38.36      30to39   geCollege   1           False
1   37.85      lt30     geCollege   1           False
2   38.57      30to39   geCollege   1           False
3   39.75      30to39   geCollege   1           True
4   43.83      30to39   geCollege   1           True
5   39.08      30to39   ltHS        1           True


  Size-mutable “labeled arrays” that
     can handle heterogeneous data
Kinda like a structured array??

•  Automatic data alignment with lots of
   reshaping and indexing methods

•  Implicit and explicit handling of missing
   data

•  Easy time series functionality
    –  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
   (group by, join, etc.)
pandas?
•  Extremely good for financial data
  –  StackOverflow: “this is a beast of a financial
     analysis tool”



•  One of the better relational data
   munging tools in any language?

•  But also has maybe 60+% of what R
   users expect when they come to
   Python
1. Heavily redesigned
         internals
•  Merged old DataFrame and DataMatrix
   into a single DataFrame: retain
   optimal performance where possible

•  Internal BlockManager class manages
   homogeneous ndarrays for optimal
   performance and reshaping
1. Heavily redesigned
         internals
•  Better handling of missing data for
   non-floating point dtypes

•  Soon: DataFrame variant with N-dim
   “hyperslabs”
2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]


Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
                         skiprows=[1,2],
                         na_values=[‘#N/A NA’])



store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2
4. Better pivoting / reshaping

    foo   bar    A         B         C
0   one   a     -0.0524    1.664     1.171
1   one   a      0.2514    0.8306   -1.396
2   one   b      0.1256    0.3897    0.5227
3   one   b     -0.9301    0.6513   -0.2313
4   one   c      2.037     1.938    -0.3454
5   two   a      0.2073    0.7857    0.9051
6   two   a     -1.032    -0.8615    1.028
7   two   b     -0.7319   -1.846     0.9294
8   two   b      0.1004   -1.19      0.6043
9   two   c     -1.008    -0.3339    0.09522
4. Better pivoting / reshaping

In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Better pivoting / reshaping

In [31]: pivoted.major_xs('a')
Out[31]:
      A        B        C
one -0.0524    1.664    1.171
two   0.2073   0.7857   0.9051


In [32]: pivoted.minor_xs('one')
Out[32]:
    A        B        C
a -0.0524    1.664    1.171
b   0.2514   0.8306 -1.396
c   0.1256   0.3897   0.5227
d -0.9301    0.6513 -0.2313
e   2.037    1.938   -0.3454
4. Better pivoting / reshaping

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Some other things
•  “Sparse” (mostly NA) versions of
   data structures
•  Time zone support in DateRange
•  Generic moving window function
   rolling_apply
Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels
Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

More Related Content

What's hot (20)

PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PDF
A look inside pandas design and development
Wes McKinney
 
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Koalas: Pandas on Apache Spark
Databricks
 
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
KEY
Large Scale Data Analysis Tools
boorad
 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
PDF
Spark what's new what's coming
Databricks
 
PPTX
Spark - Philly JUG
Brian O'Neill
 
PDF
Distributed ML in Apache Spark
Databricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
A look inside pandas design and development
Wes McKinney
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Koalas: Pandas on Apache Spark
Databricks
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Large Scale Data Analysis Tools
boorad
 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
Spark what's new what's coming
Databricks
 
Spark - Philly JUG
Brian O'Neill
 
Distributed ML in Apache Spark
Databricks
 

Similar to SciPy 2011 pandas lightning talk (20)

PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
smartashammari
 
PDF
Pandas in Python for Data Exploration .pdf
sejalkadam21
 
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
 
PDF
Python Interview Questions PDF By ScholarHat
Scholarhat
 
PPTX
Lecture 9.pptx
MathewJohnSinoCruz
 
PDF
Data Wrangling with Pandas
Luis Carrasco
 
PDF
Pandas cheat sheet
Lenis Carolina Lopez
 
PDF
Pandas Cheat Sheet
ACASH1011
 
PDF
Pandas cheat sheet_data science
Subrata Shaw
 
PDF
Panda data structures and its importance in Python.pdf
sumitt6_25730773
 
PDF
pandas.pdf
AjeshSurejan2
 
PDF
pandas (1).pdf
AjeshSurejan2
 
PDF
330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunke
gaivaseugi
 
PPTX
Meetup Junio Data Analysis with python 2018
DataLab Community
 
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
PPTX
interenship.pptx
Naveen316549
 
PPT
Python Panda Library for python programming.ppt
tejaskumbhani111
 
PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
AamnaRaza1
 
PPTX
introduction to data structures in pandas
vidhyapm2
 
PDF
Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf
jagatpal4217
 
lecture14DATASCIENCE AND MACHINE LER.pdf
smartashammari
 
Pandas in Python for Data Exploration .pdf
sejalkadam21
 
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
 
Python Interview Questions PDF By ScholarHat
Scholarhat
 
Lecture 9.pptx
MathewJohnSinoCruz
 
Data Wrangling with Pandas
Luis Carrasco
 
Pandas cheat sheet
Lenis Carolina Lopez
 
Pandas Cheat Sheet
ACASH1011
 
Pandas cheat sheet_data science
Subrata Shaw
 
Panda data structures and its importance in Python.pdf
sumitt6_25730773
 
pandas.pdf
AjeshSurejan2
 
pandas (1).pdf
AjeshSurejan2
 
330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunke
gaivaseugi
 
Meetup Junio Data Analysis with python 2018
DataLab Community
 
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
interenship.pptx
Naveen316549
 
Python Panda Library for python programming.ppt
tejaskumbhani111
 
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
AamnaRaza1
 
introduction to data structures in pandas
vidhyapm2
 
Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf
jagatpal4217
 
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PPTX
Shared Infrastructure for Data Science
Wes McKinney
 
PDF
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PPTX
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PDF
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Ad

Recently uploaded (20)

PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PDF
Sound the Alarm: Detection and Response
VICTOR MAESTRE RAMIREZ
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
Sound the Alarm: Detection and Response
VICTOR MAESTRE RAMIREZ
 

SciPy 2011 pandas lightning talk

  • 1. What’s new and awesome in pandas
  • 2. pandas? In [13]: foo Out[13]: methyl1 age edu something indic 0 38.36 30to39 geCollege 1 False 1 37.85 lt30 geCollege 1 False 2 38.57 30to39 geCollege 1 False 3 39.75 30to39 geCollege 1 True 4 43.83 30to39 geCollege 1 True 5 39.08 30to39 ltHS 1 True Size-mutable “labeled arrays” that can handle heterogeneous data
  • 3. Kinda like a structured array?? •  Automatic data alignment with lots of reshaping and indexing methods •  Implicit and explicit handling of missing data •  Easy time series functionality –  Far less fuss than scikits.timeseries •  Lots of in-memory SQL-like operations (group by, join, etc.)
  • 4. pandas? •  Extremely good for financial data –  StackOverflow: “this is a beast of a financial analysis tool” •  One of the better relational data munging tools in any language? •  But also has maybe 60+% of what R users expect when they come to Python
  • 5. 1. Heavily redesigned internals •  Merged old DataFrame and DataMatrix into a single DataFrame: retain optimal performance where possible •  Internal BlockManager class manages homogeneous ndarrays for optimal performance and reshaping
  • 6. 1. Heavily redesigned internals •  Better handling of missing data for non-floating point dtypes •  Soon: DataFrame variant with N-dim “hyperslabs”
  • 7. 2. Fancier indexing Mix boolean / integer / label / slice-based indexing df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] Setting works too df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 8. 3. More robust IO data_frame = read_csv(‘mydata.csv’) data_frame2 = read_table(‘mydata.txt’, sep=‘t’, skiprows=[1,2], na_values=[‘#N/A NA’]) store = HDFStore(‘pytables.h5’) store[‘a’] = data_frame store[‘b’] = data_frame2
  • 9. 4. Better pivoting / reshaping foo bar A B C 0 one a -0.0524 1.664 1.171 1 one a 0.2514 0.8306 -1.396 2 one b 0.1256 0.3897 0.5227 3 one b -0.9301 0.6513 -0.2313 4 one c 2.037 1.938 -0.3454 5 two a 0.2073 0.7857 0.9051 6 two a -1.032 -0.8615 1.028 7 two b -0.7319 -1.846 0.9294 8 two b 0.1004 -1.19 0.6043 9 two c -1.008 -0.3339 0.09522
  • 10. 4. Better pivoting / reshaping In [29]: pivoted = df.pivot('bar', 'foo') In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 11. 4. Better pivoting / reshaping In [31]: pivoted.major_xs('a') Out[31]: A B C one -0.0524 1.664 1.171 two 0.2073 0.7857 0.9051 In [32]: pivoted.minor_xs('one') Out[32]: A B C a -0.0524 1.664 1.171 b 0.2514 0.8306 -1.396 c 0.1256 0.3897 0.5227 d -0.9301 0.6513 -0.2313 e 2.037 1.938 -0.3454
  • 12. 4. Better pivoting / reshaping In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 13. 4. Some other things •  “Sparse” (mostly NA) versions of data structures •  Time zone support in DateRange •  Generic moving window function rolling_apply
  • 14. Near future •  More powerful Group By •  Flexible, fast frequency (time series) conversions •  More integration with statsmodels
  • 15. Thanks! •  Hack: github.com/wesm/pandas •  Twitter: @wesmckinn •  Blog: blog.wesmckinney.com