SciPy 2011 pandas lightning talk

5 likes2,931 views

This document summarizes new and improved features in the pandas library. It describes how pandas provides labeled, structured arrays for heterogeneous data and time series functionality. Major updates include merging DataFrame and DataMatrix into a single optimized DataFrame, better handling of missing data, more robust IO capabilities, and improved pivoting and reshaping of data. Future plans include more powerful group by operations and integration with statsmodels.

Technology

pandas?
In [13]: foo
Out[13]:
methyl1 age edu something indic
0 38.36 30to39 geCollege 1 False
1 37.85 lt30 geCollege 1 False
2 38.57 30to39 geCollege 1 False
3 39.75 30to39 geCollege 1 True
4 43.83 30to39 geCollege 1 True
5 39.08 30to39 ltHS 1 True

Size-mutable “labeled arrays” that
can handle heterogeneous data

Kinda like a structured array??

•  Automatic data alignment with lots of
reshaping and indexing methods

•  Implicit and explicit handling of missing
data

•  Easy time series functionality
–  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
(group by, join, etc.)

pandas?
•  Extremely good for financial data
–  StackOverflow: “this is a beast of a financial
analysis tool”

•  One of the better relational data
munging tools in any language?

•  But also has maybe 60+% of what R
users expect when they come to
Python

1. Heavily redesigned
internals
•  Merged old DataFrame and DataMatrix
into a single DataFrame: retain
optimal performance where possible

•  Internal BlockManager class manages
homogeneous ndarrays for optimal
performance and reshaping

1. Heavily redesigned
internals
•  Better handling of missing data for
non-floating point dtypes

•  Soon: DataFrame variant with N-dim
“hyperslabs”

2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]

Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
skiprows=[1,2],
na_values=[‘#N/A NA’])

store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2

4. Better pivoting / reshaping

foo bar A B C
0 one a -0.0524 1.664 1.171
1 one a 0.2514 0.8306 -1.396
2 one b 0.1256 0.3897 0.5227
3 one b -0.9301 0.6513 -0.2313
4 one c 2.037 1.938 -0.3454
5 two a 0.2073 0.7857 0.9051
6 two a -1.032 -0.8615 1.028
7 two b -0.7319 -1.846 0.9294
8 two b 0.1004 -1.19 0.6043
9 two c -1.008 -0.3339 0.09522

4. Better pivoting / reshaping

In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339

4. Better pivoting / reshaping

In [31]: pivoted.major_xs('a')
Out[31]:
A B C
one -0.0524 1.664 1.171
two 0.2073 0.7857 0.9051

In [32]: pivoted.minor_xs('one')
Out[32]:
A B C
a -0.0524 1.664 1.171
b 0.2514 0.8306 -1.396
c 0.1256 0.3897 0.5227
d -0.9301 0.6513 -0.2313
e 2.037 1.938 -0.3454

4. Better pivoting / reshaping

In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339

4. Some other things
•  “Sparse” (mostly NA) versions of
data structures
•  Time zone support in DateRange
•  Generic moving window function
rolling_apply

Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels

Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

More Related Content

What's hot (20)

PDF

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

PDF

A look inside pandas design and developmentWes McKinney

PDF

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

PDF

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

PDF

Enabling Python to be a Better Big Data CitizenWes McKinney

PDF

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

PDF

Enabling exploratory data science with Spark and RDatabricks

PDF

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

PDF

Overview of the Hive Stinger InitiativeModern Data Stack France

PDF

Jump Start into Apache® Spark™ and DatabricksDatabricks

PDF

Introduction to Spark (Intern Event Presentation)Databricks

PDF

Koalas: Pandas on Apache SparkDatabricks

PDF

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

KEY

Large Scale Data Analysis Toolsboorad

PDF

Apache Arrow: Leveling Up the Data Science StackWes McKinney

PDF

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

PDF

New Directions for Spark in 2015 - Spark Summit EastDatabricks

PDF

Spark what's new what's comingDatabricks

PPTX

Spark - Philly JUGBrian O'Neill

PDF

Distributed ML in Apache SparkDatabricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

A look inside pandas design and developmentWes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Enabling Python to be a Better Big Data CitizenWes McKinney

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Jump Start into Apache® Spark™ and DatabricksDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Koalas: Pandas on Apache SparkDatabricks

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

Large Scale Data Analysis Toolsboorad

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Spark what's new what's comingDatabricks

Spark - Philly JUGBrian O'Neill

Distributed ML in Apache SparkDatabricks

Similar to SciPy 2011 pandas lightning talk (20)

PDF

lecture14DATASCIENCE AND MACHINE LER.pdfsmartashammari

PDF

Pandas in Python for Data Exploration .pdfsejalkadam21

PPTX

Group B - Pandas Pandas is a powerful Python library that provides high-perfo...HarshitChauhan88

PDF

Python Interview Questions PDF By ScholarHatScholarhat

PPTX

Lecture 9.pptxMathewJohnSinoCruz

PDF

Data Wrangling with PandasLuis Carrasco

PDF

Pandas cheat sheetLenis Carolina Lopez

PDF

Pandas Cheat SheetACASH1011

PDF

Pandas cheat sheet_data scienceSubrata Shaw

PDF

Panda data structures and its importance in Python.pdfsumitt6_25730773

PDF

pandas.pdfAjeshSurejan2

PDF

pandas (1).pdfAjeshSurejan2

PDF

330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunkegaivaseugi

PPTX

Meetup Junio Data Analysis with python 2018DataLab Community

PPTX

python-pandas-For-Data-Analysis-Manipulate.pptxPLOKESH8

PPTX

interenship.pptxNaveen316549

PPT

Python Panda Library for python programming.ppttejaskumbhani111

PPTX

Pandas yayyyyyyyyyyyyyyyyyin Python.pptxAamnaRaza1

PPTX

introduction to data structures in pandasvidhyapm2

PDF

Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdfjagatpal4217

lecture14DATASCIENCE AND MACHINE LER.pdfsmartashammari

Pandas in Python for Data Exploration .pdfsejalkadam21

Group B - Pandas Pandas is a powerful Python library that provides high-perfo...HarshitChauhan88

Python Interview Questions PDF By ScholarHatScholarhat

Lecture 9.pptxMathewJohnSinoCruz

Data Wrangling with PandasLuis Carrasco

Pandas cheat sheetLenis Carolina Lopez

Pandas Cheat SheetACASH1011

Pandas cheat sheet_data scienceSubrata Shaw

Panda data structures and its importance in Python.pdfsumitt6_25730773

pandas.pdfAjeshSurejan2

pandas (1).pdfAjeshSurejan2

330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunkegaivaseugi

Meetup Junio Data Analysis with python 2018DataLab Community

python-pandas-For-Data-Analysis-Manipulate.pptxPLOKESH8

interenship.pptxNaveen316549

Python Panda Library for python programming.ppttejaskumbhani111

Pandas yayyyyyyyyyyyyyyyyyin Python.pptxAamnaRaza1

introduction to data structures in pandasvidhyapm2

Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdfjagatpal4217

More from Wes McKinney (20)

PDF

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

PDF

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

PDF

Apache Arrow: High Performance Columnar Data FrameworkWes McKinney

PDF

New Directions for Apache ArrowWes McKinney

PDF

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

PDF

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

PDF

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

PDF

Apache Arrow: Leveling Up the Analytics StackWes McKinney

PDF

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

PDF

Ursa Labs and Apache Arrow in 2019Wes McKinney

PDF

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

PDF

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

PDF

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

PDF

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

PPTX

Shared Infrastructure for Data ScienceWes McKinney

PDF

Data Science Without Borders (JupyterCon 2017)Wes McKinney

PPTX

Memory Interoperability in Analytics and Machine LearningWes McKinney

PPTX

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney

PDF

Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney

PDF

Python Data Wrangling: Preparing for the FutureWes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

Apache Arrow: High Performance Columnar Data FrameworkWes McKinney

New Directions for Apache ArrowWes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow: Leveling Up the Analytics StackWes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Shared Infrastructure for Data ScienceWes McKinney

Data Science Without Borders (JupyterCon 2017)Wes McKinney

Memory Interoperability in Analytics and Machine LearningWes McKinney

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney

Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney

Python Data Wrangling: Preparing for the FutureWes McKinney

Recently uploaded (20)

PDF

Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...treyka

PPTX

Paycifi - Programmable Trust_Breakfast_PPTXTFinTech Belgium

PDF

Optimizing the trajectory of a wheel loader working in short loading cyclesReno Filla

PPTX

2025 HackRedCon Cyber Career Paths.pptx Scott StantonScott Stanton

PPTX

MuleSoft MCP Support (Model Context Protocol) and Use Case Demoshyamraj55

PDF

Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

PDF

Transcript: Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

PPTX

01_Approach Cyber- DORA Incident Management.pptxFinTech Belgium

PPTX

Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP ServersHitachi, Ltd. OSS Solution Center.

PPTX

Agentforce World Tour Toronto '25 - MCP with MuleSoftAlexandra N. Martinez

PDF

Dev Dives: Accelerating agentic automation with Autopilot for EveryoneUiPathCommunity

PDF

UiPath DevConnect 2025: Agentic Automation Community User Group MeetingDianaGray10

PDF

Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdfWonjun Hwang

PPTX

New ThousandEyes Product Innovations: Cisco Live June 2025ThousandEyes

PPTX

CapCut Pro PC Crack Latest Version Free Freejosanj305

PDF

What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSidesMark Simos

PDF

Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...Carsten Stoecker

PDF

How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdfBluechip Advanced Technologies

PDF

GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...James Anderson

PDF

Sound the Alarm: Detection and ResponseVICTOR MAESTRE RAMIREZ

Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...treyka

Paycifi - Programmable Trust_Breakfast_PPTXTFinTech Belgium

Optimizing the trajectory of a wheel loader working in short loading cyclesReno Filla

2025 HackRedCon Cyber Career Paths.pptx Scott StantonScott Stanton

MuleSoft MCP Support (Model Context Protocol) and Use Case Demoshyamraj55

Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

Transcript: Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

01_Approach Cyber- DORA Incident Management.pptxFinTech Belgium

Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP ServersHitachi, Ltd. OSS Solution Center.

Agentforce World Tour Toronto '25 - MCP with MuleSoftAlexandra N. Martinez

Dev Dives: Accelerating agentic automation with Autopilot for EveryoneUiPathCommunity

UiPath DevConnect 2025: Agentic Automation Community User Group MeetingDianaGray10

Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdfWonjun Hwang

New ThousandEyes Product Innovations: Cisco Live June 2025ThousandEyes

CapCut Pro PC Crack Latest Version Free Freejosanj305

What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSidesMark Simos

Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...Carsten Stoecker

How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdfBluechip Advanced Technologies

GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...James Anderson

Sound the Alarm: Detection and ResponseVICTOR MAESTRE RAMIREZ

SciPy 2011 pandas lightning talk

1. What’s new and awesome in pandas

2. pandas? In [13]: foo Out[13]: methyl1 age edu something indic 0 38.36 30to39 geCollege 1 False 1 37.85 lt30 geCollege 1 False 2 38.57 30to39 geCollege 1 False 3 39.75 30to39 geCollege 1 True 4 43.83 30to39 geCollege 1 True 5 39.08 30to39 ltHS 1 True Size-mutable “labeled arrays” that can handle heterogeneous data

3. Kinda like a structured array?? •  Automatic data alignment with lots of reshaping and indexing methods •  Implicit and explicit handling of missing data •  Easy time series functionality –  Far less fuss than scikits.timeseries •  Lots of in-memory SQL-like operations (group by, join, etc.)

4. pandas? •  Extremely good for financial data –  StackOverflow: “this is a beast of a financial analysis tool” •  One of the better relational data munging tools in any language? •  But also has maybe 60+% of what R users expect when they come to Python

5. 1. Heavily redesigned internals •  Merged old DataFrame and DataMatrix into a single DataFrame: retain optimal performance where possible •  Internal BlockManager class manages homogeneous ndarrays for optimal performance and reshaping

6. 1. Heavily redesigned internals •  Better handling of missing data for non-floating point dtypes •  Soon: DataFrame variant with N-dim “hyperslabs”

7. 2. Fancier indexing Mix boolean / integer / label / slice-based indexing df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] Setting works too df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

8. 3. More robust IO data_frame = read_csv(‘mydata.csv’) data_frame2 = read_table(‘mydata.txt’, sep=‘t’, skiprows=[1,2], na_values=[‘#N/A NA’]) store = HDFStore(‘pytables.h5’) store[‘a’] = data_frame store[‘b’] = data_frame2

9. 4. Better pivoting / reshaping foo bar A B C 0 one a -0.0524 1.664 1.171 1 one a 0.2514 0.8306 -1.396 2 one b 0.1256 0.3897 0.5227 3 one b -0.9301 0.6513 -0.2313 4 one c 2.037 1.938 -0.3454 5 two a 0.2073 0.7857 0.9051 6 two a -1.032 -0.8615 1.028 7 two b -0.7319 -1.846 0.9294 8 two b 0.1004 -1.19 0.6043 9 two c -1.008 -0.3339 0.09522

10. 4. Better pivoting / reshaping In [29]: pivoted = df.pivot('bar', 'foo') In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339

11. 4. Better pivoting / reshaping In [31]: pivoted.major_xs('a') Out[31]: A B C one -0.0524 1.664 1.171 two 0.2073 0.7857 0.9051 In [32]: pivoted.minor_xs('one') Out[32]: A B C a -0.0524 1.664 1.171 b 0.2514 0.8306 -1.396 c 0.1256 0.3897 0.5227 d -0.9301 0.6513 -0.2313 e 2.037 1.938 -0.3454

12. 4. Better pivoting / reshaping In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339

13. 4. Some other things •  “Sparse” (mostly NA) versions of data structures •  Time zone support in DateRange •  Generic moving window function rolling_apply

14. Near future •  More powerful Group By •  Flexible, fast frequency (time series) conversions •  More integration with statsmodels

15. Thanks! •  Hack: github.com/wesm/pandas •  Twitter: @wesmckinn •  Blog: blog.wesmckinney.com