Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
Part 1 of 3 of series focusing on the infrastructure aspect of getting started with Big Data, specifically Hadoop. This presentation starts small, installing a pre-packaged virtual machine from Hadoop vendor Cloudera on your local machine.
We then install R, copy some sample data into HDFS and test everything by running Jonathan Seidman's a sample streaming job.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012
The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
The document provides an overview of Hadoop and HDFS. It discusses key concepts such as what big data is, examples of big data, an overview of Hadoop, the core components of HDFS and MapReduce, characteristics of HDFS including fault tolerance and throughput, the roles of the namenode and datanodes, and how data is stored and replicated in blocks in HDFS. It also answers common interview questions about Hadoop and HDFS.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
This document discusses Hadoop interview questions and provides resources for preparing for Hadoop interviews. It notes that as demand for Hadoop professionals has increased, Hadoop interviews have become more complex with scenario-based and analytical questions. The document advertises a Hadoop interview guide with over 100 real Hadoop developer interview questions and answers on the website bigdatainterviewquestions.com. It provides examples of common Hadoop questions around debugging jobs, using Capacity Scheduler, benchmarking tools, joins in Pig, analytic functions in Hive, and Hadoop concepts.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
The document discusses common interview questions about Hadoop Distributed File System (HDFS). It provides explanations for several key HDFS concepts including the essential features of HDFS, streaming access, the roles of the namenode and datanode, heartbeats, blocks, and ways to access and recover files in HDFS. It also covers MapReduce concepts like the jobtracker, tasktracker, task instances, and Hadoop daemons.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
This document provides an overview of Hadoop and the Hadoop ecosystem. It discusses key Hadoop concepts like HDFS, MapReduce, YARN and data locality. It also summarizes SQL on Hadoop using tools like Hive, Impala and Spark SQL. The document concludes with examples of using Sqoop and Flume to move data between relational databases and Hadoop.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
This document provides an introduction and overview of core Hadoop technologies including HDFS, MapReduce, YARN, and Spark. It describes what each technology is used for at a high level, provides links to tutorials, and in some cases provides short code examples. The focus is on giving the reader a basic understanding of the purpose and functionality of these central Hadoop components.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
This document discusses DMM's use of Apache Spark for real-time recommendations. It covers how Spark is used for tracking APIs, Hive integration, item-to-item and user-to-item recommendations using Spark MLlib ALS, connecting to databases using Sqoop, and deploying and executing recommendation APIs on Jenkins with BuildFlow. Tips are provided on using dataframes/datasets, optimizing memory, and the top 5 mistakes to avoid in Spark applications.
Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
MongoDB is an open-source, document-oriented database that provides scalability and high performance. It uses a dynamic schema and allows for embedding of documents. MongoDB can be deployed in a standalone, replica set, or sharded cluster configuration. A replica set provides redundancy and automatic failover through replication, while sharding allows for horizontal scalability by partitioning data across multiple servers. Key features include indexing, queries, text search, and geospatial support.
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
This document provides an overview of installing and configuring Apache Hadoop. It begins with background on big data and Hadoop, including definitions of big data, the Hadoop ecosystem, and differences between Hadoop 1.0 and 2.0. It then discusses installing Hadoop, describing the steps to set up a Cloudera cluster on Amazon Web Services and requirements for installing Cloudera Manager. The document concludes with mentioning a lab to set up a Cloudera cluster on AWS.
CONFidence 2014: Davi Ottenheimer Protecting big data at scalePROIDEA
We are meant to measure and manage data with more precision than ever before using Big Data. But companies are getting Hadoopy often with little or no consideration of security. Are we taking on too much risk too fast? This session explains how best to handle the looming Big Data risk in any environment. Better predictions and more intelligent decisions are expected from our biggest data sets, yet do we really trust systems we secure the least? And do we really know why "learning" machines continue to make amusing and sometimes tragic mistakes? Infosec is in this game but with Big Data we appear to be waiting on the sidelines. What have we done about emerging vulnerabilities and threats to Hadoop as it leaves many of our traditional data paradigms behind? This presentation, based on the new book "Realities of Big Data Security" takes the audience through an overview of the hardest big data protection problem areas ahead and into our best solutions for the elephantine challenges here today.
Dedupe, Merge and Purge: the art of normalizationTyler Bell
This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
This document discusses Hadoop interview questions and provides resources for preparing for Hadoop interviews. It notes that as demand for Hadoop professionals has increased, Hadoop interviews have become more complex with scenario-based and analytical questions. The document advertises a Hadoop interview guide with over 100 real Hadoop developer interview questions and answers on the website bigdatainterviewquestions.com. It provides examples of common Hadoop questions around debugging jobs, using Capacity Scheduler, benchmarking tools, joins in Pig, analytic functions in Hive, and Hadoop concepts.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
The document discusses common interview questions about Hadoop Distributed File System (HDFS). It provides explanations for several key HDFS concepts including the essential features of HDFS, streaming access, the roles of the namenode and datanode, heartbeats, blocks, and ways to access and recover files in HDFS. It also covers MapReduce concepts like the jobtracker, tasktracker, task instances, and Hadoop daemons.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
This document provides an overview of Hadoop and the Hadoop ecosystem. It discusses key Hadoop concepts like HDFS, MapReduce, YARN and data locality. It also summarizes SQL on Hadoop using tools like Hive, Impala and Spark SQL. The document concludes with examples of using Sqoop and Flume to move data between relational databases and Hadoop.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
This document provides an introduction and overview of core Hadoop technologies including HDFS, MapReduce, YARN, and Spark. It describes what each technology is used for at a high level, provides links to tutorials, and in some cases provides short code examples. The focus is on giving the reader a basic understanding of the purpose and functionality of these central Hadoop components.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
This document discusses DMM's use of Apache Spark for real-time recommendations. It covers how Spark is used for tracking APIs, Hive integration, item-to-item and user-to-item recommendations using Spark MLlib ALS, connecting to databases using Sqoop, and deploying and executing recommendation APIs on Jenkins with BuildFlow. Tips are provided on using dataframes/datasets, optimizing memory, and the top 5 mistakes to avoid in Spark applications.
Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
MongoDB is an open-source, document-oriented database that provides scalability and high performance. It uses a dynamic schema and allows for embedding of documents. MongoDB can be deployed in a standalone, replica set, or sharded cluster configuration. A replica set provides redundancy and automatic failover through replication, while sharding allows for horizontal scalability by partitioning data across multiple servers. Key features include indexing, queries, text search, and geospatial support.
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
This document provides an overview of installing and configuring Apache Hadoop. It begins with background on big data and Hadoop, including definitions of big data, the Hadoop ecosystem, and differences between Hadoop 1.0 and 2.0. It then discusses installing Hadoop, describing the steps to set up a Cloudera cluster on Amazon Web Services and requirements for installing Cloudera Manager. The document concludes with mentioning a lab to set up a Cloudera cluster on AWS.
CONFidence 2014: Davi Ottenheimer Protecting big data at scalePROIDEA
We are meant to measure and manage data with more precision than ever before using Big Data. But companies are getting Hadoopy often with little or no consideration of security. Are we taking on too much risk too fast? This session explains how best to handle the looming Big Data risk in any environment. Better predictions and more intelligent decisions are expected from our biggest data sets, yet do we really trust systems we secure the least? And do we really know why "learning" machines continue to make amusing and sometimes tragic mistakes? Infosec is in this game but with Big Data we appear to be waiting on the sidelines. What have we done about emerging vulnerabilities and threats to Hadoop as it leaves many of our traditional data paradigms behind? This presentation, based on the new book "Realities of Big Data Security" takes the audience through an overview of the hardest big data protection problem areas ahead and into our best solutions for the elephantine challenges here today.
Dedupe, Merge and Purge: the art of normalizationTyler Bell
This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
This document discusses expectations when visualizing data and creating visualizations. It covers 6 main points:
1. Expect to find the real need by understanding audience goals, questions, and intended use of the visualization. Compromise may be needed.
2. Expect to spend significant time (70-80%) cleaning data due to issues like multiple data sources and formats, missing values, and errors.
3. Expect trials and errors in the prototyping process to solve problems and meet deadlines. Iteration is important.
4. For larger datasets, expect challenges in processing, analyzing, and reducing size to find relevant insights. Tools like Hadoop can help handle bigger data.
5.
This document summarizes the key expectations and challenges when visualizing data or building visual analytics tools. There are several main points:
1. Expect potential mismatches between what clients think they need versus what the data and visualization actually require, requiring clear communication and compromise.
2. Different projects will have different goals that require flexibility in the types of visualizations created, whether for presentation, exploration, or both.
3. A significant amount of time, often 70-80%, will be spent cleaning and preparing data prior to visualization due to issues like missing values, formatting inconsistencies, and data quality problems.
4. Iteration is essential to work out bugs and refine visualizations to best meet requirements and dead
RDA: Are We There Yet?
This document discusses the progress of Resource Description and Access (RDA) since its publication in 2010. It notes recommendations from libraries that tested RDA, including rewriting instructions in plain English and improving the RDA Toolkit. The implementation date for RDA is March 31, 2013. Differences after implementing RDA include lack of abbreviations, more transcription of elements, new MARC fields, and richer authority records. Fully implementing RDA may involve changes to search options and semantic web/linked data approaches. Tips are provided for libraries on deciding when to implement, talking to vendors, and planning training.
RDA implementation is scheduled for March 31, 2013. Testers of RDA recommended improvements like rewriting instructions in plain English and ensuring community involvement. Differences from AACR2 include lack of abbreviations, more transcription of what is seen, and new fields in MARC like 336, 337, 338 for content/media/carrier types. Linked data and semantic web approaches may make relationships between works more explicit over time. Preparing for RDA involves decisions about cataloging workflows and training.
Guest Lecture at
MICA DATA ANALYTICS AND VISUALIZATION Graduate Program
https://www.mica.edu/graduate-programs/data-analytics-and-visualization-mps/
The world is y0ur$: Geolocation-based wordlist generation with wordsmithSanjiv Kawa
Wordsmith is a tool for generating custom wordlists tailored for password cracking or attacks against a specific target. It collects geolocation data like city names, sports teams, area codes, and more to build wordlists. Wordsmith has been improved in version 2 to support multiple languages, introduce religious texts and usernames, and include data for over 230 countries. The tool can output wordlists with options to modify words by mangling, filtering by length, prepending/appending data, and more.
The world is y0ur$: Geolocation-based wordlist generation with wordsmithSanjiv Kawa
Wordsmith is a tool for custom wordlist generation tailored for password cracking and penetration testing. It generates wordlists based on geolocation data, including cities, streets, landmarks and other location-specific information. The latest version (Wordsmith v2) expands on the original by including over 230 country datasets, support for 13 languages, automated username generation, and an extensible modular framework. It aims to produce targeted wordlists that leverage geolocation intelligence for password attacks and assessments.
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
This document discusses using machine learning algorithms to classify mushroom data. It begins by downloading mushroom data from an online repository, cleaning the data by handling missing values, and structuring the data with column names and attribute information. Then it uses several machine learning algorithms like OneR, JRip, and C5.0 to build classification models on the data and evaluate the performance of the models. The goal is to derive simple and accurate classification rules to determine whether mushrooms are edible or poisonous.
This document outlines an introductory class meeting for STAT 545A. It introduces the instructor and various tools and concepts related to data science, including RStudio, R Markdown, version control with Git and GitHub, and reproducible research. Students are encouraged to use R Markdown for literate programming and to publish their work to GitHub for collaboration and sharing results.
Enhancing E-Resource Records for Discovery handout (PDF)Carla Arbagey
This document provides examples of using additional MARC fields to enhance the discoverability of e-resource records. It demonstrates using field 793 to identify e-resource packages, field 910 to record cataloging details, and formatting field 505 contents notes and 520 summaries to improve keyword searching. Tips are included for fixing machine-generated notes and quickly generating 505 and 520 fields by copying from print records or publisher websites. Best practices are outlined such as checking indexed fields and avoiding subjective summaries.
This document discusses expectations and challenges when visualizing data. The key points are:
1. Expect to find the real need by understanding the audience and goals better than the client. Expect to clean data, which can take a significant amount of time due to multiple sources and formats.
2. Prepare to iterate as the initial visualization may not meet needs or deadlines. Celebrate failures as learning opportunities.
3. Visualization projects include storytelling projects with strict deadlines and analytical tools to support data exploration by technical teams over the long term. The project lifecycle involves identifying needs, prototyping, refining, and maintaining the visualization.
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...J T "Tom" Johnson
The document discusses analyzing data that has been collected through investigative journalism projects. It provides tips on storing data in the cloud and bookmarking tools, challenges in analyzing poorly formatted government data from New Mexico's transparency portal, and strategies for analyzing both qualitative and quantitative data through tools like spreadsheets, databases, and data visualization programs. The goal is to turn collected data into useful information that can be shared through stories.
Georgi Kobilarov presented on the status and future of DBpedia. DBpedia extracts structured data from Wikipedia and makes it available as linked open data. Current challenges include improving data quality, handling live Wikipedia updates, adding other data sources, and developing a new approach for infobox extraction using a domain-specific ontology. The vision is for DBpedia to become the Wikipedia of structured data and enable users and applications to access and query this data without having to understand its technical implementation.
The document discusses challenges with e-discovery, including the ever-increasing volume of electronically stored information and limitations of keyword searching. It provides examples of courts recognizing the limitations of keyword searches and advocating for cooperation between parties on search methodologies. Emerging techniques mentioned include Boolean searches, probabilistic models, and machine learning approaches.
A controversial discussion of the utility of DBpedia as authority data with examples from a project at the Library of Congress. Part of an ExLibris-sponsored panel discussion at ALA Chicago 2009.
Kelly technologies is the best data science training institute in hyderabad.We provide our trainings by industrial real time experts so that our students know about real time market technology.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
Part 2 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation is geared towards anyone with an occasional need for more computing power.
We walk through the mechanics of launching a instance on Amazon's EC2, install some software (like R and RStudio), and make sure it all works.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
This document describes analyzing sentiment towards airlines on Twitter. It searches Twitter for mentions of airlines, collects the tweets, scores the sentiment of each tweet using a simple word counting algorithm, and summarizes the results for each airline. It then compares the Twitter sentiment scores to customer satisfaction scores from the American Customer Satisfaction Index. A linear regression shows a relationship between the Twitter and ACSI scores, suggesting Twitter sentiment analysis can provide insights into customer satisfaction.
Overview of accessing relational databases from R. Focuses and demonstrates DBI family (RMySQL, RPostgreSQL, ROracle, RJDBC, etc.) but also introduces RODBC. Highlights DBI's dbApply() function to combine strengths of SQL and *apply() on large data sets. Demonstrates sqldf package which provides SQL access to standard R data.frames.
Presented at the May 2011 meeting of the Greater Boston useR Group.
Overview of how/why to reshape data in R from "wide" (spreadsheet-like) to "long" (database-like) and back.
Focuses on Hadley Wickham's reshape2 package and uses state population data from the 2010 U.S. Census. Also demonstrates use of dcast() to replace table(), etc. to generate crosstabs from a sample market research consumer survey.
Presented at the April 2011 meeting of the Greater Boston useR Group.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
This document from the FAA provides a forecast for aviation activity from 2011 to 2031. It predicts substantial growth, with passenger numbers increasing by 560 million and revenue passenger miles more than doubling by 2031. Air traffic operations such as tower operations and aircraft handled are also expected to rise significantly. However, there are risks to the forecast like higher than expected energy prices, a weaker economy producing lower demand, infrastructure constraints at congested airports, increased airline consolidation leading to higher fares, and potential reductions in demand due to climate change. The forecast represents a continued recovery from the impacts of the recession, but more modest growth compared to past recoveries.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
Tapping the Data Deluge with R
1. Tapping the Data Deluge with R
Finding and using supplemental data
to add context to your analysis
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: [email protected]
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
1
2. Data data everywhere!
This may be how you picture the data deluge looks like if you work for the Economist.
But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.
3. Real data hits us in the face...
3
Real data can hit you in the face.
Yet we keep coming back for more.
4. ...and then there’s Big Data.
4
And I’m not even going to talk about Big Data tonight. (For a change!)
5. Finding the right data makes all the difference
5
Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques
to access them so you can incorporate them into your analysis.
6. The two types of data
Data you have
Data you don’t
have... yet
6
Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people
who don’t.
I like to think that there are two kinds of data.
7. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 7
8. Reading CSV files is easy
$ head -5 data/mpg-3-13-2012.csv | cut -c 1-60
"Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd","
2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage"
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
data = read.csv('data/mpg-3-13-2012.csv')
View(data)
see R/01-read.csv-mpg.R 8
9. But so is reading Excel files directly
library(XLConnect)
wb = loadWorkbook("data/mpg.xlsx", create=F)
data = readWorksheet(wb, sheet='3-7-2012')
see R/02-XLConnect-mpg.R 9
11. RelaMonal databases
library(RMySQL)
con = dbConnect(MySQL(), user="root", dbname="test")
data = dbGetQuery(con, "select * from airport")
dbDisconnect(con)
View(data)
airport_code airport_name location state_code country_name time_zone_code
1 ATL WILLIAM B. HARTSFIELD ATLANTA,GEORGIA GA USA EST
2 BOS LOGAN INTERNATIONAL BOSTON,MASSACHUSETTS MA USA EST
3 BWI BALTIMORE/WASHINGTON INTERNATIONAL BALTIMORE,MARYLAND MD USA EST
4 DEN STAPLETON INTERNATIONAL DENVER,COLORADO CO USA MST
5 DFW DALLAS/FORT WORTH INTERNATIONAL DALLAS/FT. WORTH,TEXAS TX USA CST
6 OAK METROPOLITAN OAKLAND INTERNATIONAL OAKLAND,CALIFORNIA CA USA PST
7 PHL PHILADELPHIA INTERNATIONAL PHILADELPHIA PA/WILM'TON,DE PA USA EST
8 PIT GREATER PITTSBURGH PITTSBURGH,PENNSYLVANIA PA USA EST
9 SFO SAN FRANCISCO INTERNATIONAL SAN FRANCISCO,CALIFORNIA CA USA PST
see R/04-RMySQL-airport.R 11
12. Non-‐relaMonal databases too
> library(rhbase)
> hb.init(serialize='raw')
> x = hb.get(tablename='tweets', rows='221325531868692480')
> str(x)
List of 1
$ :List of 3
..$ : chr "221325531868692480"
..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ...
..$ :List of 10
.. ..$ : chr "2012-07-06 19:31:33"
.. ..$ : chr "FALSE"
.. ..$ : chr "221325531868692480"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "arnicas"
.. ..$ : chr "<a href="http://www.tweetdeck.com"
rel="nofollow">TweetDeck</a>"
.. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying
the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis"
.. ..$ : chr "FALSE"
12
13. weird emails from the boss
con = textConnection('
# Hi:
#
# Please invite these paid volunteers to the spontaneous rally at 3PM today:
#
Name Department "Hourly Rate" email
Alice Operations 32 [email protected]
Billy Logistics 5 [email protected]
Winston Records 20 [email protected]
#
#Thanks,
#Your Boss
#! ! ! ! !
')
data = read.table(con, header=T, comment.char='#')
close.connection(con)
View(data) Name Department Hourly.Rate email
1 Alice Operations 32 [email protected]
2 Billy Logistics 5 [email protected]
3 Winston Records 20 [email protected]
see R/05-textConnection-email.R 13
14. > data()
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure Personal Expenditure Data
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
[...]
15. > library(zipcode)
> data(zipcode)
> str(zipcode)
'data.frame': 44336 obs. of 5 variables:
$ zip : chr "00210" "00211" "00212" "00213" ...
$ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
$ state : chr "NH" "NH" "NH" "NH" ...
$ latitude : num 43 43 43 43 43 ...
$ longitude: num -71 -71 -71 -71 -71 ...
> subset(zipcode, city=='Boston' & state=='MA')
zip city state latitude longitude
664 02101 Boston MA 42.37057 -71.02696
665 02102 Boston MA 42.33895 -70.91963
666 02103 Boston MA 42.33895 -70.91963
667 02104 Boston MA 42.33895 -70.91963
668 02105 Boston MA 42.33895 -70.91963
669 02106 Boston MA 42.35432 -71.07345
670 02107 Boston MA 42.33895 -70.91963
671 02108 Boston MA 42.35790 -71.06408
672 02109 Boston MA 42.36148 -71.05417
673 02110 Boston MA 42.35653 -71.05365
674 02111 Boston MA 42.34984 -71.06101
675 02112 Boston MA 42.33895 -70.91963
676 02113 Boston MA 42.36503 -71.05636
677 02114 Boston MA 42.36179 -71.06774
678 02115 Boston MA 42.34308 -71.09268
679 02116 Boston MA 42.34962 -71.07372
680 02117 Boston MA 42.33895 -70.91963
681 02118 Boston MA 42.33872 -71.07276
682 02119 Boston MA 42.32451 -71.08455
683 02120 Boston MA 42.33210 -71.09651
684 02121 Boston MA 42.30745 -71.08127
685 02122 Boston MA 42.29630 -71.05454
686 02123 Boston MA 42.33895 -70.91963
687 02124 Boston MA 42.28713 -71.07156
688 02125 Boston MA 42.31685 -71.05811
690 02127 Boston MA 42.33499 -71.04562
691 02128 Boston MA 42.37830 -71.02550
696 02133 Boston MA 42.33895 -70.91963
726 02163 Boston MA 42.36795 -71.12056
757 02196 Boston MA 42.33895 -70.91963
[...]
17. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 17
20. Many base funcMons take URLs
url = 'http://ichart.finance.yahoo.com/table.csv?
s=YHOO&d=8&e=28&f=2012&g=d&a=3&b=12&c=1996&
ignore=.csv'
data = read.csv(url)
ggplot(data) + geom_point(aes(x=as.Date(Date),
y=Close), size = 1) + scale_y_log10() + theme_bw()
see R/06-read.csv-url-yahoo.R 20
22. download.file() if URLs aren’t supported
library(XLConnect)
url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/
all_alpha_12.xls"
local.xls.file = 'data/all_alpha_12.xls'
download.file(url, local.xls.file)
wb = loadWorkbook(local.xls.file, create=F)
data = readWorksheet(wb, sheet='all_alpha_12')
View(data)
see R/07-download.file-XLConnect-green.R 22
24. not even HTML tables are safe
library(XML)
url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'
state.capitals.df = readHTMLTable(url, which=2)
State Abr. Date of statehood Capital Capital since Land area (mi²) Most populous city?
1 Alabama AL 1819 Montgomery 1846 155.4 No
2 Alaska AK 1959 Juneau 1906 2716.7 No
3 Arizona AZ 1912 Phoenix 1889 474.9 Yes
4 Arkansas AR 1836 Little Rock 1821 116.2 Yes
5 California CA 1850 Sacramento 1854 97.2 No
6 Colorado CO 1876 Denver 1867 153.4 Yes
7 Connecticut CT 1788 Hartford 1875 17.3 No
8 Delaware DE 1787 Dover 1777 22.4 No
9 Florida FL 1845 Tallahassee 1824 95.7 No
10 Georgia GA 1788 Atlanta 1868 131.7 Yes
see R/08-readHTMLTable.R 24
As you’d expect from a package called “XML”, it parses well-formed XML files.
But I didn’t expect it would do such a good job with HTML.
And I certainly didn’t expect to find a function as handy as readHTMLTable()!
27. ..and couldn’t be easier to access.
library(rdatamarket)
oil.prod = dmseries("http://data.is/nyFeP9")
plot(oil.prod)
see R/09-rdatamarket.R 27
DataMarket includes its own URL shortner -- like bit.ly but just for their data.
Long or short, just give dmseries() the URL, and it will download the data set for you.
28. Make a withdrawal from the World Bank
> library(WDI)
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
> WDIsearch('fertility .*total')
indicator name
"SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)"
> WDIsearch('life expectancy .*birth.*total')
indicator name
"SP.DYN.LE00.IN" "Life expectancy at birth, total (years)"
> WDIsearch('GDP per capita .*constant')
indicator name
[1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)"
[2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)"
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
see R/10-WDI.R 28
29. Swedish Accent Not Included
data = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'),
! ! ! indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL',
! ! ! ! ! ! 'NY.GDP.PCAP.KD'),
! ! ! start=1900, end=2010)
library(googleVis)
g = gvisMotionChart(data, idvar='country', timevar='year')
plot(g)
see R/10-WDI.R 29
30. quantmod: the king of symbols
• getSymbols() downloads Mme series data from
source specified by “src” parameter:
– yahoo = Yahoo! Finance
– google = Google Finance
– FRED = St. Louis Fed’s Federal Reserve Economic Data
– oanda = OANDA Forex Trading & Exchange Rates
– csv
– MySQL
– RData
30
31. Hello, FRED
55,000 economic +me series • Federal Reserve Bank of Kansas • Thomson Reuters/University of
from 45 sources: City Michigan
• Federal Reserve Bank of • U.S. Congress: Congressional
• AutomaMc Data Processing, Inc.
Philadelphia Budget Office
• Banca d'Italia
• Federal Reserve Bank of St. Louis • U.S. Department of Commerce:
• Banco de Mexico Bureau of Economic Analysis
• Freddie Mac
• Bank of Japan • U.S. Department of Commerce:
• Haver AnalyMcs
• Bankrate, Inc. Census Bureau
• InsMtute for Supply Management
• Board of Governors of the • U.S. Department of Energy:
Federal Reserve System • InternaMonal Monetary Fund
Energy InformaMon
• London Bullion Market AdministraMon
• BofA Merrill Lynch
AssociaMon
• BriMsh Bankers' AssociaMon • U.S. Department of Housing and
• NaMonal AssociaMon of Realtors Urban Development
• Central Bank of the Republic of
Turkey • NaMonal Bureau of Economic • U.S. Department of Labor:
Research Bureau of Labor StaMsMcs
• Chicago Board OpMons Exchange
• OrganisaMon for Economic Co-‐ • U.S. Department of Labor:
• CredAbility Nonprofit Credit operaMon and Development Employment and Training
Counseling & EducaMon
• Reserve Bank of Australia AdministraMon
• Deutsche Bundesbank
• Standard and Poor's • U.S. Department of the Treasury:
• Dow Jones & Company Financial Management Service
• Swiss NaMonal Bank
• Eurostat • U.S. Department of
• The White House: Council of
• Federal Financial InsMtuMons Economic Advisors TransportaMon: Federal Highway
ExaminaMon Council AdministraMon
• The White House: Office of
• Federal Housing Finance Agency Management and Budget • Wilshire Associates Incorporated
• Federal Reserve Bank of Chicago • World Bank
31
32. BLS Jobless data (FRED) + S&P (Yahoo!)
library(quantmod)
initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F)
sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F)
# Convert quotes to weekly and fetch Cl() closing price
sp500.weekly = Cl(to.weekly(sp500))
see R/11-quantmod.R 32
33. Resources
• Expanded code snippets and all data for this talk
– http://bit.ly/pawdata
• R Data Import/Export manual
– http://cran.r-project.org/doc/manuals/R-data.html
• CRAN: Comprehensive R Archive Network
– package lists: http://cran.r-project.org/web/packages/
– Featured: XLConnect, foreign, RMySQL, XML, quantmod, rdatamarket, WDI,
quantmod
– Database: RODBC, DBI, RJDBC, ROracle, RPostgreSQL, RSQLite, RMongo, RCassandra
– Data sets: zipcode, agridat, GANPAdata
– Data access: crn, rgbif, RISmed, govdat, myepisodes, msProstate, corpora
• rhbase from the RHadoop project
– https://github.com/RevolutionAnalytics/RHadoop
33
34. When I first said that R is my “Swiss Army
Knife” for data, you might have pictured this:
36. Thank you!
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: [email protected]
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
36