Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Data Structures for Statistical Computing in PythonWes McKinney
The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
This document provides an overview of Continuum Analytics and Python for data science. It discusses how Continuum created two organizations, Anaconda and NumFOCUS, to support open source Python data science software. It then describes Continuum's Anaconda distribution, which brings together 200+ open source packages like NumPy, SciPy, Pandas, Scikit-learn, and Jupyter that are used for data science workflows involving data loading, analysis, modeling, and visualization. The document outlines how Continuum helps accelerate adoption of data science through Anaconda and provides examples of industries using Python for data science.
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
High-performance graph analysis is unlocking knowledge in problems like anomaly detection in computer security, community structure in social networks, and many other data integration areas. While graphs provide a convenient abstraction, real-world problems' sparsity and lack of locality challenge current systems. This talk will cover current trends ranging from massive scales to low-power, low-latency systems and summarize opportunities and directions for graphs and computing systems.
Structured Data Challenges in Finance and StatisticsWes McKinney
This document discusses structured data challenges in finance and statistics. It introduces Wes McKinney and his work developing pandas, an open-source Python library designed for working with structured and time series data. Pandas includes data structures like the DataFrame, which allows for fast and flexible data manipulation, indexing, and aggregation of tabular data. The document argues that existing tools are still lacking for working with structured data and that pandas was created to optimize ease-of-use, flexibility, and performance.
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses sequential pattern mining algorithms. It begins by introducing sequential patterns and challenges in mining them from transaction databases. It then describes the Apriori-based GSP algorithm, which generates candidate sequences level-by-level and scans the database multiple times. The document also introduces pattern-growth methods like PrefixSpan that avoid candidate generation by projecting databases based on prefixes. Finally, it discusses optimizations like pseudo-projection that speed up sequential pattern mining.
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...huguk
By Rob Claxton, Chief Researcher in Big Data
Video at: https://www.youtube.com/watch?v=gMUSDRljGM8&index=2&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Data engineering and analytics using pythonPurna Chander
This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.
A walk through the maze of understanding Data Visualization using several tools such as Python, R, Knime and Google Data Studio.
This workshop is hands-on and this set of presentations is designed to be an agenda to the workshop
This document discusses updating PageRank for streaming graphs. It begins with background on PageRank and motivation for incremental PageRank on changing graphs. It then presents an approach for incrementally updating PageRank based on a backward error view, solving the residual regionally around changes. Performance results show the incremental method has better latency than restarting PageRank iteration from scratch. The document concludes with discussion of implementing the incremental algorithm using the GraphBLAS framework to reduce overhead.
High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.
The document discusses the need for an analytics query engine that allows machine learning algorithms to be specified declaratively and executed using distributed operators and optimization techniques. It proposes a language with a SQL-like syntax and the use of Datalog to express machine learning algorithms declaratively. Key operators for tasks like linear algebra, aggregation, and iteration would be defined. The engine would optimize queries by rewriting operators and using techniques from databases and machine learning.
The document proposes improved MapReduce algorithms (UP-Growth and UP-Growth+) for mining high utility itemsets from transactional databases. Existing algorithms often generate a huge number of candidate itemsets, degrading performance. The proposed algorithms use a UP-Tree structure and MapReduce framework on Hadoop to more efficiently identify high utility itemsets from large datasets in distributed storage. Experimental results show the improved algorithms outperform other methods, especially for databases with long transactions or low minimum utility thresholds. The goal is to address limitations of existing approaches for low-memory systems and databases with null transactions.
The document discusses various algorithms for searching data structures, including serial search with average time complexity of Θ(n), binary search with average time complexity of Θ(log n), and hashing techniques that can provide constant time Θ(1) search by storing items in an array using a hash function. It provides pseudocode for binary search and discusses improvements like interpolation search that can achieve Θ(log log n) search time on average.
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
We introduce a framework for automated semantic document annotation that is composed of four processes, namely concept extraction, concept activation, annotation selection, and evaluation. The framework is used to implement and compare different annotation strategies motivated by the literature. For concept extraction, we apply entity detection with semantic hierarchical knowledge bases, Tri-gram, RAKE, and LDA. For concept activation, we compare a set of statistical, hierarchy-based, and graph-based methods. For selecting annotations, we compare top-k as well as kNN. In total, we define 43 different strategies including novel combinations like using graph-based activation with kNN. We have evaluated the strategies using three different datasets of varying size from three scientific disciplines (economics, politics, and computer science) that contain 100, 000 manually labeled documents in total. We obtain the best results on all three datasets by our novel combination of entity detection with graph-based activation (e.g., HITS and Degree) and kNN. For the economic and political science datasets, the best F-measure is .39 and .28, respectively. For the computer science dataset, the maximum F-measure of .33 can be reached. The experiments are the by far largest on scholarly content annotation, which typically are up to a few hundred documents per dataset only.
Gregor Große-Bölting, Chifumi Nishioka, and Ansgar Scherp. 2015. A Comparison of Different Strategies for Automated Semantic Document Annotation. In Proceedings of the 8th International Conference on Knowledge Capture (K-CAP 2015). ACM, New York, NY, USA, , Article 8 , 8 pages. DOI=http://dx.doi.org/10.1145/2815833.2815838
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
The document discusses different technologies for storing and querying large chemical datasets, known as "big chemical data". It evaluates PostgreSQL, SQLite, MessagePack, FlatBuffers, and Pandas on a test dataset of 4 million compounds from ZINC. For queries like retrieving atom counts for 50k molecules, counting molecules by atom number, and fingerprint lookups, SQLite and MessagePack performed the fastest, completing in under 50ms. PostgreSQL was also very fast with indices, finishing some queries in under 100ms. The document concludes no single technology is best and the complexity of the tool should match the task.
ffbase, statistical functions for large datasetsEdwin de Jonge
This document introduces ffbase, an R package that adds statistical functions and utilities for working with large datasets stored in ff format. ffbase allows standard R code to be used on ff objects by rewriting expressions to operate chunkwise. It also connects ff data to other packages for large data analysis. The goal is to make working with large out-of-memory data more convenient and productive within the R environment.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses sequential pattern mining algorithms. It begins by introducing sequential patterns and challenges in mining them from transaction databases. It then describes the Apriori-based GSP algorithm, which generates candidate sequences level-by-level and scans the database multiple times. The document also introduces pattern-growth methods like PrefixSpan that avoid candidate generation by projecting databases based on prefixes. Finally, it discusses optimizations like pseudo-projection that speed up sequential pattern mining.
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...huguk
By Rob Claxton, Chief Researcher in Big Data
Video at: https://www.youtube.com/watch?v=gMUSDRljGM8&index=2&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Data engineering and analytics using pythonPurna Chander
This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.
A walk through the maze of understanding Data Visualization using several tools such as Python, R, Knime and Google Data Studio.
This workshop is hands-on and this set of presentations is designed to be an agenda to the workshop
This document discusses updating PageRank for streaming graphs. It begins with background on PageRank and motivation for incremental PageRank on changing graphs. It then presents an approach for incrementally updating PageRank based on a backward error view, solving the residual regionally around changes. Performance results show the incremental method has better latency than restarting PageRank iteration from scratch. The document concludes with discussion of implementing the incremental algorithm using the GraphBLAS framework to reduce overhead.
High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.
The document discusses the need for an analytics query engine that allows machine learning algorithms to be specified declaratively and executed using distributed operators and optimization techniques. It proposes a language with a SQL-like syntax and the use of Datalog to express machine learning algorithms declaratively. Key operators for tasks like linear algebra, aggregation, and iteration would be defined. The engine would optimize queries by rewriting operators and using techniques from databases and machine learning.
The document proposes improved MapReduce algorithms (UP-Growth and UP-Growth+) for mining high utility itemsets from transactional databases. Existing algorithms often generate a huge number of candidate itemsets, degrading performance. The proposed algorithms use a UP-Tree structure and MapReduce framework on Hadoop to more efficiently identify high utility itemsets from large datasets in distributed storage. Experimental results show the improved algorithms outperform other methods, especially for databases with long transactions or low minimum utility thresholds. The goal is to address limitations of existing approaches for low-memory systems and databases with null transactions.
The document discusses various algorithms for searching data structures, including serial search with average time complexity of Θ(n), binary search with average time complexity of Θ(log n), and hashing techniques that can provide constant time Θ(1) search by storing items in an array using a hash function. It provides pseudocode for binary search and discusses improvements like interpolation search that can achieve Θ(log log n) search time on average.
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
We introduce a framework for automated semantic document annotation that is composed of four processes, namely concept extraction, concept activation, annotation selection, and evaluation. The framework is used to implement and compare different annotation strategies motivated by the literature. For concept extraction, we apply entity detection with semantic hierarchical knowledge bases, Tri-gram, RAKE, and LDA. For concept activation, we compare a set of statistical, hierarchy-based, and graph-based methods. For selecting annotations, we compare top-k as well as kNN. In total, we define 43 different strategies including novel combinations like using graph-based activation with kNN. We have evaluated the strategies using three different datasets of varying size from three scientific disciplines (economics, politics, and computer science) that contain 100, 000 manually labeled documents in total. We obtain the best results on all three datasets by our novel combination of entity detection with graph-based activation (e.g., HITS and Degree) and kNN. For the economic and political science datasets, the best F-measure is .39 and .28, respectively. For the computer science dataset, the maximum F-measure of .33 can be reached. The experiments are the by far largest on scholarly content annotation, which typically are up to a few hundred documents per dataset only.
Gregor Große-Bölting, Chifumi Nishioka, and Ansgar Scherp. 2015. A Comparison of Different Strategies for Automated Semantic Document Annotation. In Proceedings of the 8th International Conference on Knowledge Capture (K-CAP 2015). ACM, New York, NY, USA, , Article 8 , 8 pages. DOI=http://dx.doi.org/10.1145/2815833.2815838
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
The document discusses different technologies for storing and querying large chemical datasets, known as "big chemical data". It evaluates PostgreSQL, SQLite, MessagePack, FlatBuffers, and Pandas on a test dataset of 4 million compounds from ZINC. For queries like retrieving atom counts for 50k molecules, counting molecules by atom number, and fingerprint lookups, SQLite and MessagePack performed the fastest, completing in under 50ms. PostgreSQL was also very fast with indices, finishing some queries in under 100ms. The document concludes no single technology is best and the complexity of the tool should match the task.
ffbase, statistical functions for large datasetsEdwin de Jonge
This document introduces ffbase, an R package that adds statistical functions and utilities for working with large datasets stored in ff format. ffbase allows standard R code to be used on ff objects by rewriting expressions to operate chunkwise. It also connects ff data to other packages for large data analysis. The goal is to make working with large out-of-memory data more convenient and productive within the R environment.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
The document discusses machine learning techniques for graphs and graph-parallel computing. It describes how graphs can model real-world data with entities as vertices and relationships as edges. Common machine learning tasks on graphs include identifying influential entities, finding communities, modeling dependencies, and predicting user behavior. The document introduces the concept of graph-parallel programming models that allow algorithms to be expressed by having each vertex perform computations based on its local neighborhood. It presents examples of graph algorithms like PageRank, product recommendations, and identifying leaders that can be implemented in a graph-parallel manner. Finally, it discusses challenges of analyzing large real-world graphs and how systems like GraphLab address these challenges through techniques like vertex-cuts and asynchronous execution.
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
The document proposes a distributed approximate spectral clustering (DASC) algorithm to process large datasets in a scalable way. DASC uses locality sensitive hashing to group similar data points and then approximates the kernel matrix on each group to reduce computation. It implements DASC using MapReduce and evaluates it on real and synthetic datasets, showing it can achieve similar clustering accuracy to standard spectral clustering but with an order of magnitude better runtime by distributing the computation across clusters.
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
https://arxiv.org/abs/1606.06543
Finding optimal configurations for Stream Processing Systems (SPS) is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, we consider tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. We propose in this setting Bayesian Optimization for Configuration Optimization (BO4CO), an auto-tuning algorithm that leverages Gaussian Processes (GPs) to iteratively capture posterior distributions of the configuration spaces and sequentially drive the experimentation. Validation based on Apache Storm demonstrates that our approach locates optimal configurations within a limited experimental budget, with an improvement of SPS performance typically of at least an order of magnitude compared to existing configuration algorithms.
Continuous Architecting of Stream-Based SystemsCHOOSE
Pooyan Jamshidi CHOOSE Talk 2016-11-01
Big data architectures have been gaining momentum in recent years. For instance, Twitter uses stream processing frameworks like Storm to analyse billions of tweets per minute and learn the trending topics. However, architectures that process big data involve many different components interconnected via semantically different connectors making it a difficult task for software architects to refactor the initial designs. As an aid to designers and developers, we developed OSTIA (On-the-fly Static Topology Inference Analysis) that allows: (a) visualizing big data architectures for the purpose of design-time refactoring while maintaining constraints that would only be evaluated at later stages such as deployment and run-time; (b) detecting the occurrence of common anti-patterns across big data architectures; (c) exploiting software verification techniques on the elicited architectural models. In the lecture, OSTIA will be shown on three industrial-scale case studies.
See: http://www.choose.s-i.ch/events/jamshidi-2016/
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.
The document summarizes a presentation given by Chris Fregly on end-to-end real-time analytics using Apache Spark. It discusses topics like Spark streaming, machine learning, tuning Spark for performance, and demonstrates live demos of sorting, matrix multiplication, and thread synchronization optimized for CPU cache. The presentation emphasizes techniques like cache-friendly data layouts, prefetching, and lock-free algorithms to improve Spark performance.
This document discusses approximate query processing using sampling to enable interactive queries over large datasets. It describes BlinkDB, a framework that creates and maintains samples from underlying data to return fast, approximate query answers with error bars. BlinkDB verifies the correctness of the error bars it returns by periodically replacing samples and using diagnostics to check the accuracy without running many queries. The document discusses challenges like selecting appropriate samples, estimating errors, and verifying results to balance speed, accuracy and correctness for interactive analysis of big data.
This presentation focuses on Deep Learning (DL) concepts, such as neural neworks, backprop, activation functions, and Convolutional Neural Networks, with a short introduction to D3, and followed by a TypeScript-based code sample that replicates the TensorFlow playground. Basic knowledge of matrices is helpful.
Optimization for iterative queries on Mapreducemakoto onizuka
This document discusses optimization techniques for iterative queries with convergence properties. It presents OptIQ, a framework that uses view materialization and incrementalization to remove redundant computations from iterative queries. View materialization reuses operations on unmodified attributes by decomposing tables into invariant and variant views. Incrementalization reuses operations on unmodified tuples by processing delta tables between iterations. The document evaluates OptIQ on Hive and Spark, showing it can improve performance of iterative algorithms like PageRank and k-means clustering by up to 5 times.
The document discusses using machine learning techniques like Gaussian processes (GPs) to optimize the configuration of software systems. It notes that software performance landscapes are often complex, with non-linear interactions between parameters and non-convex response surfaces. Measurements are also subject to noise. The document introduces an approach called TL4CO that uses multi-task Gaussian processes to model software performance across different versions/deployments, allowing it to leverage data from other versions to improve optimization. This helps address challenges in DevOps where new versions are continuously delivered.
This document provides an overview of deep learning, machine learning, and artificial intelligence. It discusses the differences between traditional AI, machine learning, and deep learning. Key deep learning concepts covered include neural networks, activation functions, cost functions, gradient descent, backpropagation, and hyperparameters. Convolutional neural networks and their applications are explained. Recurrent neural networks are also introduced. The document discusses TypeScript and how it can be used for deep learning applications.
This document discusses performing data science on HBase using the WibiData platform. It introduces WibiData Language (WDL), which allows analyzing data stored in HBase columns in a concise and interactive way using Scala and Apache Crunch. The document demonstrates building a histogram of editor metrics by reading user data from an HBase table, filtering and binning average edit deltas, and visualizing the results. WDL aims to make HBase data exploration more accessible for data scientists compared to other frameworks like Hive and Pig.
Lambda expressions allow code to be passed as data in Java 8. The talk discusses myths and mistakes around lambda expressions, providing an introduction and examples. It emphasizes that syntax is less important than functional thinking and addresses common issues like debugging, testing and compiler errors. Functional thinking focuses on inputs and outputs rather than steps and requires practice to learn.
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL – NT2 – providing a Matlab -like syntax for parallel numerical computations inside a C++ library.
Main issues addressed here is how liimtaions of classical DSEL generation and multithreaded code generation can be overcome.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
The document discusses using machine learning techniques to learn vector representations of SQL queries that can then be used for various workload management tasks without requiring manual feature engineering. It shows that representations learned from SQL strings using models like Doc2Vec and LSTM autoencoders can achieve high accuracy for tasks like predicting query errors, auditing users, and summarizing workloads for index recommendation. These learned representations allow workload management to be database agnostic and avoid maintaining database-specific feature extractors.
This document discusses the responsible use of data science techniques and technologies. It describes data science as answering questions using large, noisy, and heterogeneous datasets that were collected for unrelated purposes. It raises concerns about the irresponsible use of data science, such as algorithms amplifying biases in data. The work of the DataLab group at the University of Washington is presented, which aims to address these issues by developing techniques to balance predictive accuracy with fairness, increase data sharing while protecting privacy, and ensure transparency in datasets and methods.
Brief remarks on big data trends and responsible data science at the Workshop on Science and Technology for Washington State: Advising the Legislature, October 4th 2017 in Seattle.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
Urban data science activities at the University of Washington, presented at the Urban@UW kickoff event.
http://urban.uw.edu/
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
This document summarizes a presentation about Myria, a relational algorithmics-as-a-service platform developed by researchers at the University of Washington. Myria allows users to write queries and algorithms over large datasets using declarative languages like Datalog and SQL, and executes them efficiently in a parallel manner. It aims to make data analysis scalable and accessible for researchers across many domains by removing the need to handle low-level data management and integration tasks. The presentation provides an overview of the Myria architecture and compiler framework, and gives examples of how it has been used for projects in oceanography, astronomy, biology and medical informatics.
Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
The University of Washington eScience Institute aims to help position UW at the forefront of eScience techniques and technologies. Its strategy includes hiring research scientists, adding faculty in key fields, and building a consultancy of students. The exponential growth of data is transitioning science from data-poor to data-rich. Techniques like sensors, data management, and cloud computing are important. The "long tail" of smaller science projects is also worthy of investment and can have high impact if properly supported.
A taxonomy for data science curricula; a motivation for choosing a particular point in the design space; an overview of some our activities, including a coursera course slated for Spring 2012
Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
This document discusses the roles that cloud computing and virtualization can play in reproducible research. It notes that virtualization allows for capturing the full computational environment of an experiment. The cloud builds on this by providing scalable resources and services for storage, computation and managing virtual machines. Challenges include costs, handling large datasets, and cultural adoption issues. Databases in the cloud may help support exploratory analysis of large datasets. Overall, the cloud shows promise for improving reproducibility by enabling sharing of full experimental environments and resources for computationally intensive analysis.
This document discusses enabling end-to-end eScience through integrating query, workflow, visualization, and mashups at an ocean observatory. It describes using a domain-specific query algebra to optimize queries on unstructured grid data from ocean models. It also discusses enabling rapid prototyping of scientific mashups through visual programming frameworks to facilitate data integration and analysis.
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
The Other HPC: High Productivity Computing in Polystore Environments
1. The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.
Associate Professor, Information School
Adjunct Associate Professor, Computer Science & Engineering
Associate Director, eScience Institute
Director, Urbanalytics Group
8/7/2017 Bill Howe, UW 1
3. Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
4. Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
Claim: Only these two performance
thresholds are generally important;
other performance requirements
are application-specific
5. 8/7/2017 Bill Howe, UW 5
priority is machine efficiency
HPC DB/ Dataflow
priority is developer efficiency
data manipulation considered
pre-processing
batch
analysis considered post-
processing
batch and interactive
6. Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different
ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites8/7/2017 Bill Howe, UW 6
7. Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different
ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites8/7/2017 Bill Howe, UW 7
8. Matrix Multiply
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Sparse means: |non-zero elements| < |rows|~1.2
Naïve sparse algorithm: |non-zero elements|*|rows|
Best-known dense algorithm: |rows|2.38
Matrix multiply
9. sparsity exponent r where m=nr
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zerosComplexity of matrix multiply
naïve sparse
algorithm
best known
sparse algorithm
best known
dense algorithmlots of room
here
18. CombBLAS vs. MyriaX (Real Data)
• CombBLAS 10X faster on one
dataset
• MyriaX 1.5X faster on another!
19. A x B x C
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
group . join . join
group . join . group . join
20. Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is no
real difference in expressiveness between linear
and relational algebra
• So we want full “anything anywhere” rewrites
8/7/2017 Bill Howe, UW 20
26. Example: Combine measurements from
sensors, compute means & covariances
26
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Dylan
Hutchison
27. Example: Sensor Difference Mean &
Covariance
27 https://arrayofthings.github.io/
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
Filter, bin onto
common time
buckets
Filter, bin onto
common time
buckets
Subtract
Compute
Mean
Compute
Covariance
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Array of Things
Sensor Data
Collected in CSV files
Dylan
Hutchison
28. Bin query: easy in RA, harder in LA
28
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚
466
492
528
55.2 40.1
𝟓𝟔.𝟑 35.0
𝟓𝟔.𝟓
t’ c v
460 temp 55.2
460 hum 40.1
520 temp 56.4
520 hum 35.0
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚
460
520
55.2 40.1
𝟓𝟔.𝟒 35.0
bin 𝑡 = 𝑡 − 𝑡 % 60 + 60 𝑡 % 60
60
+ .5
LA
Multiply:
using avg on added elements
466 492 528
460
520
1
1 1
RA
SELECT bin(t) AS t', c, avg(v) AS v
GROUP BY t', c
* =
Dylan
Hutchison
29. Covariance query: easy in LA, harder in RA
29
𝑋 is an 𝑛 ⨉ 𝑑 matrix
𝑀 = 1
𝑛
1 𝑇 𝑋 is a 1 ⨉ 𝑑 matrix
𝐶 = 1
𝑛
𝑋 𝑇 𝑋 − 𝑀 𝑇 𝑀 is a 𝑑 ⨉ 𝑑 matrix
LA
N = size(X, 1);
M = mean(X, 1);
C = X'*X / N – M'*M;
Carlos Ordonez.
Building Statistical Models and Scoring with UDFs. SIGMOD 2007.
𝑥11 𝑥12
𝑥21 𝑥22
𝑥31 𝑥32
d attributes
n points
RA
(Generated SQL statements for each entry)
T = SELECT FROM X sum(1.0) AS N,
sum(X1) AS M1, sum(X2) AS M2, …, sum(Xd) AS Md,
sum(X1*X1) AS Q11, sum(X1*X2) AS Q12, …,
sum(Xd-1*Xd) AS Q(d-1)d, sum(Xd*Xd) AS Qdd
C = SELECT FROM T
(1 AS i, 1 AS j, Q11/N – M1*M1 AS v) UNION
(1 AS i, 2 AS j, Q12/N – M1*M2 AS v) UNION
…
Dylan
Hutchison
32. Spark Myria CombBLAS GEMS
Parallel
Algebra
Logical
Algebra
Myria Middleware
CombBLAS
API
Spark
API
Myria
API
GEMS
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph
Algebra
Accumulo
KeyVal
Algebra
Accumulo
API
Serial
C
Serial
Algebra
C
33. Spark Myria CombBLAS GEMS
Parallel
Algebra
Logical
Algebra
Myria Middleware
CombBLAS
API
Spark
API
Myria
API
GEMS
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph
Algebra
Accumulo
KeyVal
Algebra
Accumulo
API
Serial
C
Serial
Algebra
C
LARA Algebra
LARA API
LARA Physical Plans
LaraDB (Accumulo)
34. 34
k1 k2
[0]
v1
[‘’]
v2
a 37 7 ‘dan’
a 20 0 ‘’
b 25 0 ‘dylan'
b 20 2 ‘bill’
⋈⊗
⋈
extf⨁
Join Union Extension
Objects:
Associative Tables
Operators:
Join and Union adapted from:
M. Spight and V. Tropashko.
First steps in relational lattice. 2006.
Ext is a restricted form
of monadic bind
Total functions from keys
to values with finite support
Default Values
ValuesKeys
Attributes
“horizontal
concat”
“vertical
concat”
“flatmap”
UDFs: ⊗, ⨁, f
Think “Semiring”
⊗⋈⊕
Support
35. Join: Horizontal
Concat
35
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b
[0']
z
[0']
y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
⋈⊗ = a c b
[0 ⊗ 0']
z
a1 c1 b1 1 ⊗ 5
a1 c2 b1 3 ⊗ 5
a1 c2 b2 2 ⊗ 6
(a3 c3 b1 4 ⊗ 0
= 0 ⊗ 0')
Requires:
vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'
36. Union:
Vertical Concat
36
= c
[0]
x
[0 ⨁ 0 = 0]
z
[0]
y
c1 11 ⨁ 13 1 ⨁ 3 ⨁ 5 15
c2 12 2 ⨁ 6 ⨁ 7 16 ⨁ 17
c3 14 14 0
c4 0 8 18
⋈
⨁
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b
[0]
z
[0]
y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
Requires:
v ⨁ 0 = 0 ⨁ v = v
37. Ext:
Flatmap
37
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
=
a c k'
[0 – 0 = 0]
v'
a1 c1 a1c1 11 – 5
a1 c1 c1a1 5 – 11
a1 c2 a1c2 12 – 2
a1 c2 c2a1 2 – 12
a2 c1 a2c1 13 – 3
a2 c1 c1a2 3 – 13
extf
k' v'
ac x – z
ca z – x
f(a, c, x, z) =
Requires:
38. Summary: Union, Join, Ext
38
Key Types Value Types Support
Union ( 𝑨 ⊕ 𝑩 ) = 𝐾 𝐴 ∩ 𝐾 𝐵 = 𝑉 𝐴 ∪ 𝑉 𝐵 ⊆ 𝑆 𝐴 ∪ 𝑆 𝐵
Join ( 𝑨 ⋈⊗ 𝑩 ) = 𝐾 𝐴 ∪ 𝐾 𝐵 = 𝑉 𝐴 ∩ 𝑉 𝐵 ⊆ 𝑆𝐴 ∩ 𝑆 𝐵
Ext ( extf A ) extended by f set by f ⊆ 𝑆𝐴 × 𝑆𝑓
⋈
For Support, ‘⊆’ becomes ‘=’ if
⊕ is zero-sum-free or ⊗ has zero-product-property
Duality
39. > If ⨁ or ⊗ is associative, commutative, or idempotent,
then so is Union or Join
> (Push Aggregation into Join) If ⊗ distributes over ⨁,
> (Distribute Join over Union)
If , then
LARA Properties
39
= sum(AB ⊗ CT)
43. 8/7/2017 Bill Howe, UW 43/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
44. 8/7/2017 Bill Howe, UW 44/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
45. Graph Patterns
45
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
47. Recap
• Productivity is the new performance
• …but this doesn’t mean give up on orders of
magnitude performance difference by doing
everything on one system
• Everything interesting is LA + RA
• There is no difference except syntax and
systems
• We want to comprehensively optimize
across them, generate code anywhere
48. Other Productivity Work
• Workload Analytics for SQL Data Lakes
– Shrainik Jain
• AI for Scientific Data Curation
– Maxim Grechkin, Hoing Poon (MSR)
• Visualization Recommendation
– Kanit “Ham” Wongsuphasawat, Dom Moritz, Jeff
Heer
• Information Extraction from Scientific Figures
– Poshen Lee, Sean Yang
• Scalable Approximate Community Detection
– Seung-Hee Bae (Western Michigan)
49. The SQLShare Corpus:
A multi-year log of hand-written SQL queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
Workload Analytics for Data Lakes
50. lifetime = days between first and last access of table
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/
Data “Grazing”: Short dataset lifetimes
52. Key idea: Embed queries as vectors
• Learn query embeddings; use them for
all workload analytics tasks:
– Query recommendation
– Workload summarization / index selection
– User behavior modeling
– Predicting heavy hitters
– Forensics
• Get rid of specialized feature
engineering
53. Doc2Vec on SQL
Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters
54. Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters
Doc2Vec on Templatized Query Plans
58. 8/7/2017 Bill Howe, UW 58
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
61. color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree
63. Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
NIPS 18 (review)
64. Deep Curation:
Our stuff wins, with ZERO training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
NIPS 18 (review)
65. Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
70. 70
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
71. 8/7/2017 Bill Howe, UW 71
Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
72. 8/7/2017 Bill Howe, UW 72
Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
73. 8/7/2017 Bill Howe, UW 73
Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
74. First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
75. The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
8/7/2017 Data, Responsibly / SciTech NW 75
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
76. The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Rely on human oversight
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Issues with scalability, prejudice
Option 2: Build systems to help enforce these constraints
This is the approach we are exploring
8/7/2017 Data, Responsibly / SciTech NW 76
77. The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
8/7/2017 Data, Responsibly / SciTech NW 77
#3: And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
#4: … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
#10:
This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse
#14: BLAS: fail to assign memory!!
spBLAS: # of outputs > 2B with r1.6 data.
overflow issues based on 32 bit integers
long ints may work
hyperDB: thrashing with r1.6.
#15: speedup = T_HyperDB / T_SpBLAS
benchmark datasets with r is 1.2 and the real data cases (the three largest datasets: 1.17 < r < 1.20)
on star (nTh = 12), on dragon (nTh = 60)
As n increases, the relative speedup of SpBLAS over HyperDB is reduced.
soc-Pokec: the relative speedup is only around 5 times.
on star, hyperDB stuck on thrashing with soc-Pokec data.
#27: Bridging the gap: towards optimization across linear and relational
#28: Array of Things sensor data from Argonne National Labs. First started as a project with the city of Chicago.
Suppose we have two such sensors and we wish to study the differences in their measurements. We would like to know what the mean differences are and how the measurement types covary. For example, when the first sensor records a higher temperature than the second, does that correlate with a larger humidity measurement as well?
Calibration, compare newly manufactured sensor to the golden standard
#29: We know how to optimize SQL – AGG ---- YOU KNOW THIS
Problems: Matrix needs indexing set; LA doesn’t typically have methods to transform indices like this. LA operators works well when transforming the values. Matrix is infinite in general
Assoc(bin(Row(A)), Row(A), 1) *_avg A
#30: This approach stores each column of the matrix X as a separate attribute.
Another approach is (I, j, v). We could use Lara to write the equations for this.
To prove that the MATLAB implements the LA, use: 1/n * (X – 1M)T(X – 1M) = 1/n * XTX – MTM
Use n/(n-1) for bias correction.
U = X - repmat(M, N, 1);
C = U.' * U ./ (N - 1);
Bessel’s Correction for sample variance: N/(N – 1)
#32:
So our approach is to model this overlap in capabilities as its own language.
We start
#35: These operators are not totally novel; they are inspired by the following work.
Two syntaxes: COBOL/SQL-style for writing scripts, algebraic/combinatory-style for proofs.
INSPIRED BY INTERESTING GENERALIZATION OF UNION. Flatmap is monotonic in key type.
#36: Multiply matching values by UDF ⊗s, one for each value attribute
Each ⊗ has default value as annihilator: vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'
Keys: = 𝐾 𝐴 ∪ 𝐾 𝐵
Values: = 𝑉 𝐴 ∩ 𝑉 𝐵
Support: ⊆ 𝑆 𝐴 ∩ 𝑆 𝐵 (on common keys), with equality when ⊗ has zero product property
#37: Sum colliding values by UDF ⨁s, one for each value attribute
Default values must match
Each ⨁ has default value as identity: v ⨁ 0 = 0 ⨁ v = v
Agg: special case when one table is empty
example: aggregates down to key i
Keys: = 𝐾𝐴 ∩ 𝐾𝐵
Values: = 𝑉𝐴 ∪ 𝑉𝐵
Support: ⊆ 𝑆𝐴 ∪ 𝑆𝐵 (on common keys), with equality when ⨁ is zero-sum-free
#38: Run f on each row independently, replaces values
Appends new keys (never deletes)
Monotonic in key type; no collisions
No free variables
Use Join/Union to interface with external tables
Requirement for finite support:
Map: special case when k' = ()
Rename: map that changes attribute names
RA: CROSS APPLY
LA: EXPLODE
#42:
NOTES:
Optimizations enable?
with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
#50: We want to not just build the system, we want to understand how people are using it
#51: Why do we care about lifetime?
Table usage predictions for caching and partitioning. Move from reactive to proactive physical design services.
Query idioms are consistent, while the data is fleeting. Not exact queries as in a streaming system, but the “methods” are reused over and over.
Extracting and optimizing these idioms across tenants is our goal.
#59: Google knowledge graph
Specialized Ontologies
#60: Google knowledge graph
Specialized Ontologies
#61: Google knowledge graph
Specialized Ontologies
#62: "HeLa", "K562", "MCF-7" and "brain tumor”
PCA on expression values
#63: Google knowledge graph – common knowledge, high redundancy, possibly crowdsourcing (visual: question answering via Google)
Text features:
presence of ontology terms
sibling of ontology term
Expression features
#71: LSI-R model
25 states use it
Most for targeted programs
Idaho and Colorado use this for sentencing
“As a Black male,” Cobb asked Penn statistician and resident expert Richard Berk, “should I be afraid of risk assessment tools?”
“No,” Berk said, without skipping a beat. “You gotta tell me a lot more about yourself. … At what age were you first arrested? What is the date of your most recent crime? What are you charged with?”
Cobb interjected: “And what’s the culture of policing in the neighborhood in which I grew up in?”
(emphasis mine)
That's exactly the point (and to Michael -- this is what I was arguing about with the guy from Comcast): a little bit of institutional racism has a triple effect:
a) institutional racism is amplified by the algorithm (a small signal can now dominate the model)
b) institutional racism is operationalized by the algorithm (it's far easier now to make impactful decisions based on bad data)
c) institutional racism is legitimized by the algorithm (so that everyone thinks "it's just data" and actively defends the algorithm's assumed objectivity, even when the racist results are staring you right in the face. This vigorous defense doesn't happen when a human is shown to be correlating their decisions perfectly with race.)
#75: On which projects should we engage?
How can we ensure fairness, accountability, and transparency for algorithmic decision-making?
How do we ensure privacy?
How do we avoid junk science?