A talk given to the Bristol Clojurians on 21st April 2015.
The book (print and ebook) is available here: http://cljds.com/cljds-book
This document provides an overview of using Clojure for data science. It discusses why Clojure is suitable for data science due to its functional programming capabilities, performance on the JVM, and rich library ecosystem. It introduces core.matrix, a Clojure library that provides multi-dimensional array programming functionality through Clojure protocols. The document covers core.matrix concepts like array creation and manipulation, element-wise operations, broadcasting, and optional support for mutability. It also discusses core.matrix implementation details like the performance benefits of using Clojure protocols.
From Lisp to Clojure/Incanter and RAn Introductionelliando dias
This document provides a comparison between the statistical computing languages R and Clojure/Incanter. It discusses the histories and philosophies behind Lisp, Fortran, R and Clojure. Key differences noted are that Clojure runs on the Java Virtual Machine, allowing it to leverage Java libraries, while R is primarily written in C and Fortran. Incanter is presented as a Clojure-based platform for statistical computing and graphics that is more immature than R but allows easier access to Java capabilities. Basic syntax comparisons are provided.
Procedural Content Generation with ClojureMike Anderson
This document provides an introduction to procedural content generation with Clojure. It defines procedural content generation as the programmatic generation of content using algorithms, which may incorporate random or pseudo-random processes. It gives some examples of content types like images, music, and game content that can be generated procedurally. The document then introduces Clojure as a functional programming language that runs on the JVM and has data types like keywords, symbols, and immutable collections. It describes the Clisk library for Clojure, which allows functional composition of images. The document provides examples of generating simple images and transforming them using techniques like scaling, offsetting, and warping with noise functions. It demonstrates live coding of images using Clisk.
This document summarizes a presentation about machine learning. It begins with a definition of machine learning as giving computers the ability to learn without being explicitly programmed. It then provides examples of tasks that machine learning can perform, such as spam filtering and stock market prediction. The document notes that machine learning works to some degree but not perfectly. It introduces a company called Nuroko that is building a machine learning toolkit with certain desirable properties such as being general purpose, powerful, scalable, real-time, and pragmatic. The document explains why the company chose Clojure as its programming language and provides an overview of some key machine learning concepts and abstractions like vectors, coders, tasks, modules, and algorithms. It concludes
Presentation given at the 2013 Clojure Conj on core.matrix, a library that brings muli-dimensional array and matrix programming capabilities to Clojure
The document provides an agenda for a two-day workshop on Clojure. Day one covers Clojure overviews and fundamentals including syntax, functions, flow control, and collections. Day two covers additional topics like testing, concurrency, polymorphism, performance, and tooling. The document also provides background on Clojure being a Lisp designed for functional programming and concurrency on the JVM.
This document discusses various iteration techniques in Java including for loops, iterators, and enhanced for loops. It provides examples of iterating over lists, sets, maps, and arrays. It also summarizes common object methods like toString(), equals(), hashCode(), and finalize(). The finalize() method is called by the garbage collector before an object is destroyed to allow for cleanup.
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Example of using Kotlin lang features for writing DSL for Spark-Cassandra connector. Comparison Kotlin lang DSL features with similar features in others JVM languages (Scala, Groovy).
The document discusses functional programming concepts in Ruby. It begins by stating that functional programming and Enumerable methods can be useful in Ruby. It then provides examples of various Enumerable methods like zip, select, partition, map, and inject. It encourages thinking functionally by avoiding side effects, mutating values, and using functional parts of the standard library. The document concludes by suggesting learning a true functional language to further improve functional programming skills.
The document outlines topics covered in a NetworkX tutorial, including installation, basic classes, generating graphs, analyzing graphs, saving/loading graphs, and plotting graphs with Matplotlib. Specific sections cover local and cluster installation of NetworkX, adding nodes and edges to graphs along with attributes, basic graph properties like number of nodes/edges and neighbors, simple graph generators, random graph generators, and the algorithms package.
The document discusses using Clojure for Hadoop programming. Clojure is a dynamic functional programming language that runs on the Java Virtual Machine. The document provides an overview of Clojure and how its features like immutability and concurrency make it well-suited for Hadoop. It then shows examples of implementing Hadoop MapReduce jobs using Clojure by defining mapper and reducer functions.
The document discusses using Clojure for Hadoop programming. It introduces Clojure as a new Lisp dialect that runs on the Java Virtual Machine. It then covers Clojure's data types and collections. The remainder of the document demonstrates how to write mappers and reducers for Hadoop jobs using Clojure, presenting three different approaches to defining jobs.
R is a programming language for data analysis and statistics. It allows users to enter commands at the prompt ">" to perform calculations and manipulate numeric and other objects like vectors and matrices. Basic objects in R include numeric, integer, character, complex, and logical values. Vectors are the most basic data structure and can contain elements of the same type. Matrices are two-dimensional vectors that store values in rows and columns. Functions like c(), seq(), and rep() can be used to create, combine and replicate vectors and sequences of values.
This document provides an overview of TensorFlow and how to implement machine learning models using TensorFlow. It discusses:
1) How to install TensorFlow either directly or within a virtual environment.
2) The key concepts of TensorFlow including computational graphs, sessions, placeholders, variables and how they are used to define and run computations.
3) An example one-layer perceptron model for MNIST image classification to demonstrate these concepts in action.
Scala is a multi-paradigm programming language that runs on the Java Virtual Machine. It integrates features of object-oriented and functional programming languages. Some key features of Scala include: supporting both object-oriented and functional programming, providing improvements over Java in areas like syntax, generics, and collections, and introducing new features like pattern matching, traits, and implicit conversions.
The document discusses programming with futures in Java and Scala. It introduces futures in Java 8 using CompletableFuture and shows how they allow composing asynchronous operations without blocking threads. It then discusses how streams and futures in Java 8 share similar composition concepts using thenApply and thenCompose. The talk moves on to introduce more abstract concepts from category theory - monads, foldables and monoids. It shows how these concepts can be implemented for futures and lists to provide generic sequencing and folding of asynchronous and synchronous operations in a precise way.
The document compares TypeScript and Rust by providing examples of common programming concepts like variables, functions, collections, and iterators. It shows how concepts are implemented similarly in both languages, though the syntax differs. Key points covered include declaring immutable and mutable variables, defining and calling functions, working with collections like arrays/vectors through methods like map and filter, and how iterators are implemented and consumed in each language.
The Ring programming language version 1.4.1 book - Part 7 of 31Mahmoud Samir Fayed
This document provides documentation on various file handling, system, and debugging functions in Ring. It describes functions for reading and writing files (Fread(), Fwrite()), checking if a file exists (Fexists()), getting environment variables (SysGet()), determining the operating system (IsWindows(), IsLinux(), etc.), handling errors (Try/Catch), executing code dynamically (Eval()), and more. Examples are provided to demonstrate the usage of each function.
Coscup2021 - useful abstractions at rust and it's practical usageWayne Tsai
This document provides a summary of a presentation in Chinese about useful abstractions and syntax in Rust. It begins with an introduction of the speaker and their background. The content covers why Rust is useful, collections and iterators in Rust, the Option and Result enums, and concludes with a discussion of how Rust is being used. Key points include:
- Rust provides memory safety and high performance through its borrowing system and compiler checks
- Collections like vectors can be iterated over and methods like map, filter and collect allow transforming and collecting values
- Option and Result are useful for handling errors and absent values, avoiding panics
- Fast fail validation can be done by chaining Results with and
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
The Ring programming language version 1.2 book - Part 24 of 84Mahmoud Samir Fayed
The document describes various functions available in the Ring standard library (stdlib.ring). It provides 37 functions organized into categories like math functions, string functions, date functions, etc. Each function includes its syntax, description and an example of its usage. Some key functions described are: evenorodd() to check if a number is even or odd, factors() to compute factors of a number, matrixmulti() to multiply matrices, and dayofweek() to get the day of the week from a date.
This document provides an overview of basic data structures in Python including stacks, queues, deques, and linked lists. It describes each data structure as an abstract data type with common operations. Implementations of each data structure are provided using Python classes. The stack, queue, and deque classes implement the respective data structures using Python lists. The linked list class implements nodes to link elements and allow for traversal. Examples are given demonstrating usage of each data structure implementation.
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
C++ это мультипарадигменный язык, поэтому программист сам может выбирать и совмещать структурный, объектно-ориентированный, обобщенный и функциональный подходы. Функциональный аспект C++ особенно расширился стандартом C++11: лямбда-функции, variadic templates, std::function, std::bind. (язык доклада: английский).
Building a website in Haskell coming from Node.jsNicolas Hery
This document summarizes Nicolas Hery's experience building a website in Haskell after coming from a Node.js background. It discusses choosing a web framework in Haskell, using types to document data, handling optional values, refactoring code, and deploying to Docker and Heroku. It also notes both benefits of Haskell like compiler-checked refactoring but also challenges like syntax and documentation.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
الحضارة الإسلامية في عصرى خليفة أبو بكرZayyin YinLa
The document outlines a strategic plan with 3 main goals and various tactics to achieve each goal. The first goal is to increase market share by 15% over the next year through targeted advertising campaigns, expanding product offerings, and improving customer service. The second goal is to reduce costs by 10% by streamlining operations, negotiating better supplier contracts, and automating repetitive tasks. The third goal is to develop a new line of products to capitalize on emerging industry trends and help drive future growth.
CEVA reported record revenues of $16.2 million for Q3 2015. Non-GAAP Earnings Per Share was 22 cents, driven by robust licensing and a record 27 million CEVA-powered LTE shipments.
Example of using Kotlin lang features for writing DSL for Spark-Cassandra connector. Comparison Kotlin lang DSL features with similar features in others JVM languages (Scala, Groovy).
The document discusses functional programming concepts in Ruby. It begins by stating that functional programming and Enumerable methods can be useful in Ruby. It then provides examples of various Enumerable methods like zip, select, partition, map, and inject. It encourages thinking functionally by avoiding side effects, mutating values, and using functional parts of the standard library. The document concludes by suggesting learning a true functional language to further improve functional programming skills.
The document outlines topics covered in a NetworkX tutorial, including installation, basic classes, generating graphs, analyzing graphs, saving/loading graphs, and plotting graphs with Matplotlib. Specific sections cover local and cluster installation of NetworkX, adding nodes and edges to graphs along with attributes, basic graph properties like number of nodes/edges and neighbors, simple graph generators, random graph generators, and the algorithms package.
The document discusses using Clojure for Hadoop programming. Clojure is a dynamic functional programming language that runs on the Java Virtual Machine. The document provides an overview of Clojure and how its features like immutability and concurrency make it well-suited for Hadoop. It then shows examples of implementing Hadoop MapReduce jobs using Clojure by defining mapper and reducer functions.
The document discusses using Clojure for Hadoop programming. It introduces Clojure as a new Lisp dialect that runs on the Java Virtual Machine. It then covers Clojure's data types and collections. The remainder of the document demonstrates how to write mappers and reducers for Hadoop jobs using Clojure, presenting three different approaches to defining jobs.
R is a programming language for data analysis and statistics. It allows users to enter commands at the prompt ">" to perform calculations and manipulate numeric and other objects like vectors and matrices. Basic objects in R include numeric, integer, character, complex, and logical values. Vectors are the most basic data structure and can contain elements of the same type. Matrices are two-dimensional vectors that store values in rows and columns. Functions like c(), seq(), and rep() can be used to create, combine and replicate vectors and sequences of values.
This document provides an overview of TensorFlow and how to implement machine learning models using TensorFlow. It discusses:
1) How to install TensorFlow either directly or within a virtual environment.
2) The key concepts of TensorFlow including computational graphs, sessions, placeholders, variables and how they are used to define and run computations.
3) An example one-layer perceptron model for MNIST image classification to demonstrate these concepts in action.
Scala is a multi-paradigm programming language that runs on the Java Virtual Machine. It integrates features of object-oriented and functional programming languages. Some key features of Scala include: supporting both object-oriented and functional programming, providing improvements over Java in areas like syntax, generics, and collections, and introducing new features like pattern matching, traits, and implicit conversions.
The document discusses programming with futures in Java and Scala. It introduces futures in Java 8 using CompletableFuture and shows how they allow composing asynchronous operations without blocking threads. It then discusses how streams and futures in Java 8 share similar composition concepts using thenApply and thenCompose. The talk moves on to introduce more abstract concepts from category theory - monads, foldables and monoids. It shows how these concepts can be implemented for futures and lists to provide generic sequencing and folding of asynchronous and synchronous operations in a precise way.
The document compares TypeScript and Rust by providing examples of common programming concepts like variables, functions, collections, and iterators. It shows how concepts are implemented similarly in both languages, though the syntax differs. Key points covered include declaring immutable and mutable variables, defining and calling functions, working with collections like arrays/vectors through methods like map and filter, and how iterators are implemented and consumed in each language.
The Ring programming language version 1.4.1 book - Part 7 of 31Mahmoud Samir Fayed
This document provides documentation on various file handling, system, and debugging functions in Ring. It describes functions for reading and writing files (Fread(), Fwrite()), checking if a file exists (Fexists()), getting environment variables (SysGet()), determining the operating system (IsWindows(), IsLinux(), etc.), handling errors (Try/Catch), executing code dynamically (Eval()), and more. Examples are provided to demonstrate the usage of each function.
Coscup2021 - useful abstractions at rust and it's practical usageWayne Tsai
This document provides a summary of a presentation in Chinese about useful abstractions and syntax in Rust. It begins with an introduction of the speaker and their background. The content covers why Rust is useful, collections and iterators in Rust, the Option and Result enums, and concludes with a discussion of how Rust is being used. Key points include:
- Rust provides memory safety and high performance through its borrowing system and compiler checks
- Collections like vectors can be iterated over and methods like map, filter and collect allow transforming and collecting values
- Option and Result are useful for handling errors and absent values, avoiding panics
- Fast fail validation can be done by chaining Results with and
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
The Ring programming language version 1.2 book - Part 24 of 84Mahmoud Samir Fayed
The document describes various functions available in the Ring standard library (stdlib.ring). It provides 37 functions organized into categories like math functions, string functions, date functions, etc. Each function includes its syntax, description and an example of its usage. Some key functions described are: evenorodd() to check if a number is even or odd, factors() to compute factors of a number, matrixmulti() to multiply matrices, and dayofweek() to get the day of the week from a date.
This document provides an overview of basic data structures in Python including stacks, queues, deques, and linked lists. It describes each data structure as an abstract data type with common operations. Implementations of each data structure are provided using Python classes. The stack, queue, and deque classes implement the respective data structures using Python lists. The linked list class implements nodes to link elements and allow for traversal. Examples are given demonstrating usage of each data structure implementation.
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
C++ это мультипарадигменный язык, поэтому программист сам может выбирать и совмещать структурный, объектно-ориентированный, обобщенный и функциональный подходы. Функциональный аспект C++ особенно расширился стандартом C++11: лямбда-функции, variadic templates, std::function, std::bind. (язык доклада: английский).
Building a website in Haskell coming from Node.jsNicolas Hery
This document summarizes Nicolas Hery's experience building a website in Haskell after coming from a Node.js background. It discusses choosing a web framework in Haskell, using types to document data, handling optional values, refactoring code, and deploying to Docker and Heroku. It also notes both benefits of Haskell like compiler-checked refactoring but also challenges like syntax and documentation.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
الحضارة الإسلامية في عصرى خليفة أبو بكرZayyin YinLa
The document outlines a strategic plan with 3 main goals and various tactics to achieve each goal. The first goal is to increase market share by 15% over the next year through targeted advertising campaigns, expanding product offerings, and improving customer service. The second goal is to reduce costs by 10% by streamlining operations, negotiating better supplier contracts, and automating repetitive tasks. The third goal is to develop a new line of products to capitalize on emerging industry trends and help drive future growth.
CEVA reported record revenues of $16.2 million for Q3 2015. Non-GAAP Earnings Per Share was 22 cents, driven by robust licensing and a record 27 million CEVA-powered LTE shipments.
Grupo Reifs se centra en mejorar la calidad de vida de las personas mayores con necesidades asistenciales, desarrollando la promoción y gestión de los recursos necesarios.
CEVA, Inc reports Q1 2015 total revenues of $13.8m, driven by a record twelve licensing deals signed in the quarter. Non-GAAP earnings per share is 8 cents. For more highlights from the quarter, including LTE and Bluetooth shipment updates, view the infographic.
CEVA, Inc. reported Q1 2016 total revenues of $16.5 million, and non-GAAP earnings per share of 17 cents. More than 230 million CEVA-powered devices shipped in the quarter, including a record 35 million LTE smartphones. For more highlights from Q1, view the infographic.
White Paper - CEVA-XM4 Intelligent Vision ProcessorCEVA, Inc.
A change has come to consumer electronics. Once confined to the desktop, processing-intensive algorithms for image enhancement, computational photography and computer vision have moved en masse to camera ready smartphones, tablets, wearables and other embedded mobile devices. This movement has already hit the limits of today’s underlying hardware ability to keep pace in terms of performance, space and energy
efficiency, yet we are only seeing the tip of the iceberg.
A clear and tangible indicator of recent advances in mobile imaging and vision that are pushing these limits of design is the dual-camera smartphone, with its accompanying sensor and signal-chain processing for 3D vision and scanning, along with many other image-enhancement features. While consumers may believe they are coming closer to the ideal camera-plus-phone converged solution, designers and equipment manufacturers understand that compromises have been made as the increasingly advanced algorithms are simply relying upon the pre-existing hardware.
This hardware, typically comprising a CPU and a GPU, was not designed to support such processing-intensive imaging algorithms, so it is forcing developers to compromise on features and image quality to match the processing capabilities of the hardware. Even so, the total application continues to consume too much power and drastically shortens battery life, too much so for the still unwary user.
As newer and more-complex algorithms develop to meet both consumer demand for increased functionality as well as manufacturers’ need for differentiation, an alternate approach to the underlying vision processing architecture is required if the delicate balance between functionality and acceptable battery life is to be maintained. This alternate approach relies on the adoption of dedicated, on-chip vision processors that are able to cope with both current and future complex imaging and vision algorithms. CEVA-XM4 is exactly that, a fully programmable processor that was designed from the ground up to accelerate the most demanding image-processing and computer-vision algorithms.
This document supplies an overview of the CEVA-XM4 processor’s capabilities, architecture, features, target applications, use cases and code examples.
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to truly fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way.
This talk is a meditation on the ideal environment for doing data science and how to (almost) get there. I will cover how I approach data problems with Clojure (and why Clojure in the first place), what I believe the process of doing data science should look like and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Clojure has always been good at manipulating data. With the release of spec and Onyx (“a masterless, cloud scale, fault tolerant, high performance distributed computation system”) good became best. In this talk you will learn about a data layer architecture build around Kafka and Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; and the inferences and automations that can be built on top of that.
R is an open source statistical computing platform that is rapidly growing in popularity within academia. It allows for statistical analysis and data visualization. The document provides an introduction to basic R functions and syntax for assigning values, working with data frames, filtering data, plotting, and connecting to databases. More advanced techniques demonstrated include decision trees, random forests, and other data mining algorithms.
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...confluent
When Funding Circle needed to scale its lending platform, we chose Kafka and Clojure. More than a programming language, Clojure is an interactive development environment with which you can build up an application function by function in a continuous unbroken flow. Since 2016 we have been developing our lending platform using Clojure and Kafka Streams, and today we process millions of transaction dollars daily. In 2018 we released "Jackdaw", our open-source Clojure library for working with Kafka Streams. In this talk, attendees will learn a radical new approach to building stream processing applications in a highly productive environment--one they can use immediately via Jackdaw or apply to their favorite programming system.
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis.
R is a free software environment for statistical computing and graphics that provides a wide variety of statistical techniques and graphical methods. It includes base functions and packages, and is used through interfaces like RStudio. R represents data using objects like vectors, matrices, and data frames. Common operations include calculations, generating random variables, and visualizing data. R can be used to analyze a glass fragment dataset to visualize compositions and potentially classify an unknown fragment.
This document provides an overview of the R programming language. It describes R as a functional programming language for statistical computing and graphics that is open source and has over 6000 packages. Key features of R discussed include matrix calculation, data visualization, statistical analysis, machine learning, and data manipulation. The document also covers using R Studio as an IDE, reading and writing different data types, programming features like flow control and functions, and examples of correlation, regression, and plotting in R.
Advanced Data Visualization in R- Somes Examples.Dr. Volkan OBAN
This document provides examples of using the geomorph package in R for advanced data visualization. It includes code snippets showing how to visualize geometric morphometric data using functions like plotspec() and plotRefToTarget(). It also includes an example of creating a customized violin plot function for comparing multiple groups and generating simulated data to plot.
Using R in financial modeling provides an introduction to using R for financial applications. It discusses importing stock price data from various sources and visualizing it using basic graphs and technical indicators. It also covers topics like calculating returns, estimating distributions of returns, correlations, volatility modeling, and value at risk calculations. The document provides examples of commands and functions in R to perform these financial analytics tasks on sample stock price data.
This document discusses refactoring Java code to Clojure using macros. It provides examples of refactoring Java code that uses method chaining to equivalent Clojure code using the threading macros (->> and -<>). It also discusses other Clojure features like type hints, the doto macro, and polyglot projects using Leiningen.
PostgreSQL 10 introduces several new features including parallel query, logical replication, performance improvements and other enhancements. Parallel query uses multiple CPUs to speed up queries, especially scans, joins and aggregations. Logical replication allows replicating specific table changes rather than entire transactions. Overall PostgreSQL 10 aims to improve performance, scalability and capabilities for large database workloads.
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
F# is well-suited for data analysis tasks due to its capabilities in data access, manipulation, visualization and integration with other tools. The document outlines several F# libraries and techniques for:
1. Accessing data from various sources using FSharp.Data and type providers.
2. Visualizing data with libraries like FSharp.Charting.
3. Manipulating and transforming data using techniques like Deedle frames, Math.NET for statistics, and calling R from F#.
4. Leveraging parallelism through {m}brace for distributed computing.
This document discusses time series analysis techniques in R, including decomposition, forecasting, clustering, and classification. It provides examples of decomposing the AirPassengers dataset, forecasting with ARIMA models, hierarchical clustering on synthetic control chart data using Euclidean and DTW distances, and classifying the control chart data using decision trees with DWT features. Accuracy of over 88% was achieved on the classification task.
The document discusses various built-in functions in Python including numeric, string, and container data types. It provides examples of using list comprehensions, dictionary comprehensions, lambda functions, enumerate, zip, filter, any, all, map and reduce to manipulate data in Python. It also includes references to online resources for further reading.
This document discusses different types of data visualizations that can be created using the matplotlib library in Python. It covers basic visualizations like line charts, bar charts, histograms, and scatterplots. Examples are provided for each type of visualization to illustrate how to construct them programmatically and customize aspects like titles, labels, legends. Guidelines are given for effective visualization practices, such as using consistent axis scales. The document demonstrates how to explore and communicate data through different matplotlib plotting functions.
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
INFORMATIVE ESSAY
The purpose of the Informative Essay assignment is to choose a job or task that you know how to do and then write a minimum of 2 full pages, maximum of 3 full pages, Informative Essay teaching the reader how to do that job or task. You will follow the organization techniques explained in Unit 6.
Here are the details:
1. Read the Lecture Notes in Unit 6. You may also find the information in Chapter 10.5 in our text on Process Analysis helpful. The lecture notes will really be the most important to read in writing this assignment. However, here is a link to that chapter that you may look at in addition to the lecture notes:
https://open.lib.umn.edu/writingforsuccess/chapter/10-5-process-analysis/ (Links to an external site.)
2. Choose your topic, that is, the job or task you want to teach. As the notes explain, this should be a job or task that you already know how to do, and it should be something you can do well. At this point, think about your audience (reader). Will your reader need any knowledge or experience to do this job or task, or will you write these instructions for a general reader where no experience is required to perform the job?
3. Plan your outline to organize this essay. Unit 6 notes offer advice on this organization process. Be sure to include an introductory paragraph that has the four main points presented in the lecture notes.
4. Write the essay. It will need to be at least 2 FULL pages long, maximum of 3 full pages long. You will use the MLA formatting that you used in previous essays from Units 3, 4, and 5.
5. Be sure to include a title for your essay.
6. After writing the essay, be sure to take time to read it several times for revision and editing. It would be helpful to have at least one other person proofread it as well before submitting the assignment.
Quiz2
# comments start with #
# to quit q()
# two steps to install any library
#install.packages("rattle")
#library(rattle)
setwd("D:/AJITH/CUMBERLANDS/Ph.D/SEMESTER 3/Data Science & Big Data Analy (ITS-836-51)/RStudio/Week2")
getwd()
x <- 3 # x is a vector of length 1
print(x)
v1 <- c(2,4,6,8,10)
print(v1)
print(v1[3])
v <- c(1:10) #creates a vector of 10 elements numbered 1 through 10. More complicated data
print(v)
print(v[6])
# Import test data
test<-read.csv("CVEs.csv")
test1<-read.csv("CVEs.csv", sep=",")
test2<-read.table("CVEs.csv", sep=",")
write.csv(test2, file="out.csv")
# Write CSV in R
write.table(test1, file = "out1.csv",row.names=TRUE, na="",col.names=TRUE, sep=",")
head(test)
tail(test)
summary(test)
head <- head(test)
tail <- tail(test)
cor(test$X, test$index)
sd(test$index)
var(test$index)
plot(test$index)
hist(test$index)
str(test$index)
quit()
Quiz3
setwd("C:/Users/ialsmadi/Desktop/University_of_Cumberlands/Lectures/Week2/RScripts")
getwd()
# Import test data
data<-read.csv("yearly_sales.csv")
#A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set.
#It consists o ...
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
This document provides examples of various plotting functions in R including plot(), boxplot(), hist(), pairs(), barplot(), densityplot(), dotplot(), histogram(), xyplot(), cloud(), and biplot/triplot. Functions are demonstrated using built-in datasets like iris and by plotting variables against each other to create scatter plots, histograms, and other visualizations.
At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner.
The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively.
Hunter Kelly @retnuh
tech.zalando.com
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Markus Eisele
We keep hearing that “integration” is old news, with modern architectures and platforms promising frictionless connectivity. So, is enterprise integration really dead? Not exactly! In this session, we’ll talk about how AI-infused applications and tool-calling agents are redefining the concept of integration, especially when combined with the power of Apache Camel.
We will discuss the the role of enterprise integration in an era where Large Language Models (LLMs) and agent-driven automation can interpret business needs, handle routing, and invoke Camel endpoints with minimal developer intervention. You will see how these AI-enabled systems help weave business data, applications, and services together giving us flexibility and freeing us from hardcoding boilerplate of integration flows.
You’ll walk away with:
An updated perspective on the future of “integration” in a world driven by AI, LLMs, and intelligent agents.
Real-world examples of how tool-calling functionality can transform Camel routes into dynamic, adaptive workflows.
Code examples how to merge AI capabilities with Apache Camel to deliver flexible, event-driven architectures at scale.
Roadmap strategies for integrating LLM-powered agents into your enterprise, orchestrating services that previously demanded complex, rigid solutions.
Join us to see why rumours of integration’s relevancy have been greatly exaggerated—and see first hand how Camel, powered by AI, is quietly reinventing how we connect the enterprise.
TrsLabs - Leverage the Power of UPI PaymentsTrs Labs
Revolutionize your Fintech growth with UPI Payments
"Riding the UPI strategy" refers to leveraging the Unified Payments Interface (UPI) to drive digital payments in India and beyond. This involves understanding UPI's features, benefits, and potential, and developing strategies to maximize its usage and impact. Essentially, it's about strategically utilizing UPI to promote digital payments, financial inclusion, and economic growth.
Viam product demo_ Deploying and scaling AI with hardware.pdfcamilalamoratta
Building AI-powered products that interact with the physical world often means navigating complex integration challenges, especially on resource-constrained devices.
You'll learn:
- How Viam's platform bridges the gap between AI, data, and physical devices
- A step-by-step walkthrough of computer vision running at the edge
- Practical approaches to common integration hurdles
- How teams are scaling hardware + software solutions together
Whether you're a developer, engineering manager, or product builder, this demo will show you a faster path to creating intelligent machines and systems.
Resources:
- Documentation: https://on.viam.com/docs
- Community: https://discord.com/invite/viam
- Hands-on: https://on.viam.com/codelabs
- Future Events: https://on.viam.com/updates-upcoming-events
- Request personalized demo: https://on.viam.com/request-demo
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution—avoiding performance bottlenecks and semantically inequivalent results. We discuss the engineering aspects of a refactoring tool that automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution and vice-versa.
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptxMSP360
Data loss can be devastating — especially when you discover it while trying to recover. All too often, it happens due to mistakes in your backup strategy. Whether you work for an MSP or within an organization, your company is susceptible to common backup mistakes that leave data vulnerable, productivity in question, and compliance at risk.
Join 4-time Microsoft MVP Nick Cavalancia as he breaks down the top five backup mistakes businesses and MSPs make—and, more importantly, explains how to prevent them.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAll Things Open
Presented at All Things Open RTP Meetup
Presented by Brent Laster - President & Lead Trainer, Tech Skills Transformations LLC
Talk Title: AI 3-in-1: Agents, RAG, and Local Models
Abstract:
Learning and understanding AI concepts is satisfying and rewarding, but the fun part is learning how to work with AI yourself. In this presentation, author, trainer, and experienced technologist Brent Laster will help you do both! We’ll explain why and how to run AI models locally, the basic ideas of agents and RAG, and show how to assemble a simple AI agent in Python that leverages RAG and uses a local model through Ollama.
No experience is needed on these technologies, although we do assume you do have a basic understanding of LLMs.
This will be a fast-paced, engaging mixture of presentations interspersed with code explanations and demos building up to the finished product – something you’ll be able to replicate yourself after the session!
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Train Smarter, Not Harder – Let 3D Animation Lead the Way!
Discover how 3D animation makes inductions more engaging, effective, and cost-efficient.
Check out the slides to see how you can transform your safety training process!
Slide 1: Why 3D animation changes the game
Slide 2: Site-specific induction isn’t optional—it’s essential
Slide 3: Visitors are most at risk. Keep them safe
Slide 4: Videos beat text—especially when safety is on the line
Slide 5: TechEHS makes safety engaging and consistent
Slide 6: Better retention, lower costs, safer sites
Slide 7: Ready to elevate your induction process?
Can an animated video make a difference to your site's safety? Let's talk.
Canadian book publishing: Insights from the latest salary survey - Tech Forum...BookNet Canada
Join us for a presentation in partnership with the Association of Canadian Publishers (ACP) as they share results from the recently conducted Canadian Book Publishing Industry Salary Survey. This comprehensive survey provides key insights into average salaries across departments, roles, and demographic metrics. Members of ACP’s Diversity and Inclusion Committee will join us to unpack what the findings mean in the context of justice, equity, diversity, and inclusion in the industry.
Results of the 2024 Canadian Book Publishing Industry Salary Survey: https://publishers.ca/wp-content/uploads/2025/04/ACP_Salary_Survey_FINAL-2.pdf
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/canadian-book-publishing-insights-from-the-latest-salary-survey/
Presented by BookNet Canada and the Association of Canadian Publishers on May 1, 2025 with support from the Department of Canadian Heritage.
Bepents tech services - a premier cybersecurity consulting firmBenard76
Introduction
Bepents Tech Services is a premier cybersecurity consulting firm dedicated to protecting digital infrastructure, data, and business continuity. We partner with organizations of all sizes to defend against today’s evolving cyber threats through expert testing, strategic advisory, and managed services.
🔎 Why You Need us
Cyberattacks are no longer a question of “if”—they are a question of “when.” Businesses of all sizes are under constant threat from ransomware, data breaches, phishing attacks, insider threats, and targeted exploits. While most companies focus on growth and operations, security is often overlooked—until it’s too late.
At Bepents Tech, we bridge that gap by being your trusted cybersecurity partner.
🚨 Real-World Threats. Real-Time Defense.
Sophisticated Attackers: Hackers now use advanced tools and techniques to evade detection. Off-the-shelf antivirus isn’t enough.
Human Error: Over 90% of breaches involve employee mistakes. We help build a "human firewall" through training and simulations.
Exposed APIs & Apps: Modern businesses rely heavily on web and mobile apps. We find hidden vulnerabilities before attackers do.
Cloud Misconfigurations: Cloud platforms like AWS and Azure are powerful but complex—and one misstep can expose your entire infrastructure.
💡 What Sets Us Apart
Hands-On Experts: Our team includes certified ethical hackers (OSCP, CEH), cloud architects, red teamers, and security engineers with real-world breach response experience.
Custom, Not Cookie-Cutter: We don’t offer generic solutions. Every engagement is tailored to your environment, risk profile, and industry.
End-to-End Support: From proactive testing to incident response, we support your full cybersecurity lifecycle.
Business-Aligned Security: We help you balance protection with performance—so security becomes a business enabler, not a roadblock.
📊 Risk is Expensive. Prevention is Profitable.
A single data breach costs businesses an average of $4.45 million (IBM, 2023).
Regulatory fines, loss of trust, downtime, and legal exposure can cripple your reputation.
Investing in cybersecurity isn’t just a technical decision—it’s a business strategy.
🔐 When You Choose Bepents Tech, You Get:
Peace of Mind – We monitor, detect, and respond before damage occurs.
Resilience – Your systems, apps, cloud, and team will be ready to withstand real attacks.
Confidence – You’ll meet compliance mandates and pass audits without stress.
Expert Guidance – Our team becomes an extension of yours, keeping you ahead of the threat curve.
Security isn’t a product. It’s a partnership.
Let Bepents tech be your shield in a world full of cyber threats.
🌍 Our Clientele
At Bepents Tech Services, we’ve earned the trust of organizations across industries by delivering high-impact cybersecurity, performance engineering, and strategic consulting. From regulatory bodies to tech startups, law firms, and global consultancies, we tailor our solutions to each client's unique needs.
2. WHY AM I GIVING THIS TALK?
I am in the final stages of writing Clojure for Data Science.
It will be published by later this year.http://packtpub.com
3. AM I QUALIFIED?
I co-founded and was CTO of a data analytics company.
I am a software engineer, not a statistician.
4. WHY IS DATA SCIENCE IMPORTANT?
The robots are coming!
The rise of the computational developer.
These trends influence the kinds of systems we are all
expected to build.
5. WHY CLOJURE?
Clojure lends itself to interactive exploration and learning.
It has fantastic data manipulating abstractions.
The JVM hosts many of the workhorse data storage and
processing frameworks.
6. WHAT I WILL COVER
Distributions
Statistics
Visualisation with Quil
Correlation
Simple linear regression
Multivariable linear regression with Incanter
Break
Categorical data
Bayes classification
Logistic regression with Apache Commons Math
Clustering with Parkour and Apache Mahout
7. FOLLOW ALONG
The book's GitHub is available at
http://github.com/clojuredatascience
ch1-introduction
ch2-statistical-inference
ch3-linear-regression
ch5-classification
ch6-clustering
10. IF YOU'RE FOLLOWING ALONG
git clone [email protected]:clojuredatascience/ch1-introduction.git
cd ch1-introduction
script/download-data.sh
lein run -e 1.1
14. …EXPLAINED
is `(reduce + …)`.∑
is "for all xs"∑n
i=1
is a function of x and the mean of x( −xi μx )
2
(defn variance [xs]
(let [m (mean xs)
n (count xs)
square-error (fn [x]
(Math/pow (- x m) 2))]
(/ (reduce + (map square-error xs)) n)))
17. POINCARÉ'S BREAD
Poincaré weighed his bread every day for a year.
He discovered that the weights of the bread followed a
normal distribution, but that the peak was at 950g, whereas
loaves of bread were supposed to be regulated at 1kg. He
reported his baker to the authorities.
The next year Poincaré continued to weigh his bread from
the same baker, who was now wary of giving him the lighter
loaves. After a year the mean loaf weight was 1kg, but this
time the distribution had a positive skew. This is consistent
with the baker giving Poincaré only the heaviest of his loaves.
The baker was reported to the authorities again
34. CREDIT
Proceedings of the National Academy of Sciences, titled
"Statistical Detection of Election Irregularities," a team led by
Santa Fe Institute External Professor Stefan Thurner
38. SAMPLING SIZE
The values converge as the sample size increases.
We can often only infer the population parameters.
Sample Population
n N
X¯ μX
SX σX
49. SMALL SAMPLES
The standard error is calculated from the population
standard deviation, but we don't know it!
In practice they're assumed to be the same above around 30
samples, but there is another distribution that models the
loss of precision with small samples.
53. WHY THIS INTEREST IN MEANS?
Because often when we want to know if a difference in
populations is statistically significant, we'll compare the
means.
54. HYPOTHESIS TESTING
By convention the data is assumed not to support what the
researcher is looking for.
This conservative assumption is called the null hypothesis and
denoted .h0
The alternate hypothesis, , can then only be supported with
a given confidence interval.
h1
55. SIGNIFICANCE
The greater the significance of a result, the more certainty we
have that the null hypothesis can be rejected.
Let's use our range controller to adjust the significance
threshold.
59. POPULATION OF OLYMPIC SWIMMERS
The Guardian has helpfully provided data on the vital
statistics of Olympians
http://www.theguardian.com/sport/datablog/2012/aug/07/olym
2012-athletes-age-weight-height#data
62. LOG-NORMAL DISTRIBUTION
"A variable might be modeled as log-normal if it can be
thought of as the multiplicative product of many independent
random variables, each of which is positive. This is justified by
considering the central limit theorem in the log-domain."
66. CORRELATION
A few ways of measuring it, depending on whether your data
is continuous or discrete
http://xkcd.com/552/
67. PEARSON'S CORRELATION
Covariance divided by the product of standard deviations. It
measures linear correlation.
ρX, Y =
COV(X,Y)
σX σY
(defn pearsons-correlation [x y]
(/ (covariance x y)
(* (standard-deviation x)
(standard-deviation y))))
68. PEARSON'S CORRELATION
If is 0, it doesn’t necessarily mean that the variables are not
correlated. Pearson’s correlation only measures linear
relationships.
r
69. THIS IS A STATISTIC
The unknown population parameter for correlation is the
Greek letter . We are only able to calculate the sample
statistic .
ρ
r
How far we can trust as an estimate of will depend on two
factors:
r ρ
the size of the coefficient
the size of the sample
rX, Y =
COV(X,Y)
sX sY
71. SIMPLE LINEAR REGRESSION
(defn slope [x y]
(/ (covariance x y)
(variance x)))
(defn intercept [x y]
(- (mean y)
(* (mean x)
(slope x y))))
(defn predict [a b x]
(+ a (* b x)))
72. TRAINING A MODEL
(defn swimmer-data []
(->> (athlete-data)
($where {"Height, cm" {:$ne nil} "Weight" {:$ne nil}
"Sport" {:$eq "Swimming"}})))
(defn ex-3-12 []
(let [data (swimmer-data)
heights ($ "Height, cm" data)
weights (log ($ "Weight" data))
a (intercept heights weights)
b (slope heights weights)]
(println "Intercept: " a)
(println "Slope: " b)))
73. MAKING A PREDICTION
(predict 1.691 0.0143 185)
;; => 4.3365
(i/exp (predict 1.691 0.0143 185))
;; => 76.44
Corresponding to a predicted weight of 76.4kg
In 1979, Mark Spitz was 79kg.
http://www.topendsports.com/sport/swimming/profiles/spitz-
mark.htm
74. MORE DATA!
(defn features [dataset col-names]
(->> (i/$ col-names dataset)
(i/to-matrix)))
(defn gender-dummy [gender]
(if (= gender "F")
0.0 1.0))
(defn ex-3-26 []
(let [data (->> (swimmer-data)
(i/add-derived-column "Gender Dummy"
["Sex"] gender-dummy))
x (features data ["Height, cm" "Age" "Gender Dummy"])
y (i/log ($ "Weight" data))
model (s/linear-model y x)]
(:coefs model)))
;; => [2.2307529431422637 0.010714697827121089 0.002372188749408574 0.09
75412532492026]
86. STANDARD ERROR FOR A PROPORTION
SE =
p(1 − p)
n
‾ ‾‾‾‾‾‾‾‾
√
(defn standard-error-proportion [p n]
(-> (- 1 p)
(* p)
(/ n)
(Math/sqrt)))
= = 0.61
161 + 339
682 + 127
500
809
SE = 0.013
87. HOW SIGNIFICANT?
z =
−p1 p2
SE
P1: the proportion of women who survived is = 0.76339
446
P2: the proportion of men who survived = = 0.19161
843
SE: 0.013
z = 20.36
This is essentially impossible.
94. BAYES CLASSIFICATION
P(survive|third, male) =
P(survive)P(third|survive)P(male|
P(third, male)
P(perish|third, male) =
P(perish)P(third|perish)P(male|per
P(third, male)
Because the evidence is the same for all classes, we can
cancel this out.
95. PARSE THE DATA
(titanic-samples)
;; => ({:survived true, :gender :female, :class :first, :embarked "S", :
age "20-30"} {:survived true, :gender :male, :class :first, :embarked "S
", :age "30-40"} ...)
96. IMPLEMENTING A NAIVE BAYES MODEL
(defn safe-inc [v]
(inc (or v 0)))
(defn inc-class-total [model class]
(update-in model [class :total] safe-inc))
(defn inc-predictors-count-fn [row class]
(fn [model attr]
(let [val (get row attr)]
(update-in model [class attr val] safe-inc))))
99. MAKING PREDICTIONS
(defn n [model]
(->> (vals model)
(map :total)
(apply +)))
(defn conditional-probability [model test class]
(let [evidence (get model class)
prior (/ (:total evidence)
(n model))]
(apply * prior
(for [kv test]
(/ (get-in evidence kv)
(:total evidence))))))
(defn bayes-classify [model test]
(let [probs (map (fn [class]
[class (conditional-probability model test class)])
(keys model))]
(-> (sort-by second > probs)
(ffirst))))
100. DOES IT WORK?
(defn ex-5-7 []
(let [data (titanic-samples)
model (naive-bayes data :survived [:gender :class])]
(bayes-classify model {:gender :male :class :third})))
;; => false
(defn ex-5-8 []
(let [data (titanic-samples)
model (naive-bayes data :survived [:gender :class])]
(bayes-classify model {:gender :female :class :first})))
;; => true
101. WHY NAIVE?
Because it assumes all variables are independent. We know
they are not (e.g. being male and in third class) but naive
bayes weights all attributes equally.
In practice it works surprisingly well, particularly where there
are large numbers of features.
103. LOGISTIC REGRESSION
Logistic regression uses similar techniques to linear
regression but guarantees an output only between 0 and 1.
(x) = xhθ θT
(x) = g( x)hθ θT
Where the sigmoid function is
g(z) =
1
1 + e
−z
109. CALCULATING THE GRADIENT
(defn gradient-fn [h-theta xs ys]
(let [g (fn [x y]
(matrix/mmul (- (h-theta x) y) x))]
(->> (map g xs ys)
(matrix/transpose)
(map avg))))
We transpose to calculate the average for each feature
across all xs rather than average for each x across all
features.
115. PRODUCING A MODEL
(defn ex-5-11 []
(let [data (titanic-features)
initial-guess (-> data first count (take (repeatedly rand)))]
(run-logistic-regression data initial-guess)))
119. CLUSTERING
Find a grouping of a set of objects such that objects in the
same group are more similar to each other than those in
other groups.
120. SIMILARITY MEASURES
Many to choose from: Jaccard, Euclidean.
For text documents the Cosine measure is often chosen.
Good for high-dimensional spaces
Positive spaces the similarity is between 0 and 1.
121. COSINE SIMILARITY
cos(θ) =
A ⋅ B
∥A∥∥B∥
(defn cosine [a b]
(let [dot-product (->> (map * a b)
(apply +))
magnitude (fn [d]
(->> (map #(Math/pow % 2) d)
(apply +)
Math/sqrt))]
(/ dot-product
(* (magnitude a) (magnitude b)))))
125. WHY?
(cosine-sparse
(->> "music is the food of love"
stemmer/stems
(document-vector dictionary))
(->> "war is the locomotive of history"
stemmer/stems
(document-vector dictionary)))
;; => 0.0
(cosine-sparse
(->> "music is the food of love"
stemmer/stems
(document-vector dictionary))
(->> "it's lovely that you're musical" stemmer/stems
(document-vector dictionary)))
;; => 0.8164965809277259
128. GET THE DATA
We're going to be clustering the Reuters dataset.
Follow the readme instructions:
brew install mahout
script/download-reuters.sh
lein run -e 6.7
mahout seqdirectory -i data/reuters-txt -o data/reuters-sequencefile
129. VECTOR REPRESENTATION
Each document is converted into a vector representation.
All vectors share a dictionary providing a unique index for
each word.
133. WE NEED A UNIQUE ID
And we need to compute it in parallel.
134. PARKOUR MAPPING
(require '[clojure.core.reducers :as r]
'[parkour.mapreduce :as mr])
(defn document->terms [doc]
(clojure.string/split doc #"W+"))
(defn document-count-m
"Emits the unique words from each document"
{::mr/source-as :vals}
[documents]
(->> documents
(r/mapcat (comp distinct document->terms))
(r/map #(vector % 1))))
135. SHAPE METADATA
:keyvals ;; Re-shape as vectors of key-vals pairs.
:keys ;; Just the keys from each key-value pair.
:vals ;; Just the values from each key-value pair.
136. PLAIN OLD FUNCTIONS
(->> (document-count-m ["it's lovely that you're musical"
"music is the food of love"
"war is the locomotive of history"])
(into []))
;; => [["love" 1] ["music" 1] ["music" 1] ["food" 1] ["love" 1] ["war" 1
] ["locomot" 1] ["histori" 1]]
148. WHAT DID I LEAVE OUT?
Cluster quality measures
Spectral and LDA clustering
Collaborative filtering with Mahout
Random forests
Spark for movie recommendations with Sparkling
Graph data with Loom and Datomic
MapReduce with Cascalog and PigPen
Adapting algorithms for massive scale
Time series and forecasting
Dimensionality reduction, feature selection
More visualisation techniques
Lots more…
149. BOOK
Clojure for Data Science will be available in the second half
of the year from .http://packtpub.com
http://cljds.com