Explore 1.5M+ audiobooks & ebooks free for days

From $11.99/month after trial. Cancel anytime.

The Shape of Data: Geometry-Based Machine Learning and Data Analysis in R
The Shape of Data: Geometry-Based Machine Learning and Data Analysis in R
The Shape of Data: Geometry-Based Machine Learning and Data Analysis in R
Ebook488 pages4 hours

The Shape of Data: Geometry-Based Machine Learning and Data Analysis in R

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This advanced machine learning book highlights many algorithms from a geometric perspective and introduces tools in network science, metric geometry, and topological data analysis through practical application.

Whether you’re a mathematician, seasoned data scientist, or marketing professional, you’ll find The Shape of Data to be the perfect introduction to the critical interplay between the geometry of data structures and machine learning.

This book’s extensive collection of case studies (drawn from medicine, education, sociology, linguistics, and more) and gentle explanations of the math behind dozens of algorithms provide a comprehensive yet accessible look at how geometry shapes the algorithms that drive data analysis.

In addition to gaining a deeper understanding of how to implement geometry-based algorithms with code, you’ll explore:

  • Supervised and unsupervised learning algorithms and their application to network data analysis
  • The way distance metrics and dimensionality reduction impact machine learning
  • How to visualize, embed, and analyze survey and text data with topology-based algorithms
  • New approaches to computational solutions, including distributed computing and quantum algorithms
LanguageEnglish
PublisherNo Starch Press
Release dateSep 12, 2023
ISBN9781718503090
The Shape of Data: Geometry-Based Machine Learning and Data Analysis in R

Related to The Shape of Data

Related ebooks

Intelligence (AI) & Semantics For You

View More

Reviews for The Shape of Data

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    The Shape of Data - Colleen M. Farrelly

    PRAISE FOR

    The Shape of Data

    "The title says it all. Data is bound by many complex relationships not easily shown in our two-dimensional, spreadsheet-filled world. The Shape of Data walks you through this richer view and illustrates how to put it into practice."

    —Stephanie Thompson, data scientist and speaker

    "The Shape of Data is a novel perspective and phenomenal achievement in the application of geometry to the field of machine learning. It is expansive in scope and contains loads of concrete examples and coding tips for practical implementations, as well as extremely lucid, concise writing to unpack the concepts. Even as a more veteran data scientist who has been in the industry for years now, having read this book I’ve come away with a deeper connection to and new understanding of my field."

    —Kurt Schuepfer, PhD, McDonald’s Corporation

    "The Shape of Data is a great source for the application of topology and geometry in data science. Topology and geometry advance the field of machine learning on unstructured data, and The Shape of Data does a great job introducing new readers to the subject."

    —Uchenna Ike Chukwu, senior quantum developer

    The Shape of Data

    Geometry-Based Machine Learning and Data Analysis in R

    by Colleen M. Farrelly and Yaé Ulrich Gaba

    THE SHAPE OF DATA. Copyright © 2023 by Colleen M. Farrelly and Yaé Ulrich Gaba.

    All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher.

    First printing

    27 26 25 24 23 1 2 3 4 5

    ISBN-13: 978-1-7185-0308-3 (print)

    ISBN-13: 978-1-7185-0309-0 (ebook)

    Publisher: William Pollock

    Managing Editor: Jill Franklin

    Production Manager: Sabrina Plomitallo-González

    Production Editor: Sydney Cromwell

    Developmental Editor: Alex Freed

    Cover Illustrator: Gina Redman

    Interior Design: Octopod Studios

    Technical Reviewer: Franck Kalala Mutombo

    Copyeditor: Kim Wimpsett

    Compositor: Jeff Lytle, Happenstance Type-O-Rama

    Proofreader: Scout Festa

    Indexer: BIM Creatives, LLC

    For information on distribution, bulk sales, corporate sales, or translations, please contact No Starch Press ® directly at [email protected] or:

    No Starch Press, Inc.

    245 8th Street, San Francisco, CA 94103

    phone: 1.415.863.9900

    www.nostarch.com

    Library of Congress Cataloging-in-Publication Data

    Names: Farrelly, Colleen, author. | Gaba, Yaé Ulrich, author.

    Title: The shape of data : network science, geometry-based machine learning, and topological data

       analysis in R / by Colleen M. Farrelly and Yaé Ulrich Gaba.

    Description: San Francisco, CA : No Starch Press, [2023] | Includes bibliographical references.

    Identifiers: LCCN 2022059967 (print) | LCCN 2022059968 (ebook) | ISBN 9781718503083 (paperback) |

       ISBN 9781718503090 (ebook)

    Subjects: LCSH: Geometric programming. | Topology. | Machine learning. | System analysis--Data

       processing. | R (Computer program language)

    Classification: LCC T57.825 .F37 2023 (print) | LCC T57.825 (ebook) | DDC 006.3/1--dc23/

       eng/20230301

    LC record available at https://lccn.loc.gov/2022059967

    LC ebook record available at https://lccn.loc.gov/2022059968

    No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.

    The information in this book is distributed on an As Is basis, without warranty. While every precaution has been taken in the preparation of this work, neither the authors nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.

    To my grandmother Irene Borree, who enjoyed our discussions about new technologies into her late nineties.

    —Colleen M. Farrelly

    To God Almighty, Yeshua Hamashiach, the Key to all treasures of wisdom and knowledge and the Waymaker.

    To my beloved wife, Owolabi. Thank you for believing.

    To my parents, Prudence and Gilberte, and my siblings, Olayèmi, Boladé, and Olabissi.

    To Jeff Sanders, the man I would call my academic father.

    —Yaé Ulrich Gaba

    About the Authors

    Colleen M. Farrelly is a senior data scientist whose academic and industry research has focused on topological data analysis, quantum machine learning, geometry-based machine learning, network science, hierarchical modeling, and natural language processing. Since graduating from the University of Miami with an MS in biostatistics, Colleen has worked as a data scientist in a variety of industries, including healthcare, consumer packaged goods, biotech, nuclear engineering, marketing, and education. Colleen often speaks at tech conferences, including PyData, SAS Global, WiDS, Data Science Africa, and DataScience SALON. When not working, Colleen can be found writing haibun/haiga or swimming.

    Yaé Ulrich Gaba completed his doctoral studies at the University of Cape Town (UCT, South Africa) with a specialization in topology and is currently a research associate at Quantum Leap Africa (QLA, Rwanda). His research interests are computational geometry, applied algebraic topology (topological data analysis), and geometric machine learning (graph and point-cloud representation learning). His current focus lies in geometric methods in data analysis, and his work seeks to develop effective and theoretically justified algorithms for data and shape analysis using geometric and topological ideas and methods.

    About the Technical Reviewer

    Franck Kalala Mutombo is a professor of mathematics at Lubumbashi University and the former academic director of AIMS Senegal. He previously worked in a research position at University of Strathclyde and at AIMS South Africa in a joint appointment with the University of Cape Town. He holds a PhD in mathematical sciences from the University of Strathclyde, Glasgow, Scotland. He is an expert in the study and analysis of complex network structure and applications. His most recent research considers the impact of network structure on long-range interactions applied to epidemics, diffusion, and object clustering. His other research interests include differential geometry of manifolds, finite element methods for partial differential equations, and data science.

    Foreword

    The title of Colleen M. Farrelly and Yaé Ulrich Gaba’s book, The Shape of Data, is as fitting and beautiful as the journey that the authors invite us to experience, as we discover the geometric shapes that paint the deeper meaning of our analytical data insights.

    Enabling and combining common machine learning, data science, and statistical solutions, including the combinations of supervised/unsupervised or deep learning methods, by leveraging topological and geometric data analysis provides new insights into the underlying data problem. It reminds us of our responsibilities as data scientists, that with any algorithmic approach a certain data bias can greatly skew our expected results. As an example, the data scientist needs to understand the underlying data context well to avoid performing a two-dimensional Euclidean-based distance analysis when the underlying data needs to account for three-dimensional nuances, such as what a routing analysis would require when traveling the globe.

    Throughout the book’s mathematical data analytics tour, we encounter the origin of data analysis on structured data and the many seemingly unstructured data scenarios that can be turned into structured data, which enables standard machine learning algorithms to perform predictive and prescriptive analytical insights. As we ride through the valleys and peaks of our data, we learn to collect features along the way that become key inputs into other data layers, forming geometrical interpretations of varying unstructured data sources including network data, images, and text-based data. In addition, Farrelly and Gaba are masterful in detailing the foundational and advanced concepts supported by the well-defined examples in both R and Python, available for download from their book’s web page.

    Throughout my opportunities to collaborate with Farrelly and Gaba on several exciting projects over the past years, I always hoped for a book to emerge that would explain as clearly and eloquently as The Shape of Data does the evolution of the topological data analysis space all the way to leveraging distributed and quantum computing solutions.

    During my days as a CTO at Cypher Genomics, Farrelly was leading our initiatives in genomic data analytics. She immediately inspired me with her keen understanding of how best to establish correlations between disease ontologies versus symptom ontologies, while also using simulations to understand the implications of missing links in the map. Farrelly’s pragmatic approach helped us successfully resolve critical issues by creating an algorithm that mapped across gene, symptom, and disease ontologies in order to predict disease from gene or symptom data. Her focus on topology-based network mining for diagnostics helped us define the underlying data network shape, properties, and link distributions using graph summaries and statistical testing. Our combined efforts around ontology mapping, graph-based prediction, and network mining and decomposition resulted in critical data network discoveries related to metabolomics, proteomics, gene regulatory networks, patient similarity networks, and variable correlation networks.

    From our joint genomics and related life sciences analytics days to our most recent quantum computing initiatives, Farrelly and Gaba have consistently demonstrated a strong passion and unique understanding of all the related complexities and how to apply their insights to several everyday problems. Joining them on their shape of data journey will be valuable time spent as you embark on a well-scripted adventure of R and Python algorithms that solve general or niche problems in machine learning and data analysis using geometric patterns to help shape the desired results.

    This book will be relevant and captivating to beginners and devoted experts alike. First-time travelers will find it easy to dive into algorithm examples designed for analyzing network data, including social and geographic networks, as well as local and global metrics, to understand network structure and the role of individuals in the network. The discussion covers clustering methods developed for use on network data, link prediction algorithms to suggest new edges in a network, and tools for understanding how, for example, processes or epidemics spread through networks.

    Advanced readers will find it intriguing to dive into recently developing topics such as replacing linear algebra with nonlinear algebra in machine learning algorithms and exterior calculus to quantity needs in disaster planning. The Shape of Data has made me want to roll up my sleeves and dive into many new challenges, because I feel as well equipped as Lara Croft in Tomb Raider thanks to Farrelly’s tremendous treasure map and deeply insightful exploration work. Could there be a hidden bond or hidden layer between them?

    Michael Giske

    Technology executive, global CIO of B-ON, and chairman of Inomo Technologies

    Acknowledgments

    I, Colleen, would like to thank my parents, John and Nancy, and my grandmother Irene for their support while I was writing this book and for encouraging me to play with mathematics when I was young.

    I would also like to thank Justin Moeller for the sports and art conversations that led me into data science as a career, as well as his and Christy Moeller’s support over the long course of writing this book, and Ross Eggebeen, Mark Mayor, Matt Mayor, and Malori Mayor for their ongoing support with this project and other writing endeavors over the years.

    My career in this field and this book would not have been possible without the support of Cynthia DeJong, John Pustejovsky, Kathleen Karrer, Dan Feaster, Willie Prado, Richard Schoen, and Ken Baker during my educational years, particularly my transition from the medical/social sciences to mathematics during medical school. I’m grateful for the support of Jay Wigdale and Michael Giske over the course of my career, as well as the support from many friends and colleagues, including Peter Wittek, Diana Kachan, Recinda Sherman, Natashia Lewis, Louis Fendji, Luke Robinson, Joseph Fustero, Uchenna Chukwu, Jay and Jenny Rooney, and Christine and Junwen Lin.

    This book would not have been possible without our editor, Alex Freed; our managing editor, Jill Franklin; and our technical reviewer, Franck Kalala Mutombo. We both would also like to acknowledge the contributions of Bastian Rieck and Noah Giansiracusa. We are grateful for the support of No Starch Press’s marketing team, particularly Briana Blackwell in publicizing our speaking engagements.

    We are also grateful to R, which provided open source packages and graphics generated with code, as well as Microsoft PowerPoint, which was used with permission to generate the additional images in this book. We would also like to thank NightCafe for providing a platform to generate images and granting full rights to creators.

    No achievement in life is without the help of many known and unknown individuals. I, Yaé, would like to thank just a few who made this work possible.

    To my wife, Owolabi, for your unwavering support. To Colleen Farrelly, for initiating this venture and taking me along. To Franck Kalala, my senior colleague and friend, for his excellent reviewer skills.

    To my friends and colleagues: Collins A. Agyingi, David S. Attipoe, Rock S. Koffi, Evans D. Ocansey, Michael Kateregga, Mamana Mbiyavanga, Jordan F. Masakuna and Gershom Buri for the care. To Jan Groenewald and the entire AIMS-NEI family.

    To my spiritual fathers and mentors, Pst Dieudonné Kantu and the entire SONRISE family, Pst Daniel Mukanya, and Pst Magloire N. Kunantu, whose leadership inpired my own.

    Introduction

    The first time I, Colleen, confronted my own hesitancy with math was when geometry provided a solution to an art class problem I faced: translating a flat painting onto a curved vase. Straight lines from my friend’s canvas didn’t behave the same way on the curved vase. Distances between points on the painting grew or shrank with the curvature. We’d stumbled upon the differences between the geometry we’d learned in class (where geometry behaved like the canvas painting) and the geometry of real-world objects like the vase. Real-world data often behaves more like the vase than the canvas painting. As an industry data scientist, I’ve worked with many non-data-science professionals who want to learn new data science methods but either haven’t encountered a lot of math or coding in their career path or have a lingering fear of math from prior educational experiences. Math-heavy papers without coding examples often limit the toolsets other professionals can use to solve important problems in their own fields.

    Math is simply another language with which to understand the world around us; like any language, it’s possible to learn. This book is focused on geometry, but it is not a math textbook. We avoid proofs, rarely use equations, and try to simplify the math behind the algorithms as much as possible to make these tools accessible to a wider audience. If you are more mathematically advanced and want the full mathematical theory, we provide references at the end of the book.

    Geometry underlies every single machine learning algorithm and problem setup, and thousands of geometry-based algorithms exist today. This book focuses on a few dozen algorithms in use now, with preference given to those with packages to implement them in R. If you want to understand how geometry relates to algorithms, how to implement geometry-based algorithms with code, or how to think about problems you encounter through the lens of geometry, keep reading.

    Who Is This Book For?

    Though this book is for anyone anywhere who wants a hands-on guide to network science, geometry-based aspects of machine learning, and topology-based algorithms, some background in statistics, machine learning, and a programming language (R or Python, ideally) will be helpful. This book was designed for the following:

    Healthcare professionals working with small sets of patient data

    Math students looking for an applied side of what they’re learning

    Small-business owners who want to use their data to drive sales

    Physicists or chemists interested in using topological data analysis for a research project

    Curious sociologists who are wary of proof-based texts

    Statisticians or data scientists looking to beef up their toolsets

    Educators looking for practical examples to show their students

    Engineers branching out into machine learning

    We’ll be surveying many areas of science and business in our examples and will cover dozens of algorithms shaping data science today. Each chapter will focus on the intuition behind the algorithms discussed and will provide examples of how to use those algorithms to solve a problem using the R programming language. While the book is written with examples presented in R, our downloadable repository (https://nostarch.com/download/ShapeofData_PythonCode.zip) includes R and Python code for examples where Python has an analogous function to support users of both languages. Feel free to skip around to sections most relevant to your interests.

    About This Book

    This book starts with an introduction to geometry in machine learning. Topics relevant to geometry-based algorithms are built through a series of network science chapters that transition into metric geometry, geometry- and topology-based algorithms, and some newer implementations of these algorithms in natural language processing, distributed computing, and quantum computing. Here’s a quick overview of the chapters in this book:

    Chapter 1: The Geometric Structure of Data Details how machine learning algorithms can be examined from a geometric perspective with examples from medical and image data

    Chapter 2: The Geometric Structure of Networks Introduces network data metrics, structure, and types through examples of social networks

    Chapter 3: Network Analysis Introduces supervised and unsupervised learning on network data, network-based clustering algorithms, comparisons of different networks, and disease spread across networks

    Chapter 4: Network Filtration Moves from network data to simplicial complex data, extends network metrics to higher-dimensional interactions, and introduces hole-counting in objects like networks

    Chapter 5: Geometry in Data Science Provides an overview on the curse of dimensionality, the role of distance metrics in machine learning, dimensionality reduction and data visualization, and applications to time series and probability distributions

    Chapter 6: Newer Applications of Geometry in Machine Learning Details several geometry-based algorithms, including supervised learning in educational data, geometry-based disaster planning, and activity preference ranking

    Chapter 7: Tools for Topological Data Analysis Focuses on topology-based unsupervised learning algorithms and their application to student data

    Chapter 8: Homotopy Algorithms Introduces an algorithm related to path planning and small data analysis

    Chapter 9: Final Project: Analyzing Text Data Focuses on a text dataset, a deep learning algorithm used in text embedding, and analytics of processed text data through algorithms from previous chapters

    Chapter 10: Multicore and Quantum Computing Dives into distributed computing solutions and quantum algorithms, including a quantum network science example and a quantum image analytics algorithm

    Downloading and Installing R

    We’ll be using the R programming language in this book. R is easy to install and compatible with macOS, Linux, and Windows operating systems. You can choose the download for your system at https://cloud.r-project.org. You might be prompted to click a link for your geographic location (or a general cloud connection option). If you haven’t installed R before, you can choose the first-time installation of the base, which is the first download option on the R for Windows page.

    Once you click the first-time option, you should see a screen that will give you an option to download R for Windows.

    After R downloads, you’ll follow the installation instructions that your system provides as a prompt. This will vary slightly depending on the operating system. However, the installation guide will take you through the steps needed to set up R.

    You may want to publish your projects or connect R with other open source projects, such as Python. RStudio provides a comfortable interface with options to connect R more easily with other platforms. You can find RStudio’s download at https://www.rstudio.com. Once you download RStudio, simply follow your operating system’s command prompts to install with the configurations that work best for your use case.

    Installing R Packages

    R has several options for installing new packages on your system. The command line option is probably the easiest. You’ll use the install.packages("package_name") option, where package_name is the name of the package you want to install, such as install.packages(mboost) to install the mboost package. From there, you may be asked to choose your geographic location for the download. The package will then download (and download any package dependencies that are not already on your machine).

    You can also use your graphical user interface (GUI) to install a package. This might be preferable if you want to browse available packages rather than install just one specific package to meet your needs. You can select Install package(s) from the Packages menu option after you launch R on your machine.

    You’ll be prompted to select your location, and the installation will happen as it would with the command line option for package installation.

    Getting Help with R

    R has many useful features if you need help with a function or a package in your code. The help() function allows you to get information about a function or package that you have installed in R. Adding the package name after the function (such as help(glmboost, mboost) for help with the generalized linear modeling boosted regression function through the mboost package) will pull up information about a package not yet installed in your machine so that you can understand what the function does before deciding to install the new package. This is helpful if you’re looking for something specific but not sure that what you’re finding online is exactly what you need. In lieu of using the help() function, you can add a question mark before the function name (such as ?glmboost).

    You can also browse for vignettes demonstrating how to use functions in a package using the command browseVignettes(), which will pull up vignettes for each package you have installed in R. If you want a vignette for a specific package, you can name that package like so: browseVignettes(package=mboost). Many packages come with a good overview of how to apply the package’s functions to an example dataset.

    R has a broad user base, and internet searches or coding forums can provide additional resources for specific issues related to a package. There are also many good tutorials that overview the basic programming concepts and common functions in R. If you are less familiar with programming, you may want to go through a free tutorial on R programming or work with data in R before attempting the code in this book.

    Because R is an evolving language with new packages added and removed regularly, we encourage you to keep up with developments via package websites and web searches. Packages that are discontinued can still be installed and used as legacy packages but require some caution, as they aren’t updated by the package author. We’ll see one of these in this book with an example of how to install a legacy package. Similarly, new packages are developed regularly, and you should find and use new packages in the field of geometry as they become available.

    Support for Python Users

    While this book presents examples in R code, our downloadable repository (https://nostarch.com/download/ShapeofData_PythonCode.zip) includes translations to Python packages and functions where possible. Most examples have a Python

    Enjoying the preview?
    Page 1 of 1