The 1st International Symposium on Human InformatiX
X-Dimensional Human Informatics and Biology
ATR, Kyoto, February 27-28, 2020
https://human-informatix.atr.jp
This document discusses the potential for machine learning to accelerate scientific discovery by rationalizing the inductive process of generating hypotheses from data. It outlines two approaches in science - theory/hypothesis-driven modeling and data-driven modeling using machine learning. It argues that machine learning can help "rationalize" the intuitive, non-logical parts of the scientific process by using data to generate and test hypotheses. The document also discusses how machine learning may automate parts of the scientific method, from hypothesis generation to model building and experimentation, thereby amplifying a scientist's progress.
This document provides a summary of recent publications related to research conducted at the WPI-ICReDD. It lists five publications from 2018-2019 related to catalysis and materials science. It then discusses the research projects and personnel involved in the JST CREST program that is funding this work. The document outlines the goals of using data-driven approaches and machine learning to optimize materials discovery and design. It proposes a multilevel framework that combines in-house and public data along with quality control and annotations to advance the field.
How to use data to design and optimize reaction? A quick introduction to work...Ichigaku Takigawa
(Journal Club) ICReDD Seminar, Apr 27 2020
Institute for Chemical Reaction Design and Discovery (ICReDD)
Hokkaido University
Sapporo, JAPAN
https://www.icredd.hokudai.ac.jp
Friday, October 15th, 2021, Sapporo, Hokkaido, Japan.
Hokkaido University ICReDD - Faculty of Medicine Joint Symposium
https://www.icredd.hokudai.ac.jp/event/5993
ICReDD (Institute for Chemical Reaction Design and Discovery)
https://www.icredd.hokudai.ac.jp
Machine Learning for Chemistry: Representing and InterveningIchigaku Takigawa
Joint Symposium of Engineering & Information Science & WPI-ICReDD in Hokkaido University
Apr. 26 (Mon), 2021
https://www.icredd.hokudai.ac.jp/event/5430
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryIchigaku Takigawa
Perspectives on Artificial Intelligence and Machine Learning in Materials Science
February 4, 2022. – February 6, 2022.
https://joint.imi.kyushu-u.ac.jp/post-2698/
The document summarizes a presentation by Itakawa Ichigaku on using machine learning and optimal experimental design for heterogeneous catalysis research. Itakawa introduces himself and his background working in machine learning and its applications in natural science fields. He emphasizes that applying machine learning to natural sciences requires close collaboration with domain experts and understanding ML's capabilities and limitations. The presentation aims to help audiences understand these points and properly position ML's role in exploratory research through examples from his past work.
1) Machine learning can help rationalize the "experience and intuition" of chemical research by finding patterns and exceptions from large amounts of chemical data to predict new materials and phenomena.
2) While in theory chemical structures and properties can be described by Schrodinger's equation, it is impossible to solve for realistic systems, requiring approximations. Machine learning may help address this challenge.
3) Chemists have successfully created compounds with desired properties through "experience and intuition", which involves inductive reasoning from experiments rather than purely deductive logic, incorporating serendipitous findings.
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Ichigaku Takigawa
2nd ICReDD International Symposium—Toward Interdisciplinary Research Guided by Theory and Calculation
Nov. 27 (wed) - Nov. 29 (fri), 2019
https://www.icredd.hokudai.ac.jp/event/1229
Scikit-learn : Machine Learning in PythonAjay Ohri
- Scikit-learn is a Python module for machine learning that provides simple and efficient tools for data mining and data analysis.
- It incorporates optimized algorithms like support vector machines and linear models while maintaining an easy-to-use interface.
- The goal is to make machine learning accessible to non-experts through a consistent API and minimal dependencies.
The document provides a self-introduction by Takigawa Ichigaku, who specializes in machine learning and data-driven natural science research, particularly those involving discrete structures. It outlines his work experience and current affiliations with RIKEN and Hokkaido University. It then previews the topics to be covered in the talk, including machine learning applications in molecular representation and chemical reaction design, as well as challenges in interpreting machine learning models.
The Materials Genome Initiative (MGI) is a multi-agency effort led by the Department of Energy, National Science Foundation, National Institute of Standards and Technology, and other agencies to reduce the time to develop and deploy new materials by 50% while reducing costs. The goal is to develop a materials innovation infrastructure through improved data sharing and predictive models to achieve national goals in energy, security, and human welfare. Key aspects of MGI include developing repositories of high-quality, shared data and predictive models that span multiple length and time scales from the quantum to the macro level to enable the accelerated discovery and design of new materials with targeted properties.
Machine Learning for Molecular Graph Representations and GeometriesIchigaku Takigawa
Dec 1, 2021, Pacifico Yokohama, Japan.
Symposium 1AS-17 "Data science and machine learning: Tackling the Noise and Heterogeneity of the Real World"
The 44th Annual Meetingn of the Molecular Biology Society of Japan
https://www2.aeplan.co.jp/mbsj2021/english/designation/index.html
This document provides an overview of the ORCHID fundamental research project, a collaboration between Caltech and universities in Austria, Germany, Switzerland, and Canada, funded by DARPA. The project aims to advance the field of optomechanics, using light to manipulate mechanical devices at the nanoscale. It has produced some milestone experimental findings despite challenges of multi-university collaboration. Key coordination mechanisms that supported the virtual organization included graduate student exchanges, face-to-face meetings, and facilitation from the DARPA program manager. The experience offers insights into managing multi-organizational collaboration for fundamental research.
Computational methods to analyze biological data. It is a way to introduce some of the many resources available for analyzing sequence data with bioinformatics software. This paper will cover the theoretical approaches to data resources and we will get knowledge about some sequential alignments with its databases. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types for example, DNA and protein sequences, molecular structures, phenotypes, and biodiversity. Databases may contain empirical data. Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale. In this materialistic world, People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. Durgesh Raghuvanshi | Vivek Solanki | Neha Arora | Faiz Hashmi "Computational of Bioinformatics" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30891.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30891/computational-of-bioinformatics/durgesh-raghuvanshi
Over the past decade, unprecedented progress in the development of neural networks influenced dozens of different industries, including weed recognition in the agro-industrial sector. The use of neural networks in agro-industrial activity in the task of recognizing cultivated crops is a new direction. The absence of any standards significantly complicates the understanding of the real situation of the use of the neural network in the agricultural sector. The manuscript presents the complete analysis of researches over the past 10 years on the use of neural networks for the classification and tracking of weeds due to neural networks. In particular, the analysis of the results of using various neural network algorithms for the task of classification and tracking was presented. As a result, we presented the recommendation for the use of neural networks in the tasks of recognizing a cultivated object and weeds. Using this standard can significantly improve the quality of research on this topic and simplify the analysis and understanding of any paper.
This document summarizes the history and activities of SIG-FPAI, a special interest group on artificial intelligence and natural language processing in Japan. It discusses past annual meetings and key topics discussed. It also provides an overview of the development of AI and the internet in Japan from the 1980s to present day. Key events and technologies discussed include the emergence of ISPs in the early 1990s, the rise of search engines and e-commerce in the late 1990s/early 2000s, and the growth of social media and mobile internet in the mid-2000s.
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
Evolutionary computation techniques like genetic programming and evolutionary algorithms can be used for adaptive optimization, data mining, and machine learning. They have been successfully applied to problems like modeling galaxy distributions, material modeling, constraint handling, dynamic optimization, multi-objective optimization, and ensemble learning. While evolutionary computation has had many real-world applications, challenges remain in improving theoretical foundations, scalability to large problems, dealing with dynamic and uncertain environments, and developing the ability to learn from previous optimization experiences.
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...YogeshIJTSRD
The main recommendations of this article mainly analyzing the rate of harmful elements the period of exploitation of the automobile implements and its services to develop activity of automobile implements of the exploitation period. Shavkat Giyazov "Analysis of Existing Models in Relation to the Problems of Mass Exchange between Autotransport Complex and the Environment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38681.pdf Paper URL: https://www.ijtsrd.com/engineering/automotive-engineering/38681/analysis-of-existing-models-in-relation-to-the-problems-of-mass-exchange-between-autotransport-complex-and-the-environment/shavkat-giyazov
Dendral was an early artificial intelligence system developed in the 1960s at Stanford University to help chemists identify unknown organic molecules. It used mass spectrometry data and knowledge of chemistry to generate possible molecular structures and test them against the data. Dendral consisted of two subprograms: Heuristic Dendral, which produced potential structures, and Meta Dendral, which learned to explain the correlation between structures and spectra. The system pioneered the use of heuristics, knowledge engineering, and the plan-generate-test problem-solving paradigm in expert systems.
Listing of Intellectual work of patanjali kashyap , mainly contains , name, details and reference of papers , presentations , patents , public speaking
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
Machine learning and artificial intelligence have transformed our online experience, and for an increasing number of individuals, these fields are fundamentally changing the way we work. In this talk, I will discuss how machine learning is used in the physical sciences, particularly materials science and chemistry, and what transformative impacts we have seen or might expect to see in the future. This discussion will focus on the unique challenges (and opportunities) faced by materials and chemistry researchers applying machine learning in their work. I will present a brief introduction to machine learning for physical scientists and give examples related to synthesis, property prediction and engineering, and artificial intelligence that “reads” research articles. These examples will introduce some of the most prevalent and useful open-source software tools that drive modern machine learning applications. Two significant themes will be emphasized throughout: the careful evaluation of machine learning results and the central importance of data quality and quantity. Finally, I will provide some mundane, “human learned” speculation about the future of machine learning in physical science and recommended resources for further study.
The Center for Applied Optimization at the University of Florida conducts interdisciplinary research in optimization involving faculty from various departments. Over the past 5 years, their research has included global optimization, optimization in biomedicine like predicting epileptic seizures, analyzing massive datasets like social networks, developing approximation algorithms, and algorithms for problems like multicast networks. Current projects also involve computational neuroscience, probabilistic classifiers in medicine, research on energy problems, and using Raman spectroscopy for cancer research. The Center collaborates with researchers from other institutions and hosts many visiting scholars each year.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
Jeremy Hadidjojo is a PhD candidate in physics at the University of Michigan with expertise in computational physics, mathematical modeling, simulation, and data analysis. His research focuses on developing physical models of biological pattern formation and applying machine learning techniques to analyze complex systems. He has extensive programming skills in MATLAB, Python, C++ and experience with parallel and GPU computing. His published works include modeling mechanisms of planar cell chirality and retinal cone patterning in zebrafish.
The document summarizes 5 papers from Zhejiang University of Finance and Economics that were included in the Ei Compendex database in 2005. It provides the title, authors, source, and brief summaries for each of the 5 papers.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
1) Machine learning can help rationalize the "experience and intuition" of chemical research by finding patterns and exceptions from large amounts of chemical data to predict new materials and phenomena.
2) While in theory chemical structures and properties can be described by Schrodinger's equation, it is impossible to solve for realistic systems, requiring approximations. Machine learning may help address this challenge.
3) Chemists have successfully created compounds with desired properties through "experience and intuition", which involves inductive reasoning from experiments rather than purely deductive logic, incorporating serendipitous findings.
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Ichigaku Takigawa
2nd ICReDD International Symposium—Toward Interdisciplinary Research Guided by Theory and Calculation
Nov. 27 (wed) - Nov. 29 (fri), 2019
https://www.icredd.hokudai.ac.jp/event/1229
Scikit-learn : Machine Learning in PythonAjay Ohri
- Scikit-learn is a Python module for machine learning that provides simple and efficient tools for data mining and data analysis.
- It incorporates optimized algorithms like support vector machines and linear models while maintaining an easy-to-use interface.
- The goal is to make machine learning accessible to non-experts through a consistent API and minimal dependencies.
The document provides a self-introduction by Takigawa Ichigaku, who specializes in machine learning and data-driven natural science research, particularly those involving discrete structures. It outlines his work experience and current affiliations with RIKEN and Hokkaido University. It then previews the topics to be covered in the talk, including machine learning applications in molecular representation and chemical reaction design, as well as challenges in interpreting machine learning models.
The Materials Genome Initiative (MGI) is a multi-agency effort led by the Department of Energy, National Science Foundation, National Institute of Standards and Technology, and other agencies to reduce the time to develop and deploy new materials by 50% while reducing costs. The goal is to develop a materials innovation infrastructure through improved data sharing and predictive models to achieve national goals in energy, security, and human welfare. Key aspects of MGI include developing repositories of high-quality, shared data and predictive models that span multiple length and time scales from the quantum to the macro level to enable the accelerated discovery and design of new materials with targeted properties.
Machine Learning for Molecular Graph Representations and GeometriesIchigaku Takigawa
Dec 1, 2021, Pacifico Yokohama, Japan.
Symposium 1AS-17 "Data science and machine learning: Tackling the Noise and Heterogeneity of the Real World"
The 44th Annual Meetingn of the Molecular Biology Society of Japan
https://www2.aeplan.co.jp/mbsj2021/english/designation/index.html
This document provides an overview of the ORCHID fundamental research project, a collaboration between Caltech and universities in Austria, Germany, Switzerland, and Canada, funded by DARPA. The project aims to advance the field of optomechanics, using light to manipulate mechanical devices at the nanoscale. It has produced some milestone experimental findings despite challenges of multi-university collaboration. Key coordination mechanisms that supported the virtual organization included graduate student exchanges, face-to-face meetings, and facilitation from the DARPA program manager. The experience offers insights into managing multi-organizational collaboration for fundamental research.
Computational methods to analyze biological data. It is a way to introduce some of the many resources available for analyzing sequence data with bioinformatics software. This paper will cover the theoretical approaches to data resources and we will get knowledge about some sequential alignments with its databases. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types for example, DNA and protein sequences, molecular structures, phenotypes, and biodiversity. Databases may contain empirical data. Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale. In this materialistic world, People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. Durgesh Raghuvanshi | Vivek Solanki | Neha Arora | Faiz Hashmi "Computational of Bioinformatics" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30891.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30891/computational-of-bioinformatics/durgesh-raghuvanshi
Over the past decade, unprecedented progress in the development of neural networks influenced dozens of different industries, including weed recognition in the agro-industrial sector. The use of neural networks in agro-industrial activity in the task of recognizing cultivated crops is a new direction. The absence of any standards significantly complicates the understanding of the real situation of the use of the neural network in the agricultural sector. The manuscript presents the complete analysis of researches over the past 10 years on the use of neural networks for the classification and tracking of weeds due to neural networks. In particular, the analysis of the results of using various neural network algorithms for the task of classification and tracking was presented. As a result, we presented the recommendation for the use of neural networks in the tasks of recognizing a cultivated object and weeds. Using this standard can significantly improve the quality of research on this topic and simplify the analysis and understanding of any paper.
This document summarizes the history and activities of SIG-FPAI, a special interest group on artificial intelligence and natural language processing in Japan. It discusses past annual meetings and key topics discussed. It also provides an overview of the development of AI and the internet in Japan from the 1980s to present day. Key events and technologies discussed include the emergence of ISPs in the early 1990s, the rise of search engines and e-commerce in the late 1990s/early 2000s, and the growth of social media and mobile internet in the mid-2000s.
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
Evolutionary computation techniques like genetic programming and evolutionary algorithms can be used for adaptive optimization, data mining, and machine learning. They have been successfully applied to problems like modeling galaxy distributions, material modeling, constraint handling, dynamic optimization, multi-objective optimization, and ensemble learning. While evolutionary computation has had many real-world applications, challenges remain in improving theoretical foundations, scalability to large problems, dealing with dynamic and uncertain environments, and developing the ability to learn from previous optimization experiences.
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...YogeshIJTSRD
The main recommendations of this article mainly analyzing the rate of harmful elements the period of exploitation of the automobile implements and its services to develop activity of automobile implements of the exploitation period. Shavkat Giyazov "Analysis of Existing Models in Relation to the Problems of Mass Exchange between Autotransport Complex and the Environment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38681.pdf Paper URL: https://www.ijtsrd.com/engineering/automotive-engineering/38681/analysis-of-existing-models-in-relation-to-the-problems-of-mass-exchange-between-autotransport-complex-and-the-environment/shavkat-giyazov
Dendral was an early artificial intelligence system developed in the 1960s at Stanford University to help chemists identify unknown organic molecules. It used mass spectrometry data and knowledge of chemistry to generate possible molecular structures and test them against the data. Dendral consisted of two subprograms: Heuristic Dendral, which produced potential structures, and Meta Dendral, which learned to explain the correlation between structures and spectra. The system pioneered the use of heuristics, knowledge engineering, and the plan-generate-test problem-solving paradigm in expert systems.
Listing of Intellectual work of patanjali kashyap , mainly contains , name, details and reference of papers , presentations , patents , public speaking
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
Machine learning and artificial intelligence have transformed our online experience, and for an increasing number of individuals, these fields are fundamentally changing the way we work. In this talk, I will discuss how machine learning is used in the physical sciences, particularly materials science and chemistry, and what transformative impacts we have seen or might expect to see in the future. This discussion will focus on the unique challenges (and opportunities) faced by materials and chemistry researchers applying machine learning in their work. I will present a brief introduction to machine learning for physical scientists and give examples related to synthesis, property prediction and engineering, and artificial intelligence that “reads” research articles. These examples will introduce some of the most prevalent and useful open-source software tools that drive modern machine learning applications. Two significant themes will be emphasized throughout: the careful evaluation of machine learning results and the central importance of data quality and quantity. Finally, I will provide some mundane, “human learned” speculation about the future of machine learning in physical science and recommended resources for further study.
The Center for Applied Optimization at the University of Florida conducts interdisciplinary research in optimization involving faculty from various departments. Over the past 5 years, their research has included global optimization, optimization in biomedicine like predicting epileptic seizures, analyzing massive datasets like social networks, developing approximation algorithms, and algorithms for problems like multicast networks. Current projects also involve computational neuroscience, probabilistic classifiers in medicine, research on energy problems, and using Raman spectroscopy for cancer research. The Center collaborates with researchers from other institutions and hosts many visiting scholars each year.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
Jeremy Hadidjojo is a PhD candidate in physics at the University of Michigan with expertise in computational physics, mathematical modeling, simulation, and data analysis. His research focuses on developing physical models of biological pattern formation and applying machine learning techniques to analyze complex systems. He has extensive programming skills in MATLAB, Python, C++ and experience with parallel and GPU computing. His published works include modeling mechanisms of planar cell chirality and retinal cone patterning in zebrafish.
The document summarizes 5 papers from Zhejiang University of Finance and Economics that were included in the Ei Compendex database in 2005. It provides the title, authors, source, and brief summaries for each of the 5 papers.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
This document describes a classroom exercise where students developed deep neural networks to model and predict adsorption equilibrium data. The exercise introduced students to artificial intelligence and deep learning concepts. Students used MATLAB to create neural networks that modeled adsorption of acids by activated carbon at different temperatures, comparing results to theoretical models. The goals were to teach AI methodology, increase coding skills, and show neural networks can accurately model complex chemical engineering processes. Feedback confirmed students gained knowledge of machine learning terms and abilities to develop simple or sophisticated neural networks for modeling unit operations.
Future Directions in Chemical Engineering and BioengineeringIlya Klabukov
"Future Directions in Chemical Engineering and Bioengineering"
January 16-18, 2013
Austin, Texas
Chair: John G. Ekerdt, The University of Texas at Austin
Sponsored by Department of Defense,
Office of the Assistant Secretary of Defense for Research and Engineering
Chemical and biological engineers use math, physics, chemistry, and biology to develop chemical transformations and processes, creating useful products and materials that improve society. In recent years, the boundaries between chemical engineering and bioengineering have blurred as biology has become molecular science, more seamlessly connecting with the historic focus of chemical engineering on molecular interactions and transformations.
This disappearing boundary creates new opportunities for the next generation of engineered systems – hybrid systems that integrate the specificity of biology with chemical and material systems to enable novel applications in catalysis, biomaterials, electronic materials, and energy conversion materials.
Basic research for the U.S. Department of Defense covers a wide range of topics such as metamaterials and plasmonics, quantum information science, cognitive neuroscience, understanding human behavior, synthetic biology, and nanoscience and nanotechnology. Future Directions workshops such as this one identify opportunities
for continuing and future DOD investment. The intent is to create conditions for discovery and transformation, maximize the discovery potential, bring balance and coherence, and foster connections. Basic research stretches the limits of today’s technologies and discovers new phenomena and know-how that ultimately lead to future technologies and enable military and societal progress.
Skoltech at a glance. What is the new type of university? What do we do? What differs us from traditional universities?
And - what it takes to become a Skoltech student?
Skoltech is a new university in Russia aimed at accelerating innovation. It offers master's programs in fields like energy, IT, and biomedical science. Students take intensive courses and engage in research with a focus on practical applications. The university works with industry and has 15 research centers in areas such as biomedicine, energy, IT, and space. It aims to produce global leaders by giving students skills in both research and entrepreneurship.
1) The document discusses challenges in using machine learning and data analytics for materials science research. Specifically, most materials are irrelevant for a given purpose, so models need to identify statistically exceptional subgroups rather than averaging all data.
2) Two potential methods for identifying promising subgroups are discussed: focusing on materials with small oxygen-carbon-oxygen angles or large carbon-oxygen bond lengths for catalysis applications.
3) The concept of a model's domain of applicability is introduced, wherein models perform best when applied only to similar data they were trained on, rather than all data globally. Identifying these reliable domains is important.
This paper discusses the several research methodologies that can
be used in Computer Science (CS) and Information Systems
(IS). The research methods vary according to the science
domain and project field. However a little of research
methodologies can be reasonable for Computer Science and
Information System.
Machine Learning in Material Characterizationijtsrd
Machine learning has shown great potential applications in material science. It is widely used in material design, corrosion detection, material screening, new material discovery, and other fields of materials science. The majority of ML approaches in materials science is based on artificial neural networks ANNs . The use of ML and related techniques for materials design, development, and characterization has matured to a main stream field. This paper focuses on the applications of machine learning strategies for material characterization. Matthew N. O. Sadiku | Guddi K. Suman | Sarhan M. Musa "Machine Learning in Material Characterization" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-6 , October 2021, URL: https://www.ijtsrd.com/papers/ijtsrd46392.pdf Paper URL : https://www.ijtsrd.com/engineering/electrical-engineering/46392/machine-learning-in-material-characterization/matthew-n-o-sadiku
Predicting Material Properties Using Machine Learning for Accelerated Materia...Nikhil Sanjay Suryawanshi
The rapid prediction of material properties has become a pivotal factor in accelerating materials discovery and development, driven by advancements in machine learning and data-driven methodologies. This paper presents a novel system for predicting material properties using machine learning techniques, offering a scalable and efficient framework for exploring new materials with optimized properties. The system incorporates large datasets, feature engineering, and multiple machine learning models, such as Kernel Ridge Regression, Random Forest, and Neural Networks, to predict material properties like thermal conductivity, elastic modulus, and electronic bandgap. By integrating physics-based knowledge into machine learning models, the proposed system enhances the accuracy and interpretability of predictions. The results indicate that the system can significantly reduce the time and cost of material discovery while delivering high prediction accuracy. This is the potential approach to revolutionize materials science by enabling researchers to identify promising material candidates in silico, paving the way for breakthroughs in energy, electronics, and sustainable materials.
In this deck from the HPC User Forum, Rick Stevens from Argonne presents: AI for Science.
"Artificial Intelligence (AI) is making strides in transforming how we live. From the tech industry embracing AI as the most important technology for the 21st century to governments around the world growing efforts in AI, initiatives are rapidly emerging in the space. In sync with these emerging initiatives including U.S. Department of Energy efforts, Argonne has launched an “AI for Science” initiative aimed at accelerating the development and adoption of AI approaches in scientific and engineering domains with the goal to accelerate research and development breakthroughs in energy, basic science, medicine, and national security, especially where we have significant volumes of data and relatively less developed theory. AI methods allow us to discover patterns in data that can lead to experimental hypotheses and thus link data driven methods to new experiments and new understanding."
Watch the video: https://wp.me/p3RLHQ-kQi
Learn more: https://www.anl.gov/topic/science-technology/artificial-intelligence
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Phase Field Methods Workshop was held at Northwestern University on January 9, 2015. The workshop brought together researchers from national laboratories, universities, and industry to discuss phase field modeling tools and methods. The agenda included sessions on current phase field codes and capabilities, large-scale computing approaches, potential focus areas for research, and how to structure a community code. Attendees discussed formulating standard benchmark problems and organizing a community repository to enable further collaboration on phase field modeling code development.
Knowledge Management in the AI Driven Scintific SystemSubhasis Dasgupta
In this dynamic talk, we'll explore the transformative role of AI in scientific knowledge management. We'll delve into how AI revolutionizes data organization, analysis, and hypothesis testing, enhancing efficiency and discovery. Highlighting the seamless integration with existing research processes, we'll address the training and ethical considerations of AI adoption. Through real-world examples, we'll demonstrate AI's impact on scientific breakthroughs, emphasizing the shift towards more collaborative and innovative research landscapes. This presentation aims to inspire the scientific community to embrace AI, leveraging its potential to redefine the boundaries of knowledge and innovation.
The document describes the eBank UK project, which seeks to link e-research data, scholarly communication, and e-learning by building connections from data generated in experiments through publications and into educational resources. It discusses the scholarly knowledge cycle and how eBank UK is addressing the bottleneck of data publication by developing a distributed information architecture with common data standards and ontologies. This will allow an aggregator to harvest metadata from repositories holding experimental data and publications and provide a single access point for discovery across distributed resources through services like search and retrieval.
The Materials Project: Applications to energy storage and functional materia...Anubhav Jain
The Materials Project is a free online database containing calculated properties of over 150,000 materials designed to help researchers discover new functional materials. It has been used extensively in academia and industry to identify novel battery electrode materials and solid electrolytes through high-throughput computational screening. Researchers are now using the Materials Project dataset to train machine learning models to predict battery properties and screen for new materials. Related efforts aim to bridge the gap between computational design and physical synthesis by developing an automated synthesis lab to experimentally validate candidate materials identified from the database.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
Machine learning techniques show promise for accelerating materials design by serving as surrogates for experiments and computations, enabling "self-driving laboratories", and extracting insights from natural language text. Key opportunities include using ML to screen large areas of chemical space before running computationally expensive DFT calculations or laboratory experiments. Challenges include limited materials data, data heterogeneity across problems, and ensuring ML models can accurately extrapolate beyond the training data distribution. Overcoming these challenges could substantially reduce the decades-long timelines currently needed for new materials discovery and optimization.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
This document contains a discussion of AI technologies from 2023, including ChatGPT, DALL-E, Copilot, and developments from Microsoft, Google, Apple, and other companies. It also discusses gradient descent optimization techniques like momentum, AdaGrad, RMSProp and Adam for training neural networks. Methods for calculating gradients using the chain rule are explained for multi-layer models.
The document discusses the rise of major technology companies and social networks. It notes that Google was founded in 1996 and Facebook in 2004, and that smartphones running iOS and Android became popular around 2008-2009. It also mentions that Japan's GDP ranking may improve to #1 by 2023, surpassing other countries, and that the integration of technologies like artificial intelligence and the internet of things will continue to change our lives. Charts show the growth of data and devices over time.
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Ichigaku Takigawa
Video https://youtu.be/P4QogT8bdqY
ACS Spring 2023 Symposium on AI-Accelerated Scientific Workflow
https://acs.digitellinc.com/acs/sessions/526630/view
ACS SPRING 2023 ———— Crossroads of Chemistry
Indianapolis, IN & Hybrid, March 26-30
https://www.acs.org/meetings/acs-meetings/spring-2023.html
Slide PDF
https://itakigawa.page.link/acs2023spring
Our Paper
Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach (2022, ChemRxiv)
https://doi.org/10.26434/chemrxiv-2022-695rj
Ichi Takigawa
https://itakigawa.github.io/
The document appears to be a research paper discussing the relationship between weight (g) and distance (cm) when objects are thrown. It contains several scatter plots with data points showing the weight and distance of different objects. The paper suggests running regression analysis on the data to determine the linear relationship between weight and distance. Additional analysis may include investigating the effects of other variables like air resistance.
The document contains contact information for Ichigaku Takigawa including their email address [email protected], personal website URL https://itakigawa.github.io/, and mentions they are working with IBISML and ATR on materials informatics and bioinformatics. It also includes a link to their page https://itakigawa.page.link/IBISML for a PDF document.
The document discusses the Rubik's Cube and cubing as a competitive sport. It provides an overview of popular cubing methods like CFOP and compares the number of combinations for different puzzle sizes. It also mentions top cubing brands and provides statistics on the growth of the cubing market in recent years.
1. The document discusses issues related to technology companies and artificial intelligence, focusing on topics like social media, big data, and privacy concerns.
2. It notes that companies like Google, Facebook, Amazon, Apple, and Microsoft now dominate the global technology industry and have significant influence over people's lives and access to information.
3. Concerns are raised about the use of personal data by these companies and how technologies like social media, internet of things, and artificial intelligence could impact privacy, democracy, and society.
The document appears to be a biography or CV for Ichigaku Takigawa. It details his employment history from 1995-2004 at an unnamed organization, 2004 at another unnamed organization, 2005-2011 at the RIKEN Center for Developmental Biology, and 2012-2018 at the Japan Science and Technology Agency. In 2019 he became the director of the RIKEN Center for Integrative Medical Sciences. His research focuses on induced pluripotent stem cells.
- The document contains the results of an experiment with multiple data points plotted as dots across various x and y-axis values.
- There are a large number of data points densely plotted in the graph, with some outliers at the edges.
- The data points are recorded measurements from an experiment, but no other context is provided about the experiment, variables, or what is being measured.
Structure formation with primordial black holes: collisional dynamics, binari...Sérgio Sacani
Primordial black holes (PBHs) could compose the dark matter content of the Universe. We present the first simulations of cosmological structure formation with PBH dark matter that consistently include collisional few-body effects, post-Newtonian orbit corrections, orbital decay due to gravitational wave emission, and black-hole mergers. We carefully construct initial conditions by considering the evolution during radiation domination as well as early-forming binary systems. We identify numerous dynamical effects due to the collisional nature of PBH dark matter, including evolution of the internal structures of PBH halos and the formation of a hot component of PBHs. We also study the properties of the emergent population of PBH binary systems, distinguishing those that form at primordial times from those that form during the nonlinear structure formation process. These results will be crucial to sharpen constraints on the PBH scenario derived from observational constraints on the gravitational wave background. Even under conservative assumptions, the gravitational radiation emitted over the course of the simulation appears to exceed current limits from ground-based experiments, but this depends on the evolution of the gravitational wave spectrum and PBH merger rate toward lower redshifts.
2025 Insilicogen Company English BrochureInsilico Gen
Insilicogen is a company, specializes in Bioinformatics. Our company provides a platform to share and communicate various biological data analysis effectively.
Poultry require at least 38 dietary nutrients inappropriate concentrations for a balanced diet. A nutritional deficiency may be due to a nutrient being omitted from the diet, adverse interaction between nutrients in otherwise apparently well-fortified diets, or the overriding effect of specific anti-nutritional factors.
Major components of foods are – Protein, Fats, Carbohydrates, Minerals, Vitamins
Vitamins are A- Fat soluble vitamins: A, D, E, and K ; B - Water soluble vitamins: Thiamin (B1), Riboflavin (B2), Nicotinic acid (niacin), Pantothenic acid (B5), Biotin, folic acid, pyriodxin and cholin.
Causes: Low levels of vitamin A in the feed. oxidation of vitamin A in the feed, errors in mixing and inter current disease, e.g. coccidiosis , worm infestation
Clinical signs: Lacrimation (ocular discharge), White cheesy exudates under the eyelids (conjunctivitis). Sticky of eyelids and (xerophthalmia). Keratoconjunctivitis.
Watery discharge from the nostrils. Sinusitis. Gasping and sneezing. Lack of yellow pigments,
Respiratory sings due to affection of epithelium of the respiratory tract.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
The human eye is a complex organ responsible for vision, composed of various structures working together to capture and process light into images. The key components include the sclera, cornea, iris, pupil, lens, retina, optic nerve, and various fluids like aqueous and vitreous humor. The eye is divided into three main layers: the fibrous layer (sclera and cornea), the vascular layer (uvea, including the choroid, ciliary body, and iris), and the neural layer (retina).
Here's a more detailed look at the eye's anatomy:
1. Outer Layer (Fibrous Layer):
Sclera:
The tough, white outer layer that provides shape and protection to the eye.
Cornea:
The transparent, clear front part of the eye that helps focus light entering the eye.
2. Middle Layer (Vascular Layer/Uvea):
Choroid:
A layer of blood vessels located between the retina and the sclera, providing oxygen and nourishment to the outer retina.
Ciliary Body:
A ring of tissue behind the iris that produces aqueous humor and controls the shape of the lens for focusing.
Iris:
The colored part of the eye that controls the size of the pupil, regulating the amount of light entering the eye.
Pupil:
The black opening in the center of the iris that allows light to enter the eye.
3. Inner Layer (Neural Layer):
Retina:
The light-sensitive layer at the back of the eye that converts light into electrical signals that are sent to the brain via the optic nerve.
Optic Nerve:
A bundle of nerve fibers that carries visual signals from the retina to the brain.
4. Other Important Structures:
Lens:
A transparent, flexible structure behind the iris that focuses light onto the retina.
Aqueous Humor:
A clear, watery fluid that fills the space between the cornea and the lens, providing nourishment and maintaining eye shape.
Vitreous Humor:
A clear, gel-like substance that fills the space between the lens and the retina, helping maintain eye shape.
Macula:
A small area in the center of the retina responsible for sharp, central vision.
Fovea:
The central part of the macula with the highest concentration of cone cells, providing the sharpest vision.
These structures work together to allow us to see, with the light entering the eye being focused by the cornea and lens onto the retina, where it is converted into electrical signals that are transmitted to the brain for interpretation.
he eye sits in a protective bony socket called the orbit. Six extraocular muscles in the orbit are attached to the eye. These muscles move the eye up and down, side to side, and rotate the eye.
The extraocular muscles are attached to the white part of the eye called the sclera. This is a strong layer of tissue that covers nearly the entire surface of the eyeball.he layers of the tear film keep the front of the eye lubricated.
Tears lubricate the eye and are made up of three layers. These three layers together are called the tear film. The mucous layer is made by the conjunctiva. The watery part of the tears is made by the lacrimal gland
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Sérgio Sacani
The origin of heavy elements synthesized through the rapid neutron capture process (r-process) has been an enduring mystery for over half a century. J. Cehula et al. recently showed that magnetar giant flares, among the brightest transients ever observed, can shock heat and eject neutron star crustal material at high velocity, achieving the requisite conditions for an r-process.A. Patel et al. confirmed an r-process in these ejecta using detailed nucleosynthesis calculations. Radioactive decay of the freshly synthesized nuclei releases a forest of gamma-ray lines, Doppler broadened by the high ejecta velocities v 0.1c into a quasi-continuous spectrum peaking around 1 MeV. Here, we show that the predicted emission properties (light curve, fluence, and spectrum) match a previously unexplained hard gamma-ray signal seen in the aftermath of the famous 2004 December giant flare from the magnetar SGR 1806–20. This MeV emission component, rising to peak around 10 minutes after the initial spike before decaying away over the next few hours, is direct observational evidence for the synthesis of ∼10−6 Me of r-process elements. The discovery of magnetar giant flares as confirmed r-process sites, contributing at least ∼1%–10% of the total Galactic abundances, has implications for the Galactic chemical evolution, especially at the earliest epochs probed by low-metallicity stars. It also implicates magnetars as potentially dominant sources of heavy cosmic rays. Characterization of the r-process emission from giant flares by resolving decay line features offers a compelling science case for NASA’s forthcomingCOSI nuclear spectrometer, as well as next-generation MeV telescope missions.
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Sérgio Sacani
The interplay between data-driven and theory-driven methods for chemical sciences
1. The interplay between data-driven and
theory-driven methods for chemical sciences
Ichigaku Takigawa
• Medical-risk Avoidance based on iPS Cells Team,
RIKEN Center for Advanced Intelligence Project (AIP)
• Institute for Chemical Reaction Design and Discovery
(WPI-ICReDD), Hokkaido University
[email protected]
2. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
3. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
From AAAI-20 Oxford-Style Debate
4. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
5. Machine Learning with "Discrete Structures"
Q. How we can make use of "discrete structures" for prediction...?
sets, sequences, branchings or hiearchies (trees), networks (graphs), relations,
logic, rules, combinations, permutations,, point clouds, algebra, langages, ...
H
H
H
H
H
H
H
H
O
N
O
O
H
H
H
O
O
H
H
N
O
O
Cl
ClCl
• The target variables can come with "discrete structures"
• The relationship between variables can have "discrete structures"
• The ML models themselves can involve "discrete structures"
6. Past work: Data-intensive approaches to life science
• Transcriptional regulation of metabolism
Bioinformatics 2007, 2008a, 2008b, 2009, 2010
Nucleic Acids Res 2011, PLoS One 2012, 2013, KDD'07
• Transcription regulation by mediator complex
Nat Commun 2015, Nat Commun 2020
• Repetitive sequences in genomes
Discrete Appl Math 2013, 2016, AAAI 2020
• Polypharmacology of drug-target interaction networks
PLoS One 2011, Drug Discov Today 2013, Brief Bioinform 2014, BMC
Bioinformatics 2020
• Copy number variations in neurological disorder (MSA, MS)
Mol Brain 2017
• Substrate analysis of modulator protease (Calpain family)
Mol Cell Proteom 2016, Genome Informatics 2009, http://calpain.org
• Cell competition in cancer
Cell Reports 2018, Sci Rep 2015
• Phenotype patterns of somatic mutations in breast cancer
Brief Bioinform 2014
7. Recent work: Data-intensive approaches to chemistry
Machine learning for heterogeneous catalysis
Haber–Bosch Process
Ferrous Metal Catalysis
NOx
CO
HC
N2
CO2
H2O
Exhaust Gas Harmless gas
Noble Metal Catalysis (Pt, Pd, Rh…)
• Ethane
• Ethylene
• Methanol
• :
Methane
Various Metallic Catalysts
(Li, rare earthes, alkaline earths)
(industrial synthesis of ammonia)
Exhaust Gas Purification Conversion of Methane
“Fertilizer from Air”
artificial nitrogen fixation
• ACS Catalysis 2020 (review)
• ChemCatChem 2019 (front cover)
• J Phys Chem C 2018a, 2018b, 2019 (cover)
• RSC Advances 2016 (highlighted in Chemical World)
catalysis where the phase of the catalyst differs from
the phase of the reactants or products.
Reactants in
the gas phase
Catalysts in
the solid phase
8. Effective use of data is another key in natural sciences
REVIEW
Inverse molecular design using
machine learning: Generative models
for matter engineering
Benjamin Sanchez-Lengeling1
and Alán Aspuru-Guzik2,3,4
*
The discovery of new materials can bring enormous societal and technological progress. In this
context, exploring completely the large space of potential materials is computationally
intractable. Here, we review methods for achieving inverse design, which aims to discover
tailored materials from the starting point of a particular desired functionality. Recent advances
from the rapidly growing field of artificial intelligence, mostly from the subfield of machine
learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular
design are being proposed and employed at a rapid pace. Among these, deep generative models
have been applied to numerous classes of materials: rational design of prospective drugs,
synthetic routes to organic compounds, and optimization of photovoltaics and redox flow
batteries, as well as a variety of other solid-state materials.
M
any of the challenges of the 21st century
(1), from personalized health care to
energy production and storage, share a
common theme: materials are part of
the solution (2). In some cases, the solu-
mapped 166.4 billion molecules that contain at
most 17 heavy atoms. For pharmacologically rele-
vant small molecules, the number of structures is
estimated to be on the order of 1060
(9). Adding
consideration of the hierarchy of scale from sub-
act properties. In practice, approximations are
used to lower computational time at the cost of
accuracy.
Although theory enjoys enormous progress,
now routinely modeling molecules, clusters, and
perfect as well as defect-laden periodic solids, the
size of chemical space is still overwhelming, and
smart navigation is required. For this purpose,
machine learning (ML), deep learning (DL), and
artificial intelligence (AI) have a potential role
to play because their computational strategies
automatically improve through experience (11).
In the context of materials, ML techniques are
often used for property prediction, seeking to
learn a function that maps a molecular material
to the property of choice. Deep generative models
are a special class of DL methods that seek to
model the underlying probability distribution of
both structure and property and relate them in a
nonlinear way. By exploiting patterns in massive
datasets, these models can distill average and
salient features that characterize molecules (12, 13).
Inverse design is a component of a more
complex materials discovery process. The time
scale for deployment of new technologies, from
discovery in a laboratory to a commercial pro-
duct, historically, is 15 to 20 years (14). The pro-
cess (Fig. 1) conventionally involves the following
steps: (i) generate a new or improved material
FRONTIERS IN COMPUTATION
httpDownloadedfrom
REVIEW https://doi.org/10.1038/s41586-018-0337-2
Machine learning for molecular and
materials science
Keith T. Butler1
, Daniel W. Davies2
, Hugh Cartwright3
, Olexandr Isayev4
* & Aron Walsh5,6
*
Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning
techniques that are suitable for addressing research questions in this domain, as well as future directions for the field.
We envisage a future in which the design, synthesis, characterization and application of molecules and materials is
accelerated by artificial intelligence.
T
he Schrödinger equation provides a powerful structure–
property relationship for molecules and materials. For a given
spatial arrangement of chemical elements, the distribution of
electrons and a wide range of physical responses can be described. The
development of quantum mechanics provided a rigorous theoretical
foundationforthechemicalbond.In1929,PaulDiracfamouslyproclaimed
that the underlying physical laws for the whole of chemistry are “completely
known”1
. John Pople, realizing the importance of rapidly developing
generating, testing and refining scientific models. Such techniques are
suitable for addressing complex problems that involve massive combi-
natorial spaces or nonlinear processes, which conventional procedures
either cannot solve or can tackle only at great computational cost.
As the machinery for artificial intelligence and machine learning
matures, important advances are being made not only by those in main-
stream artificial-intelligence research, but also by experts in other fields
(domain experts) who adopt these approaches for their own purposes. As
DNA to be sequences into distinct pieces,
parcel out the detailed work of sequencing,
and then reassemble these independent ef-
forts at the end. It is not quite so simple in the
world of genome semantics.
Despite the differences between genome se-
quencing and genetic network discovery, there
are clear parallels that are illustrated in Table 1.
In genome sequencing, a physical map is useful
to provide scaffolding for assembling the fin-
ished sequence. In the case of a genetic regula-
tory network, a graphical model can play the
same role. A graphical model can represent a
high-level view of interconnectivity and help
isolate modules that can be studied indepen-
dently. Like contigs in a genomic sequencing
project, low-level functional models can ex-
plore the detailed behavior of a module of genes
in a manner that is consistent with the higher
level graphical model of the system. With stan-
dardized nomenclature and compatible model-
ing techniques, independent functional models
can be assembled into a complete model of the
cell under study.
To enable this process, there will need to
be standardized forms for model representa-
tion. At present, there are many different
modeling technologies in use, and although
models can be easily placed into a database,
they are not useful out of the context of their
specific modeling package. The need for a
standardized way of communicating compu-
tational descriptions of biological systems ex-
tends to the literature. Entire conferences
have been established to explore ways of
mining the biology literature to extract se-
mantic information in computational form.
Going forward, as a community we need
to come to consensus on how to represent
what we know about biology in computa-
tional form as well as in words. The key to
postgenomic biology will be the computa-
tional assembly of our collective knowl-
edge into a cohesive picture of cellular and
organism function. With such a comprehen-
sive model, we will be able to explore new
types of conservation between organisms
and make great strides toward new thera-
peutics that function on well-characterized
pathways.
References
1. S. K. Kim et al., Science 293 , 2087 (2001).
2. A. Hartemink et al., paper presented at the Pacific
Symposium on Biocomputing 2000, Oahu, Hawaii, 4
to 9 January 2000.
3. D. Pe’er et al., paper presented at the 9th Conference
on Intelligent Systems in Molecular Biology (ISMB),
Copenhagen, Denmark, 21 to 25 July 2001.
4. H. McAdams, A. Arkin, Proc. Natl. Acad. Sci. U.S.A.
94 , 814 ( 1997 ).
5. A. J. Hartemink, thesis, Massachusetts Institute of
Technology, Cambridge (2001).
V I E W P O I N T
Machine Learning for Science: State of the
Art and Future Prospects
Eric Mjolsness* and Dennis DeCoste
Recent advances in machine learning methods, along with successful
applications across a wide variety of fields such as planetary science and
bioinformatics, promise powerful new tools for practicing scientists. This
viewpoint highlights some useful characteristics of modern machine learn-
ing methods and their relevance to scientific applications. We conclude
with some speculations on near-term progress and promising directions.
Machine learning (ML) (1) is the study of
computer algorithms capable of learning to im-
prove their performance of a task on the basis of
their own previous experience. The field is
closely related to pattern recognition and statis-
tical inference. As an engineering field, ML has
become steadily more mathematical and more
successful in applications over the past 20
years. Learning approaches such as data clus-
tering, neural network classifiers, and nonlinear
regression have found surprisingly wide appli-
cation in the practice of engineering, business,
and science. A generalized version of the stan-
dard Hidden Markov Models of ML practice
have been used for ab initio prediction of gene
structures in genomic DNA (2). The predictions
correlate surprisingly well with subsequent
gene expression analysis (3). Postgenomic bi-
ology prominently features large-scale gene ex-
pression data analyzed by clustering methods
(4), a standard topic in unsupervised learning.
Many other examples can be given of learning
and pattern recognition applications in science.
Where will this trend lead? We believe it will
lead to appropriate, partial automation of every
element of scientific method, from hypothesis
generation to model construction to decisive
experimentation. Thus, ML has the potential to
amplify every aspect of a working scientist’s
progress to understanding. It will also, for better
or worse, endow intelligent computer systems
with some of the general analytic power of
scientific thinking.
creating hypotheses, testing by decisive exper-
iment or observation, and iteratively building
up comprehensive testable models or theories is
shared across disciplines. For each stage of this
abstracted scientific process, there are relevant
developments in ML, statistical inference, and
pattern recognition that will lead to semiauto-
matic support tools of unknown but potentially
broad applicability.
Increasingly, the early elements of scientific
method—observation and hypothesis genera-
tion—face high data volumes, high data acqui-
sition rates, or requirements for objective anal-
ysis that cannot be handled by human percep-
tion alone. This has been the situation in exper-
imental particle physics for decades. There
automatic pattern recognition for significant
events is well developed, including Hough
transforms, which are foundational in pattern
recognition. A recent example is event analysis
for Cherenkov detectors (8) used in neutrino
oscillation experiments. Microscope imagery in
cell biology, pathology, petrology, and other
fields has led to image-processing specialties.
So has remote sensing from Earth-observingMachine Learning Systems Group, Jet Propulsion Lab-
Table 1. Parallels between genome sequencing
and genetic network discovery.
Genome
sequencing
Genome semantics
Physical maps Graphical model
Contigs Low-level functional
models
Contig
reassembly
Module assembly
Finished genome
sequence
Comprehensive model
C O M P U T E R S A N D S C I E N C E
onAugust29,2018http://science.sciencemag.org/Downloadedfrom
Nature, 559
pp. 547–555 (2018)
Science, 293
pp. 2051-2055 (2001)
Science, 361
pp. 360-365 (2018)
Science is changing, the tools of science are changing. And
that requires different approaches. ── Erich Bloch, 1925-2016
(alongside experiments and simulations)
And please keep in mind that unplanned data collection is too risky.
We need right designs for data collection and right tools to analyze.
A bitter lesson: "low input, high throughput, no output science." (Sydney Brenner)
9. Today's AI has stark limitations, facing big problems
• Deep learning techniques thus far have proven to be data hungry, shallow,
brittle, and limited in their ability to generalize (Marcus, 2018)
• Current machine learning techniques are data-hungry and brittle—they can
only make sense of patterns they've seen before. (Chollet, 2020)
• A growing body of evidence shows that state-of-the-art models learn to exploit
spurious statistical patterns in datasets... instead of learning meaning in the
flexible and generalizable way that humans do. (Nie et al., 2019)
• Current machine learning methods seem weak when they are required to
generalize beyond the training distribution, which is what is often needed in
practice. (Bengio et al., 2019)
Though it's extremely powerful and useful in suitable applications!
10. The AI/ML community is now struggling with these...
• The Winograd Schema Challenge (Levesque et al., 2011)
A collection of 273 multiple-choice problems that requires commonsense
reasoning to solve as an alternative to Turing test (Turing, 1950).
Many methods passed this test of "Imitation Games" were
actually tricks being fairly adept at fooling humans...
Unexpectedly, it turned out that several tricks using BERT/Transformer
can crack many of these (with >90% accuracy) ...
• WinoGrande (Sakaguchi et al., 2020)
An improved collection of 44k very hard problems
(AAAI-20 Outstanding Paper Award)
Though some top-ranking methods can have 70-80% accuracy ...
• Abstraction and Reasoning Challenge at Kaggle (Chollet, 2020)
"Kaggle’s most important AI competition" "Impossible AI Challenge"
At the moment, to what extent we can create an AI capable of solving
reasoning tasks it has never seen before?
11. Examples
WSC-273
WinoGrande-1.1
(Q0) The town councillors refused to give the angry demonstrators
a permit because they feared violence. Who feared violence?
• the town councillors
• the angry demonstrators
(Q1) The trophy would not fit in the brown suitcase because it was
so small. What was so small?
• the trophy
• the brown suitcase
(due to Terry Winograd)
(Q) He never comes to my home, but I always go to his house
because the ______ is smaller.
1. home
2. house
12. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
13. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
14. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
15. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
16. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
17. They are trantitions from a valley to a hill
to another valley, but solving a quantum
chemical equation is needed at every
point to get the energy surface
Chemistry has the first principle at the electron level
θ1
θ2
Schrödinger equation
Potential Energy Surface (PES)
Outcomes we see are just
discrete and combinatorialstable state 1 transition state stable state 2
Energy
Every substance is a combination of
only 118 elements in the periodic table.
Our end goal:
take full control over chemical reactions,
the basis for transforming substances to
another (energy, materials, foods, colors, ...)
Recombinations of atoms and
chemical bonds are subjected to
the laws of nature
19. But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate,
chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
20. But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate,
chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
Takeaway: We also need 'data-driven' bridges!
first principles are not enough for us to throw away these empirical
things; data-driven approaches (ML) play a complementary role!
Fusing multi-disciplinary methods...?
1. Theory-driven: (slow but) high-fedelity quantum-chemical simulations
2. Knowledge-driven: explicit knowns and inference
3. Data-driven: empirical prediction grounded upon multifaceted data
21. Indeed computational chemistry has many limitations...
Chemical reactions = recombinations of atoms and chemical bonds
subjected to the laws of nature
• Intractably large chemical space: A intractably large number of
"theoretically possible" candidates for reactions and compounds...
• Computing time and scalability issue: Simulating an Avogadro-constant
number of atoms is utterly infeasible... (we need some compromise here)
• Complexity and uncertainty of real-world systems:
Many uncertain factors and arbitrary parameters are involved...
• Known and unknown imperfections of currently established theories:
Current theoretical calculations have many exceptions and limitations...
22. Yet another approach: Data-driven
Cause-and-Effect
Relationship
Related factors
(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Theory-driven methods try to explicitly model the inner workings
of a target phenomenon (e.g. through first-principles simulations)
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data".
(e.g. through machine learning from a large collection of data)
based on very different principles and quite complementary!
governing equation?
23. Machine Learning (ML)
Generic Object Recognition
Speech Recognition
Machine Translation
QSAR/QSPR Prediction
AI Game Players
“ありがとう”
J’aime la
musique I love music
CH3
N
N
H
N
H
H3C
N
Growth inhibition
0.739
A new style of programming
a technique to reproduce a transformation process (or function) where
the underlying principle is unclear and hard to be explicitly modelled
just by giving a lot of input-output examples.
24. How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
25. How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
e.g.) Deep learning usually comes with overparametrization
and the number of parameters can be several tens of millions
or even billions... (This is why model-complexity control or
regularization is so critical and essential in ML)
This can be often ultra high-dimensional in modern ML
26. Multilevel representations of chemical reactions
Brc1cncc(Br)c1 C[O-] CN(C)C=O Na+ COc1cncc(Br)c1SMILES
Structural Formla
Steric Structures
Electronic States
Reactants Reagents Products
As pattern languages (e.g. known facts in textbooks/databases)
As physical entities (e.g. quantum chemical calculations)
27. Computer-assisted synthetic planning
Corey+ 1972
J Am Chem Soc (JACS), 94(2), 1972.
440
Computer-Assisted Synthetic Analysis for Complex Molecules.
Methods and Procedures for Machine Generation of
Synthetic Intermediates
E. J. Corey,* Richard D. Cramer III, and W. Jeffrey Howe
Contribution from the Department of Chemistry, Harvard University,
Cambridge, Massachusetts 02138. Received January 30, 1971
Abstract: A classification of synthetic reactions is outlined which is suitable for use in a machine program to
generate a tree of synthetic intermediates starting from a given target molecule. The generation of a particular
intermediate by the program involves the search of appropriate data tables of synthetic processes, the search being
driven by the information obtained by machine perception of the parent structure and certain basic strategies.
Procedures have been developed for the evaluation of chemical interconversions which allow the effective exclusion
of invalid or naive structures. The paper provides a view of the status of computer-assisted synthetic problem
solving as of 1970.
The
communication of chemical structural informa-
tion to and from a digital computer by graphical
methods has been discussed in detail in a foregoing
paper,1 as has the machine representation and percep-
tion of key features within structures,2 as for example,
functional groups and rings. This paper is concerned
with the ways in which the structural information made
available by the perception process can be utilized to
generate a tree of chemical structures3 which represent
possible synthetic intermediates for the construction of
a complex target molecule. More specifically, the
following topics will be treated: (1) classification of
and necessary control strategies, and also for eventual
inclusion of a fairly complete collection of families.
In the discussion which follows, the degree of imple-
mentation of each area of study will be cited.
A variety of rational schemes for creating families of
synthetic reactions already exists. However, most of
these depend on properties of the reactants,5 and as
such they are irrelevant to a computer program which
analyzes the features of a target or product molecule
in order to generate appropriate starting materials.
One very general treatment of synthetic reactions in-
volves classification on the basis of the structural
Computer-Assisted Solution of Chemical Problems-
The Historical Development and the Present State of the Art
of a New Discipline of Chemistry
By Ivar Ugi," Johannes Bauer, Klemens Bley, Alf Dengler, Andreas Dietz,
Eric Fontain, Bernhard Gruber, Rainer Herges, Michael Knauer, Klaus Reitsam,
and Natalie Stein
Dedicated to Projkssor Karl-Heinz Biicliel
The topic of this article is the development and the present state of the art of computer
chemistry, the computer-assisted solution of chemical problems. Initially the problems in
computer chemistry were confined to structure elucidation on the basis of spectroscopic data,
then programs for synthesis design based on libraries of reaction data for relatively narrow
classes of target compounds were developed, and now computer programs for the solution of
a great variety of chemical problems are available or are under development. Previously it was
an achievement when any solution of a chemical problem could be generated by computer
assistance. Today, the main task is the efficient, transparent, and non-arbitrary selection of
meaningful results from the immense set of potential solutions--that also may contain innova-
tive proposals. Chemistry has two aspects, constitutional chemistry and stereochemistry,
which are interrelated, but still require different approaches. As a result, about twenty years
ago, an algebraic model of the logical structure of chemistry was presented that consisted of
two parts: the constitution-oriented algebra of be- and r-matrices, and the theory of the
Ugi+ 1993
Angew Chem Int Ed Engl. 32, 202-227, 1993.
A traditional chemoinformtics topic since 70s:
collect known chemical reactions, and search for reaction paths
e.g.) Representation, classification, exploration of chemical reactions
29. Purely ML-based approaches and more
ML-based chemical reaction predictions
Fermionic Neural Network
Pfau+ Ab-Initio Solution of the Many-Electron Schrödinger Equation with Deep Neural Networks.
arXiv:1909.02487, Sep 2019.
Hamiltonian Graph Networks with ODE Integrators
Sanchez-Gonzalez+ Hamiltonian Graph Networks with ODE Integrators.
arXiv:1909.12790, Sep 2019.
Both from
ML + First-principle simulations
3N-MCTS/AlphaChem
Segler+ Nature 2018
Molecular Transformer
Schwaller+ ACS Cent Sci
2019
seq2seq
Liu+ ACS Cent Sci 2017
WLDN
Jin+ NeurIPS 2017
ELECTRO
Bradshaw+ICLR 2019
WLN
Coley+ Chem Sci 2019
GPTN
Do+ KDD 2019
Graph Neural Networks Sequence Neural Networks Combined or Other
IBM RXN
Schwaller+ Chem Sci 2018
Molecule Chef
Bradshaw+ DeepGenStruct
(ICLR WS) 2019
Neural-Symbolic ML
Segler+ Chemistry 2017
Similarity-based
Coley+ ACS Cent Sci 2017
GLN
Dai+ NeurIPS2019
Transformer
Karpov+ ICANN 2019
30. Similar technologies go for biology!
DeepMind's AlphaFold paper
A successful example from MIT MLPDS
31. How to integrate data-driven and theory-driven?
• Theory-driven: first-principles simulations, logical reasoning,
mathematical models, etc.
• Data-driven: machine learning
Still emerging topics,
but examples of already available approaches are
1. Wrapper: Use ML to control or plan simulations
Basically we solve the problem by simulations, but use ML models to
ask "what if?" questions to guide what simulations to run.
(Model-based optimization, Sequential design of experiments,
Reinforcement learning, Generative methods, etc)
2. Hybrid: Use ML as approximator for unsure parts of simulations
Plugging ML models into the unsure part of simulations or
calling simulations when ML predictions are less confident
(Data assimilation, domain adaptation, semi-empirical methods, etc)
32. Back to our example: Heterogeneous Catalysis
Wolfgang Pauli
“God made the bulk;
the surface was invented by the devil.”
adsorption
diffusion
desorption
dissociation
recombination
kinks
terraces
adatom
vacancysteps
Gas-Phase
(Reactants)
Sold-Phase
(Catalysts)
Many hard-to-quantify intertwined factors involved.
Too complicated (impossible?) to model everything...
• multiple elementary
reaction processes
• composition, support,
surface termination,
particle size, particle
morphology, atomic
coordination environment
• reaction conditions
Notoriously complex surface reactions between different phases.
33. Our ML-based case studies
1. Can we predict the d-band center?
2. Can we predict the adsorption energy?
3. Can we predict the catalytic activity?
predicting DFT-calculated values by machine learning
(Takigawa et al, RSC Advances, 2016)
predicting DFT-calculated values by machine learning
(Toyao et al, JPCC, 2018)
predicting values from experiments reported in the
literature by machine learning
(Suzuki et al, ChemCatChem, 2019)
34. • Problem: Very strong "selection bias" in existing datasets
Catalyst research has relied heavily on prior published data,
strongly biased toward catalyst composition that were successful
One of big problems we had
Example) Oxidative coupling of methane (OCM)
• 1868 catalysts in the original dataset [Zavyalova+ 2011]
• Composed of 68 different elements: 61 cations and 7 anions
(Cl, F, Br, B, S, C, and P) excluding oxygen
• Occurrences of only a few elements such as La, Ba, Sr, Cl,
Mn, and F are very high.
• Widely used elements such as Li, Mg, Na, Ca, and La also
frequent in the data
35. An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
36. An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
In reality, distinction between
interpolation and extrapolation
are not that clear due to the
high-dimensionality.
37. No guarantee of data-driven for the outside of given data
Cause-and-Effect
Relationship
Related factors
(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data".
(e.g. through machine learning from a large collection of data)
Keep in mind: Given data DEFINES the data-driven prediction!
"Current machine learning techniques are data-hungry and brittle
—they can only make sense of patterns they've seen before."
(Chollet, 2020)
38. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
amount
we want to
maximize
max(best)
for now
39. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
40. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
very confident
(we have data)
not confident
(no data around)
41. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.
"expected
improvement"
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
42. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.
"expected
improvement"
ML prediction
(function fitting)
prediction variance
(uncertainty)
probing area with no-data or scarse data relying on available data
amount
we want to
maximize
max(best)
for now
43. AlphaGo
(Nature, Jan 2016)
AlphaGo Zero
(Nature, Oct 2017)
AlphaZero
(Science, Dec 2018)
• Algorithm Configuration
• Hyperparameter Optimization (HPO)
• Neural Architecture Search (NAS)
• Meta Learning / Learning to Learn Amazon
SageMaker
MuZero
(arXiv, Nov 2019)
Key: model-based and exploitation-exploration balance
Model-based reinforcement learning
AutoML (Use ML for tuning ML)
44. Facts from experiments and calculations
Rationalize & Accelerate Chemical Design and Discovery
In-House data + Public data + Knowledge base
(and their quality control & annotations)
Hypothesis generation Validation
(Experiments and/or Simulations)(Machine learning, Data mining)
• Planning what to test in the next
experiments or simulations
• Surrogates to expensive or time-
consuming experiments or simulations
• Optimize uncertain factors or conditions
• Multilevel information fusion
• Highly Reproducible experiments with
high accuracies and speeds
• Acceleration with ML-based surrogates
for time-consuming subproblems
• Simulating many 'what-if' situations
Key: Effective use of data with a help with data-driven techniques
46. Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
47. Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
49. Summary
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
50. Reviewer: 1
I don't usually recommend that papers should be accepted "as is", but in this case I don't see
the need for changes. This review should be accepted and published in ACS Catalysis. ... I will
certainly recommend it to my group and my students when it is published.
Reviewer: 2
The manuscript gives an excellent over the field of machine learning especially with regard to
heterogeneous catalysis and I would highly recommend the article for the publication in ACS
Catalysis.
Reviewer: 3
This is one of the best reviews for catalyst informatics that the Reviewer has read. In particular,
the chapter 2 delivers a very good tutorial, which is concisely and professionally written.
Chapter 2 is the general user's guide of ML for natural sciences.
ACS Catalysis, 2020; 10: 2260-2297.