How to Become a Data Scientist
SF Data Science Meetup, June 30, 2014
Video of this talk is available here: https://www.youtube.com/watch?v=c52IOlnPw08
More information at: http://www.zipfianacademy.com
Zipfian Academy @ Crowdflower
This document provides an introduction and overview of resources for learning Python for data science. It introduces the presenter, Karlijn Willems, a data science journalist who has worked as a big data developer. It then lists several useful links for learning Python, statistics, machine learning, databases, and data science tools like Apache Spark. Finally, it recommends people to follow in data science and analytics fields.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
This document provides an overview of machine learning tools and languages. It discusses Python, R, and MATLAB as the most commonly used tools. For each tool, it lists advantages and disadvantages. Python is highlighted as the number one language for machine learning due to its many libraries and large user community. R is best for time series analysis and causal inference. MATLAB is still a leading tool for signal processing but lacks machine learning libraries. The document also provides resources for learning machine learning foundations and examples.
What is the basis for the Data Science course and Data Scientist to know?
1-Algorithm
2-Data
3-Ask The Right Question
4-Predict an answer
5- Copy other people's work to do data science
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
Claudia Gold: Learning Data Science Onlinesfdatascience
Claudia Gold, author of the Data Analysis Learning path on SlideRule, talks about why she wrote it and how to approach learning data science on your own. https://www.mysliderule.com/learning-paths/data-analysis/
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Clare Corthell: Learning Data Science Onlinesfdatascience
Clare Corthell, Data Scientist and Designer at Mattermark, and author of the Open Source Data Science Masters, shares her experience teaching herself data science with online resources. http://datasciencemasters.org/
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
This document provides an overview of becoming a data scientist. It defines a data scientist and lists common job titles. It discusses the functions of a data scientist like devising business strategies, descriptive/predictive analytics, and data mining. Examples are provided of customer churn analysis and market basket analysis. The skills, aptitudes, and educational paths to become a data scientist are also outlined.
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
This document provides an overview of machine learning, including definitions, common applications, and examples of companies using machine learning. It discusses how BuildFax, a company that provides building permit data and services to industries like insurance, used Amazon Machine Learning to build more accurate predictive models for roof age and job cost estimates. By leveraging Amazon ML, BuildFax was able to build models much faster and provide more precise, property-specific predictions to customers through APIs.
This document provides an introduction and overview of a summer school course on business analytics and data science. It begins by introducing the instructor and their qualifications. It then outlines the course schedule and topics to be covered, including introductions to data science, analytics, modeling, Google Analytics, and more. Expectations and support resources are also mentioned. Key concepts from various topics are then defined at a high level, such as the data-information-knowledge hierarchy, data mining, CRISP-DM, machine learning techniques like decision trees and association analysis, and types of models like regression and clustering.
This document provides an introduction to machine learning. It discusses that machine learning focuses on learning about processes in the world rather than just memorizing data. It also covers the main types of machine learning: supervised learning which learns mappings between examples and labels; unsupervised learning which learns structure from unlabeled examples; and reinforcement learning which learns to take actions to maximize rewards. The document explains that machine learning requires representing data as feature vectors and using models with optimization techniques to find parameters that generalize to new data rather than overfitting the training data.
Becoming a Data Scientist: Advice From My Podcast GuestsRenee Teate
Information and advice about learning data science, from the 17 data scientists & data science learners I have interviewed to date on the Becoming a Data Scientist Podcast, and from me!
Originally presented at PyDataDC conference, 10/9/2016
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
The document discusses the role and responsibilities of a data scientist. It describes how data scientists take large amounts of messy data and use skills in math, statistics, and programming to organize and analyze the data to uncover solutions to business problems. An effective data scientist has strong skills in both statistics and software engineering. The document also outlines the scientific process that data scientists follow, including developing algorithms and models, testing hypotheses on data, deploying solutions, and continuously monitoring and improving based on results.
The document discusses the role of a full-stack data scientist. It begins with an introduction of the author, Alexey Grigorev, as a data scientist. It then outlines the plan to discuss the data science process, roles in a data science team, what defines a full-stack data scientist, and how to become a full-stack data scientist. It proceeds to explain the CRISP-DM process for data science projects. It describes the different roles in a data science team including product manager, data analyst, data engineer, data scientist, and ML engineer. It defines a full-stack data scientist as someone who can work across the entire data science lifecycle and discusses the breadth of skills required to become a
A presentation covers how data science is connected to build effective machine learning solutions. How to build end to end solutions in Azure ML. How to build, model, and evaluate algorithms in Azure ML.
This document provides an introduction to data science, noting that 90% of the world's data was generated in the last two years. It discusses the fields of computer science, business, statistics, and data science. It describes two types of data scientists: statisticians who specialize in analysis and developers who specialize in building tools. It also lists some popular programming languages and visualization tools used in data science like Python, R, and Tableau. Finally, it provides some tips for those interested in data science such as learning design, public speaking, coding, and finding value.
The document discusses putting "magic" into data science. It provides several tricks or techniques for data science, including collecting novel data sources, dimensionality reduction, Bayesian methods, bootstrapping statistics, and matrix factorizations. It also emphasizes the importance of reliability, latency/interactivity, simplicity/modularity, and unexpectedness to solve the "last mile" problem of getting people to actually use data science tools and models. Specific Facebook tools like Planout, Deltoid, ClustR, Prophet, and Hive/Presto/Scuba are presented as examples.
Machine Learning in the age of Big DataDaniel Sârbe
This document provides an overview of machine learning and how it gains more importance in the age of big data. It discusses machine learning techniques like supervised learning, unsupervised learning, and reinforcement learning. It also contrasts traditional data science with machine learning approaches and explains how machine learning works better with large, big data. A key point made is that more data is better for machine learning algorithms to learn from than having the best algorithm alone.
This document provides guidance on becoming a data scientist by outlining important skills to learn like statistics, programming, visualization, and big data concepts. It recommends starting with hands-on SQL and statistical learning in R or Python, developing expertise in data visualization, and learning to apply techniques such as regression, classification, and recommendation engines. The document advises demonstrating what you've learned by applying for data scientist positions.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
This document summarizes a presentation on data science consulting. It discusses:
1) The Agile Analytics group at ThoughtWorks which does data science consulting projects using probabilistic modeling, machine learning, and big data technologies.
2) Two case studies are described, including developing a machine learning model to improve matching of healthcare product data and using logistic regression for retail recommendation systems.
3) The origins and future of the field are discussed, noting that while not entirely new, data science has grown due to improvements in technology, programming languages, and libraries that have increased productivity and driven new career opportunities in the field.
Clare Corthell: Learning Data Science Onlinesfdatascience
Clare Corthell, Data Scientist and Designer at Mattermark, and author of the Open Source Data Science Masters, shares her experience teaching herself data science with online resources. http://datasciencemasters.org/
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
This document provides an overview of becoming a data scientist. It defines a data scientist and lists common job titles. It discusses the functions of a data scientist like devising business strategies, descriptive/predictive analytics, and data mining. Examples are provided of customer churn analysis and market basket analysis. The skills, aptitudes, and educational paths to become a data scientist are also outlined.
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
This document provides an overview of machine learning, including definitions, common applications, and examples of companies using machine learning. It discusses how BuildFax, a company that provides building permit data and services to industries like insurance, used Amazon Machine Learning to build more accurate predictive models for roof age and job cost estimates. By leveraging Amazon ML, BuildFax was able to build models much faster and provide more precise, property-specific predictions to customers through APIs.
This document provides an introduction and overview of a summer school course on business analytics and data science. It begins by introducing the instructor and their qualifications. It then outlines the course schedule and topics to be covered, including introductions to data science, analytics, modeling, Google Analytics, and more. Expectations and support resources are also mentioned. Key concepts from various topics are then defined at a high level, such as the data-information-knowledge hierarchy, data mining, CRISP-DM, machine learning techniques like decision trees and association analysis, and types of models like regression and clustering.
This document provides an introduction to machine learning. It discusses that machine learning focuses on learning about processes in the world rather than just memorizing data. It also covers the main types of machine learning: supervised learning which learns mappings between examples and labels; unsupervised learning which learns structure from unlabeled examples; and reinforcement learning which learns to take actions to maximize rewards. The document explains that machine learning requires representing data as feature vectors and using models with optimization techniques to find parameters that generalize to new data rather than overfitting the training data.
Becoming a Data Scientist: Advice From My Podcast GuestsRenee Teate
Information and advice about learning data science, from the 17 data scientists & data science learners I have interviewed to date on the Becoming a Data Scientist Podcast, and from me!
Originally presented at PyDataDC conference, 10/9/2016
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
The document discusses the role and responsibilities of a data scientist. It describes how data scientists take large amounts of messy data and use skills in math, statistics, and programming to organize and analyze the data to uncover solutions to business problems. An effective data scientist has strong skills in both statistics and software engineering. The document also outlines the scientific process that data scientists follow, including developing algorithms and models, testing hypotheses on data, deploying solutions, and continuously monitoring and improving based on results.
The document discusses the role of a full-stack data scientist. It begins with an introduction of the author, Alexey Grigorev, as a data scientist. It then outlines the plan to discuss the data science process, roles in a data science team, what defines a full-stack data scientist, and how to become a full-stack data scientist. It proceeds to explain the CRISP-DM process for data science projects. It describes the different roles in a data science team including product manager, data analyst, data engineer, data scientist, and ML engineer. It defines a full-stack data scientist as someone who can work across the entire data science lifecycle and discusses the breadth of skills required to become a
A presentation covers how data science is connected to build effective machine learning solutions. How to build end to end solutions in Azure ML. How to build, model, and evaluate algorithms in Azure ML.
This document provides an introduction to data science, noting that 90% of the world's data was generated in the last two years. It discusses the fields of computer science, business, statistics, and data science. It describes two types of data scientists: statisticians who specialize in analysis and developers who specialize in building tools. It also lists some popular programming languages and visualization tools used in data science like Python, R, and Tableau. Finally, it provides some tips for those interested in data science such as learning design, public speaking, coding, and finding value.
The document discusses putting "magic" into data science. It provides several tricks or techniques for data science, including collecting novel data sources, dimensionality reduction, Bayesian methods, bootstrapping statistics, and matrix factorizations. It also emphasizes the importance of reliability, latency/interactivity, simplicity/modularity, and unexpectedness to solve the "last mile" problem of getting people to actually use data science tools and models. Specific Facebook tools like Planout, Deltoid, ClustR, Prophet, and Hive/Presto/Scuba are presented as examples.
Machine Learning in the age of Big DataDaniel Sârbe
This document provides an overview of machine learning and how it gains more importance in the age of big data. It discusses machine learning techniques like supervised learning, unsupervised learning, and reinforcement learning. It also contrasts traditional data science with machine learning approaches and explains how machine learning works better with large, big data. A key point made is that more data is better for machine learning algorithms to learn from than having the best algorithm alone.
This document provides guidance on becoming a data scientist by outlining important skills to learn like statistics, programming, visualization, and big data concepts. It recommends starting with hands-on SQL and statistical learning in R or Python, developing expertise in data visualization, and learning to apply techniques such as regression, classification, and recommendation engines. The document advises demonstrating what you've learned by applying for data scientist positions.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
This document summarizes a presentation on data science consulting. It discusses:
1) The Agile Analytics group at ThoughtWorks which does data science consulting projects using probabilistic modeling, machine learning, and big data technologies.
2) Two case studies are described, including developing a machine learning model to improve matching of healthcare product data and using logistic regression for retail recommendation systems.
3) The origins and future of the field are discussed, noting that while not entirely new, data science has grown due to improvements in technology, programming languages, and libraries that have increased productivity and driven new career opportunities in the field.
How to crack Big Data and Data Science rolesUpXAcademy
How to crack Big Data and Data Science roles is the flagship event of UpX Academy. This slide was used for the event on 10th Sept that was attended by hundreds of participants globally.
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
One of the most popular buzz words nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
The paper will be written for statistical programmers who want to explore Machine Learning career, add Machine Learning skills to their experiences or enter a Machine Learning fields. The paper will discuss about personal journey to become to a Machine Learning Engineer from a statistical programmer. The paper will share my personal experience on what motivated me to start Machine Learning career, how I started it, and what I have learned and done to be a Machine Learning Engineer. In addition, the paper will also discuss the future of Machine Learning in Pharmaceutical Industry, especially in Biometric department.
Which institute is best for data science?DIGITALSAI1
EduXfactor is the top and best data science training institute in hyderabad offers data science training with 100% placement assistance with course certification.
Join us for the Best Selenium certification course at Edux factor and enrich your carrier.
Dream for wonderful carrier we make to achieve your dreams come true Hurry up & enroll now.
<a href="https://eduxfactor.com/selenium-online-training">Best Selenium certification course</a>
Data Science Online Training In HA comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge.hyderabad Data Science Online Training
#datascienceonlinetraininginhyderabad
#datascienceonline
#datascienceonlinetraining
#datascience
Data science training institute in hyderabadVamsiNihal
Exploring the EduXfactor Data Science Training program, you will learn components of the Data Science lifecycle such as Big Data, Hadoop, Machine Learning, Deep Learning & R programming. Our professional experts will teach you how to adopt a blend of mathematics, statistics, business acumen, tools, algorithms & machine learning techniques. You will learn how to handle a large amount of data information & process it according to any firm business strategy.
A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge.
Eduxfactor is an online data science training institution based in Hyderabad. A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Data science online training in hyderabadVamsiNihal
Exploring the EduXfactor Data Science Training program, you will learn components of the Data Science lifecycle such as Big Data, Hadoop, Machine Learning, Deep Learning & R programming. Our professional experts will teach you how to adopt a blend of mathematics, statistics, business acumen, tools, algorithms & machine learning techniques. You will learn how to handle a large amount of data information & process it according to any firm business strategy.
Overview of Data Science Courses Online
A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge.
What You'll Learn In Data Science Courses Online
Grasp the key fundamentals of data science, coding, and machine learning. Develop mastery over essential analytic tools like R, Python, SQL, and more.
Comprehend the crucial steps required to solve real-world data problems and get familiar with the methodology to think and work like a Data Scientist.
Learn to collect, clean, and analyze big data with R. Understand how to employ appropriate modeling and methods of analytics to extract meaningful data for decision making.
Implement clustering methodology, an unsupervised learning method, and a deep neural network (a supervised learning method).
Build a data analysis pipeline, from collection to analysis to presenting data visually.
#datasciencecoursesonline
#datascience
#datasciencecourses
A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge
EduXfactor is the top and best data science training institute in hyderabad offers data science training with 100% placement assistance with course certification.
Data science online training in hyderabadVamsiNihal
Exploring the EduXfactor Data Science Training program, you will learn components of the Data Science lifecycle such as Big Data, Hadoop, Machine Learning, Deep Learning & R programming. Our professional experts will teach you how to adopt a blend of mathematics, statistics, business acumen, tools, algorithms & machine learning techniques. You will learn how to handle a large amount of data information & process it according to any firm business strategy.
data science online training in hyderabadVamsiNihal
A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. Grasp the key fundamentals of data science, coding, and machine learning. Develop mastery over essential analytic tools like R, Python, SQL, and more.
Best data science training in HyderabadKumarNaik21
Join us for the Best data science training in Hyderabad at Edux factor and enrich your carrier.
Dream for wonderful carrier we make to achieve your dreams come true Hurry up & enroll now.
Eduxfactor is an online data science training institution based in Hyderabad. A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure.
The world has witnessed explosive digital growth in the last two decades, which has led to a data deluge. This data may be
holding some key business insights or solutions to crucial problems. Data Science is the key that unlocks this possibility
to extract vital insights from the raw digital data. These findings can then be visualized, and communicated to the
decision-makers to be acted upon.Online Data Science Training is the best choice for the students to begin a new life. We
provide Data Science Training and Placement for the students .
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
From SQL to Python - A Beginner's Guide to Making the Switch
1. FROM SQL TO PYTHON:
HANDS-ON DATA
ANALYTICS AND MACHINE
LEARNING
RACHEL BERRYMAN
DATA SCIENTIST – TEMPUS ENERGY
[email protected]
2. ABOUT ME
MSc Sustainable
Development, BA
Economics
Senior Energy Data
Analyst
Data Science Retreat
Batch 12, 2017
Data Scientist at
Tempus Energy.
Instructor of Model
Pipelines course at DSR
3. FROM DATA ANALYSIS TO DATA SCIENCE
What’s the
difference between
data analysis and
Data Science?
How do I know if a
career in Data Science
is right for me?
How do I make the
switch to a career in
Data Science?
4. WHAT IS DATA SCIENCE?
• Data Science definition (Wikipedia): Data
Science is a concept to unify statistics,
data analysis, machine learning and their
related methods in order to understand
and analyze actual phenomena with data.
It employs techniques and theories
drawn from many fields within the broad
areas
of mathematics, statistics, information
science, and computer science.
5. DATA ANALYSIS VS. DATA SCIENCE:
WHAT’S THE DIFFERENCE?
• Data Analysis mainly looks at the present and past. It answers
questions like:
• How much revenue did we bring in last year?
• What is product does customer X buy most frequently?
• Data Science mainly looks at the present and future. It answers
questions like:
• What products should we invest in expanding for the future?
• What product should we recommend to customer X so that they buy
more when they visit our site, based on their most-purchased products
in the past?
6. DATA ANALYSIS VS. DATA SCIENCE:
WHAT’S THE DIFFERENCE?
• Data Analysis works mainly with proprietary tools
• Oracle, MS SQL Server, Tableau.
• Data Science works mainly with open-source tools
• Open-source languages and packages: ie Python, scikit-learn, keras,
matplotlib
7. DATA ANALYSIS VS. DATA SCIENCE:
WHAT’S THE DIFFERENCE?
• Data Analysis works with data from one or few sources
• Ex: data from an in-house SQL database.
• Data Science works with data from many varying sources
• Ex: data from in-house SQL database, data scraped from the web, text
data from customer surveys, data from across multiple departments
8. WHY DO COMPANIES NEED DATA
SCIENTISTS?
• More and more data comes in
unstructured formats: ex,
natural language in emails and
social media posts, photos,
audio files.
• Data Scientists make use of
this data and apply it to
pressing business questions.
10. IS DATA SCIENCE RIGHT FOR ME?
You like to code
You like to work across many teams and find synergies
You enjoy getting “stuck in” and working through challenges
You consider yourself a life-long learner
You like math
You seek out work and answers to questions: you don’t wait
for questions to be delegated to you
11. HOW CAN I MAKE THE SWITCH FROM DATA
ANALYSIS TO DATA SCIENCE?
• Learn the basics:
• Practical
• Command Line
• Git and Github
• A common Data Science programming language (Python, or R).
• Theoretical
• Machine Learning Algorithms
• Supervised, Unsupervised
• Go further:
• MOOCs
• Intensive deep-dive: Bootcamp/Retreat
12. LEARN THE BASICS: COMMAND LINE & GIT
• Command line is how you directly interact with your computer
(sans GUI). It is the “ultimate seat of power for your computer”.
• Git is a distributed version control system. Git is responsible for
keeping track of changes to content (usually source code files),
and it provides mechanisms for sharing that content with
others. GitHub is a company that provides Git repository
hosting.
• Learn how to use your command line to clone repositories
(‘repos’) from GitHub. You will open up a world of learning
opportunities!
13. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
• In SQL, you’re usually using a company-purchased software
like Oracle SQL Developer, or MS SQL Server Management
Studio.
14. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
Python
Interpreter
iPython IDE/Jupyter
Get
Coding!
15. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
Python Interpreter
17. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
Jupyter Notebook
18. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
• IDEs:
• Python-specific:
• PyCharm
• Spyder
• Thonny
• General, with support for
Python:
• Atom (also can add iPython
with Hydrogen)
• Sublime Text
• Vim
22. LEARN THE BASICS: PYTHON FOR DATA
MUNGING AND ANALYSIS
• Start with what you know!
• Write SQL commands in Python
• Automate what you would have to do manually in Excel
• Use matplotlib to make visualizations you would have done in Tableau
• Learn what you don’t know
• Python packages, modules, libraries
• Object-Oriented Programming (OOP)
23. LEARN THE BASICS: PYTHON FOR DATA
SCIENCE AND MACHINE LEARNING
• Github Repo with practice for switching from SQL to Python
• Clone repo: https://github.com/rachelkberryman/From_SQL_to_Python
• cd into repo, and run command “jupyter notebook” (more information
about jupyter notebooks here)
• Includes:
• sample SQL queries with Python equivalents
• examples of Python functions for data manipulation and analysis
• sample visualizations in Seaborn
24. LEARN THE BASICS: PYTHON FOR DATA
SCIENCE AND MACHINE LEARNING
• By applying machine learning algorithms (with code), you will
learn them MUCH more quickly than by only reading about
them
• Even better, apply them to a sample dataset (work and/or
passion project)
25. GOING FURTHER: MOOCS VS. DEEP DIVE
• The jump from automating what you already know to working
with predictive models is where most people get overwhelmed.
• Need for more structured learning: MOOCs vs. Deep Dive
26. GOING FURTHER: MOOCS VS. DEEP DIVE
MOOCs:
• PROs:
• On your own time
• Little risk
• CONs:
• No or little help when stuck
• No supportive community/job
help
Deep Dive:
• PROs:
• Structure and support
• Faster learning rate
• Network and hiring support
• CONs:
• Risk and opportunity cost
27. MAKING THE SWITCH: GOING FURTHER
• MOOCs:
• Andrew Ng’s Machine Learning course on Coursera
• Explanation of machine learning algorithms
• Applied Data Science with Python course on Coursera
• Coding practice in notebooks, with explanation videos
• Deep Dive: Data Science Retreat
• 3 months of intensive data science teaching and training in Berlin
• Culminates in final portfolio project
28. MORE RESOURCES
• On Python:
• Think Python: Thinking Like a Computer Scientist, Allan B. Downey
• Fluent Python, Luciano Ramalho
• On DS/ML:
• Think Bayes: Bayesian Statistics in Python, Allan B. Downey
• The Master Algorithm, Pedro Domingos
• Pattern Recognition in Machine Learning, Christopher Bishop
#3: SaaS platform for energy utility bills: got data in a lot of formats, and basically just fit it into the Database. Creative bit was getting to use tableau. Quickly realized I wanted to work on more intense analytics/topics.
Thrown in the deep end at DSR, but I survived.
#5: To process all of this unstructured data, you need to not only analyze it and be able to make predictions about what it will do in the future, but you also build solutions for managing it, storing it, and manipulating it. This is why many people see Data Scientists as first and foremost being Software Engineers.
Source: Wikipedia.
#6: DS: we don’t just query the data that’s already there, we use that data to create predictions for the future.
#7: Could be argued, but I’ve found this to be true in industry. Proprietary tools cost money, and you’re (mostly) bound to only the data your company has.
#11: - Life long learner: because it’s such a new field, there are constantly new technologies and libraries coming out that you have to stay up on.
#13: Command line is ESSENTIAL before learning any “proper” coding language.
https://www.davidbaumgold.com/tutorials/command-line/
https://softwareengineering.stackexchange.com/questions/173321/conceptual-difference-between-git-and-github
Git is a revision control system, a tool to manage your source code history.
GitHub is a hosting service for Git repositories.
So they are not the same thing: Git is the tool, GitHub is the service for projects that use Git.
#14: It’s important to learn how different working in Python is from working in SQL.
In SQL, you’re usually using a company-purchased software like Oracle SQL Developer, or MS SQL Server Management Studio.
With these, you usually have authentications that let you run queries on various databases in the RDBMS.
Because SQL is a QUERY language, it’s only as good as the data you have access to.
The great thing about Python is that you’re not tied to one buggy editor for running your code. Also, you’re not tied to data from one source.
#15: Python isn’t like this. Anyone can code in python if they have a computer, you just have to know how to get it started! To start learning Python, you have to get it working on your computer.
To code in Python you have to have a python interpreter on your computer. This is the program that reads your python code and does what it says. Macs have this built in.
#17: iPython is an interface to the python language. It lets you run small bits of code without writing entire programs. Usually, regular Python is used for scripts that you’ve already written.
A script contains a list of commands to execute in order. It runs from start to finish and display some output. On the contrary, with IPython, you generally write one command at a time and you get the results instantly, and it has a lot of features to make it work better.
Additional features:
Tab autocompletion (on class names, functions, methods, variables)
More explicit and colour-highlighted error messages
Better history management
Basic UNIX shell integration (you can run simple shell commands such as cp, ls, rm, cp, etc. directly from the IPython command line)
#18: Another interface, but web-based. Makes it easy to share as you can save them as HTML files or PDFs. Lets you both run code (via an ipython kernel), and add text in markdown cells. Great for playing around and trying things.
#19: IDE (or Integrated Development Environment) is a program dedicated to software development. (ideally) lets you work in both script and interactive modes. Good for when you’ve “graduated” from jupyter notebooks and need something to right more full length programs.
#20: Anaconda is a distribution of Python, made specifically for data science. The goal is to make it easy to have everything you need to do data science in python. When you download it, you automatically get jupyter, and a lot of the core libraries. Now, about libraries…
#21: There are a lot of libraries in python that directly deal with data: Ex: pandas
There are also a LOT that don’t.
Learn the ones that deal with data first! Also, learn the ones that deal with data that you already know how to work with first. Ex: learn pandas for working with CSVs before you learn beautifulsoup for scraping data off the web.
Quick read on some of this biggest Python libraries: http://www.developintelligence.com/blog/python-ecosystem-2017/
#23: When I started with python, I was frustrated not being able to do everything I could already do in SQL, as far as manipulating data.
A lot of the tutorials are abstract and go too far in to the basics.
#24: Once you have Python working, you can start using it on a real dataset. This is e-commerce data from Kaggle.
Once you’ve played around a bit with the SQL-like and data-focused libraries, you can move on to learning about machine learning, and going beyond just analytics.
#25: Get a rough idea, ex: If yours is a supervised or unsupervised learning problem, if it’s regression or classification, and then, read about each algorithm as you implement them.