Introduction To Statistics For Data Science: Opensap
Introduction To Statistics For Data Science: Opensap
00:00:06 Hello and welcome back to week six of the openSAP course, "Introduction to Statistics for
Data Science".
00:00:14 In this unit we'll be looking at how statistics can help in the real world.
00:00:21 Every day in our professional and personal lives, we are faced with making decisions,
00:00:27 but the data that should feed those decisions are often huge and complex.
00:00:33 You can use statistics to simplify this complexity in order to summarize data,
00:00:39 make comparisons, forecast the future, test claims and hypotheses, check probabilities,
00:00:47 and let's look at examples of each of these. Statistics to summarize data enable you to gain
00:00:55 an understanding of large volumes of data. For example, you may want to calculate
00:01:03 the average of how much it costs students to go to university,
00:01:08 or the average pay for men and women. Average pay can be skewed by those at the top.
00:01:17 Therefore, you may want to use the median, the middle value. Or you may also want to
calculate
00:01:26 the spread around the mean using standard deviation to understand the range of pay from top
to bottom.
00:01:35 There are four major types of descriptive statistics. Firstly, measures of frequency,
00:01:40 so count, percentage, frequency. Measures of central tendency,
00:01:45 so the mean, median, and mode. Measures of dispersion or variation,
00:01:51 so range, variance, standard deviation. And measures of position, so percentile,
00:01:58 ranks, and quartile. Statistics can be used to compare
00:02:05 two means or two distributions. You might want to confirm if two means
00:02:11 are significantly different from each other. Perhaps your mean is almost the same
00:02:17 for two sets of data showing exam results for two class groups. You can then use standard
deviation to check
00:02:25 whether the spreads are significantly different. There are lots of useful calculators
00:02:32 available on the Internet. You can also make these comparisons using MS Excel,
00:02:38 or using R or Python, or a range of other statistical software.
00:02:46 A valuable use of statistics within SAP solutions, such as SAP Analytics Cloud,
00:02:52 allows you to use historical data to predict the future. SAP solutions make this simple and easy
to use.
00:03:02 A couple of examples. Linear regression allows you to understand
00:03:06 how much of the change in one variable can be explained by the change in another.
00:03:12 This approach can be used to make predictions about the future.
00:03:18 For example, how much of the change in sales of children's buckets and spades
00:03:24 can be explained by weather change? Time-series analysis is also a useful way
00:03:30 of predicting the future values based on historical data, identifying trends and seasonal
variations over time.
00:03:41 We will present more about SAP solutions in an upcoming unit.
00:03:48 You use hypothesis testing to challenge whether a claim about a population is true.
00:03:54 For example, a claim that men are paid more than women. To test a statistical hypothesis,
00:04:02 you take a sample, collect data, form a statistic, standardize it to form a test statistic,
00:04:09 and decide whether the test statistic refutes the claim. Probability is a notion which you can
use
00:04:18 to deal with uncertainty. If an event can have a number of outcomes,
00:04:25 and you don't know for certain which outcome will occur, you can use probability to describe
00:04:31 the likelihood of each of the possible events. When you analyze probability,
00:04:38 you can establish the likelihood of an event. There is a 30% chance of rain tomorrow.
00:04:47 Based on how poorly the interview went, it is unlikely I will get the job.
00:04:52 Since it is 10 degrees Celsius outside, it's impossible that it will snow.
00:04:58 You have seen some simple examples to understand probability,
00:05:02 but there are many real-world applications. Weather forecasting, insurance claims,
00:05:10 predictive text on mobile phones. To summarize.
00:05:18 Now that you understand the fundamentals of these statistical approaches,
00:05:24 you are in a position to use them in the real world. Summarize - can I see my way through this
mass of numbers?
00:05:34 Compare - how is one group of people treated in comparison to another?
00:05:39 Such as access to resources, pay, jobs, and so on. Forecast - how much influence does the
temperature
00:05:48 have on my sales of winter clothes? How much stock should I buy?
00:05:55 Test claims and hypotheses - for example, are men paid more than women?
00:06:01 Check probabilities - what's the probability of having a certain disease given a positive result
00:06:08 in a test that is not perfectly accurate? In the next unit we will consider
00:06:14 how to critically evaluate reports.
2
Week 6 Unit 02
00:00:06 Hello and welcome back to week six, unit two of the openSAP course,
00:00:11 "An Introduction to Statistics for Data Science". In this unit we'll be looking at
00:00:17 how to critically evaluate reports. Fake news is a type of propaganda
00:00:25 that consists of deliberate disinformation or hoaxes spread via the media or online social
media.
00:00:32 It is written and published usually with the intent to mislead in order to damage an agency,
entity,
00:00:40 or person, and or gain financially or politically, often using sensationalist, dishonest,
00:00:47 or outright fabricated headlines to increase readership. Unfortunately, it is also very easy
00:00:54 for business report authors to deceive readers so that their misleading statistics become
accepted fact.
00:01:05 This unit presents some common practices and interesting examples to demonstrate
00:01:12 how easy it is for reports to be misleading, and to help you to critically evaluate
00:01:19 the content and decide if it is genuine or not. You will have undoubtedly seen claims
00:01:28 such as when we started to use XYZ's software, our profit conversion rates, and so on,
increased by 200%.
00:01:37 There might be some truth in this claim, but you need to see the numbers that back it up.
00:01:43 You need to know the exact situation and numbers before you can confirm this report.
00:01:51 For example, if a company is measuring conversion rate, and starts with only a small number
of customers,
00:01:58 say, one or two, but then implements solution XYZ and attracts a few more customers,
00:02:05 say, another two or three, then the 200% claim might be true.
00:02:10 However, without the actual number of customers, you cannot be sure if the increase is really
significant.
00:02:19 Simply giving the percentage increase or decrease without the actual numbers can be
misleading.
00:02:30 Even if the targeted KPI changed significantly and there really was an increase of 200%,
00:02:37 you also need to examine how other important KPIs changed. So for example, conversion rate
went from
00:02:45 20% per 1000 to 30% per 1000, but how did open rates, bounce rates,
00:02:52 activation, and retention rates change? It's very easy to focus on one of the KPIs
00:02:59 and tell a story, ignoring other KPIs. Business is very complex and you need to be aware
00:03:06 of all of the metrics to build up a complete picture. You always need to understand
00:03:15 how the analysis was conducted. If a proper statistical test,
00:03:20 such as an A/B test, was used, you need to ensure you know what was compared
00:03:26 in each group, and what the outcome was. A/B testing is a randomized experiment
00:03:33 with two variants, A and B. The two versions, A and B,
00:03:38 are identical except for one variation that might affect a user's behavior.
00:03:45 Version A might be the currently used version, called the control, while version B
00:03:51 is modified in some respect, it's called the treatment. If no statistical test was used,
00:03:57 how can you be sure that the increase was not due to some other effect,
00:04:03 such as seasonality or organic growth? Remember that pie charts
00:04:11 are a poor way of communicating information. The eye is good at judging linear measures
00:04:18 and bad at judging relative areas. A bar chart or dot chart
00:04:23 is a preferable way of displaying data. The main problem with 3D pie charts
3
00:04:30 is that in these graphics, the frontmost wedges of the pie appear to be larger than the rear
wedges.
00:04:39 There are two reasons. Firstly, the viewer sees the front edge
00:04:43 but not the back edge of the pie chart. Secondly, if the chart is displayed using perspective,
00:04:51 the upper surface of a wedge in the front of the pie will be greater than the upper surface
00:04:58 of a rear wedge spanning the same angle. Angles and areas at the bottom of the chart
00:05:05 must be exaggerated, and the angles and areas at the top of the chart reduced,
00:05:12 in order to create the dimensional effect. Bar charts can also be used to confuse you,
00:05:20 especially when the axes fail to reach zero. In this chart, the value for 2014
00:05:27 is approximately 1.08 times the value for 2010, but because the vertical axis has been
truncated,
00:05:35 the bar for 2014 looks approximately three times greater than the bar for 2010.
00:05:41 Graphs and charts are the basis of communicating and explaining many statistics,
00:05:48 but beware of any graph that does not start at zero and any chart that makes use of
pictographs instead of bars.
00:05:59 A company that sells cars presented annual sales increases in a pictogram using a car image.
00:06:07 Sales in 2017 were double sales in 2016, so this was represented by doubling
00:06:13 the height of the second car. However, when they doubled the height,
00:06:19 they also had to double the width, and the second car was then four times the size
00:06:25 of the first, taking into account height and width. The second car was far more impressive than
the first,
00:06:33 as you can see. The power of the bubble chart is that,
00:06:38 by using color and size as well as vertical and horizontal position,
00:06:44 you can simultaneously encode four different attributes for each item in the dataset.
00:06:51 In this example, the disk representing people aged 65 plus has a value of 13%,
00:06:57 and for 45 to 54 years old, the disk has a value of 26%. The 65 plus disk should have 1/2 the
area
00:07:06 of the 45 to 54 disk. However, in the diagram, the 65 plus disk
00:07:11 has 1/2 the radius and thus 1/4 of the area. By representing the percentage value with the
radius
00:07:20 rather than the area of the disk, this graph visually exaggerates the differences
00:07:26 between the groups represented by each bubble. The only information in this chart
00:07:32 is contained in the sizes of the disks, which is misleading. It does not use the disks' vertical
and horizontal positions
00:07:42 and colors to convey multidimensional information. On February 7, the Sunday Times
00:07:50 had a front page article asserting that six out of 10 Conservative backbenchers
00:07:57 contacted by this newspaper said they backed Brexit. On page two, the report explained
00:08:05 that 238 MPs had been contacted, 144 had replied, 66 of those replying that they would
campaign to leave,
00:08:15 and 50 said they were committed to remain. 66 out of 238 contacted is 28%,
00:08:23 while 66 out of 144 is giving an answer 46%, so six out of 10 is clearly simply wrong.
00:08:35 This visualization appeared in a Time magazine article. It illustrates the most common causes
00:08:42 of death by age group. There is a large amount of very complex information
00:08:47 packed into a single figure. Let's look at it in a little bit more detail.
00:08:54 Vertically, for one particular age, the graph is correctly proportioned,
00:08:59 the height of each shaded block corresponds to the fraction of total deaths from the associated
cause.
4
00:09:08 Horizontally, among age groups, the total number of deaths for each age group changes.
00:09:16 But the graph represents the fraction of deaths, not the total number of deaths.
00:09:22 Total deaths differ widely by age group, so this graph is misleading.
00:09:27 So for example, far more people aged 65 plus died of heart disease, the red,
00:09:32 than children aged one to four died of accidents, the blue-gray.
00:09:36 However, the children dying of accidents is represented by a much larger area
00:09:43 because it represents a larger percentage of the relatively low total deaths at that age.
00:09:50 Therefore, the within-age comparisons are entirely correct. However, between-age
comparisons
00:09:58 have a shifting denominator, which is the number of total deaths in each age group.
00:10:05 And this obscures the true risks associated with different causes of death.
00:10:13 To summarize, when you judge the veracity of the claims presented in a business report,
00:10:20 you should become accustomed to saying, "It depends." Many statistics are either simply
wrong
00:10:28 or have been calculated from slightly different populations so as to give support to a spurious
argument.
00:10:37 Calculating impossibly precise figures, using the wrong concepts and definitions,
00:10:43 or mixing up sources in a manner that amounts to comparing apples with oranges,
00:10:49 seems to be becoming more common, especially in media, business, and politics.
00:10:57 Even simple graphics can be misleading. In the next unit, we'll consider
00:11:04 lying and misleading with statistics.
5
Week 6 Unit 03
00:00:05 Hello and welcome back to week six, unit three of the openSAP course,
00:00:10 "Introduction to Statistics for Data Science". In this unit, we'll be looking at
00:00:16 lying and misleading with statistics. Modern statistical tools enable you to create
00:00:23 very powerful visualizations to illustrate the key insights from your statistical analysis.
00:00:30 For example, SAP Analytics Cloud provides powerful business planning, BI - for reporting,
dashboarding,
00:00:40 data-discovery and visualization - and predictive analytics capabilities.
00:00:47 Although it's now very easy to analyze data, it's also very easy to accidentally,
00:00:54 or deliberately in fact, mislead. In this unit, we'll be looking at some typical ways
00:00:59 that visualizations can mislead, and how to avoid them. Misleading trends, incorrectly inferring
causality
00:01:07 from correlation, not taking account of inflation, starting from zero or not.
00:01:17 So does the graph on the left look like a truthful and honest representation of the trend?
00:01:23 Of course it does, but there is missing information. Notice there is no scale for revenue or time.
00:01:31 Therefore you cannot really judge whether this a short-term or longer-term improvement.
00:01:40 The person who designed this graph chose to start with the representation from the lowest
point,
00:01:45 suggesting that it's a good news story. In fact, the visualization is hiding key information
00:01:52 about the previous poor revenue performance. And you can see that on the right.
00:02:02 A scatter plot is a very good way of spotting if there is a statistical correlation between two
variables.
00:02:12 A chart like this though is giving the impression there is a predictive relationship.
00:02:17 How do ice cream sales cause shark attacks? Clearly that's spurious. In fact, there are key
missing data -
00:02:24 the weather information. Hot weather tends to drive more ice cream sales,
00:02:29 and also more people venturing into the sea to cool down. And this is where there may be
sharks swimming.
00:02:38 Remember that correlation shows a possible statistical relationship, but it is very important
00:02:45 that the relationship is checked and validated. At least we have a scale here.
00:02:53 However, the graph is asking us to look at the changing revenue figures every five years.
00:03:01 How do we know if the graph takes inflation into account? Should we start from zero?
00:03:11 Fox Business used a graphic with a fairly badly distorted scale to exaggerate the effect that the
expiration
00:03:19 of the Bush tax cuts would have on the rich. This screenshot shows how Fox Business
presented
00:03:28 a return to the Clinton-era tax rate of 39.6% on the top income bracket, from the Bush-era rate
of 35%.
00:03:37 It looks like it will have a huge impact on people's tax bill.
00:03:41 But note the y-axis and where the tax rate starts. This is misleading.
00:03:49 In this case, the graph on the left does start from zero. However, is this an honest
representation?
00:03:57 Arguably, we are interested here less in the absolute value, and more in the volatility of the
stock price.
00:04:06 In the left-hand graph, it's difficult to make out the volatility.
6
00:04:11 The volatility is much clearer on the right-hand graph. Therefore, there can good examples and
good reasons
00:04:20 for not having a zero baseline. To summarize - when presenting the results
00:04:26 of your statistical analysis, it's legitimate to emphasize the data and story you want to tell.
00:04:33 But it is very important that this does not mislead the reader.
00:04:39 Think what you want to choose not to display. Can this omission be misleading?
00:04:46 Are you misleading by presenting correlations as causation without justification?
00:04:53 Do you have the appropriate scales, and where should they start?
00:04:57 And do your financial comparisons take account of inflation? Think about whether starting from
zero
00:05:04 helps or hinders clarity of representation. In the next and final unit, we will consider
00:05:11 some of the different statistical tools that are available.
7
Week 6 Unit 04
00:00:06 Hello and welcome back to week six, unit four of the openSAP course, "Introduction to
Statistics for Data Science".
00:00:14 In this final unit, we will be looking at some of the different statistical tools
00:00:20 that are available. This unit will introduce you to some of
00:00:25 the most popular tools used for statistics, MS Excel, R, and Python
00:00:32 Of course, there are many more tools available, but these are very popular.
00:00:39 There is a large body of reference and introductory material that's available on the Web
00:00:45 for each of these different tools. You will also be introduced to some of
00:00:51 the powerful SAP tools that have been developed that enable you to perform statistical
analysis
00:00:59 and predictive modeling. SAP HANA, SAP Predictive Analytics,
00:01:05 SAP Analytics Cloud, SAP Data Intelligence. We've included some interesting references
00:01:14 so that you can learn more about how to use these tools for statistical analysis.
00:01:23 Microsoft Excel uses a grid of cells arranged in numbered rows and letter-named columns
00:01:30 to organize data manipulations like arithmetic operations. It provides a range of functions for
statistical,
00:01:37 engineering, and financial analysis. It can display data as line graphs, histograms, and charts,
00:01:45 and with a very limited three-dimensional graphical display. R is a programming language and
free software environment
00:01:57 for statistical computing and graphics supported by the R Foundation for Statistical Computing.
00:02:05 The R language is widely used among statisticians and data miners for developing statistical
software.
00:02:14 The basic installation includes all of the commonly used statistical techniques such as
univariate analysis,
00:02:23 categorical data analysis, hypothesis testing, generalized linear modeling, multivariate
analysis,
00:02:31 and time series analysis. There are also powerful facilities
00:02:36 to produce statistical graphics. The base software is supplemented by over
00:02:41 5000 add-on packages developed by R users. These packages cover a broad range
00:02:49 of statistical techniques. R is command driven, so it takes longer to master
00:02:56 than point-and-click software. However, it has greater flexibility.
00:03:03 Python is a widely used general-purpose, high-level programming language.
00:03:10 It was mainly developed for emphasis on code readability, and its syntax allows programmers
to express concepts
00:03:19 in fewer lines of code. Python is concise and easy to read,
00:03:24 and it can be used for everything from Web development to software development and
scientific applications.
00:03:33 Python's standard library is very extensive and contains built-in modules written in C
00:03:40 that provide access to system functionality that would otherwise be inaccessible to Python
programmers,
00:03:48 as well as modules written in Python that provide standardized solutions
00:03:53 for many problems that occur in everyday programming. There are thousands of external
libraries that can be added on.
00:04:04 Python 2.0 was released in 2000 and Python 3.0 was released 2008.
8
00:04:11 This was a major revision of the language that is not completely backward-compatible,
00:04:16 and much of Python 2 code does not run unmodified on Python 3.
00:04:23 The SAP Analytics Cloud solution offers all analytics for all users on one platform,
00:04:31 in one user experience, combining BI, planning, and predictive analytics in a single solution.
00:04:38 In SAC, you can present the results of all the statistical analysis
00:04:43 you've been learning about in this course. SAC makes the whole process of developing
statistical
00:04:51 insight from problem to presentation quite simple. You can use SAC to build
00:04:57 interactive visualizations and stories, to create predictive models, and integrate them
00:05:04 into business intelligence and planning workflows. There is no data science experience
required.
00:05:12 To discover the key influencers behind business-critical KPIs, and run powerful simulations
00:05:20 to see how the KPIs can change based on changing contributory factors.
00:05:26 You can uncover contributing factors to your data points using natural language.
00:05:33 You can also easily use SAC to quickly develop a clear understanding of even the most
complex
00:05:41 aspects of your data. The Predictive Analysis Library, PAL,
00:05:48 is delivered with SAP HANA. This application function library
00:05:53 defines functions that can be called from within SAP HANA database SQLScript procedures
00:06:01 to perform analytic algorithms. SQLScript is an extension of SQL.
00:06:07 It includes enhanced control-flow capabilities and lets developers define complex application
logic
00:06:16 inside database procedures. The PAL defines functions that can be called
00:06:24 from within SQLScript procedures to perform analytic algorithms.
00:06:30 PAL includes classic and universal predictive analysis algorithms in the following data-mining
categories.
00:06:38 Clustering, where you create groups of objects, such as customers who have similar
characteristics, for example.
00:06:48 So you can develop relevant marketing messages and communications to specific customer
groups.
00:06:56 Classification, where you can identify the category a new observation belongs to, so for
example to decide
00:07:05 if a customer is going to buy a product or not, or if a machine will break down or not.
00:07:11 Regression, where you estimate the value of a target variable, so for example,
00:07:17 to estimate customer spend in dollars over the next six months, or estimate costs,
00:07:23 or how much investors will invest over a certain period. Association analysis, which you can
use
00:07:31 for market basket analysis to discover next best product recommendations for retailers.
00:07:38 Time series analysis, which is a sequence of data points measured at successive time instants
00:07:45 spaced at uniform time intervals. You can use time series analysis to forecast
00:07:51 monthly sales figures, or stock market prices. There's data pre-processing, statistics.
00:07:58 Social Network Analysis, where you analyze the links between customers
00:08:03 or between customers and the products they've purchased. These algorithms are explored in
much more detail
00:08:13 in another openSAP course, "Getting Started with Data Science", if you're interested.
00:08:22 SAP Predictive Analytics provides business analysts and data scientists
9
00:08:27 with a fully automated analysis and modeling capability. Data preparation allows you to easily
create
00:08:35 thousands of derived variables, you can build dynamic analytical datasets automatically,
00:08:42 and it automates data preparation and data encoding, automatically preparing data with
missing values,
00:08:50 outliers, and non-linear distributions. It provides predictive modeling,
00:08:55 regression, and classification. It provides clustering, segmentation, forecasting,
00:09:01 association rules, Social Network Analysis, and a range of advanced modeling deployment
00:09:07 and management capability in Predictive Factory. SAP Data Intelligence is a cloud solution,
focusing on
00:09:17 developing artificial intelligence projects, extracting value from distributed data sources,
00:09:24 using open source technology such as R, Python, and TensorFlow.
00:09:30 You can use Data Intelligence to explore, preview, and profile data assets, prepare data
directly
00:09:38 from the Data Intelligence Data Explorer, without any technical scripting skills.
00:09:44 Discover, refine, govern, and orchestrate any type, variety, and volume of data.
00:09:52 Develop data analysis and predictive models using notebooks, code, and data.
00:09:59 Integrate Jupyter Notebooks into a Pipeline Modeler application, so it is possible
00:10:06 to directly harness Jupyter Notebooks and Python in the operationalization of model pipelines.
00:10:17 So to summarize, this unit has introduced you to some of the most popular tools
00:10:23 that are used for statistical analysis. Microsoft Excel is commonly used
00:10:28 for relatively small data sets. It's very accessible, provides lots of easy-to-use functionality,
00:10:36 and has good graphical features. R and Python require you to learn a programming language,
00:10:44 but provide a huge amount of functionality, and it's extremely powerful,
00:10:51 very good graphical capability. If you take this interest in statistics further,
00:10:57 you should consider learning one or both of these programming languages.
00:11:03 The SAP tools provide easy-to-use statistical analysis, predictive modeling, and machine
learning capabilities
00:11:12 that can be used to quickly develop a clear understanding of even the most complex aspects
of your data.
00:11:22 Well, you've made it to the end of the course. You now know about some of
00:11:28 the fundamental principles of statistics. We hope you've enjoyed it - you are now well prepared
00:11:35 to do the last weekly assignment and the final exam that will be open next week.
00:11:42 We wish you good luck for both. We're happy to receive your feedback
00:11:46 in the, "I like, I wish" forum, and of course we'll still be available in the discussion forum
00:11:53 for your content-related questions until the course will be closed.
00:11:59 We enjoyed being your instructor for this course, and who knows - maybe we will meet again.
10
www.sap.com/contactsap
The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable
for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality
mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’ strategy and possible future developments, products, and/or platform directions and functionality are
all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation
to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are
cautioned not to place undue reliance on these forward-looking statements, and they should not be relied upon in making purchasing decisions.
SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other
countries. All other product and service names mentioned are the trademarks of their respective companies. See www.sap.com/copyright for additional trademark information and notices.