0% found this document useful (0 votes)
235 views

EDA Unit 1 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
235 views

EDA Unit 1 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 27
- eee A Exploratory Data Analysis ies EDA fundamentals - Understanding data science - Significance of EDA - Making sense of data - Comparing EDA with classical and Bayesian analysis - Software tools for EDA - Visual Aids for EDA - Data transformation techniques-merging database, reshaping and pivoting, Transformation techniques. Contents 1.1 Data and EDA Fundamentals 4.2 Understanding Data Science Significance of EDA Types of Exploratory Data Analysis Making Sense of Data Comparing EDA with Classical and Bayesian Analysis Software Tools for EDA Visual Aids for EDA Data Transformation Techniques Merging Database (Using Pandas Library) Reshaping and Pivoting Two Marks Questions with Answers Long Answered Questions ee Tf re nos re eee lescriptive inform, mbers), data can only take cents can take any value (within a rang ted, continuous data is measured. range) sales (ike wh Diverete data is oo Examples of data Qualitative My frien’ favorite holiday destination. * Customers in a shop (Discrete) Information a sified or organized data that has some meaningful value for 5 also the processed data used to make decisions and take action. tion is defined as a method of collecting, analyzing data forthe purpose of and research using some techniques. Data collection is done to analyze a about is outcome and fture tends. When there is anced to arrive at a «question, data collection methods help to make assumptions about the result * Below are widely used data collection methods, ws 2. Questionnaires and surveys 3. Observations 4, Documents and records: 5. Focus groups 6, Oral histories. 1. Int q R ust be available when it’s needed. TECANCAL PUBLCATIONS® an ups for knoe TEEIWNGA PBUOATIONS- opt te cords are vind documents and record a e sos and oral histories aT . cs groups era Farce, os TO of people, either by diteor ‘This method is by ect interview, Used When, ——___— TECHWCAL PUBLEATIONS® an up oa for hnowtadpe cxptaratory Dat Analysis ws Exploratory Date Anaya —_— a _____ Elonitiry Data hratysia co Customer personal information (e.g., name, address, age, contact info) Business journals co. Government records (e.g, census, ax records, Social Security info) 0. Trade / Business magazines o. The Internet. Data collection tools : Below are widely used tools for data collection. ‘+ Word association : The researcher gives the respondent a set of words and asks them what comes to mind when they hear each word, + Sentence completion : Researchers use sentence completion to understand what kind of ideas the respondent has. This tool involves giving an incomplete sentence and seeing how the interviewee finishes it. Role-playing : Respondents are presented with an imaginary situation and asked how they would act or react if it was real. + In-person surveys : The researcher asks questions in person. ine / Web surveys : These surveys are easy to accomplish, but some users may be ‘unwilling to answer truthfully, ifat all, * Mobile surveys : These surveys take advantage of the increasing proliferation of mobile technology. Mobile collection surveys rely on mobile devices like tablets or smartphones to conduct surveys via SMS or mobile apps. ‘+ Phone surveys : No researcher can call thousands of people at once, so they need a third Party to handle the chore, However, many people have called screening and won't answer. Observation : Sometimes, the simplest method is the best. Researchers who make direct observations collect data quickly and easily, with little intrusion or third-party bias. Naturally, itis only effective in small-scale situations. & 1.1.3 Common Issues / Problems in Data Inconsistent data * When working with various data Sources, it is conceivable that the same information will have discrepancies between sources. The differences could be in formats, units or ecasionally spellings. The introduction of inconsistent data might also occur during firm ‘mergers or relocations, Inconsistencies in data have a tondeney to accumulate and reduce ee eee TEGRICAL PUBLICATIONS an up tna or oaape Exploratory Data Analg Organizations that have heayiy reliable data {0 Support theip TEOMUCAL PLEUCATIONS- on opt or nowdge exploratory Data Anaiysis (1-7) Exploratory Data Analysis ee eer inaccurate data highly regulated businesses like healthare, data accuracy is crucial. Given the current experience, itis more important than ever to increase the data quality for COVID-19 and tater pandemics. Inaccurate information does not provide a true picture ofthe situation and cannot be used to plan the best course of action. Personalized customer experiences and ing strategies underperform ifthe customer data is inaccurate. accuracies can be attributed to a number of things, including data degradation, mistake and data Worldwide data decay occurs at a rate of about 3 % per uite concerning. Data integrity can be com tween different systems and data quality might deteriorate wit Hidden data wajority of businesses only utilize a portion of their data, with the remainder imes being lost in data or discarded in data graveyards. For instance, the rer service team might not receive client data from sales, missing an opportunity to ise and comprehensive customer profiles. Missing out on possibilities to products, enhance services and streamline procedures is caused by hidden clevant time period and so many more factors that one needs to consider while trying find relevant data, is not relevant tothe study in any ofthe factors render it obsolete and one cannot ly proceed with its analysis. This could lead to incomplete research or analysis, ing data again and again or shutting down the study. 1g the data to collect ining what data to collect is one ofthe most important factors while collecting data bbe one of the first factors while collecting date. One must choose the subjects cover, the sources used to gather it and the required quantity of information, sitors between the ages of 20 and $0 most frequently access. One can also decide to compile data onthe typical age ofall the clients who made a purchase from the business over the previous month see ——_— 0 Exploratory Dat —— spay fs —— Dealing wi bia Sat® It is the practice of predictors and other process of using summary statistics and s on data in order to uncover test hypotheses and * EDA is an approach to data an _ Mat kind of model the data follow with the more direct approach of allowing the ; oo" underlying structure and model. EDA is not a mere collection * * EDA is a philosophy that dissects a data set; what is to look for: how exploratory Date Analysis (1-9) Exploratory Data Anaysia a ny Seen is true that EDA heavily uses the collection of tical graphics", but it is not identical to statistical purpose of EDA is to spot problems in data (as part of data wrangling) and understand variable properties like, Central trends (mean) co Spread (variance) o Skew © Outliers and anomalies, low are the most prominent reasons to use EDA Detection of mistakes, relationships among the explanatory variables. 11. Assessing the direction and rough size of relationships between explanatory and @ 1.2 Understanding Data Science * Data science is the domain of study that deals with vast volumes of data using modem tools ‘and techniques to find unseen patterns, derive meaningful information and make business decisions. : "uncover actionable insights hidden in an organization's data. These insights can be used to 2uide decision making and strategic planning. le TECHNICAL RIRUCATIONS- murat ontaee with its own tasks reception, data extraction «This od data ‘ » data staging, data processing, dag ata and purting it in a form that can be ing, data summari ranges and biases analysis, regression, text mining the various analyses on the data. R Studio, MATLAB, Excel, RapidMiner tica / Talend, AWS Redshift lupyter, Tableau, Cognos, RAW. ‘eaming - Spark MLib, Mahout, Azure ML studio. Exploratory Data Analysis a1 Exploratory Data Analysis pomeymme NO) xporatory Data nasa Use of data science 1, Data science may detect pattems in seemingly unstructured or unconnected data, allowing conclusions and predictions to be made Tech businesses that acquire user data can utilize strategies to transform that data into valuable or profitable information, Data science has also made inroads into the transportation industry, such as with driverless cars, It e to lower the number of accidents with the use of driverless cars. For iriverless cars, ‘training data is supplied to the algorithm and the data is examined using data ssience approaches, such as the speed limit on the highway, busy streets, ete, example, w Data science applications provide a better level of ‘therapeutic customization through s and genomics research, Data stience has found its applications in almost every industry. Applications of data science |} Healthcare : Healthcare companies are using data science to build sophisticated medical uments to detect and cure diseases. Machine learning models and other data science Components are used by hospitals and other healthcare providers t0 automate X-ray anal S and assist doctors in diagnosing illnesses and pl ing treatments based on previous patient outcomes, Video and computer games are now being created withthe help of data science ‘and that has taken the gaming experience to the next level, ~ Image recognition: Identifying patems in images and detecting objects in an image is )pular data science applications. Recommendation systems : Netflix and Amazon give movie and product, ‘commendations based on what people like to watch, purchase or browse on their platfor + Data science is used by logistics companies to optimize routes to ensure faster ry of products and inerease operational efficiency. Fraud detection : Banking and financial i ‘algorithms to detect fraudulent transeetions. Exploratory Data Anaya a le comes into ming, to keep items in stock, TECHNICAL PUBLICATIONS on upto knowledge + Data science enables streaming services to follow and evaluate what . Which aids in the creation of new TV series and films. Data-driven Iso utilised to provide tailored suggestions based on the watching history gain an understanding ofthe data set beyond the formal modeling Exploratory data analysis is essential for any research analysis, for analyzing data sets is, in data sets. EDA gives a better understanding of the data set. oF anomalous events. eae * Exploratory data analysis is key and usually the first exercise in data mining, It aie us to data to understand it as well as to create hypotheses for further analysis. The ‘centers around creating a synopsis of data or insights for the next steps in a data mining project. EjanGH PBLEATONG a pa roe Exploratory Data Anaya ee making Ny undering DA is complete data analysis or TECHNICAL PUBLCATIONS® an vp tines xeon 0st Anais (1-15) Exploratory Data Analysis ratty 008 DONO eee (1.4 Types of Exploratory Data Analysis |) There ae fur primary types OFEDA namely, ips. The main purpose of univariate analysis is to describe the nd find patterns that exist within it ariate graphical : Non - graphical methods do not provide a full picture of the aphical methods are therefore required. Common types of univariate graphics co. Stem - and - leaf plots, which show all data values and the shape of the distribution. ims, a bar plot in which each bar represents the frequency (count) or -ount/total count) of cases fora range of values. +h graphically depict the five-number summary of minimum, fist third quartile and maximum. th each group representing one level of one of the variables and each bar within roup representing the levels ofthe ater variable ised to plot data points on a horizontal and a vertical axis to show affected by another. chart, which is @ graphical representation of the relationships between ind a response. ow much one variabl © Run chart, whic raph of data plotted over time, © Bubble chart, which isa data visualization that displays multiple circles (bubbles) in a two-dimensional pl © Heat map, which is @ graphical representation of data where values are depicted by color. 4 TECHNICAL PIBLCATIONS®- en pre rine ; Exploratory Date Analyaig ) ee n of data is. nop et insight into the ing inference from data, of data educates owners op jered in order to successfully complete so forth), different One person might make a cone has to do before starting the Is, At one level, this ng a filing system or ¢ helps quick access. the fact that how many keep in details of updations TECHNICAL PUBLICATIONS? - on up-tst for knowledge exiomioy Don Aeyn ___-) expert bata Anais 43. Making sense of data - Once data is collected and a well organized analysis process can ‘© of data. For analyzing data there is a need to have data in ‘wable format. Data is best seen using tables, charts, graphs, patterns. This Would help to analyze the dataset 16 Comparing EDA with Classical and Bayesian Analysis ilar in that they all begin with a general science / ield science / engineering conclusions. The difference is the Problem -> Data > Model > Analysis > Conclusions Exploratory data analysis follows below steps, Problem -> Data -> Analysis > Model -> Conclusions Bayesian data analysis follows below steps, -m -> Data > Model > Prior Distribution > Analysis -> Conclusions analysis, the data collection is followed by the imposition of a model rarity, ete.) and the analysis, estimation and testing that follows are focused on the parameters of that model. For EDA, the data collection is not followed by a model imposition; rather it is followed immediately by analy: a goal of inferring what mode! would be appropriate. a Bayesian analysis, the analyst attempts to incorporate scientific/engineering i is by imposing a data - independent distribution on the Parameters of the selected model; the analysis thus consists of formally combining both the ion on the parameters and the collected data to jointly make inferences and/or bout the model parameters. In the real world, data analysts freely mix lements of all ofthe above three approaches and if required other approaches as well. ee TERIA PRUEA TIONS mp toe ee PO Pat Aa Anatsis Exploratory Dat classical analysis VS EDA ng EDA can compared onthe basis of below parameter, proach sportory Data Analy exgerronelomee ty) pay DtaArs estimation techniques have the characteristic of taking all of the data and data into a few numbers ("estimates"). This is both a virtue and a vice. The these few numbers focus on important characteristics (location, variation, '¢ Population. The vice is that concentrating on these few characteristics can * characteristics (skewness, tail length, autocorrelation, etc.) of the same this sense there is a loss of information due to this “filtering” process. he EDA approach often makes use of (and shows) all of the available data. In there is no corresponding loss of information. Classical ants ces models (both dete example, regression models and analysis gp approach does not impose deterministic or probabilitie rary. the EDA approach allows the data to suggest suggested bythe data inked to the validity of the underlying assumptions. In ns are unknown or untested, the validity of the scientific conclusions becomes suspect. Many EDA techniques make little or no assumptions they present and show the data - all of the data - as is, with fewer encumbering assumptions. 5 D 1.7 Software Tools for EDA Python, R, Excel are some of the popular EDA tools. Techniques 1. R-- An open-source programming language and free software environment for statistical computing and graphics supported by the R foundation for statistical computing. The R language is widely used among statisticians in developing statistical observations and data analysis, Python - An interpreted, object-oriented programming language with dynamic semantics. Uts high-level, built-in daa structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or elue language to connect existing components together. Python and EDA can be used ‘missing values in a dataset, which is important so one can decide how pray ost iis (ay Exploratory Data Analysis ng value in a data set, Othe. + Example + Below is the waiting time of the customer at the cash counter of the srocessary shop during peak hours, which was observed by the cashier. n of data, handling outliers, wiltin data structure and dynamic me sk hat ah womate the wig von provides SPEDA and bein saving time ining and dependable tool excel remtins ay ty long rut Spreadsheet = A ven 1 problems fa dustry bin range is 2.30 mins to 2.86 mins. It can be noted that the count is three for that category from the table and as seen in the below graph. riate plots (Used for univariate data - the data containing one variable) frequency or the distribution shape of a variable, Below are shows the frequency values, which are Bar graphs have gaps between the bars {0 there are no gaps in histograms. Hencé ly skew (most of the data falls to the right (most of the data falls to the left side), bi-modal (graphs (23,206) (286.343) (949,999) (899,456) (456, 5:12) Fig. 1.8.1 Histogram TECnnCH PLBLATONS®- np oie a2 Ppl) Dal Anal, Exploratory Dota Anas ' iain, which i a Pe OF © Iisa condom Jacks an apparent patter n) which means what i the probably sth distplot of distribution of when ive with new add Density 140 180 Release_year Fig. 1.8.2 Distplot 4 continuous line. This is the most basic type of only depicts a security's elosing prices over time. frame, but they most often use day-to-day price used in finance and charts can be used for . charts are simplistic and may not fully capture pattems or trends. TECHNICAL PUBLICATIONS® an up tasterinowedpe eM ploratory Data Analysis (1-23) Exploratory Data Analysis There are diferent types of tne chars. They are: th i tows the trend over ime (ears, months, days) o other categories t order of ime oF types important Line chart with markers - Ii similar tothe line chart, but it will highlight data points with markers. 6 Stacked Tine chart - This i a line chart where fines of the data points do hot get overlapped because they will be cumulative at each point. highlight data points with markers, o 100 category. © 100 % stacked fi ‘% stacked line chart - It shows the percentage contribution to a whole-time or ight data points, ‘+ Below line chart shows number of houses sold in particular months. Number of houses sold Jan Feb Mar Apr May Jun Months Fig, 1.8.3 Lineplot iv. Stacked area plot / chart + An area chart combines the lime chart and bar chart to show how one or more group's numeric values change over the progression of second variable, typically that of time. ‘An area chart is distinguished from a line chart by the addition of shading between lines and a baseline, like ina bar chart TECiNNGALPIBLEATING?- on opti nee — wey onalomn 2) tno ays xptvtey ote nen istribution plots «such ae ar ributions are mathematical functions that describe all the possible values m assume within a given range. They help model random in order to estimate the probability of a particular event. This helpful to know the likely outcomes and the spread of potential ingle random variable, probability distributions can be divided into two types : tribution + There are two possible outcomes in this distribution - oF failure and multiple trials are caried out. The probability of suecess and failure isthe same forall rials. The sum of all probabilities must equal one. + Success probability : So, le’s say that there is a success 10 etenpo fk peboa eee sae ee The ‘number of trials is ten and the number of successes is 7. Iso known as probability ite number of values between any two values; like weight can take any value like 45.3, 45.36. 45.369 or 45.3698 and so on. tes for continuous distributions are measured over ranges of values rather than . A probability indicates the likelihood that a value will fall within an ion, research and data analysis, Tables interval. The entire area under the distribution curve equals 1. For instance, the handwritten notes, computer software, architectural Proportion of the area under the curve that falls within a range of values along the and many other places, ‘X-axis is the likelihood that a value will fall within that range. sa cannot easily be Wsuallyor when the data requires more specifi attention, | rence the output is between 0 and 1. There are a variety of distributions that can be used to model different types of data. RaUEATIONS ap trio ——— Probability density function Exploratory Data : lation plots (Heat maps) | data follows a no tance, correlation he Suppose sisubution smaller than 4,5 there is a large amount of data. They are used during A/B testing to see which parts of a web page are accessed by users on a website, The number of reviews generated every ays observed data in a time hour or to analyze a cricket match to understand where a batsman is scoring the bulk of pect of a business process's hart. They are often analyzed in a process over time. CI it ina pr Thanges in Iso use a cluster map to understand the relationship between two categorical A cluster map basically plots a dendrogram that shows the categories of behavior together. doc 2. Bivariate plots (Used fo bivariate data - the data containing two variables) ari 3, Special purpose plots i. Pair plots it plots are a simple way in order to visualize relationships between multiple variables. So, It produces a matrix of relationships between variables in the data for a direct examination ofthe data, ‘This plot shows how registered and casual users are using bike rentals, It also shows the effect of temperature, humidity and wind speed on bike rentals. This gives an overview of the correlation between multiple variables. ii, Contour plots ‘+ The contour plot can be used for representing a 3D surface in a 2D format, Contour plots are generally used for continuous variables rather than categorical data. «The contour maps are inspired by seismic data analysis, They can explain where the data explore deep learning error functions or gradient analysis, Rectangular boxes are used the data points are spread out, an a) i. Density plots +A density plot isa smoothed, continyous version ofa histogram estimated from the data, ‘The most common form of estimation is the jensity plot. In this method, @ continuous curve (the kernel) s drawn at every individual data point. All ofthese curves are then combined to make # ingle smooth density estimation. TECH PLN oa on ama Rat poet 2 Exploratory Data Aaja expiratory Data Anais (1-2 Exploratory Data Anais __ #8 Epo ——————_____(1-29) _______Exploratory Data Analysis inthe probity Sesh the hemel densiy vi. Lag plots cea densin pC oan the diferese b dat the probly desi i hy + A relationship between an observation and the previous observation is beneficial in time aie series modeling, Previous observations in a time series are lags. with the observation at ‘one previous time step. I is known as lag, the observation at two previous steps lag 2 and soon, categories, histograms + A lag plot is a useful type of plot in order to explore each observation’s relationship and that observation and is displayed as a scatter plot. Ifthe points cluster along a Tine from the bottom-left to the plot's wop-right, it suggests a positive relationship. Ifthe points cluster along a diagonal line from the top-left to the er right, it means a negative correlation relationship. to show information ag the same point. It help + Lag plots can help compare observations simultaneously in the last week or last month or the previous year by using corresponding lag values. here shows the count of bike rentals compared to the previous day's count and it displays a relatively strong positive correlation, vii. Auto-correlation plots is useful when relationships + The correlation between observations and their lag values in a time series name autocorrelation. Correlation coefficients are plotted on an autocorrelation plot. coefficient is a correlation value between observations and their lag 1 jn a number between ~ 1 and + 1. A value close to zero suggests a ter understand how this relationship changes over the lag. It shows the lag on axis and the correlation on the y-axis. viii. Lognormal plots * A normal distribution can be converted to a lognormal distribution using logarithmic ‘mathematics. The lognormal distribution plots the log of random variables from a normal distribution curve. It displays the Probability Density Function (PDF) and is of Particular interest when the variable must be positive as log values are always positive. ‘* Many examples follow lognormal distribution like the concentration of elements and their radioactivity in the Earth's crust, latent periods of infectious diseases, the . suffixes is a tuple of strings to append to identical column names that aren't merge keys. ‘This allows us to keep track ofthe origins of columns with the same name. «These are some of the most important parameters to pass to merge) Using merge() + Before getting into the details of how to use merge(). one should first understand the ‘arious forms of joins: more complex and result in the cartesian product of the joined f Inner This means that, afer the merge, there will be every combination of rows that share the © Outer © Left ‘What makes merge) so flexible is the sheer number of ing the behavior of Sc ‘number of options for defining the Right TEOANCAL PIUCATOND oop alrite ‘EIB PLBUCATIONSY on pte Eno xlrtry Daa Att -) Exploratory Data «li and of asm sues in merge) Tey specify alin a 02) ‘overlapping columns but have no effet when passing alist of other Data co Sort can be enabled to sor the resulting DataFrame by the join key. over Fig. 1.10.4 Merge and joins age the wo datasets andthe labels point to which part or «Inch image the two cic eS ro be seen, he same options a how from z lumns are also specie, nerge(). The difference is that i PIBUCATIONS? wp na ec meee y combining Data aroes ROWS OF Columns parameters along rows would look ag Soncateating datasets, one can specify the axis along which one iil ee “oncatenation results in a set union, where all data is preserved with the same ‘merge() and join() as an outer join, with the required join parameter. I perform an inner join or set intersection, ‘+ As with the case of other inner joins, some data loss can oceur when an inner join with concat() is carried ou. Only hee the axis labels match preserves rows or columns. «Below are some of the other parameters that concat() takes, any a eee be axis that concatenate along. The default value is 0, which 's along the index or row axis. Altematively, a value of 1 will concatenate long columns. One can also use the string values "index" or “columns”. ilar to the how parameter in the other techniques, but it only accepts the F or outer, The default value is outer, which preserves data, while inner inate data that doesn’t have a match in the other dataset. : takes a Boolean True or False value. It defaults to False. If True, then the ied dataset won't preserye the original index values in the axis specified in parameter. Ths gives entirely new index values. index while preserving the original indices so that one can tell which rows, (ute ‘come from which original dataset. TEDWCALPRLCATON® trivago lotry Data Anas (4) Exploratory Data Analysis syntax ‘pandas mel frame, i_vars-Non,value_vars=None, var_name=None, ‘lue_name=value col_fevel=None) parameters : =a (eae DataFrame DataFrame Contains | Required list, numbers, strings tuple, list or ndarray | Optional tuple, list or ndarray | Optional var_name | Name to use forthe ‘variable’ pees Required column. If None it uses frame.columns.name or ‘variable’ [Name to use forthe ‘value’ columa. | scalar, default ‘value’ | Required int or tring Optional Retums : Unpivoted DataPrame. Example Program -3 ‘osing melt futiod ‘import pandas espa” oft = (Name: TEOMA RLM mapa e] in the columns, Deserption ae Required / Optional Required Optional Required | Opon reshaped DataFrame, When the == ar any index, column combinations with multiple valUeS: ROWCA FBLCATOG® map altrinntage pat Data Anas (1-49) Experton Dei tonree sonen atone mpc example Program -4 snpor panda 08 Pa at = {Name': [Revi “Aniket, “An, 1: |1, 2,3], Role" FCEOY, Editor “Author a= pt DataFrame(d) # print)

You might also like