0% found this document useful (0 votes)
27 views4 pages

Data Analysis W Pandas

The document provides code examples and descriptions for importing data sets, data wrangling, exploratory data analysis, model development, and model evaluation/refinement in Python. Some key steps include reading CSV data into a pandas dataframe, handling missing data, data normalization, correlation analysis, linear and polynomial regression modeling, evaluating models using metrics like R^2 and MSE, and optimizing models with techniques like k-fold cross validation and grid search.

Uploaded by

x7jn4sxdn9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Data Analysis W Pandas

The document provides code examples and descriptions for importing data sets, data wrangling, exploratory data analysis, model development, and model evaluation/refinement in Python. Some key steps include reading CSV data into a pandas dataframe, handling missing data, data normalization, correlation analysis, linear and polynomial regression modeling, evaluating models using metrics like R^2 and MSE, and optimizing models with techniques like k-fold cross validation and grid search.

Uploaded by

x7jn4sxdn9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Cheat Sheet: Importing Data Sets

Package/Method Description Code Example


Read CSV file to a pd.read_csv('data.csv', header=None) <br> pd.read_csv('data.csv',
Read CSV data set pandas data frame header=0)
Print first few entries
of the pandas data
Print first few entries frame df.head() # default prints first 5 entries
Print last few entries
of the pandas data
Print last few entries frame df.tail() # default prints last 5 entries
Assign appropriate
Assign header header names to the
names data frame df.columns = ['Column1', 'Column2', ...]
Replace "?" entries
Replace "?" with with NaN from
NaN Numpy library df = df.replace("?", np.nan)
Retrieve data types
of the data frame
Retrieve data types columns df.dtypes
Retrieve statistical
Retrieve statistical description of the
description data set df.describe(include="all")
Retrieve summary of
Retrieve data set the data set from the
summary data frame df.info()
Save processed data
Save data frame to frame to a CSV file
CSV with a specified path df.to_csv('processed_data.csv')

Cheat Sheet: Data Wrangling

Package/Method Description Code Example


Replace
missing values
Replace missing with mode
data with frequency common entry df['attribute'].fillna(df['attribute'].mode()[0], inplace=True)
Replace
missing values
Replace missing with mean of
data with mean entries df['attribute'].fillna(df['attribute'].mean(), inplace=True)
Fix data types
of columns in
Fix the data types the dataframe df['numeric_col'] = df['numeric_col'].astype('float64')
Data Normalization Normalize df['attribute'] = (df['attribute'] - df['attribute'].min()) / (df['attribute'].max()
- df['attribute'].min())
data in a
Package/Method Description Code Example
column
between 0
and 1
Create bins for
better analysis
and
Binning visualization pd.cut(df['numeric_col'], bins=5)
Change label
name of a
Change column dataframe
name column df.rename(columns={'old_name':'new_name'}, inplace=True)
Create
indicator
variables for
categorical
Indicator Variables data pd.get_dummies(df['categorical_col'])

Cheat Sheet: Exploratory Data Analysis

Package/Method Description Code Example


Complete dataframe Correlation matrix using
correlation all attributes df.corr()
Specific Attribute Correlation matrix using
correlation specific attributes df[['attr1', 'attr2']].corr()
Create scatter plot for
dependent vs
Scatter Plot independent variables plt.scatter(df['independent'], df['dependent'])
Create regression plot
using dependent and
Regression Plot independent variables sns.regplot(x='independent', y='dependent', data=df)
Create box-and-whisker
Box plot plot for variables sns.boxplot(x='category', y='numeric', data=df)
Create subset of data
Grouping by based on different
attributes attributes df_group = df.groupby('attribute')
Group data and display
average value of
GroupBy statements numerical attributes df_group = df.groupby('attr')['numeric'].mean()
Create Pivot tables for pivot = df.pivot_table(index='attr1', columns='attr2',
Pivot Tables data representation values='numeric')
Create heatmap using
Pseudocolor plot Pivot table data plt.pcolor(pivot, cmap='RdBu')
Pearson Coefficient Calculate Pearson
and p-value Coefficient and p-value pearson_coef, p_value = stats.pearsonr(df['attr1'], df['attr2'])
Cheat Sheet: Model Development

Process Description Code Example


Create Linear
Linear Regression model from sklearn.linear_model import LinearRegression <br> lr =
Regression object LinearRegression()
Train Linear
Train Linear Regression model on
Regression input and output
model attributes X = df[['attr1', 'attr2']] <br> Y = df['target'] <br> lr.fit(X, Y)
Generate Predict output for set
output of input attribute
predictions values Y_hat = lr.predict(X)
Identify the Get slope coefficient
coefficient and and intercept values
intercept of the model coeff = lr.coef <br> intercept = lr.intercept_
Create residual plot
for regression
Residual Plot analysis sns.residplot(x=df['attr1'], y=df['attr2'])
Plot distribution of
data with respect to
Distribution Plot an attribute sns.distplot(df['attribute'], hist=False)
Fit polynomial
Polynomial regression model
Regression using numpy f = np.polyfit(x, y, deg) <br> p = np.poly1d(f) <br> Y_hat = p(x)
Generate new feature
Multi-variate matrix with
Polynomial polynomial
Regression combinations pr = PolynomialFeatures(degree=2) <br> Z_pr = pr.fit_transform(Z)
Create data pipelines
to simplify from sklearn.pipeline import Pipeline <br> pipe = Pipeline([('scale',
Pipeline processing steps StandardScaler()), ('model', LinearRegression())])
Calculate R^2 for
linear and
polynomial
R^2 value regression R2_score = lr.score(X, Y) <br> R2_score = r2_score(y, p(x))
Calculate Mean from sklearn.metrics import mean_squared_error <br> mse =
MSE value Squared Error mean_squared_error(Y, Y_hat)

Cheat Sheet: Model Evaluation and Refinement

Process Description Code Example


Splitting
data for Separate data
training and into training from sklearn.model_selection import train_test_split <br> x_train, x_test, y_train,
testing and testing sets y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
Process Description Code Example
Evaluate model
Cross performance
validation using cross- from sklearn.model_selection import cross_val_score<br> scores =
score validation cross_val_score(model, X, Y, cv=5)
Predict output
Cross using cross-
validation validated from sklearn.model_selection import cross_val_predict <br> y_pred =
prediction model cross_val_predict(model, X, Y, cv=4)
Ridge Implement
Regression Ridge from sklearn.linear_model import Ridge <br> ridge_model =
and Regression Ridge(alpha=0.5) <br> ridge_model.fit(X_train, Y_train) <br> yhat =
Prediction model ridge_model.predict(X_test)
Use Grid Search from sklearn.model_selection import GridSearchCV <br>param_grid = {'alpha':
to find best [0.001, 0.01, 0.1, 1, 10, 100]}<br> grid_search = GridSearchCV(Ridge(),
model param_grid, cv=5) <br> grid_search.fit(X, Y) <br>`best_params =
Grid Search parameters grid_search.best_params_

You might also like