The document provides code examples and descriptions for importing data sets, data wrangling, exploratory data analysis, model development, and model evaluation/refinement in Python. Some key steps include reading CSV data into a pandas dataframe, handling missing data, data normalization, correlation analysis, linear and polynomial regression modeling, evaluating models using metrics like R^2 and MSE, and optimizing models with techniques like k-fold cross validation and grid search.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
27 views4 pages
Data Analysis W Pandas
The document provides code examples and descriptions for importing data sets, data wrangling, exploratory data analysis, model development, and model evaluation/refinement in Python. Some key steps include reading CSV data into a pandas dataframe, handling missing data, data normalization, correlation analysis, linear and polynomial regression modeling, evaluating models using metrics like R^2 and MSE, and optimizing models with techniques like k-fold cross validation and grid search.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
Cheat Sheet: Importing Data Sets
Package/Method Description Code Example
Read CSV file to a pd.read_csv('data.csv', header=None) <br> pd.read_csv('data.csv', Read CSV data set pandas data frame header=0) Print first few entries of the pandas data Print first few entries frame df.head() # default prints first 5 entries Print last few entries of the pandas data Print last few entries frame df.tail() # default prints last 5 entries Assign appropriate Assign header header names to the names data frame df.columns = ['Column1', 'Column2', ...] Replace "?" entries Replace "?" with with NaN from NaN Numpy library df = df.replace("?", np.nan) Retrieve data types of the data frame Retrieve data types columns df.dtypes Retrieve statistical Retrieve statistical description of the description data set df.describe(include="all") Retrieve summary of Retrieve data set the data set from the summary data frame df.info() Save processed data Save data frame to frame to a CSV file CSV with a specified path df.to_csv('processed_data.csv')
Cheat Sheet: Data Wrangling
Package/Method Description Code Example
Replace missing values Replace missing with mode data with frequency common entry df['attribute'].fillna(df['attribute'].mode()[0], inplace=True) Replace missing values Replace missing with mean of data with mean entries df['attribute'].fillna(df['attribute'].mean(), inplace=True) Fix data types of columns in Fix the data types the dataframe df['numeric_col'] = df['numeric_col'].astype('float64') Data Normalization Normalize df['attribute'] = (df['attribute'] - df['attribute'].min()) / (df['attribute'].max() - df['attribute'].min()) data in a Package/Method Description Code Example column between 0 and 1 Create bins for better analysis and Binning visualization pd.cut(df['numeric_col'], bins=5) Change label name of a Change column dataframe name column df.rename(columns={'old_name':'new_name'}, inplace=True) Create indicator variables for categorical Indicator Variables data pd.get_dummies(df['categorical_col'])
Cheat Sheet: Exploratory Data Analysis
Package/Method Description Code Example
Complete dataframe Correlation matrix using correlation all attributes df.corr() Specific Attribute Correlation matrix using correlation specific attributes df[['attr1', 'attr2']].corr() Create scatter plot for dependent vs Scatter Plot independent variables plt.scatter(df['independent'], df['dependent']) Create regression plot using dependent and Regression Plot independent variables sns.regplot(x='independent', y='dependent', data=df) Create box-and-whisker Box plot plot for variables sns.boxplot(x='category', y='numeric', data=df) Create subset of data Grouping by based on different attributes attributes df_group = df.groupby('attribute') Group data and display average value of GroupBy statements numerical attributes df_group = df.groupby('attr')['numeric'].mean() Create Pivot tables for pivot = df.pivot_table(index='attr1', columns='attr2', Pivot Tables data representation values='numeric') Create heatmap using Pseudocolor plot Pivot table data plt.pcolor(pivot, cmap='RdBu') Pearson Coefficient Calculate Pearson and p-value Coefficient and p-value pearson_coef, p_value = stats.pearsonr(df['attr1'], df['attr2']) Cheat Sheet: Model Development
Process Description Code Example
Create Linear Linear Regression model from sklearn.linear_model import LinearRegression <br> lr = Regression object LinearRegression() Train Linear Train Linear Regression model on Regression input and output model attributes X = df[['attr1', 'attr2']] <br> Y = df['target'] <br> lr.fit(X, Y) Generate Predict output for set output of input attribute predictions values Y_hat = lr.predict(X) Identify the Get slope coefficient coefficient and and intercept values intercept of the model coeff = lr.coef <br> intercept = lr.intercept_ Create residual plot for regression Residual Plot analysis sns.residplot(x=df['attr1'], y=df['attr2']) Plot distribution of data with respect to Distribution Plot an attribute sns.distplot(df['attribute'], hist=False) Fit polynomial Polynomial regression model Regression using numpy f = np.polyfit(x, y, deg) <br> p = np.poly1d(f) <br> Y_hat = p(x) Generate new feature Multi-variate matrix with Polynomial polynomial Regression combinations pr = PolynomialFeatures(degree=2) <br> Z_pr = pr.fit_transform(Z) Create data pipelines to simplify from sklearn.pipeline import Pipeline <br> pipe = Pipeline([('scale', Pipeline processing steps StandardScaler()), ('model', LinearRegression())]) Calculate R^2 for linear and polynomial R^2 value regression R2_score = lr.score(X, Y) <br> R2_score = r2_score(y, p(x)) Calculate Mean from sklearn.metrics import mean_squared_error <br> mse = MSE value Squared Error mean_squared_error(Y, Y_hat)
Cheat Sheet: Model Evaluation and Refinement
Process Description Code Example
Splitting data for Separate data training and into training from sklearn.model_selection import train_test_split <br> x_train, x_test, y_train, testing and testing sets y_test = train_test_split(X, Y, test_size=0.2, random_state=42) Process Description Code Example Evaluate model Cross performance validation using cross- from sklearn.model_selection import cross_val_score<br> scores = score validation cross_val_score(model, X, Y, cv=5) Predict output Cross using cross- validation validated from sklearn.model_selection import cross_val_predict <br> y_pred = prediction model cross_val_predict(model, X, Y, cv=4) Ridge Implement Regression Ridge from sklearn.linear_model import Ridge <br> ridge_model = and Regression Ridge(alpha=0.5) <br> ridge_model.fit(X_train, Y_train) <br> yhat = Prediction model ridge_model.predict(X_test) Use Grid Search from sklearn.model_selection import GridSearchCV <br>param_grid = {'alpha': to find best [0.001, 0.01, 0.1, 1, 10, 100]}<br> grid_search = GridSearchCV(Ridge(), model param_grid, cv=5) <br> grid_search.fit(X, Y) <br>`best_params = Grid Search parameters grid_search.best_params_