0% found this document useful (0 votes)

5 views

Feature engineering techniques

snier

Uploaded by

anandx1011

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Feature engineering techniques

snier

Uploaded by

anandx1011

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Feature engineering techniques: - correlation , variance inflation factor ,

practical implementation

Aggregated Statistical Summaries for Numeric Features with cuDF

Hash-based Frequency Encoding for Tokenized Categorical Data

Fi:

Feature Importance with cuML Random Forest on Encrypted Data

Feature Importance with RandomForestClassifier in Pandas

10/27/2024

For your upcoming hackathon, using RAPIDS cuDF and cuML libraries on
Google Colab’s paid tier with GPUs (often T4, V100, or A100), you’ll be able
to explore and apply feature engineering techniques on encrypted data
efficiently. Since the dataset is encrypted and in tabular form, you’ll focus on
patterns and feature extraction without needing to decrypt the data directly.

Here’s a recommended approach:

### Step 1: Set Up RAPIDS Environment in Google Colab

Since RAPIDS isn’t natively available in Colab, you’ll need to set it up. This
installation works best on GPUs like T4, P100, or V100:

```python

# Install RAPIDS cuDF and cuML libraries in Google Colab

!apt-get update

!apt-get install -y wget

!wget -qO-
https://github.com/rapidsai/rapidsai-csp-utils/releases/download/v0.20/
update_csp_colab.sh | bash

```

Once installed, restart the runtime. After restarting, you’ll be ready to use
RAPIDS libraries for fast data processing.

### Step 2: Data Preprocessing and Exploration

With an encrypted dataset, you won’t be able to access the raw values
directly, so use feature engineering techniques like **statistical
aggregations** or **hash-based encodings** to extract relevant patterns.

#### 1. Frequency Encoding for Encrypted Categorical Data

Encrypted categorical columns (e.g., hashed or tokenized values) can be

frequency encoded, which assigns each unique value its frequency within the
dataset. This technique is effective for categorical data, preserving privacy
while revealing underlying patterns.

```python

import cudf

# Sample encrypted-like categorical data

data = cudf.DataFrame({'encrypted_feature': ['hash1', 'hash2', 'hash1',

'hash3', 'hash2', 'hash1']})

# Frequency encoding

frequency_encoded = data['encrypted_feature'].value_counts().reset_index()

frequency_encoded.columns = ['encrypted_feature', 'frequency']

# Merge frequency encoding with original data

data = data.merge(frequency_encoded, on='encrypted_feature', how='left')

print("Frequency Encoded Data:\n", data)

```

#### 2. Statistical Aggregations on Encrypted Numeric Features

For numeric columns in encrypted datasets, aggregating statistics like

**mean, sum, and standard deviation** per feature can reveal trends or
outliers without needing original values.

```python

# Sample encrypted-like numeric data

data = cudf.DataFrame({

'feature1': [2345, 5234, 1987, 7342, 4251],

'feature2': [1001, 2340, 897, 3451, 2033]

})

# Aggregate features to calculate mean, sum, standard deviation

aggregated_features = data.agg(['mean', 'sum', 'std'])

print("Aggregated Features:\n", aggregated_features)

```

### Step 3: Feature Importance Using cuML’s Random Forest

Using the `RandomForestClassifier` from RAPIDS’ `cuML` library on

encrypted data can provide insights into feature importance.

```python

from cuml.ensemble import RandomForestClassifier

# Assume you have features and target in the encrypted dataset

X = data[['feature1', 'feature2']] # encrypted numeric features

y = cudf.Series([1, 0, 1, 0, 1]) # encrypted target or binary labels

# Train Random Forest model to evaluate feature importance

rf_model = RandomForestClassifier(n_estimators=10, max_depth=3,

random_state=42)

rf_model.fit(X, y)

# Extract feature importances

feature_importance = rf_model.feature_importances_

importance_df = cudf.DataFrame({'feature': X.columns, 'importance':

feature_importance})

print("Feature Importances:\n", importance_df.sort_values(by='importance',

ascending=False))

```

Best Practices and Tips

-Use cuDF for Large Data**: Since your data is 10GB, `cuDF` on GPUs will
significantly speed up processing.

- Focus on Secure Aggregations**: Techniques like `mean`, `sum`, and

`frequency encoding` are useful for anonymized insights.

- Consider Non-Decryption-Based Feature Selection**: Look at statistical

summaries, transformations, and feature importance rankings rather than
relying on decrypted values.
By following these steps, you can extract meaningful patterns and insights
from encrypted data and apply feature engineering methods in a secure,
GPU-accelerated environment with RAPIDS on Google Colab.

OilSpill MarRosso Notebook
No ratings yet
OilSpill MarRosso Notebook
48 pages
Rivets
No ratings yet
Rivets
28 pages
Python Code
No ratings yet
Python Code
7 pages
DE&V RECORD
No ratings yet
DE&V RECORD
36 pages
exp1
No ratings yet
exp1
5 pages
21BCP122 - Digital - Forensics - Assignment - 4a 2
No ratings yet
21BCP122 - Digital - Forensics - Assignment - 4a 2
6 pages
EDA LAB MANUAL (1) (1)
No ratings yet
EDA LAB MANUAL (1) (1)
34 pages
Important
No ratings yet
Important
12 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
python_for_rf
No ratings yet
python_for_rf
22 pages
ML Capacity Career Choice Prediction Annotation
No ratings yet
ML Capacity Career Choice Prediction Annotation
20 pages
Information Retrieval Journal
No ratings yet
Information Retrieval Journal
33 pages
DATA MINING LAB MANUAL
No ratings yet
DATA MINING LAB MANUAL
35 pages
Functions and Packages
No ratings yet
Functions and Packages
7 pages
R
No ratings yet
R
14 pages
Iot Based Vehicle Number Plate Detection
No ratings yet
Iot Based Vehicle Number Plate Detection
13 pages
filefile (6) (1)
No ratings yet
filefile (6) (1)
39 pages
AI Manual
No ratings yet
AI Manual
69 pages
EDAP LAB
No ratings yet
EDAP LAB
47 pages
Swagatika Rautaray
No ratings yet
Swagatika Rautaray
6 pages
Python Data Analytics Libraries
No ratings yet
Python Data Analytics Libraries
8 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
UI21CS29_Lab2
No ratings yet
UI21CS29_Lab2
11 pages
Data Science Using r 2
No ratings yet
Data Science Using r 2
29 pages
Sample Phase 2 Document
No ratings yet
Sample Phase 2 Document
7 pages
60 ChatGPT Prompts For Data Science 2023
100% (3)
60 ChatGPT Prompts For Data Science 2023
67 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
Introduction to Popular-1
No ratings yet
Introduction to Popular-1
15 pages
Downloading and Streaming Data to QGIS
No ratings yet
Downloading and Streaming Data to QGIS
36 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
24 pages
DMDW Lab10[1]
No ratings yet
DMDW Lab10[1]
6 pages
Dev
No ratings yet
Dev
33 pages
vkjfd
No ratings yet
vkjfd
27 pages
Knitr Manual
No ratings yet
Knitr Manual
11 pages
Operationalizing The Model
No ratings yet
Operationalizing The Model
46 pages
All Exp Lab
No ratings yet
All Exp Lab
15 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
ChatDBClean - Colab
No ratings yet
ChatDBClean - Colab
3 pages
Grade Project
No ratings yet
Grade Project
1 page
ML Pgms_24Mar2025
No ratings yet
ML Pgms_24Mar2025
23 pages
Prerequisites: R Installation
No ratings yet
Prerequisites: R Installation
11 pages
Accelerating Data Parallelism in Gpus Through Apgas
No ratings yet
Accelerating Data Parallelism in Gpus Through Apgas
9 pages
FDA Lab Manual Final
No ratings yet
FDA Lab Manual Final
42 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
FOD Record Sem 1
No ratings yet
FOD Record Sem 1
25 pages
CS2209 - Oops Lab Manual
100% (1)
CS2209 - Oops Lab Manual
62 pages
mine5
No ratings yet
mine5
8 pages
todo
No ratings yet
todo
8 pages
DATA MINING LAB MANAUL
No ratings yet
DATA MINING LAB MANAUL
32 pages
Ml Lab Manual Completed
No ratings yet
Ml Lab Manual Completed
56 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
Tran Quang Kha - Full Stack Engineer
No ratings yet
Tran Quang Kha - Full Stack Engineer
21 pages
200104092_DA_4
No ratings yet
200104092_DA_4
14 pages
Business intelligent
No ratings yet
Business intelligent
20 pages
IP_Lab_record[1]
No ratings yet
IP_Lab_record[1]
23 pages
OOP Group C1
No ratings yet
OOP Group C1
7 pages
Window_Functions_Spark
No ratings yet
Window_Functions_Spark
3 pages
CSE 3024: Web Mining: Lab Assessment - 3
No ratings yet
CSE 3024: Web Mining: Lab Assessment - 3
13 pages
python for rf
No ratings yet
python for rf
15 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Ultratag™ RBC
No ratings yet
Ultratag™ RBC
2 pages
Omega Paw Case Study
0% (1)
Omega Paw Case Study
6 pages
Kapitangan National High School: 082 Purok 3, Kapitangan, Paombong, Bulacan
No ratings yet
Kapitangan National High School: 082 Purok 3, Kapitangan, Paombong, Bulacan
4 pages
Phy Investigatory
No ratings yet
Phy Investigatory
14 pages
ETO Oral Questions Jan 2020
No ratings yet
ETO Oral Questions Jan 2020
2 pages
DLS-3 v1-3 IM EN NA 29005694 R001
No ratings yet
DLS-3 v1-3 IM EN NA 29005694 R001
120 pages
Project of Mahindra SonaManagement (BIPIN FINAL)
No ratings yet
Project of Mahindra SonaManagement (BIPIN FINAL)
52 pages
CPA Marketing Beginners Guide - (Monetize - Info)
No ratings yet
CPA Marketing Beginners Guide - (Monetize - Info)
35 pages
Handbook of Scientific Methods of Inquiry for Intelligence Analysis Hank Prunckun - Download the ebook now for instant access to all chapters
No ratings yet
Handbook of Scientific Methods of Inquiry for Intelligence Analysis Hank Prunckun - Download the ebook now for instant access to all chapters
78 pages
JCSM 13 2 307
No ratings yet
JCSM 13 2 307
43 pages
Classwiz Emulator Subscription: (For Windows)
No ratings yet
Classwiz Emulator Subscription: (For Windows)
88 pages
Break-Even Analysis Week 5
No ratings yet
Break-Even Analysis Week 5
12 pages
Gwynne Hogan Resume
No ratings yet
Gwynne Hogan Resume
1 page
Column Chromatography
No ratings yet
Column Chromatography
4 pages
Olimpiada de Limba Engleză Varianta 2
100% (1)
Olimpiada de Limba Engleză Varianta 2
4 pages
Dragon Wyrmling - The Homebrewery
No ratings yet
Dragon Wyrmling - The Homebrewery
5 pages
Employee Exit Checklist Clearance Form
No ratings yet
Employee Exit Checklist Clearance Form
2 pages
CMT - A1 Slave Pits of The Undercity
100% (2)
CMT - A1 Slave Pits of The Undercity
13 pages
The Kiev Psalter
No ratings yet
The Kiev Psalter
17 pages
FDP Brochure-MSRIT Mech Bangalore
No ratings yet
FDP Brochure-MSRIT Mech Bangalore
2 pages
SrivatsanVaradharajan TAM Resume
No ratings yet
SrivatsanVaradharajan TAM Resume
5 pages
Electricity Calculator - Power Consumption KWH Estimator
No ratings yet
Electricity Calculator - Power Consumption KWH Estimator
8 pages
Organizational Behaviour
No ratings yet
Organizational Behaviour
9 pages
Download ebooks file Layered double hydroxide polymer nanocomposites Daniel all chapters
100% (4)
Download ebooks file Layered double hydroxide polymer nanocomposites Daniel all chapters
66 pages
Intro To Philosophy Module 6 PDF
100% (1)
Intro To Philosophy Module 6 PDF
9 pages
Daf Ditty Beitza 10: Black and White Doves
No ratings yet
Daf Ditty Beitza 10: Black and White Doves
54 pages
The Columbia Basin Salmon & Killer Whales
100% (1)
The Columbia Basin Salmon & Killer Whales
4 pages
WORKINGWHILESTUDYING
No ratings yet
WORKINGWHILESTUDYING
10 pages
Boarding School
No ratings yet
Boarding School
4 pages

Feature engineering techniques

Uploaded by

Feature engineering techniques

Uploaded by

Feature engineering techniques: - correlation , variance inflation factor ,

Aggregated Statistical Summaries for Numeric Features with cuDF

Hash-based Frequency Encoding for Tokenized Categorical Data

Feature Importance with cuML Random Forest on Encrypted Data

Feature Importance with RandomForestClassifier in Pandas

Here’s a recommended approach:

### Step 1: Set Up RAPIDS Environment in Google Colab

# Install RAPIDS cuDF and cuML libraries in Google Colab

!apt-get install -y wget

### Step 2: Data Preprocessing and Exploration

#### 1. Frequency Encoding for Encrypted Categorical Data

Encrypted categorical columns (e.g., hashed or tokenized values) can be

# Sample encrypted-like categorical data

data = cudf.DataFrame({'encrypted_feature': ['hash1', 'hash2', 'hash1',

frequency_encoded.columns = ['encrypted_feature', 'frequency']

data = data.merge(frequency_encoded, on='encrypted_feature', how='left')

print("Frequency Encoded Data:\n", data)

#### 2. Statistical Aggregations on Encrypted Numeric Features

For numeric columns in encrypted datasets, aggregating statistics like

# Sample encrypted-like numeric data

'feature1': [2345, 5234, 1987, 7342, 4251],

'feature2': [1001, 2340, 897, 3451, 2033]

# Aggregate features to calculate mean, sum, standard deviation

aggregated_features = data.agg(['mean', 'sum', 'std'])

print("Aggregated Features:\n", aggregated_features)

### Step 3: Feature Importance Using cuML’s Random Forest

Using the `RandomForestClassifier` from RAPIDS’ `cuML` library on

from cuml.ensemble import RandomForestClassifier

X = data[['feature1', 'feature2']] # encrypted numeric features

y = cudf.Series([1, 0, 1, 0, 1]) # encrypted target or binary labels

# Train Random Forest model to evaluate feature importance

rf_model = RandomForestClassifier(n_estimators=10, max_depth=3,

# Extract feature importances

importance_df = cudf.DataFrame({'feature': X.columns, 'importance':

print("Feature Importances:\n", importance_df.sort_values(by='importance',

Best Practices and Tips

- Focus on Secure Aggregations**: Techniques like `mean`, `sum`, and

- Consider Non-Decryption-Based Feature Selection**: Look at statistical

You might also like