0% found this document useful (0 votes)
3 views3 pages

Machine learning Assignment

The document outlines an exam for valuing machine learning's potential in real-time fraud prediction for a bank's credit card transactions. It details specific tasks, including identifying data leakage, ethical considerations, feature engineering, and modeling procedures, while also emphasizing the importance of a structured report. Deliverables are divided into two sections, with the first focusing on specific questions and the second on a comprehensive report of findings and methodologies.

Uploaded by

Sonal Katiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Machine learning Assignment

The document outlines an exam for valuing machine learning's potential in real-time fraud prediction for a bank's credit card transactions. It details specific tasks, including identifying data leakage, ethical considerations, feature engineering, and modeling procedures, while also emphasizing the importance of a structured report. Deliverables are divided into two sections, with the first focusing on specific questions and the second on a comprehensive report of findings and methodologies.

Uploaded by

Sonal Katiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

EXAM DESCRIPTION

You work for a bank that issues credit cards. The bank has asked you to value machine learning’s potential use
for real time fraud prediction. You have received some sample data and a data dictionary, all available in this
folder:

https://drive.google.com/drive/folders/18wSh3FxhwVrDA9EIzNtmBDAmVPN9UA8O?usp=sharing

Your job is as follows. Use the provided data and data dictionary to build a machine learning model predicting
credit card fraud. Estimate the annual profitability of deploying your ML model.

For purposes of the exam, assume:

• Transactions are in dollars


• These are transactions for two years (and those years are representative of all other years)
• The bank receives 2% of every completed transaction as a processing fee
• The bank is 100% liable for any fraudulent transaction (i.e., the bank has to cover fraud losses)
• 10% of false positives will call, speak with customer service, and complete their transaction
o These calls take an average of 15 minutes
• Customer service personnel make 30 dollars per hour
• There is no customer goodwill cost or benefit associated with predicting fraud
• There are no costs, beyond those mentioned above, for predicted frauds (i.e., no further auditing,
reporting, or other potential costs for transactions predicted/found to be fraudulent). There are no
benefits, beyond those mentioned above, for the bank in its credit card business.

If you need to make other assumptions, feel free to do so. Just note the assumption and any references you
use to ground the assumption (if necessary).

DELIVERABLES

Answer the following questions 1 to 5 (max 275 words altogether: number your responses; comprising 25% of
your exam grade):

1. Is there leakage in the dataset? If yes, is it a case of data leakage, target leakage or time leakage, and
which variable(s) is (are) responsible?
2. What are your ethical considerations when building this machine learning model? How would you
safeguard against these potential issues?
3. How could you combine the customer and merchant’s location to derive a new feature? Why may that
feature hold predictive power?
4. What would be the rationale behind inclusion of a new feature, measuring frequency of purchases by
a cardholder, in your model to predict fraud?
5. Time_since_last_trans has many missing values. Use the count of the missing values and summary
statistics of other variables (and notion that these are time differences), to make an educated guess
what the missing values correspond to. What is the fraud rate for these values? What would be the
rationale for an abnormal fraud rate for these missing values? How can you increase model
predictability by incorporating this insight in your model?

75% of your exam grade will be determined by the following deliverables:


Write a report based on your findings (5 pages max and 1650 words max; “and” means both limits apply;
This includes front page, table of contents, reference, and figure list as well as any appendices, while no
appendix is necessary). This should include:

1. A concrete and valid recommendation (specifying the baseline, an appropriate action or actions and
well-motivated prediction thresholds) and the associated total profit
2. The action/intervention on which you base your profit calculation
3. Your data management procedures, consisting of:
a. Your review of the data;
b. Any preprocessing and feature engineering (Do these in Excel, Python or other appropriate
programs) steps; Note that a basic level of preprocessing and feature engineering would
have to be done. Motivating the rational and implications of consequential decisions you
make here are required and valuable; e.g. these could be motivated by summary statistics
and data explorations.
i. To be specific a basic level of preprocessing and feature engineering, would
constitute some of the following (among others referred to in class), where
appropriate:
1. Feature creation
2. Feature exclusion (redundant features, etc.)
3. Encoding (if appropriate)
4. Row exclusions (if appropriate)
5. Outlier removal (if appropriate)
6. Numerical transformations (if appropriate)
7. Dealing with excess zeros, missing values, etc.
c. Your employed data partitioning and justification of it;
d. The composition of the profit components (e.g., what are your high-level profit
components, and how does each contribute to the total?). Ensure to specify qualitative and
numerical representation of each component of a profit matrix (the numerical
representation could be a formula or a number; you are to decide how to represent it.)
e. Technical/data issues you think might affect results
4. Your modeling procedures:
a. Your model selection process, that:
i. Searches across possible models and hyperparameters (this can be done using
automated tools, such as DataRobot or tpot)
ii. Searches across meaningful sampling routines (e.g. downsampling or SMOTE)
iii. Searches across potential prediction decision thresholds
iv. Searches across potential actions/targets
5. Your evaluation process, that:
a. Executes the above without compromising the holdout, overfitting or failing to address any
leakage
b. Describes which features hold the most signal and motivate reasoning
c. Makes correct use of, and interpretation of, partitioning decisions
d. Addresses other key modelling issues
6. CRUCIAL NOTE: You must begin all paragraphs with a 1- to 7-word title that describes the
paragraph (the purpose of this is to provide more structure, cohesion & flow to your writing), an
example would be: "On data splitting: We split the dataset into three subsets: training, valuation
and holdout. We used 60% of the data for training, 20% for valuation and 20% for testing."
7. Clear and organized reporting of machine learning model prediction and evaluation steps
TIPS, TRICKS, and NOTES

1. I strongly suggest using the same structure, as the enumerated list above. That means use a larger font
heading for sections corresponding to words in bold font in the “deliverable” instructions above belonging
to numbers 1 to 6, and smaller headings for subsections corresponding to words in bold belonging to a)
to e), and yet smaller headings for the subsubsections, i) to iv).
2. The words in bold font in the “deliverable” instructions above, are great (not exhaustive) suggestions for
the titles (or part of titles), which each paragraph in the deliverables is to start with.
3. You may use Python (in conjunction with Excel), DataRobot (in conjunction with Excel), or some
combination for the exam. If using DataRobot, choosing “Quick” as the modeling mode (under Start), will
be adequate.
4. You will not be graded based on your solution’s profit. If one student’s model yields expected profit X, and
another student’s model yields expected profit 10*X, the 10*X exam will not necessarily receive a higher
score. Exams will be graded on their model building, model evaluation, and model valuation processes.
Solution profitability may spuriously correlate with exam grades to the extent that a more thorough
modeling procedure may yield a more profitable solution.
5. The data contain ~50K rows. Be aware of your time limit, taking into consideration the modeling platform
(Python, DataRobot) and models that you build, and your resource limitations (hardware, memory, etc).
6. You are not to submit Python codes (references to code are fine). However, where Python is used for
modelling, explaining the modelling procedure(s) is a requirement.
7. I strongly recommend that you do not rely on online solutions for similar code. Using existing code runs a
risk of yielding a good solution without demonstrating that you know how to properly execute the
modeling process. Because you will not be graded on your model’s profitability (the goal of many online
codes), but will be graded based on the thoroughness of your modeling process and explanation of model
value, adapting existing solutions online can often become a minefield of shortcuts and misunderstandings
that hurt exams.
8. Many python machine learning models and methods may require your data to be encoded (e.g. one-hot
encoded instead of categorical values). In that case, either implement the correct encoding or ensure to
find and apply the equivalent method for your data type. Ensure to do so in DataRobot too. You can find
the appropriate methods online (by googling or searching in documentations for packages we have
introduced you to).
9. If you modify the dataset in Excel, it is possible that automated formatting changes could make the dataset
un-loadable into DataRobot. This is usually a function of semi-colons, commas, and decimals being used
differently around the world. If you find that a dataset you’ve modified in Excel cannot be loaded to
DataRobot, you may need to come up with a workaround. If this is a problem for you, I often find it easiest
to load Excel data into Python, save data from Python into .txt or .csv, and then load that into DataRobot.
10. A recommendation given this exam has a time limit: In exams, work projects, or your thesis project, always
get 1 “OK” solution first. Even if you immediately envision a grand solution and the “OK” solution seems
like a wasted interim step, start with a simple but informative approach! You’ll learn something from this
process, and you’ll ensure that you’ve got something to hand in by the deadline.
11. You will be graded on the first 1650 words (and 5 pages) of your report. Title words do count towards the
total word count. The limits are meant to help replicate reports to technically-oriented business managers
and data scientists in firms.

You might also like