Financial Data
Financial Data
This dataset (.csv) collects 200+ financial indicators for all the stocks of the US stock market.
The financial indicators are found in the 10-K filings that publicly traded companies release
yearly.
The last column of the dataset represent the class of each stock, where:
In other words, stocks that belong to class 1 are stocks that one should buy at the start of
year 2018, and sell at the end of year 2018.
1. Some financial indicator values are missing (nan cells), so the user can select the
best technique to clean each dataset (dropna, fillna, etc.).
2. There are outliers, meaning extreme values that are probably caused by mistypings.
Also in this case, the user can choose how to clean each dataset (have a look at the
1% - 99% percentile values).
3. The third-to-last column, Sector, lists the sector of each stock. Indeed, in the US
stock market each company is part of a sector that classifies it in a macro-area.
Since all the sectors have been collected (Basic Materials, Communication Services,
Consumer Cyclical, Consumer Defensive, Energy, Financial Services, Healthcare,
Industrial, Real Estate, Technology and Utilities), the user has the option to perform
per-sector analyses and comparisons.
4. The second-to-last column, PRICE VAR [%], lists the percent price variation of each
stock for the year. For example, if we consider the dataset 2018_Financial_Data.csv,
we will have percent price variation for the year 2018 (meaning from the first
trading day on Jan 2018 to the last trading day on Dec 2018).
5. The last column, class, lists a binary classification for each stock, where
○ for each stock, if the PRICE VAR [%] value is positive, class = 1. From a
trading perspective, the 1 identifies those stocks that an hypothetical trader
should BUY at the start of the year and sell at the end of the year for a profit.
○ for each stock, if the PRICE VAR [%] value is negative, class = 0. From a
trading perspective, the 0 identifies those stocks that an hypothetical trader
should NOT BUY, since their value will decrease, meaning a loss of capital.
This dataset has been developed in order to understand whether or not it is possible to
classify the future performance of a stock by looking at the financial information released in
the 10-K filings.
How can you achieve that?
1. Build a classification model for the stocks that would and would not increase their
value in 2018
Big Mart Sales
This dataset contains information collected by BigMart (a supermarket chain in the US)
about sales data for 1559 products across 10 stores in different cities.
The attributes recorded for each product and store have been defined as here below
BigMart has collected such data in order to understand what kind of products sell more in
what kind of stores. Furthermore, it would like to investigate how much the item_visibility
impacts sales. We, as third users, may be interested in segmenting the products according
to their specifics and/or sales at different stores.
How can you achieve that?
1. Build a predictive model and find out the sales of each product at a particular store
(or at generic stores with different characteristics
2. Cluster items according to the available covariates, perhaps considering also the
different sales in different stores (you should spread() the dataset for this last task.
Brazilian Houses
This dataset contains information about 10962 houses to rent in different Brazilian cities. The
data have been gathered through a web-crawler (data have been automatically scraped from
publicly available rent ads in the web), therefore be aware of possible errors or inconsistency
in the data (outliers, duplicates, missing values, etc.).
The following 13 different features have been collected.
These data have been collected in order to better understand the house-rent market in some
of the most important cities in Brazil. A new company wants to enter the real-estate market,
and wants to understand what kind of houses grant the larger (rent) revenue before investing
its money: what are the driving forces leading to high rents?
Furthermore, we may want to segment the rent-houses market in different groups: does it
check with the geographical positioning?
1. Build a predictive model and find out the rent amount according to the house
specifics
2. Cluster the houses for rental according to their characteristics.
Garment Workers
This dataset includes important attributes of the garment manufacturing process and the
productivity of the employees which had been collected manually and also been validated by
the industry experts. Data have been collected across different days along the year, and
each row (instance) contains different characteristics of a specific worker team devoted at
performing a specific task.
The following 14 different features have been collected.
The Garment Industry is one of the key examples of the industrial globalization of this
modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying
the huge global demand for garment products is mostly dependent on the production and
delivery performance of the employees in the garment manufacturing companies. So, it is
highly desirable among the decision makers in the garments industry to track, analyse and
predict the productivity performance of the working teams in their factories.
In particular, they would like to understand what really impacts the productivity of a team.
From a practical point of view, they also know that a productivity score larger than 0.8 is
good enough, while a productivity score lower than 0.8 is not.
1. Build a predictive model for the actual productivity of different teams in different days.
The response in is (0, 1). What can we do to make the regression setting feasible?
2. Build a classification model to understand what teams and in what days generally
have a good enough performance.
Telecom Churn
This dataset contains information about US Telecom customers. Each row represents a
customer. The dataset has the following variables on the columns:
Telecom wants to understand what customers are more likely to churn (change provider),
according to the available covariates so that it can target these customers with an ad-hoc
promotional campaign.
This dataset contains information about apartment transaction data generated from August
2007 to August 2017 in Daebong district, Daegu city, South Korea. Each row refers to an
apartment. The dataset has the following variables:
A new company wants to enter the real-estate market in Korea, and wants to understand
what kind of houses, and in what areas (facilities, etc.) grant the larger sale prices.
1. Build a predictive model for the sale prices of houses in Korea according to the
available covariates
Car Prices
This dataset contains information about automobiles. Each row represents a different car
model. For each observation we have information about the price and the car features, as
well as an insurance risk score.
I car producing company has to come up with a new model of car. It wants to target a
specific segment of the market, so it needs a model able to predict the price of the new-
designed car according to its specifics. Furthermore, it is well known that car market is
strongly segmented in different types (van, suv, coupet, etc.). What cluster will the new car
belong?
1. Build a predictive model for the sale prices of cars according to their characteristics
2. Can you get a clusterization of car-type just looking at the available covariates
(excluding model and carbody)?
Affairs
This dataset contains information about extramarital affairs. For each subject the following
features were recorded:
Some possible questions of interest (not exhaustive: these are just some ideas): Is the
amount of time spent in extramarital affairs affected by the other covariates? Does the
presence of children (or other features) tend to affect in some way the probability of being in
a successful marriage?